Introduction: Inner voice is estimated to occur at least a quarter of people’s conscious waking life. Much research work asserts that inner voice plays various important roles in cognitive functions, such as self-regulation, self-reflection, and so on. Virtual cognitions are a stream of simulated thoughts people can hear while emerged in a virtual reality environment that intend to mimic the inner voice and thus simulating the effect of an inner voice. Presenting and manipulating virtual cognitions in learning and training may be a useful intervention method to affect people’s behavior and beliefs. Exposing people to virtual cognitions, presented as an inner voice, requires the simulation of such voice and therefore understanding of the underlying sound parameters. Many researchers believe that there is a relationship between people’s inner and outer voice, even suggesting that people’s inner voice resembles the features of their own outer voice. The work presented here, therefore, explored people’s perception of their simulated inner voice by considering several core sound parameters of their outer voice. Methods: Using a specially developed audio recording and modification software tool, 15 participants (11 males, 4 females) set key sound parameters to match their own voice recording with their perception of either their own inner or their outer voice. After reading aloud nine sentences, they modified seven sound parameters of the recordings: pitch, speed, echo and volume of sound with the frequency band (20-320Hz, 320-1280Hz, 1280-5120Hz, and 5120-20480Hz). Conclusion: The result of the study indicates that people’s sound perception is different between inner and outer voice. Also, individual variations were found for the perception of inner and outer voice differences. For developers who want to simulate inner voice in a virtual environment, these findings suggest that inner voice has its own distinct characteristics compared to an outer voice. The volume setting for the frequency band of 1280-5120Hz can be based on group perception, whereas for speed and echo settings it might require individualization. © 2018, Interactive Media Institute. All rights reserved.