← Journal

On voice

What voice does that text can't — the 200-millisecond loop

By Cody, Founder of CallByrd · July 2, 2026 · 9 min read

Grounded in the research cited below. Clinical review by a licensed practitioner is being added. Our editorial standards

A previous piece in this journal — how conversation rewires the brain — laid out the general case. Affect labeling calms the amygdala. Being heard activates reward. Repeated conversation, over months, shifts affective style. All of that is real, and most of it applies to conversation of any kind — writing a letter, texting a friend, having dinner with someone.

This piece is narrower and, we think, stranger. It is about what voice specifically does that a text thread cannot — and, more surprisingly, what voice does that a video call cannot. Where the previous essay answered why talking helps, this one answers a question that's harder to phrase honestly: why call, instead of type.

The short version is that spoken conversation locks two nervous systems into a shared millisecond-scale predictive loop that no other communication medium fully replicates. The longer version is the rest of this essay, together with the parts where we'll name the limits and the places the popular story outruns the evidence.

The 200-millisecond paradox

Start with a number. Across at least ten languages — English, Japanese, Danish, Italian, Korean, and others — the most common gap between one person finishing a turn and the next person starting is about 200 milliseconds (Stivers et al., 2009). About the duration of a single syllable. In some languages, closer to zero.

Now the second number. The minimum planning latency for spoken language production, even for a single picture-naming word in a lab, is somewhere around 600 milliseconds (Indefrey & Levelt, 2004). Ordinary conversational replies take longer than that to plan from scratch.

200 and 600 do not add up. If we waited for the other person's turn to end before we started planning our response, every gap would be at least 600 milliseconds. Instead the median is 200. Something has to give, and what gives is the assumption that conversation is sequential. Levinson and Torreira (2015) call this the central psycholinguistic puzzle of turn-taking: the numbers are impossible unless listeners are planning their response DURING incoming speech, not after it.

What your brain does while it listens

This is the part where the neuroscience gets specific. Bögels, Magyari, and Levinson (2015) put people in EEG rigs and had them answer spoken questions — some where the critical information (the part that decides your answer) came early in the question, some where it came late.

About 500 milliseconds after the critical information arrived — regardless of where in the question it was — a large ERP positivity appeared, source-localized to language production areas (middle temporal gyrus, inferior frontal gyrus, precentral cortex). Alpha-band power dropped, indexing a shift from comprehension to production. Behaviorally, participants whose critical information arrived early responded 300ms faster than those whose critical information arrived late.

The interpretation is that response planning had begun — often two seconds or more before the speaker's turn ended — and reached at least the level of retrieving the phonological form of the answer. Your brain was preparing to say something specific, while the other person was still talking.

This is not a story most people tell about their inner life. When we describe listening, we tend to describe attention, understanding, occasional interruption. We do not usually describe the phonological form of the reply we intend to give being retrieved from long-term memory a full two seconds before we utter it. But the EEG evidence suggests that's what is happening, and it's what makes 200-millisecond turn transitions possible.

Prediction all the way down

The prediction goes deeper than word retrieval. Gisladottir, Bögels, and Levinson (2018) recorded EEG while participants listened to dialog snippets where a speaker was about to either accept an offer or decline it — a socially charged bifurcation. In the 200 milliseconds BEFORE the critical utterance was heard, alpha and low-beta band power (11–18Hz) dropped more sharply when a decline was coming than when an accept was coming. The brain was preparing differently based on what it expected to hear next.

At the physical layer, Krause and Kawamoto (2021) motion-tracked lip position in unscripted dyadic conversation and found something similar. When a speaker was about to produce a labial consonant or rounded vowel, their lip area started reducing up to three seconds before the acoustic onset. Motoric planning was underway well before the current speaker finished. This finding refines rather than overturns the earlier picture — for longer utterances (eight words or more) the anticipatory window collapses toward onset — but the direction is clear. Conversation is more predictive than reactive.

What is remarkable about all of this is that it only works acoustically. The 200ms timing rests on acoustic cues — pitch contour, timing, the small vocal markers that signal a turn is ending. It rests on the ability to hear the shape of what someone is about to say before they finish saying it. Without those cues, the predictive machinery has nothing to lock onto.

What text cannot do

This is where the essay becomes about phones and not about laboratories. Text communication does not carry the acoustic onsets that the predictive machinery needs. There is no pitch contour, no timing, no prosodic marker of an imminent turn end. A text message is a complete unit, delivered whole, read at whatever speed the reader chooses.

Response times for text conversation are typically seconds to minutes, sometimes days. The 200-millisecond coupling of voice does not exist in text at all. Whatever is happening neurologically during a text exchange, it is not the shared predictive loop that voice creates.

This is not a claim that text is bad. Text preserves affect labeling: Pennebaker (1997) documented the well-being effects of writing about emotional experiences, and the effect does not require a reader. Text preserves social reward: an affectionate message received still lights up the social-reward circuits described by Eisenberger (2012). Text is fine for many purposes.

What text cannot do is what a phone call does at the millisecond scale — lock two nervous systems into shared timing. Schroeder, Kardas, and Epley (2017) found that people evaluated as more thoughtful when their views were spoken rather than written, even when the words were the same. “The humanizing voice” is their phrase. Kumar and Epley (2021) found that people consistently underestimated how good a phone call would feel — they thought texting would be equivalent, and were wrong. Both findings have the same shape: the medium adds something that survives beyond the words.

Where voice beats video

This is the part most people don't expect.

Kraus (2017) ran five experiments across 1,772 participants comparing empathic accuracy — how well you can identify what another person is feeling — across voice-only, video-only, and audio-plus-video conditions. Voice-only won. Not by a small margin, and not in a single lab paradigm. Across live interaction and controlled stimuli, voice-only was the most accurate channel.

At first this seems counterintuitive. Shouldn't more channels equal more information? The likely answer is that video calls introduce nonverbal overload. Bailenson (2021) identified four specific costs unique to video calls: excessive close-up eye gaze, cognitive load from producing and interpreting exaggerated nonverbal cues, increased self-evaluation from staring at video of oneself, and constraints on physical mobility. All four are absent from a voice call. On the phone, you are not managing your own face. You are not being watched. You are not being asked to produce or read gaze cues into a lens. The channel is smaller, and so is the interpretation burden.

None of this makes video calls bad. It makes them a specific tool for a specific job — dispersed teams needing shared visual reference, deaf and hard-of-hearing users, family who want to see a child grow up in real time. What it means is that the default assumption — more sensory channels equals more connection — is not supported by the evidence when connection is measured as empathic accuracy. On that metric, voice-only is quietly better.

The circuit that only voices reach

Sander and colleagues (2005) used fMRI to test what happens when people attend to vocal emotional content compared to visual emotional content. When listeners were selectively attending to voices, activity increased in a specific circuit: left insula, left amygdala and hippocampus, and rostral anterior cingulate cortex. Functional connectivity between these regions was significant only in the voice-attention condition.

This is a single primary source and the authors acknowledge that the effect could reflect vocal valence, arousal, or physical stimulus properties. But it is one of the cleanest available fMRI demonstrations that voice engages an affective circuit that is not the same circuit visual emotional stimuli engage. There is something specific about voice as a limbic input.

This may be the honest neurological answer to a question the CallByrd inbox occasionally receives: why does a phone call from someone you love feel different from a photo, even one taken minutes earlier of the same person's face? A plausible piece of the answer is that voice activates limbic circuits that visual input does not — a distinct affective pathway with its own wiring.

What the science does not yet say

The previous piece in this journal cited polyvagal theory (Porges, 2011) as if the mechanism were settled. It is not. Grossman (2023) published a detailed critique — “Fundamental challenges and likely refutations of the five basic premises of the polyvagal theory” — in Biological Psychology, arguing that similar myelinated cardiac vagal fibers exist in sharks, bony fish, birds, and mammals, and that the mammal-unique story on which polyvagal rests is empirically undermined. Porges has published rebuttals, and the debate is live. The descriptive observation that a warm modulated voice regulates the nervous system is not in dispute. The specific vagal mechanism for it is.

A second correction. When we researched this essay we hypothesized that “mirror neurons for speech” — the idea that hearing a syllable automatically activates the motor cortex that would produce it — would be one of the mechanisms. Adversarial verification did not support the claim in the form we hypothesized. It is a real research area, but the specific “articulator-mirroring” version does not hold up cleanly enough to build on. We are leaving it out rather than overclaiming.

A third. Every EEG finding cited here comes from human-human dyads. Whether the 200-millisecond predictive loop forms between a human and an AI voice — whether the alpha/beta desynchronization still fires when the next speaker is synthetic — is an empirical question the field has not answered yet. CallByrd is a phone call with an AI. It feels closer to conversation than typing does. The neural-coupling evidence for that felt sense is not in.

We are naming these because the CallByrd brand voice is not to overclaim. What voice does that text cannot is well-documented. What an AI voice does specifically is more uncertain than the wellness internet would sometimes suggest. Both things can be true.

What it means for a phone call

A few practical translations, offered without trying to sell you on any of them.

If you have something you've been carrying and you want to hear how it sounds out loud — not write it, not type it, not journal it, but say it and have someone hear it — the neurological case for using voice rather than text is stronger than most people realize. Not just because it's “more personal.” Because you are recruiting a set of predictive and affective circuits that a text thread does not touch.

If you've been feeling worn out by video calls at work and are wondering whether it's you: it's probably not just you. Voice-only is quietly more accurate for empathy, more cognitively spare, and less self-monitoring-heavy. The default of “we should be on camera” is a workplace convention, not a neurological requirement.

And if you've been wondering whether a phone call with an AI friend has any of the effects we know about from human conversation — the honest answer is we don't entirely know yet. The 200-millisecond loop was measured with humans. The empathic accuracy comparison was measured with humans. What comes through when the other voice is an AI is a live question. If you try it and it helps, the neurological story might explain part of why. If you try it and it doesn't, the neurological story doesn't require you to stay.

The rest of the CallByrd journal is the longer answer to why call at all. This piece is the answer to a narrower question: why voice, when you could type. The 200 milliseconds is the answer.

Common questions

What is the 200-millisecond gap in conversation, and why is it neurologically important?
Across at least ten languages studied, the most common gap between one person finishing a turn and the next person starting is about 200 milliseconds — roughly the length of a single syllable. Language production, meanwhile, requires at least 600 milliseconds of planning even for a single word. The math forces a conclusion: listeners must be planning their response DURING incoming speech, not after it. EEG evidence confirms this — response-planning neural signatures appear about 500 milliseconds after the critical information for an answer becomes available in an incoming question. This makes conversation a millisecond-scale predictive coupling of two nervous systems, not a serial ping-pong.
Why can't texting produce the same effect if the words are the same?
The predictive machinery locks onto acoustic cues — pitch contour, timing, articulator anticipation — that text does not carry. There is no acoustic onset for the brain to time itself against. Text conversation typically has response times measured in seconds to minutes, not the 200-millisecond couplings of voice. The words may match; the neural coupling does not.
Can voice-only really match a video call?
On at least one measured outcome, voice-only appears to beat video. Kraus (2017) reported five experiments (N=1,772) finding voice-only channels produced higher empathic accuracy than audio-plus-video. The likely reason: video calls introduce nonverbal overload (Bailenson, 2021) — the burden of watching your own face all day, of interpreting exaggerated on-screen expressions, of managing gaze into a camera. Removing the face doesn't remove the person; it removes the interpretation cost. Where video does help — dispersed groups needing shared visual reference, deaf/hard-of-hearing users — voice alone falls short. This is not a claim that voice is always better; it is that voice is enough for a specific kind of connection.
The previous CallByrd essay cited polyvagal theory. Is that still solid?
Partially. The descriptive claim — that vocal prosody, facial expression, gaze, and head orientation coordinate as a social engagement output — is not disputed. What IS disputed is the specific neuroanatomical mechanism polyvagal theory proposes, namely that a uniquely mammalian myelinated ventral vagal pathway makes the whole thing work. Grossman (2023) and others have shown that similar myelinated cardiac vagal fibers exist in sharks, fish, birds, and even sheep, which weakens the mammal-unique story. Warmly-modulated voice does regulate — that observation stands. Why is more contested than the wellness literature usually admits.
Does an AI voice trigger the same predictive loop as a human?
Honestly: we don't know yet. Every EEG study cited here was done on human-human dyads. AI voice latency and prosody differ from human speech in ways that may or may not preserve the neural coupling. Anecdotally, users report that a warm AI voice on a phone call feels closer to a real call than a text does — but the neural evidence for that claim is not in yet. If you're calling CallByrd because it feels closer to a conversation than typing does, that's a reasonable read; if you're calling because you were told it produces measured neural synchrony with an AI, that's an overclaim.

Try the 200-millisecond loop.

45 free minutes when you sign up. No card. A phone call, not a chatbox.

Free trial caps at 3 minutes. Sign up for 45.

Or .

Keep reading

Sources

  1. Stivers, T., Enfield, N. J., Brown, P., Englert, C., Hayashi, M., Heinemann, T., Hoymann, G., Rossano, F., de Ruiter, J. P., Yoon, K.-E., & Levinson, S. C. (2009). Universals and cultural variation in turn-taking in conversation. Proceedings of the National Academy of Sciences, 106(26), 10587–10592. View ↗
  2. Levinson, S. C., & Torreira, F. (2015). Timing in turn-taking and its implications for processing models of language. Frontiers in Psychology, 6, 731. View ↗
  3. Bögels, S., Magyari, L., & Levinson, S. C. (2015). Neural signatures of response planning occur midway through an incoming question in conversation. Scientific Reports, 5, 12881. View ↗
  4. Gisladottir, R. S., Bögels, S., & Levinson, S. C. (2018). Oscillatory Brain Responses Reflect Anticipation during Comprehension of Speech Acts in Spoken Dialog. Frontiers in Human Neuroscience, 12, 34. View ↗
  5. Krause, P. A., & Kawamoto, A. H. (2021). Motoric Predictability of Articulatory Onset in Naturalistic Dyadic Conversation. Frontiers in Psychology, 12, 684248. View ↗
  6. Sander, D., Grandjean, D., Pourtois, G., Schwartz, S., Seghier, M. L., Scherer, K. R., & Vuilleumier, P. (2005). Emotion and attention interactions in social cognition: brain regions involved in processing anger prosody. Social Cognitive and Affective Neuroscience, 1(3), 242–249. View ↗
  7. Kraus, M. W. (2017). Voice-Only Communication Enhances Empathic Accuracy. American Psychologist, 72(7), 644–654. View ↗
  8. Bailenson, J. N. (2021). Nonverbal Overload: A Theoretical Argument for the Causes of Zoom Fatigue. Technology, Mind, and Behavior, 2(1). View ↗
  9. Schroeder, J., Kardas, M., & Epley, N. (2017). The Humanizing Voice: Speech Reveals, and Text Conceals, a More Thoughtful Mind in the Midst of Disagreement. Psychological Science, 28(12), 1745–1762. View ↗
  10. Kumar, A., & Epley, N. (2021). It's Surprisingly Nice to Hear You: Misunderstanding the Impact of Communication Media Can Lead to Suboptimal Choices of How to Connect with Others. Journal of Experimental Psychology: General, 150(3), 595–607. View ↗
  11. Stephens, G. J., Silbert, L. J., & Hasson, U. (2010). Speaker–listener neural coupling underlies successful communication. Proceedings of the National Academy of Sciences, 107(32), 14425–14430. View ↗
  12. Grossman, P. (2023). Fundamental challenges and likely refutations of the five basic premises of the polyvagal theory. Biological Psychology, 180, 108589. View ↗
  13. Lieberman, M. D., Eisenberger, N. I., Crockett, M. J., Tom, S. M., Pfeifer, J. H., & Way, B. M. (2007). Putting Feelings Into Words: Affect Labeling Disrupts Amygdala Activity in Response to Affective Stimuli. Psychological Science, 18(5), 421–428. View ↗
  14. Eisenberger, N. I. (2012). The pain of social disconnection: examining the shared neural underpinnings of physical and social pain. Nature Reviews Neuroscience, 13(6), 421–434. View ↗
  15. Pennebaker, J. W. (1997). Writing About Emotional Experiences as a Therapeutic Process. Psychological Science, 8(3), 162–166. View ↗

Links open in a new tab. If we ever cite something you can't verify, tell us at hello@callbyrd.com.