Defending The Motor Theory Of Speech Perception
May 9, 2012 2 Comments
I delivered the paper below to the Ockham Society at the University of Oxford last night.
Humans have superior abilities to perceive speech rapidly and accurately even in conditions where the signal is of poor quality. These abilities are markedly better than the similar abilities in relation to perceiving non-speech sounds. This suggests that there is a special nature to speech perception. The Motor Theory of speech perception was proposed to account for this special nature. It postulates that the mechanical and neural elements involved in the human production of speech are also involved in the perception of speech. This would explain the special nature of speech perception because humans have the ability to produce speech sounds only and not general types of sound. One might illustrate the theory by seeing it as the claim that speech perception is the offline running of the systems that when online, actually produce speech.
Mole1 does not support the Motor Theory. He agrees that speech is special, but not that it is special in such a way as to support the Motor Theory. In §2 I will challenge Mole’s account in five ways.
In §2.1 I will deny that Mole’s metamer is a relevant counter-example, on the grounds that simultaneity of multiple percepts is the important distinction in the case of speech. The counterexample purports to show that the mode of lack of invariance in speech is mirrored in visual phenomena and thus not in need of special explanation.
In §2.2, I will deny Mole’s claim that we could never understand how face recognition proceeds from analyzing mathematical retinal arrays by arguing that precisely this is what occurs when researchers describe computer face recognition algorithms. This casts doubt on Mole’s claim that the invariances in speech perception are not special, thus weakening his challenge to a key motivation of the Motor Theory.
In §2.3, I will argue that Mole’s use of cross-modal data from two experiments does not support his claim that McGurk-like effects are seen in areas other than speech perception.
￼1See [1, Ch. 10].
In §2.4, I will deny Mole’s claim that there is a problem accounting for how persons who cannot speak can understand speech using the Motor Theory.
In §2.5 I will will consider additional data that pose a challenge for Mole.
2 Challenges To Mole
2.1 Mole’s Counterexample Is Disanalogous
Any single phoneme will be understood by its hearer despite the fact that there will be many different sound patterns associated with it. It is clearly a very useful ability for humans to be able to ignore – when irrelevant – details about pitch and accent etc in order to focus purely on the phonemes which convey meaning. This ‘lack of invariance’ is a feature of speech perception but not of sound perception, which situation motivated the proposal of the Motor Theory.
For supporters of the Motor Theory, lack of invariance is evidence that the perceptual object in speech perception is a gesture – the phoneme that the speaker intended to pronounce. Mole does not agree that this lack of invariance can count as evidence for the special nature of speech.2 Mole’s argument is as follows. He agrees that there is not a one-to-one mapping between stimulus and perceived phoneme in speech perception. He then denies that this means that speech perception is special on the grounds that there is not in general a one-to-one mapping between stimulus and percept in perception. He produces a putative counterexample in vision, by noting that metamers exist. A metamer is one of two colors of slightly different wavelengths that are nevertheless perceived to be the same color. Note that color is defined here by wavelength rather than phenomenology. So Mole has indeed produced a further example of a situation where there is no one-to-one mapping between stimulus and percept.
However, this lack of one-to-one mapping is not exactly what is cited as the cause of the special nature of speech perception by supporters of the Motor Theory. It derives instead from the phenomenon of ‘co-articulation’. This is the way in which we are generally articulating more than one phoneme at a time.3 So while it is indeed the case that there are multiple stimuli being presented which result in a single percept, it is the temporal overlap between those stimuli that is the key factor, not the mere fact of their multiplicity.
This means that Mole’s metamer counter-example is disanalogous, because it only deals with the multiplicity of the stimuli in the mapping and not with their temporal overlap. We can see this if we imagine a lighting rig that is capable of projecting arbitrary colors and also of projecting more than one color at the
2Mole writes: [1, p. 217] “Even if speech were processed in an entirely non-special way, one would not expect there to be an invariant relationship between [...] properties of speech sounds [...] and phonemes heard for we do not [...] expect perceptual categories to map onto simple features of stimuli in a one-to-one fashion.”
3As Liberman and Mattingly write: [2, p. 4]“coarticulation means that the changing shape of the vocal tract, and hence the resulting signal, is influenced by several gestures at the same time” so the “relation between gesture and signal [...] is systematic in a way that is peculiar to speech”.
same time. In that case, we could not say that the perception of a color being projected at a particular time was changed by the other colors being projected with it. That situation would simply be the projection of a different color. So a projection of red light with green light does not produce a modified red, it produces yellow light. We cannot then say that yellow is a modified red without abandoning any meaning for separate colors altogether – every color would be a modified version of every other color. This impossible lighting rig is what Mole needs to cite to have a genuine counterexample because it would be a case of multiple stimuli being projected at the same time and resulting in activation of the same perceptual category.
In sum, a metamer is an example of a non one-to-one mapping between stimulus and perceptual category where the different stimuli are not simultaneous. A co-articulation is an example of a non one-to-one mapping between stimulus and perceptual category where the different stimuli are indeed simultaneous. Since, for supporters of the Motor Theory, it is that very simultaneity that is the key to the special nature of the systematic relation between gesture and signal, Mole does not have a counterexample.
2.2 Face Recognition Not Impossible For Computers
Mole uses a face recognition point in support of his claim that we could create the appearance of a special lack of invariance for any categorization task by moving to a low enough level of description. For example, it would be difficult to decide which painting was in question given only a specification of the molecular structure presented by the Mona Lisa. He thus again challenges the idea that the lack of invariance in speech perception is evidence for the special nature of speech perception. His claim is that face recognition is another example of a lack of invariance.
Mole allows that we use invariances in face recognition, but denies this could ever be understood by examination of retinal data.4 However, this can be questioned as follows. Since the only thing that computers can do in terms of accepting data is to read in a mathematical array, Mole’s claim is in fact equivalent to the claim that it cannot be understood how computers can perform face recognition. That claim is false. To be very fair to Mole, his precise claim is that the task might appear impossible, but I shall now show that since it is in fact possible, it should not appear impossible to anyone either.
Fraser et al describe an algorithm that performs the face recognition task better than the best algorithm in a ‘reference suite’ of such algorithms. The computer has a gallery of pictures of faces and a target face. Its task is to sort the gallery such that the target face is near the top. The authors report that their algorithm is successful at performing this task nearly 80% of the time.
4Mole writes: [1, p. 218] “The invariances which one exploits in face recognition are at such a high level of description that if one were trying to work out how it was done given a moment-by-moment mathematical description of the retinal array, it might well appear impossible.”
5Fraser et al write: [3, p. 836] “We tested our techniques by applying them to a face recognition task and found that they reduce the error rate by more than 20% (from an error rate of 26.7% to an error rate of 20.6%).”.
So we see firstly that the computer can recognize a face. Then we turn to the claim that how the computer does this cannot be understood. That is refuted by the entire paper, which is an extended discussion of exactly that. Since this in an active area of research, we can take it that such understanding is widely available in computational circles.
It may be true in one sense that we could not efficiently perform the same feat as the computer – in the sense of physically taking the mathematical data representing the retinal array and explicitly manipulating it in a sequence of complex ways in order to perform the face recognition task. In another sense, we could, of course. It is what we do every time we actually recognize a face. The mechanics of our eyes and the functioning of our perceptual processing system have the effect of performing those same mathematical manipulations. We know this because we do in fact perform face recognition using only the retinal array as input data.
This casts doubt on Mole’s claim that there is any level of description of data in this particular task which would produce an invariance problem, so in fact the invariances we exploit may not be at a high level of description. Therefore Mole has not here provided a further example of a lack of invariance and he has not thereby questioned the specialness of speech perception which does indeed exhibit a lack of invariance.
2.3 Experimental Data Do Not Show Cross-Modal Fusion
2.3.1 Cello Experiment
Mole argues that an experiment on judgments made as to whether a cello was being bowed or plucked shows the same illusory optical/acoustic combinations as are seen in the McGurk effect. The McGurk effect6 is observed in subjects hearing a /ba/ stimulus and seeing a /ga/ stimulus. The subjects report that they have perceived a /da/ stimulus. It is important to note that this is not one of the stimuli presented; it is a fusion or averaging of the two stimuli. So an optical stimulus and and an acoustical stimulus have combined to produce an illusory result.
If Mole’s claim that the cello experiment shows McGurk like effects is true, this would show that these illusory effects are not special to speech, thus challenging the claim that there is anything special about speech that the Motor Theory can explain.7 Unfortunately, the data Mole cites do not show the same type of illusory combination and so Mole is unable to discharge the specialness of speech perception as he intends.
The Motor Theory postulates that the gesture intended by the speaker is the object of the perception, and not the acoustical signal produced. The theory explains this by also postulating a psychological gesture recognition module which
7Mole writes: [1, p. 221] “judgments of whether a cello sounds like it is being plucked or bowed are subject to McGurk-like interference from visual stimuli”.
will make use of the speech production capacities in performing speech percep- tion tasks. Thus the McGurk effect constitutes strong evidence for the Motor Theory by explaining that the module has considered optical and acoustical inputs in deciding what gesture has been intended by the speaker. This strong evidence would be weakened if Mole can show that McGurk-like effects occur other than in speech perception, because the proponents of the Motor Theory would then be committed to the existence of multiple modules and their original motivation by the observed specialness of speech would be put in question.
The paper8 Mole cites describes an experimental attempt to find non-speech cross-modal interference effects using a cello as the source of acoustic and optical stimuli. There are two ways to make a cello produce sound: it can be plucked or it can be bowed. The experimenters proceed by presenting subjects with discrepant stimuli – for example, an optical stimulus of a bow accompanied by an acoustical stimulus of a pluck. The experimenters found that the reported percepts were adjusted slightly by a discrepant stimulus in the direction of that stimulus.
However, to see a McGurk effect, we need the subjects to report that the gesture they perceive is a fusion of a pluck and a bow. Naturally enough, this did not occur, and indeed it is unclear what exactly such a fusion might be. Therefore, Mole has not here produced evidence that there are McGurk effects outside the domain of speech perception.
Mole’s response is to admit that speech perception is special, but to deny that it is special in a helpful way to proponents of the Motor Theory. Mole’s discrimination of the type of specialness that will support the Motor Theory is whether it involves a qualitative or quantitative difference.9 As we have seen, Mole is wrong to claim there is only a quantitative difference between the McGurk effect observed in speech perception and the cross-modal effects observed in the cello experiment because only in the former were fusion effects observed. That is most certainly a major qualitative difference.
Mole’s claim that the cello results are only quantitatively different to the McGurk effect produces further severe difficulties when we consider in detail the experimental results obtained. The experimenters describe a true McGurk effect as being one where there is a complete shift to a different entity – the syllable is reported as clearly heard and is entirely different to the one in the acoustic stimulus.10 The cello data were not able to make a pluck sound exactly like a bow and in fact the discrepant optical stimuli were only able to slightly shift the responses in their direction, by less than a standard deviation, and in some cases not at all. This is not the McGurk effect at all and so Mole cannot
8See . The authors state prominently in their abstract that their work suggests “the nonspeech visual influence was not a true McGurk effect” in direct contradiction of Mole’s reason for citing them.
9Mole writes: [1, p. 221] “The McGurk effect does reveal an aspect of speech that is in need of a special explanation because the McGurk effect is of a much greater magnitude than analogous cross-modal context effects for non-speech sounds”.
10Saldaña and Rosenblum [5, p. 409] describe these McGurk data as meaning: “continuum endpoints can be visually influenced to sound like their opposite endpoints”.
say it is only quantitatively different.11
In sum, the cross-modal fusion effect that Mole needs is physically impossible
in the cello case and the data actually found do not even represent a non-speech analog of the McGurk effect, as is confirmed by the authors.
2.3.2 Sound Localization Experiment
The other experiment that Mole cites also fails to bear out his claim that there are McGurk like effects outside the domain of speech perception. The second experiment12 is in the area of the ventriloquism effect, whereby the perceived location of a sound is different to its actual source as a result of discrepant optical data. As above, the result that Mole needs is an effect that is a good analogy to the McGurk effect in a non-speech domain.
This experiment uses tones and lights as its acoustic and optical stimuli. It investigates the ventriloquism effect quantitatively in both the spatial and temporal domains. The idea is that separate optical and acoustic events will tend to be perceived as a unified single event with optical and acoustical effects. This will only occur if the spatial or temporal separation of the component events is below certain thresholds. The authors of the paper Mole cites, Lewald and Guski, have proposed13 a “spatio-temporal window for audio-visual integration” within which separate events will be perceived as unified.
Lewald and Guski suggest maximum values of 3° for spatial separation and 100 ms for temporal separation. Thus a scenario in which a light flash occurs less than 3° away from the source of a tone burst will produce a unified percept of a single optical/acoustical event as will a scenario in which a light flash occurs within 100 ms of a tone burst. Since the two stimuli in fact occurred at slightly different times or locations, this effect entails that at least one of the stimuli is perceived to have occurred at a different time or location than it actually did.
In the McGurk effect, discrepant optical and acoustic stimuli result in a percept that is different to either of the two stimuli and is a fusion of them. We may allow to Mole that Lewald and Guski do indeed report subjects perceive a single event comprising a light flash and a tone burst. However, that is insufficient to constitute an analogy to the McGurk effect. Subjects do not report that their percept is some fusion of a light flash and a tone burst – as with the cello experiment, it is unclear what such a fusion could be – they merely report that an event has resulted in these two observable effects.14
Indeed, the subjects were not even asked whether they perceived some fused event. They were asked whether the sound and the light had a common cause; were co-located or were synchronous.15
11As Saldaña and Rosenblum [5, p. 410] put it: “[t]his would seem quite different from the speech McGurk effect”.
13See [6, p. 469].
14We should note that Lewald and Guski do not take themselves to be searching for non-speech analogs of the McGurk effect; the term does not appear in their paper or their 88 references.
15As Lewald and Guski write: [6, p. 470] “In Experiment 1, participants were instructed to judge the likelihood that sound and light had a common cause. In Experiment 2, participants had to judge the likelihood that sound and light sources were in the same position. In Experiment 3, participants judged the synchrony of sound and light pulses’ ”. A ‘common cause’ might have been some particular event but it is not the sound and the light and they were the only things that were perceived therefore the instructions do not even consider the possibility that a fused event was perceived.
Since Lewald and Guski are measuring the extent to which participants agree that a light and a tone had a common cause, were co-located or were synchronous, it is puzzling that Mole cites them to support his claim that perceived flash count can be influenced by perceived tone count.to judge the likelihood that sound and light had a common cause. In Experiment 2, participants had to judge the likelihood that sound and light sources were in the same position. In Experiment 3, participants judged the synchrony of sound and light pulses’ ”. A ‘common cause’ might have been some particular event but it is not the sound and the light and they were the only things that were perceived therefore the instructions do not even consider the possibility that a fused event was perceived.
Neither this experiment or the cello experiment support Mole’s summation:17 “[i]t is not special to speech that sound and vision can interact to produce hybrid perceptions influenced by both modalities” in the way he needs. There are no hybrid perceptions in either case, if that means ‘a perception of an event which is neither of the stimulus events’, as is seen in the McGurk effect. There are cross-modal effects between non-speech sound stimuli and optical stimuli but that is inadequate to support Mole’s claim that speech is not special.
2.4 Mute Perceivers Can Be Accommodated
One of Mole’s challenges is that the Motor Theory cannot explain how some people can have the capacity to perceive speech that they lack the capacity to produce.18
There is an equivocation here though on what is meant by‘capacity to produce’. Mole is reading that term so that the claim is that someone who is unable to use their mouth to produce speech lacks the capacity to perceive speech. Since such mute people can indeed as he claims understand speech, he takes his claim to be made out.
However, in the article cited by Mole, it is clear that this is not what is un- derstood by ‘capacity to produce’. In the Fadiga study described, the neuronal activation related to tongue muscles is not sufficient to generate movement.19 Thus the question is whether the subject has the capacity to produce such a sub-threshold activation, and not the capacity to produce speech via a super- threshold activation. Naturally, since all the subjects had normal speech, they could produce both a sub-threshold and a super-threshold activation, with the latter resulting in speech.
However, someone could be able to activate their tongue muscles below the threshold to generate overt movement but not be able to activate those muscles above the threshold. That would mean that they lacked ‘capacity to produce’ in Mole’s sense, but retained it in Fadiga’s sense. This would be a good categorisation of the mute people who can understand speech they cannot utter. Those people would retain the ability to produce the neural activity that Fadiga observes, which does not result in tongue muscle movement.
16Mole writes: [1, p. 221] “The number of flashes that a subject seems to see can be influenced by the number of concurrent tones that he hears (Lewald and Guski 2003)”.
17See [1, p. 221].
18Mole writes: [1, p. 226] “Any move that links our ability to perceive speech to our ability to speak is an unappealing move, since it ought to be possible to hear speech without being able to speak oneself”.
19Fadiga writes: [7, Fig. 1] “The observed motor facilitation is under-threshold for overt movement generation, as assessed by high sensitivity electromyography showing that during the task the participants’ tongue muscles were absolutely relaxed”.
Similarly, we can resolve Mole’s puzzle about how one can understand regional accents that one cannot mimic. The capacity to understand results from our ability to generate sub-threshold activations of that form. We nevertheless do not have the ability to produce super-threshold activations of that form and actually speak in the regional accent because that is just what it is to be unable to do so. Had we acquired that regional accent, it would exactly mean that our super-threshold muscle activation capacities were of the required form.
We may imagine here an analogy between speech perception and speech production. Mole has already outlined￼ how infants can perceive all speech sound category distinctions, but eventually lose the ability to discriminate the ones that do not represent a phoneme distinction in their language. So we may postulate that all infants are born with the neural capacity to learn to generate super-threshold activations of all regional accents, but eventually retain that capacity only at the sub-threshold level – because they can later understand regional accents – and lose the capacity at the super-threshold level – for those regional accents they cannot mimic.
Mole is describing the apparent problem of accounting for perceptual capacity in those that lack productive ability because it is what motivated Liberman and Mattingly to revise the Motor Theory. That theory as revised became a claim about a model of the vocal tract as opposed to a claim about the vocal tract. Yet given the preceding analysis, we can see that a better way for the revision to be understood is that the vocal tract can function as a model of itself. This means that the sub-threshold activation functions as a model of the super-threshold activation, or in other words, that perceptual capacities involve the former modelling the latter exactly as the Motor Theory predicts.21
2.5 Further Brief Challenges To Mole
2.5.1 Cerebellar Involvement In Dyslexia
Mole cites a paper by Ivry22 in order to make the case that supporters of the Motor Theory see lack of invariance as its sole major explanandum. We may note that citing a two page letter written in response to a target article may involve the authors not having space to convey all of their views on what a theory is explaining. In any case, the target article to which Ivry et al refer shows that 80% of dyslexia cases are associated with cerebellar impairments. Since the cerebellum is generally regarded as a motor area, and dyslexia is most definitely a language disorder, we have clear evidence for a link between the two areas. That is naturally a result that can be clearly accommodated by the
20See [1, p. 216].
21Note that this approach does not commit the Motor Theory to the modelling/perception neurons controlling the sub-threshold activations being the same as the production neurons controlling speech production.
22See [1, p. 216] for Mole’s citation of .
Motor Theory. It is not open to Mole to claim that the link is only between motor control areas and writing skills, because although writing skills are the primary area of deficit for dyslexic subjects, the authors also found impairments in reading ability to be strongly associated with the cerebellar impairments.
2.5.2 Links Between Speech Production And Perception In Infants
Mole does not address some important results supplied by Liberman and Mattingly23 that link perception and production of speech. These data show that infants preferred to look at a face producing the vowel they were hearing rather than the same face with the mouth shaped to produce a different vowel. That effect is not seen when the vowel sounds were replaced with non-speech tones matched for amplitude and duration with the spoken vowels. What this means is that the infants are able to match the acoustic signal to the optical one. In a separate study, the same extended looking effect was seen in infants when a disyllable was the test speech sound. These data cannot be understood without postulating a link between speech production and speech perception abilities, because differentiating between mouth shapes is a production-linked task – albeit one mediated by perception – and differentiating between speech percepts is a perceptual task.
2.5.3 Neural Stimulation Of Motor Areas Enhances Perception
An experiment was conducted in which TMS – Transcranial Magnetic Stimulation – was applied to areas of the brain known to be involved in motor control of articulators. Articulators are the physical elements that produce speech, such as the tongue and lips. After the TMS, the subjects were tested on their abilities to perceive speech sounds. It was found that the stimulation of speech production areas improved the ability of the subjects to perceive speech. The authors sug- gest that the effect is due to the TMS causing the priming the relevant neural areas such that they are more liable for activation.
Even more remarkably, the experimenters find more fine grained effects such that stimulation of the exact area involved in production of a sound enhanced perceptual abilities in relation to that sound.24 This constitutes powerful evidence for the Motor Theory’s claim that the neural areas responsible for speech production are also involved in speech perception.
23See [2, p. 18].
24As D’Ausilio et al [9, p. 383] report: “the perception of a given speech sound was facilitated by magnetically stimulating the motor representation controlling the articulator producing that sound, just before the auditory presentation”.
 M. Nudds and C. O’Callaghan, Sounds and perception: new philosophical essays. Oxford University Press, 2010.
 A. Liberman, “The motor theory of speech perception revised,” Cognition, vol. 21, pp. 1–36, Oct. 1985.
 A. M. Fraser, N. W. Hengartner, K. R. Vixie, and B. E. Wohlberg, “Classification modulo invariance, with application to face recognition,” Journal of Computational and Graphical Statistics, vol. 12, no. 4, pp. pp. 829–852, 2003.
 H. McGurk and J. MacDonald, “Hearing lips and seeing voices.,” Nature, vol. 264, no. 5588, pp. 746–748, 1976.
 H. M. Saldaña and L. D. Rosenblum, “Visual influences on auditory pluck and bow judgments.,” Perception And Psychophysics, vol. 54, no. 3, pp. 406– 416, 1993.
 J. Lewald and R. Guski, “Cross-modal perceptual integration of spatially and temporally disparate auditory and visual stimuli.,” Brain Res Cogn Brain Res, vol. 16, pp. 468–478, May 2003.
 L. Fadiga, L. Craighero, G. Buccino, and G. Rizzolatti, “Speech listening specifically modulates the excitability of tongue muscles: a TMS study,” Eur J Neurosci, vol. 15, no. 2, pp. 399–402, 2002.
 R. B. Ivry and T. C. Justus, “A neural instantiation of the motor theory of speech perception.,” Trends Neurosci, vol. 24, no. 9, pp. 513–5, 2001.
 A. D’Ausilio, F. Pulvermüller, P. Salmas, I. Bufalari, C. Begliomini, and L. Fadiga, “The motor somatotopy of speech perception.,” Current biology : CB, vol. 19, pp. 381–385, Mar. 2009.