I Speak Analogue, You Hear Digital

John Higgins

A version of this paper was first presented at a plenary session of the Canadian CALL conference in Guelph, 1989. At that time computer sound cards and speech recognition software existed but were expensive and fairly primitive by present standards.


A speaker uses breath pulses, modulated by the flexible cavities of the vocal tract, to produce cyclical air pressure variations. These spread in waves to where they may encounter a listener and produce minute but infinitely variable distortions of the eardrum. A writer moves a pen or hammers a key to produce an irregular darkening of a white surface, which may then be put in front of a reader in enough light to form an image on the retina. These are essentially analogue processes.

Language, however, imposes a digital system on the analogue facts. Noises are assigned unequivocally to phonemes, shapes to graphemes, and sequences of these to words and sentences. Aided by considerable processing input by the receiver, sentences become meanings or messages. Sometimes, of course, the data being received is contradictory or indeterminate; the signal is corrupted and the system has to go into repair mode: "Sorry, what was that?" or "I can't read your sister's handwriting."

The title of this paper is a truism, something that linguists have known for a long time, though they may not have expressed it in those words. It is, after all, the basis of the whole of phonology. It doesn't matter, in a word like lamb whether I use a clear l, a dark l or anything in between. Because I am speaking English and you are listening in English, you assign the noise I am making to the /l/ phoneme. If I say ram you assign the first noise of that word to the /r/ phoneme. But if you are listening in Japanese, you may find that more difficult to do. Listening is digital, but the base is different in different languages.

Think how easy it is to spot the exact moment wo man in einer anderen Sprache zu sprechen anfaengt, und auch genau wo you start to speak the first language again. Funnily enough this becomes even easier when one of the languages in the pair is completely unfamiliar. If I start in English lewkoo gamlang phuud pasaa thai, khun mai khawchai leui theung wela I start speaking English again. I imagine you found it quite easy to identify the crossover points, even without knowing what was going on in the middle. There is no gradual blending. I am speaking English or I am not speaking English.

A digital speaking test

One area in which we could take more notice of the analogue/digital distinction is in testing pronunciation. Most techniques, from the interview down to the most basic read aloud or repeat activities, involve a candidate who goes through the analogue process of making noises and an assessor who makes analogue judgements as to how close the performance is to some criterion:

 "Yes, that was definitely /ɪ/ rather than /i:/";
or  "Your /y/ was insufficiently rounded."

This is totally unlike the decision processes of communication, which is hardly surprising since no communication is taking place. The problem is that the listener knows in advance what the speaker is trying to say. This is where a very simple computer program can help. The computer can set the learner a task, e.g., to read aloud a sentence or answer a question, while the assessor is not told which one of several possible tasks has been set. Thus the question in the assessor's mind is not "Was that a good attempt to say the word ship?" but rather "Was the candidate talking about a ship or a sheep?"

To test single phoneme contrasts, one makes the computer print out a task sheet for the candidate like this:

Read these sentences aloud.

One hardly ever sees any whales nowadays.
The pilots were responsible for many shipwrecks.
You've got to heat it.
Be careful! That heel's dangerous.
I was surprised when he gave me the rice.
The rats needed more room to breed.

(BABAAB)

Each is randomly generated and the actual sentences will vary from candidate to candidate. Meanwhile the assessor gets a check sheet like this:

One hardly ever sees any (A) veils nowadays.
                         (B) whales

The (A) pilots were responsible for many...
    (B) pirates

You've got to (A) eat it.
              (B) heat

Be careful! That (A) heel's dangerous.
                 (B) hill's

I was surprised when he gave me the (A) rice.
                                    (B) lice.

The rats need more room to (A) breathe.
                           (B) breed.

The assessor listens to the sentences, checks A or B for each, and then compares the list with what is printed at the bottom of the candidate's sheet. In effect what we are doing here is to have the candidate give the assessor a listening test. We are certainly making the assessor behave more like a listener dealing digitally with the question "What is the candidate trying to tell me?" rather than like a judge dealing in an analogue way with the question "How well can the candidate make that sound?"

There are still problems. The tasks are not necessarily of equal difficulty, and most candidates who get the word breed can consider themselves "luckier" than those who get breathe. This may need to be adjusted by having certain contrasts turn up several times, with the harder option more frequent, though not so frequent that it makes the word predictable. If the test is prepared for diverse language backgrounds, a number of items will present no difficulty to particular learners. One might eventually, when records have been amassed, be able to make the test sensitive to specific languages, so that a test sheet for a Japanese learner would contain more r/l contrasts and no s/z contrasts, for instance.

Supra-segmentals

One can to some extent use the same technique for features other than segmental phonemes.

For example:

(Read this sentence as if you are talking to (A) Miss Jenkins / (B) a different person)

This is my colleague, Miss Jenkins.

(Say this sentence as an answer to the question (A) "Was he a clever student?' or (B)"Wasn't he rather a bad student?")

He was an extremely clever student.

You need some ingenuity to devise items that cover a wide range of stress and intonation features, and there are problems with the metalanguage used to describe the tasks. However, it may still be worth including some items of this kind if students have been suitably prepared. For the time being we are waiting for an opportunity to carry out trials to see whether the technique is reliable and economical.

Recognition

Language is riddled with metaphors for the recognition process which emphasise its suddenness: the penny dropped, it clicked, it fell into place. Recognition and understanding occur in quantum leaps, not in a gradual process. A quantum leap, don't forget, though often used to suggest something very large and significant, is actually for physicists something quite minute. The metaphorical quantum leaps that occur in even a short stretch of language defy enumeration. The tininess of these quantum leaps is relevant when it comes to considering how language is understood and learned. If learning consists of being taught, then language seems virtually unlearnable. If you try to enumerate all the separate acts of discrimination and synthesis that the listener carries out in the process of understanding a long utterance, the list quickly becomes longer than any course syllabus. A phonemic transcription is already a huge simplification of what is present and significant in the stream of speech. And yet we do learn language. It cannot be by a separate conscious act of learning for each item.

The analogue/digital paradox

There is a curious paradox here concerning the computer itself. Computers are digital devices, and yet they handle the analogue process of speaking or presenting language much better than the digital process of speech recognition. To do anything with speech they use analogue to digital converters which take the waveform of sound as captured by a microphone, find the closest approximation to its shape representable in whole numbers, analyse these numbers into values for the energy that is present in various parts of the sound spectrum, and then use these values to agitate the cone of a loudspeaker. Computers in fact are pretty good at making digital analyses of sounds, something I am well aware of whenever I play music CDs. They can make a stab at matching waveforms to phonemes, but only in rather gross ways which achieve 95% accuracy with slow careful speech from a known speaker. They do not seem very good at generalising the process of matching, at abstracting what is common to the waveform of the [f] sound in my speech and my wife's. They are still worse at matching sounds to messages, at deciding what speech means. They do it after a fashion, but tentatively and laboriously. Humans seem to do it with little effort and instantaneously.

Simple uses of a digitizer

The investigation of the unconscious (for humans) process of making sense out of noise is something I have been interested in for a number of years, and it is a rewarding application of digitizing software. Even the editing facilities of a straightforward sound card can transform introductory phonetics classes. By recording the word eye, reversing it and playing it back, you turn it into something like the word ear (with British pronunciation), thus demonstrating that the vowel element is the // diphthong. I used to do this trick with a loop tape-recorder; believe me, the voice card makes it easier. If you record a word like ham, you can display the oscilloscope trace, showing clearly that there is no silence between the component sounds; the stream of speech is indeed a stream. One can print out the /h/ elements of ham, his and who, showing the way the following vowel colours the sound. You can get simplified spectrograms of the vowels, showing the high second formant in /i/ and the huge burst of energy in the middle of the pitch range for /a/ where the first and second formants coincide. Demonstrations like this bring phonetics to life.

Speech-gating

A small research application was undertaken by a graduate student, Anne Graham, who recorded a number of everyday utterances such as:
How about coming for a cup of coffee in the common room?

The library has thousands of books, but there isn't one of them that I really want to read.

A meaningless fragment, just a syllable or so, was picked out of each sentence, and then the sentence was extended outwards in both directions four times in steps of about a quarter of a second. At the sixth step subjects heard the full sentence.

fra
ingfra
omingfracu
comingfracupof
boutcomingfracupofcoff
Howaboutcomingfracupofcoffeeinthecommonroom

Subjects were asked to write down each fragment they heard in any way they could represent it. The object was to see at which point they stopped writing nonsense syllables and started writing words that made sense. There were twenty volunteers, ten of whom were native speakers and ten were overseas students of intermediate or better standard in English. They were divided into two sub-groups, one hearing the sentences in a well-established context and the other hearing the sentences "cold." All the tests were carried out individually, and subjects could ask for as many repetitions of each fragment as they wanted.

Several rather unexpected findings came out of this little experiment One was that, although the penny dropped more or less as predicted for the native speakers, most of whom could predict the whole sentence by the third or fourth fragment, most of the foreign learners never made sense of the utterances and produced highly garbled transcriptions of them. The other interesting point was that knowledge of context seemed to make no measurable difference for either native or non-native speakers. We are still trying to work out the implications of that finding.

Hidden words

The other project that interests me at the moment also concerns listening. The Cambridge Local Examinations Syndicate used to make use of a form of listening test in which candidates heard a sentence and had to decide which of four words actually occurred in the sentence. For instance, they might hear:

I wouldn't do that if I were you.

and see

  1. WOOD
  2. TIFF
  3. FIRE
  4. WERE

Only word 4 occurred, but all the distractors were sound sequences which had occurred in the spoken sentence though not as meaningful units. If the sentence had been a meaningless blur of sound, then they had "heard" all four words, and it was only if they had succeeded in "writing out" the complete sentence in their minds that they could reliably spot the real word among the distractors.

For some reason this testing technique seemed to fall out of favour, and it has not been used to my knowledge since 1969. 1 have often wondered why. I suspect it is because the technique had low face validity, and that candidates suspected a trap of some kind and resented the form of test. This happened to some extent when I used similar types of test in entrance examinations in Turkey. The test itself performed magnificently well as far as the statistics were concerned, but it was unpopular. However, I still feel that the activity supplies reliable indications of the listening skill and might even have applications in training learners to listen. As long as this is just a hunch, it needs to be examined, and I have several students working on this kind of activity.

The public forms of this test that were used in the '60s all presented isolated sentences out of context. One thing I am keen to investigate is how the test is affected by being carried out with sentences which form a coherent text. One project was carried out in 1989 by a Bristol student on Omani schoolchildren, using texts taken from a local textbook. Preliminary results here suggest that familiarity with content (falling short of recent close study of the text) may have very little effect on performance. In a follow up study in Stirling conducted by Eiman Marafi of Kuwait, the learning effect of having to listen for specific types of function words has been examined; does an expectation that the key word may be there or can, for instance, affect the way that learners listen for weak forms? Follow this link to see a web-based example of the technique.

Testing and learning

Everything I have described so far has something to do with testing, and may suggest the magisterial side of testing, i.e., something carried out by teachers for teachers. In the long run, however, I hope to see the computer making the testing process so easy that it becomes a self access resource, something which learners go to just because they are inquisitive about their own standard of performance. One can make a reasonable prediction that greater self-awareness is likely to lead to enhanced performance. I do not see the computer functioning as an adviser, telling students what they should do, but rather as a completely neutral instrument which does nothing but report facts. If learners use the machine to discover facts about themselves, then it is up to them rather than the machine to decide what to do next. What I would like to see is a world in which learners turn themselves into critical consumers of test data, not passive receivers of test results.

Meanwhile the computer in the context of my own work is already turning the whole area of pronunciation and phonetics into a more experimental subject, allowing my students to look at sound as well as hear it, and to get rid of some of their preconceptions about, for instance, sounds and spellings, even at the simple level of discovering that doubled letters in spelling are not double articulations. This is part of a general service that computers in their slave role can provide for language learners, namely the opportunity for learners (not just teachers) to experiment. Expressed as a slogan, it puts the trial back into trial-and-error.


John Higgins
Revised Shaftesbury, July 2014.