(from English Phonetics, the journal of the EPSJ, Vol 21, January 2017, a Festschrift for Professor Jack Windsor Lewis on the occasion of his 90th birthday.)

    English homographs and text-to-speech algorithms

    John Higgins (Stirling University, retired)

    I had the good fortune to get to know Jack Windsor Lewis in Oslo in 1966 and we have remained friends ever since. We collaborated on a book of exercise material for

    my adult education learners (Higgins and Lewis, 1968) while my wife worked with him preparing his English pronouncing dictionary (Lewis, 1972). His excellent Guide to English Pronunciation (1969) was what aroused my interest in homographs, especially Chapter Five on word rhythm. This book has, by the way, perhaps the most striking cover design of any book in the field of phonetics.

    *****

    The anomalous English spelling system has two troublesome by-products. The first is the existence of nearly three thousand homophones, words spelled differently but

    pronounced alike, such as fair/fare. These lead to many errors in writing, among native speakers as well as foreign learners, some of the commonest being the

    confusion of their with there, or principle with principal. It is also a substantial problem in automatic speech recognition, and many hilarious errors have been reported when such software has been used to generate sub-titles during live broadcasts: The air to the thrown prints Charles, for example. Not that English is unique in this respect. French, too, is notorious for its homophone problems.

    A second by-product, less problematical and more local to English, is the occurrence of homographs, words spelled alike but pronounced differently, such as wound

    (past tense of “to wind” versus “injury”) or moped (“was miserable” versus “small motorcycle”). These are sometimes known as heteronyms. There are around six

    hundred and fifty of these, listed on my website http://www.minpairs.talktalk.net/graph.html. Although they may occasionally be mispronounced, they rarely cause much misunderstanding. However, they must cause problems for anyone who aims to create an accurate text-to-speech algorithm for a computer. Not only must two pronunciations be stored for the one word, but there must also be a means of selecting the relevant one. This entails either picking up contextual markers or using real-time parsing of the whole utterance.

    To see how well this is done at present, I selected four devices with inbuilt algorithms, namely a third-generation Kindle e-reader from 2010 and an eighth-generation model from 2016, the Narrator software in Windows 10, and the voice on my Apple iPad using Apple iOS9. Each of these offers a single female voice, though I believe alternative male voices are available, certainly for Windows.

    I also chose four on-line companies out of many offering text-to-speech services: Acapela, Ivona, Oddcast and ReadSpeaker, all of which offer a demonstration mode

    on their web pages. These companies offered a selection of voices, male, female and child, with a range of regional accents. In some cases thirty or more different voices were selectable. What I soon learned was that the voices did not all perform consistently on my test sentences. Clearly they were not drawing on the same dictionary resource. I did not have the time to run every sentence through every voice, so I normally took a sample of three or four voices, recording diversity when I observed it.

    I have anonymised the results to some extent rather than identifying individual suppliers, since I do not wish to promote one product over another; anyone who wants to assess an individual company’s product can replicate my tests if they wish. In the tables below, the first two columns are the results of the two generations of Kindle e-reader, then Windows 10, followed by Apple, while columns E to H are the four on-line companies though not in the order listed above.

    Read

    I began with perhaps the severest test of all, the choice of /ri:d/ versus /red/ as pronunciations of read. Interestingly this is something that native speakers hardly ever get wrong in spontaneous speech and rarely in reading aloud. When they do make an error in reading, it is usually corrected spontaneously except in highly ambiguous contexts. This is an interesting fragment of evidence to suggest that our mental lexicon is stored as an inventory of sounds, with little dependence on spelling. If software is to match this, it must use a full parse that can distinguish a noun from a verb, present from past, and a past participle from an infinitive, rather than simply using proximity testing for to or will or have/had.

    I started with four sentences with verbs which were unequivocally /i:/, unmarked present, infinitive after a modal verb, infinitive with to, and imperative.

    When I read his books, I am never disappointed.

    I will read it when I have time.

    I don’t want to read it now.

    Here’s her new book. Read it and then pass it on.

    As expected, these gave no problems. I followed this with four which were equally certainly /e/ though one of them was deliberately tricky, including a to in front of a

    read past participle to smoke out algorithms which rely on to to identify an infinitive.

    I have read it already.

    Have you ever read his first book?

    When he read it, he was bowled over.

    The contents of this shelf are restricted to previously read books.

    Kindle 2010

    e

    Kindle 2016

    e

    Windows 10

    e

    Apple

    e

    E

    e

    F

    e

    G

    e

    H

    e

    i:

    i:

    e

    i:

    i:

    i:

    i:

    i:

    e

    e

    i:

    e

    i:

    e

    e

    e

    i:

    i:

    i:

    i:

    i:

    i:

    i:

    i:

    These results were very disappointing. All the engines picked up have read, but only three of the eight recognised the question form in Have you ever read. The on-line engines picked up /e/ in he read, but the built-in algorithms all failed this one. Rather as I expected, none of them escaped the trap I set in the last sentence.

    I then tried four sentences which were less clear cut and might cause even a native speaker to hesitate.

    When I go on holiday I read books like that.

    When I used to go on holiday I read books like that.

    I read almost everything she wrote.

    I read almost everything she writes.

    Here, again, results were disappointing with all the algorithms playing safe with /i:/ apart from one internet engine which inexplicably played safe in the opposite direction, giving /e/ for all.

    Finally, for completeness, I included two cases of nouns (always /i:/).

    It’s a good read.

    We need some good reads around Christmas.

    As expected, these caused no problems, with /i/ being used in all the algorithms. However, the conclusion seems to be that very little real-time parsing is being applied. With a few exceptions, the engines seem to rely on a default pronunciation, with a few exceptional contexts, perhaps storing the phrases he/she read and have/has read with the /e/ pronunciation and everything else with /i/.

    Other homographs

    The 650 English homographs can be broken down into several different categories, not all of equal importance. I have made a more or less random selection of items from these groups and assembled sentences to display the contrast. In the tables below I have used the following symbols:

    the symbol + shows clearly distinguished outputs matching what one expects from a native speaker, eg ‘Please /ri`ko:d/ my objection. You did not make a /`

    reko:d/ last time.’

    the symbol ~ shows the reverse of this, both words wrong, eg ‘Please /`reko:d/ my objection. You did not make a /ri`ko:d/ last time.’

    the < symbol shows the first pronunciation used for both words, eg ‘Please /ri`ko:d/ my objection. You did not make a /ri`ko:d/ last time.’

     and the symbol > shows the second pronunciation in both, eg ‘Please /`reko:d/ my objection. You did not make a /`reko:d/ last time.’

    Double stress

    One minor category is the set of double-stress words (or variable stress as they are called in JWL’s Guide, p. 53). These are words which vary according to their position, front-stressed before a noun or end-stressed when final in the phrase. These include individual words such as afternoon, upstairs, outside, routine, as well as numbers from 13 to 99 apart from multiples of 10, nationality adjectives ending in -ese, and many compound adjectives such as easy-going or home-made. The difference is usually fairly clear in such examples as “an outside toilet” opposed to “Take it outside, please.” However, there may be a tendency towards greater front-stressing. Words like downhill, princess or sardine, which are often assigned to this category, no longer sound wrong when front-stressed in sentences like “They’re going downhill”, “She’s a real princess” or “I can’t stand sardines.”

    The following sentences were used to test this feature in the eight algorithms:

    Give me the inside story.  You were on the inside, weren’t you?

    Can you speak Chinese?  No, but I can play Chinese whispers.

    He was born in nineteen-twenty-seven. When he came back from the war he was just nineteen.

    Kindle 2010

    Kindle 2016

    +

    Windows 10

    +

    Apple

    +

    E

    +

    F

    +

    G

    H

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    What this shows immediately is the advance made by the Kindle, where the early model had all these words invariably front-stressed. The occasional anomalies I noted in F, G and H are the product of even stressing of the two syllables rather than completely wrong stressing. Generally the results were much better than expected.

    Variable stress

    The second category, much the most numerous, is variable stress where one form, usually a noun or adjective, is front-stressed while a matching verb is end-stressed.  In some cases the stress difference is the only distinction, e.g. import, digest or torment. In other cases there is a reduction of the unstressed vowel: affix, conduct, or

    produce. The 1989 edition of the Advanced Learners Dictionary listed nearly 300 words of this type, though it is probably true that some of them are losing the distinction and becoming front-stressed in all contexts, such as decrease, increase, replay, and possibly the -port words, import, export and transport, all of which are often heard front-stressed as verbs. In some cases one of the uses is obscure and might be got wrong by a native speaker, collect as a noun meaning prayer or second as a verb meaning to send away on temporary duty for instance, so I would not expect the algorithms to make the distinction. However I included them, along with other more clear-cut cases.

    Her conduct was deplorable. We cannot allow such a person to conduct the choir.

    He got his just deserts. They sent him to the deserts of North Africa.

    He would frequent the bars around Soho, and we had frequent visits in the autumn.

    Take that object away. I object to its presence.

    Kindle 2010

    Kindle 2016

    +

    Windows 10

    +

    Apple

    E

    +

    F

    +

    G

    +

    H

    +

    +

    +

    + >

    + < >

    + <

    +

    +

    +

    +

    +

    + >

    +

    G performed very inconsistently, some of its thirty different voices making the distinctions correctly while others did not. There was an interesting inconsistency in H where the American voices dealt correctly with deserts while the British voices failed.

    At present there is no need to present him with a reward.

    Please record my disagreement. You did not make a record last time.

    I refuse to stay in this room. There is far too much refuse everywhere.

    Then follows the collect for the second Sunday in Advent after which we collect your freewill offerings.

    We will need to second two officers to the security section. Yes, I second that.

    Kindle 2010

    +

    Kindle 2016

    +

    Windows 10

    +

    Apple

    +

    E

    +

    F

    +

    G

    +

    H

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    + >

    + <

    +

    +

    +

    +

    +

    + >

    > <

    Again G performed very inconsistently. As expected second was wrong throughout, but collect fared better, though the perceived correct versions may have been due to even stressing of the two syllables rather than front stressing.

    –ate words

    She advocates a cautious approach, but she is a poor advocate for her cause.

    He alternates happiness and misery on alternate days.

    There is no need to duplicate your effort. I made a duplicate of the document yesterday.

    She will go into the graduate programme, provided that she graduates next year.

    Kindle 2010

    +

    Kindle 2016

    +

    Windows 10

    +

    Apple

    +

    E

    +

    F

    +

    G

    +

    H

    +

    +

    +

    +

    +

    +

    +

    +

    + <

    + <

    + <

    +

    +

    +

    +

    +

    +

    +

    +

    +

    I was surprised how well this group was handled since it requires some parsing to distinguish nouns from verbs.

    Voicing

    I will not put up with this abuse any longer. Nor will I let you abuse your friends.

    That was a close call. Close the door, please.

    What’s the use of waiting? We will use up all the sugar.

    I used to enjoy James Bond books. They used up my leisure time well.

    Kindle 2010

    Kindle 2016

    +

    Windows 10

    +

    Apple

    +

    E

    F

    +

    G

    + < >

    H

    + >

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    + < >

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    Again the level of success was gratifying since some parsing must be taking place. Only F used a fully voiced /z/ in “I used to enjoy …”  In G and H there were unexpected differences between the available voices.

    re-

    You will have to re-mark this paper. I don’t like the remark you made at the top.

    You will have to remark this paper. I don’t like the remark you made at the top.

    Kindle 2010

    Kindle 2016

    +

    Windows 10

    +

    Apple

    +

    E

    +

    F

    +

    G

    +

    H

    +

    There were no problems with the hyphenated spelling other than with the older Kindle, and I assume the same would be true with the other words in this set, re-sign, re-form, re-join, etc. With the hyphen omitted the words are not treated as homographs.

    –ed

    It’s a blessed nuisance. He’s not blessed with much intelligence.

    I never learned how to read books by learned professors.

    Kindle 2010

    Kindle 2016

    Windows 10

    Apple

    ~

    E

    F

    G

    + >

    H

    + >

    + <

    + <

    I was surprised that this was so poorly handled by most of the algorithms, and at how the iPad got blessed completely the wrong way round.

    French plurals

    We’ll make a rendezvous, but he has missed the last five rendezvous, and I am not hopeful.

    Kindle 2010

    Kindle 2016

    Windows 10

    Apple

    E

    F

    G

    < >

    H

    I was not surprised that there were no /z/ plurals audible here. Many native speakers might hesitate when reading aloud, although in spontaneous speech they would probably put in a /z/ here or even with proper names: “The restaurant stocks a good range of Beaujolais and Chablis.”

    True homographs

    This leaves a large group of homographs arising from a variety of causes. I have selected ten of these more or less at random.

    How do you calculate the arithmetic mean? It doesn’t need much arithmetic.

    The stamp attaches here. It was issued by the attachés at the embassy.

    He often catches sea bass. Then he goes to church and sings bass in the choir.

    She won the bow and arrow contest, then took a bow in front of the audience.

    What does it mean when the does assemble in Richmond Park?

    You entrance me. From your first entrance I was overwhelmed.

    Kindle 2010

    Kindle 2016

    +

    Windows 10

    Apple

    ~

    E

    +

    F

    G

    > < ~

    H

    + >

    +

    +

    +

    +

    +

    +

    +

    +

    > <

    + <

    +

    +

    +

    +

    + <

    + <

    < >

    +

    +

    +

    +

    +

    +

    +

    +

    +

    Arithmetic was better dealt with than expected, but I was again surprised that the iPad made the distinction but in reverse. A big disappointment was the word bow. Only one of the voices in G got it right. To balance that, I was surprised to find all the algorithms succeeding with entrance.

    To finish the discussion he made a fine distinction. Well a finish distinction I would say.

    How can you write French when there is no grave accent on your keyboard? Voltaire would turn in his grave.

    This permit is invalid. This parking space is reserved for an invalid.

    I had quite a job persuading people to read the Book of Job.

    Kindle 2010

    Kindle 2016

    Windows 10

    Apple

    E

    F

    G

    H

    +

    +

    +

    +

    + >

    + >

    + <

    + <

    + <

    The only success with the job/Job distinction was with the American voice in F and some but not all the American voices in G and H. Does this reflect the strength of

    religious observance in the USA? I was rather mischievous to include the two pronunciations of finish, which defeated all the algorithms as expected. I was disappointed that grave also defeated them.

    Overall the results are mixed. Text-to-speech has made huge advances on the robotic voice we have associated over the last twenty years with Professor Stephen Hawking. While none of the algorithms is yet capable of rendering an extended passage of prose fiction as pleasantly as a trained actor, all of them deliver fully comprehensible speech, often with pleasant natural timbres and an acceptable range of intonation. However, it is clear that the handling of homographs remains problematical. Mistakes rarely cause complete misunderstanding, but they do interfere with smooth comprehension and pleasurable listening. The better results obtained from the 2016 Kindle as against the earlier model show that the field is developing, and by the time this paper appears it may be that many of the specific shortcomings in current software will have been fixed. It is worth remembering that this work does owe something to the pioneering work of phoneticians in the last century among whom Jack Windsor Lewis was a significant figure.

    References

    Higgins, J.J. and Lewis, J. Windsor. Pronunciation and Listening Practice, a language laboratory course. Illustrated by David Handforth. Oslo. Studentersamfundets Fri Undervisnings Forlag, 1968.

    Lewis, J. Windsor. A Guide to English Pronunciation for users of English as a Foreign Language. Oslo, Bergen, Tromso. Universitetsforlaget, 1969.

    Lewis, J. Windsor. A Concise Pronouncing Dictionary of British and American English. London. Oxford University Press. 1972.