Please consider a donation to the Higher Intellect project. See https://preterhuman.net/donate.php or the Donate to Higher Intellect page for more info.

PlainTalk: Difference between revisions

From Higher Intellect Vintage Wiki
No edit summary
 
(2 intermediate revisions by the same user not shown)
Line 30: Line 30:
* From 300K to 1.5 MB of available RAM, depending on voice quality
* From 300K to 1.5 MB of available RAM, depending on voice quality
* 5 MB of available hard disk space
* 5 MB of available hard disk space
=Speech Technologies: An Emerging Trend=
Remember when science-fiction shows like Star Trek showed the fantasy of people interacting with computers by giving spoken commands? Or when the computer named HAL in the science-fiction film 2001 communicated with people by talking to them? These far-out sci-fi fantasies, which seemed so out-of-reach when they were written, were right on target about how humans would prefer to interact with machines. Today, these examples are not as out-of-reach as one might think, due to modern desktop computer­ based speech technologies.
Speech technologies are about talking to a computer and having it talk back to you. It is commonly known that computers equipped for speech are a coming trend, and many of the large computer and software corporations are working to include these technologies in their products. Apple has been shipping speech technology products for its computers since 1993, giving it a leadership position in providing speech technologies on personal computers. Apple calls this technology PlainTalk, which includes both speech recognition and speech synthesis.
To be able to talk to computers and have them respond accordingly requires a technology called "speech recognition." To have computers talk to you requires "speech synthesis." Apple offers both technologies now, and is involved in continued improvement of recognition accuracy and voice synthesis quality. For Apple, speech technology is more than a novel human-machine interface, it is a path to enhanced user experience and productivity.
Speech is natural-we have been using it all our lives. Using voice commands doesn't require people to master the keyboard or memorize cryptic commands. Thus, computers that can give and receive information via speech communication offer greater productivity potential.
Historically, text-oriented machines have preceded speech-oriented machines, sometimes by centuries. Word recorders (e.g., typewriters) came well before voice/ sound recorders (e.g., phonographs and tape recorders). Computers have also followed that cycle. Since 1947, computers have been text-oriented machines that we wrote to and read from. Now, less than 50 years later, we are on the brink of a transition to speech interaction with personal computers-and Apple is leading the way.
=Speech Recognition=
===How a Computer Recognizes Speech===
Automatic Speech Recognition (ASR) is the process by which a computer listens to the human voice communicating a message, and converts it into written text and/or commands. This process is made up of several subprocesses: analog-to-digital conversion, signal processing, and recognition search. This may be followed by an optional natural language understanding process.
The analog-to-digital conversion process captures the analog input signal (continuous speech) and converts it into a series of digital values by sampling the speech at regular time intervals. The digitized samples are then passed through a signal processing module.
The signal processing module further simplifies the input into a sequence of patterns. These patterns represent the signal in a manner similar to that used by the human auditory system.
The recognition search process takes these patterns and compares them to a set of templates or models that has been created from several sources of knowledge such as acoustic models, language models, and dictionaries. The acoustic model is made up of the patterns of speech sounds, such as phonemes (the smallest units of sound) or words. The language model incorporates the set of words and sentences that are allowed and expected within the context of the application.
The output of the search process is a group of words corresponding to the best match found in the set of templates. The recognized sequence of words can be used to trigger a command or script. This, in turn, can initiate an action such as creating, retrieving, or printing a document.
Natural language processing takes the recognized sequence of words, and attempts to interpret and extract their meaning in the context of the application. This interpretation is then used to effect underlying actions.
===Features of a Good Speech Recognition System===
The ideal speech recognition system is characterized by the following features and capabilities: speaker independence, continuous speech, and flexible vocabulary.
A speaker-independent system can typically recognize the speech of any person without requiring that the user go through a lengthy training period in order for the system to understand his or her voice. This capability is critical for certain applications, such as command-and-control (controlling the computer with voice commands) and over-the-phone speech applications, where the interaction is short and the user requires an immediate action by the computer.
Continuous speech capability enables the user to talk naturally, in whole phrases and sentences, without inserting unnatural pauses between words. Recognition of natural, continuous speech is absolutely essential in order for speech recognition to become widely adopted as a capability within applications, and a valuable addition to the graphical user interface.
Flexible vocabulary enables an ASR system to recognize almost any set of words, and allows easy customization to suit varying tasks and applications. Flexible vocabulary is achieved through dictionary technologies where words are modeled as a sequence of subword units, or phonemes. The active vocabulary size determines the effective size of vocabulary that can be reliably recognized at any given time. To be fully functional, a dictation system would need an active vocabulary in the 20,000 range. A command-and-control system, on the other hand, needs to handle only up to a couple of hundred words or phrases at a time.
===What Makes Apple's Speech Recognition Unique?===
To date, Apple's ASR focus has been to use speech recognition to augment its graphical user interface for desktop command-and-control and to enable new functionality of the personal computer. Apple leads its competitors in bringing this technology to the market and has been shipping speech products on Macintosh systems since 1993.
Speech recognition, which is a part of Apple's PlainTalk speech product family, is a speaker-independent system that enables a Power Macintosh computer to recognize the voice of any North American. To develop this technology, voice samples of over 500 adult speakers-male and female-from several different regions of the continent were recorded. These voice samples were used to develop the acoustic models, allowing accommodation of a wide variety of dialects and voices.
Apple's speech recognition system can recognize continuous speech input. This means that the user can speak naturally, without pauses between words. For example, the command: "Schedule a meeting with my manager for tomorrow at 9 A.M.," is easily recognized by a calendar program using Apple's speech recognition technology.
PlainTalk has a flexible vocabulary system as well, enabling the easy incorporation of domain-specific terms into an application. It is just as easy to add speech capabilities to an application that helps users schedule meetings as it is to permit speech navigation within a computer-aided design application.
In addition to speaker independence, continuous speech, and flexible vocabulary, PlainTalk has built-in speech recognition capabilities that enable it to operate reliably and flexibly in real-world environments. It has tolerance for common noises like coughs and slamming doors, and for differences between acoustic environments, such as a conference room versus an office.
===Apple's Speech Recognition Applied===
To allow third-party applications to take maximum advantage of the aforementioned capabilities, Apple developed the PlainTalk Speech Recognition Toolbox, which has an application program interface (API) called the Speech Recognition Manager.
To use speech recognition, an application need only make calls to the Speech Recognition Manager. This API provides speech recognition services for applications. By creating a language model, the application can define the words and phrases it is to recognize. For example, an application designed to teach children about different animals could display the pictures of ten animals in a window, one at a time. The application would build a language model containing ten phrases-the names of those ten animals-and recognize when the child spoke the name of the animal shown on the screen.
An application can create several language models and install one of them as the active language model. This is useful if the words or phrases to be recognized might change according to context. Similarly, an application can create language models that contain other language models. A language model is associated with a recognizer, the part of the Speech Recognition Manager that performs the work of recognizing spoken utterances and reporting its results to the application.
Speech recognition provides users and developers with a rich set of functionality available as part of the operating system. The next section covers the second half of PlainTalk-speech synthesis-and makes clear what can be accomplished with speech technologies.
=Speech Synthesis=
===How to Make a Computer Talk===
Speech synthesis, or Text-to-Speech (TTS), is the process by which a computer converts any readable text into audible speech. Speech synthesis technology makes it possible to generate spoken output from a computer without requiring prior recordings of a human speaker, and without any restrictions to the vocabulary. This process is made up of several subprocesses: text processing, prosodic processing, and signal processing.
''Text processing''. Some text needs to be expanded and substituted, or "normalized," to ensure that the right words are generated. For example, "$100 million" should not be spoken as "one hundred dollars million." Many abbreviations are ambiguous, such as "Dr. Smith lives on Smith Dr." Text normalization makes sure these and other potentially misinterpreted phrases are spoken properly by the computer.
In addition to generating the right words, the system has to derive their correct pronunciation. In isolation, this is difficult enough. Compare the pronunciation of "ough" in tough, cough, though, thought, through, plough, and thorough. Putting the emPHAsis on the wrong sylLAble is a common mistake among non-native speakers, as well as computers. In addition, the pronunciation of many words depends on their grammatical analysis; for example, "Wind your watch when the wind blows from the west." Names are particularly difficult to pronounce because they often do not follow normal English spelling conventions. Apple's text processing uses extensive context-sensitive grammatical rules, part-of-speech tagging, and a 70,000-word dictionary to address these problems. After the correct words and pronunciations have been generated, the next step is to define the prosody.
''Prosodic processing''. Prosody is the tune, phrasing, rhythm, and emphasis that jointly make the difference between isolated words and connected speech. Getting the prosody right is crucial to conveying the correct meaning. Consider "Sam struck out my friend" versus "Sam struck out, my friend." Apple has implemented a new algorithm for generating the tune in synthetic speech. This markedly improves the quality and naturalness of the speech, making it less robotic, more pleasant to hear, and easier to understand.
''Signal processing''. This is the final stage in speech synthesis-to create an acoustic signal that sounds like a human voice saying the correct words with the desired prosody. This is the most computationally expensive component of the whole process because it requires many mathematical operations per sample of the output speech. However, it is completely software based, with no special hardware requirements. Apple provides a range of voices that balance quality and naturalness against memory and computational requirements.
===What Makes Apple's Speech Synthesis Unique?===
When a computer acquires a speaking voice, it acquires a new degree of personality and accessibility. Speech synthesis provides a more natural and less obtrusive means of providing information to the user, or requesting further information in return. Natural speech output augments the graphical user interface and ushers in a new range of usage for the personal computer. This is increasingly important as screen real estate becomes smaller and more expensive, devices become more portable, and computer access over the telephone becomes pervasive.
Apple's speech synthesis leads the personal computer industry in intelligibility and naturalness. In addition, Apple makes it possible to further optimize the speech quality for any particular application. For example, developers can modify pronunciations, emphasis, or speaking rate according to their specific needs. The footprint and architecture are scalable to fit into a variety of environments and Apple platforms. Equally important, each of the three stages of speech synthesis is customizable by the developer or user to perform optimally within an application scenario.
The prosody can easily be specified by rules that generate the correct intonation for texts within the context of a particular application. These rules can be automatically applied to any text within that application, to better convey the intended meaning. Alternatively, it is possible to "hand-sculpt" the intonation of a specific sentence to create any desired effect.
In addition, it is possible to select between high-quality male and female voices, or a wide range of smaller footprint voices that include several amusing and entertaining personalities. The dictionary can also be customized to contain application-specific jargon. Currently, there are three synthesis engines available. All run in software, with no special hardware requirements. They vary in the RAM footprint, intelligibility, and naturalness.
''MacinTalk 2''. The smallest synthesizer, MacinTalk 2 is an extremely compact implementation of speech synthesis. It uses the technique known as wavetable synthesis, adopted from music synthesizers. It runs on all Macintosh systems, from a Macintosh Classic running system 6.1 to today's Power Macintosh computers, and has ten voices.
''MacinTalk 3''. More natural and less robotic than MacinTalk 2, MacinTalk 3 is based on an acoustic model of the human vocal tract, and has a wide range of voices including robots, talking bubbles, whispering, and even a voice that sings the text rather than simply speaking it. These voices, 19 in all, require a 33-MHz 68030 or higher processor.
''MacinTalk Pro''. This is the highest-end synthesizer. It is based on samples of real human speech, and so has the most natural and human-sounding voice. This technology is also the most successful for generating female voices. The pronunciations are derived from a dictionary of about 65,000 words, plus the 5,000 most common names in the U.S. The prosody is generated by a state-of-the-art model built on many years of research into the acoustic structure of spoken language.
MacinTalk Pro has three English voices, each capable of three different quality levels. They require a 68040 or higher processor. MacinTalk Pro also has two Spanish voices, each with three quality levels. The Spanish voices require a 68020 or higher processor.
The three different synthesis engines jointly supply a synthesis solution for every Macintosh and every configuration. From the smallest to the largest combination of engine and voice, Apple's speech synthesis provides a range of voices to meet different needs.
Apple's speech synthesis technology also provides multiple voices in multiple languages. This range of voices spans market needs from entertainment and education, to proofreading for desktop publishing, to accessing e-mails and databases over the phone.
Apple's grammatical analysis and text processing enable speech synthesis to cope with the wide variability encountered in normal human-generated text, correctly speaking such items as currency amounts, times, dates, and ambiguous words.
===Apple's Speech Synthesis Applied===
A developer may want an application to be able to speak dialog-box messages to users. A word-processing application or spreadsheet might provide the ability to read back selected portions of a document to help the user check for errors. A calendar might have the capability to read out the day's appointments or announce upcoming meetings. An e-mail system might read out messages over the phone. A multimedia application might use speech synthesis to provide narration of a [[QuickTime]] movie instead of including sampled-sound data on a movie track. Because sound samples can take up large amounts of room on a disk, using text in place of recorded, digitized sound is much more efficient.
It is easy to incorporate speech synthesis into applications by using the PlainTalk Speech Synthesis Toolbox, called Speech Synthesis Manager. The Speech Synthesis Manager provides an easy-to-use, well-documented API that is the same for all synthesis engines. Therefore, an application does not need to be changed to accommodate the different voices available on the Macintosh. Similarly, localization to other languages is straightforward because the calls are exactly the same to a synthesis engine that speaks a different language.
At its simplest, speech synthesis requires that only one line of code be added to an application. The Speech Synthesis Manager also makes available more complex features to give an application much more detailed control. For example, an application can change the pitch, volume, or speaking rate of a voice; create, install, and manipulate customized pronunciation dictionaries; or even send the synthesized speech output to a telephone via a GeoPort Telecom Adapter.
=Solutions Incorporating Speech Technologies=
===Apple's Speech Solution Focus Areas===
Speech benefits a host of applications in significant ways. It allows a more natural interaction through voice input and output. This enriches the user's experience and makes learning more effective and games more fun. Speech can also increase productivity by enabling a new class of solutions that were not before possible with personal computers. For example, it allows remote access to information on the computer using only a telephone, or dictating to a computer and having spoken words automatically transformed into text. Apple's solution focus areas for speech technologies and products include education, entertainment, computer telephony, and productivity tools.
''Education''. Imagine children learning reading, math, and problem-solving skills by interacting with computers that speak. Imagine learning new languages by having a computer automatically check your pronunciation and give pronunciation feedback. Computers have been shown to be effective educational tools, particularly when they encourage interaction and stimulate motivation. Until recently, educational software has typically required that users already have reading ability. But research has proven that learning is fastest and most effective when it builds on what people already do well-usually speaking and listening-to teach new knowledge and skills. For example, an application that teaches biology to older children via environmental simulation but that requires fluent reading skills could be used by a much younger child if the developer incorporated Apple's speech synthesis. Talking dictionaries with unlimited vocabulary in targeted languages help language students of all ages. Apple's speech technologies open up a whole new way to learn.
''Entertainment''. Imagine, in the heat of a race-car simulation, being able to inform your crew before a pit-stop using voice commands, or having the computer speak to inform you of low fuel or a flat tire. Personal computers are being used more and more for entertainment-type applications. With speech synthesis, developers can add a variety of voice responses or interactive dialogs to their applications. Until recently, this has been limited because of the need for storage-intensive recorded sound bytes. With speech synthesis, only the text needs to be stored. And with speech recognition, voice command-and-control features can add a new dimension to interaction, making games more engaging. In entertainment applications such as simulations, where actions are often hand-and-eye intensive, being able to command by voice may be the only logical choice for added control. Apple's speech technologies enable these novel capabilities that can make entertainment applications truly captivating.
''Computer telephony''. Imagine being able to call your Macintosh from any telephone, anywhere in the world, and have it read out your e-mail messages, your day's appointments, or the latest sales figures. Previously, a user on the move had to carry at least a notebook machine or a PDA to communicate with the office computer or send e-mail while in the field. With Apple's speech synthesis, information from remote computers can be obtained with nothing more than a common telephone. No extra equipment costs or hassles are necessary. And with speech recognition, users will soon be able to use voice commands to transcend the limits of a telephone keypad-to quickly navigate databases, access information, or control their message retrieval.
''Productivity tools''. Imagine being able to dictate freely to a computer-memos, letters, and e-mail messages-using natural, fully continuous speech. And imagine having those spoken words transcribed to text, on the screen, in real time. This is part of Apple's vision and solution focus: to make computers easier to use and to enhance productivity. Apple's speech technologies are setting the foundation for practical personal computer­ based dictation. The initial focus is on providing dictation solutions for ideographic languages such as Chinese and Japanese, where keyboard entry of text is the most cumbersome and nonintuitive. With computerized voice dictation, the user can significantly increase the effectiveness and efficiency of text entry in these languages, thereby raising productivity.
Other productivity applications include spreadsheet software using text-to-speech to read out a column of numbers to check accounting accuracy, or a word-processing application reading back text for proofreading and for checking the flow of the written piece. As the technology becomes more pervasive, more and more productivity-boosting applications are appearing on the market.
===Speech on the Power Macintosh===
There is a reason why Macintosh computers have led the personal computer market in the adoption of advanced technologies: the Macintosh platform architecture. The Macintosh was the first personal computer to offer an intuitive graphical user interface. It was the computer that revolutionized desktop publishing. Its made-for-multimedia architecture and processor speed are the reasons desktop presentations and multimedia authoring have become so popular on the Macintosh.
The Macintosh operating system has been continuously extended with advanced technology extensions, such as [[QuickDraw]], [[QuickTime]], [[PowerTalk]], PowerShare, and PlainTalk. These permit the integration of advanced technologies into popular applications, and permit users to take advantage of these technologies with little or no additional investment. The [[Power Macintosh]] family offers more functionality and performance to provide a robust platform that supports these enhanced applications.
Speech recognition requires digital signal processing, and the Power Macintosh has that ability built in. There's no need to add costly add-in boards with special digital signal processing (DSP) chips. The reason is the PowerPC chip, with its reduced instruction set computing (RISC) architecture. This processor has so much horsepower that it can execute DSP functions written in software, such as the digital signal processing code written into PlainTalk; this ability is known as native signal processing, or NSP. Other platforms that don't utilize the PowerPC chip may need an add-on DSP card because the processor is not powerful enough to do native signal processing.
Speech recognition needs a recognition search engine, and the PlainTalk extension to the operating system provides one. Speech synthesis needs speech synthesizers, and PlainTalk provides those, too. Ultimately, speech synthesis must create the sound of a voice, and the Sound Manager operating system extension works hand-in- hand with the speech synthesis to produce those sounds. Developers of speech-enhanced applications deal only with the speech synthesis; all other hardware and software integration is transparent and seamless.
When Apple embraces an advanced technology, it does so from two perspectives. First, it develops operating system extensions and APIs that make it easy for developers to incorporate those advanced technologies in their applications. Second, it adds those operating system extensions to the platform to enable users to take advantage of the technologies without having to invest in add-in cards and specialized software modules. As of this writing, several Power Macintosh computers are sold with speech technologies included, and earlier Power Macintosh computers can be easily and inexpensively updated with the PlainTalk extensions. Virtually every Macintosh, from the Macintosh SE to the present Power Macintosh models, can support PlainTalk speech synthesis technology.


=See Also=
=See Also=

Latest revision as of 17:45, 30 September 2020

PlainTalk is a group of Apple Speech technologies providing your Macintosh with speech recognition and speech synthesis capabilities. PlainTalk 1.5 Software consists of three components: English Speech Recognition, English Text-to-Speech, and Mexican Spanish Text-to-Speech.

English Speech Recognition[edit]

About English Speech Recognition[edit]

With a program using Apple Speech Recognition, you can ask "What time is it?" and hear if you're late for dinner. You can open your spreadsheet by saying "Open the February forecast." Or use voice command to control your starship in a game. The possibilities are as limitless as the human imagination.

Apple Speech Recognition lets your Macintosh understand what you say, giving you a new dimension for interacting with and controlling your computer by voice. You don't even have to train it to understand your voice, because it already understands you, from your very first word. You can speak naturally, without pausing or stopping, and even add your own words (like "Denibian Slime Devil," if you're a trekkie). Apple's leadership in speech recognition technology makes it possible - we're bringing a whole new dimension to the user interface: speech.

Speech Recognition System Requirements

  • A Power Macintosh with 16-bit sound
  • System 7.5 or later
  • A microphone, such as Apple's PlainTalk Microphone or the built-in microphone on some Apple AudioVision monitors.

English Text-to-Speech[edit]

About English Text-to-Speech[edit]

English Text-to-Speech (also known as Speech Synthesis) converts text to spoken words. Just imagine having this web page read aloud to you, or having information on your mac read to you over the phone! These, and more, applications are now possible with PlainTalk speech synthesis technology.

Hear It for Yourself

Apple's English Text-to-Speech supports just about any Macintosh ever made, from the Macintosh Plus to the Power Macintosh. (Our speech synthesizer engines, called MacinTalk 2, 3, and Pro, automatically scale to your system configuration.) You can hear some digitized voice samples as you learn about our synthesizer engines.

English Speech Synthesis System Requirements

  • Any Macintosh or Power Macintosh Computer
  • System Software 6.0.7 or later, depending on voice quality
  • From 300K to 1.5 MB of available RAM, depending on voice quality
  • 5 MB of available hard disk space

Speech Technologies: An Emerging Trend[edit]

Remember when science-fiction shows like Star Trek showed the fantasy of people interacting with computers by giving spoken commands? Or when the computer named HAL in the science-fiction film 2001 communicated with people by talking to them? These far-out sci-fi fantasies, which seemed so out-of-reach when they were written, were right on target about how humans would prefer to interact with machines. Today, these examples are not as out-of-reach as one might think, due to modern desktop computer­ based speech technologies.

Speech technologies are about talking to a computer and having it talk back to you. It is commonly known that computers equipped for speech are a coming trend, and many of the large computer and software corporations are working to include these technologies in their products. Apple has been shipping speech technology products for its computers since 1993, giving it a leadership position in providing speech technologies on personal computers. Apple calls this technology PlainTalk, which includes both speech recognition and speech synthesis.

To be able to talk to computers and have them respond accordingly requires a technology called "speech recognition." To have computers talk to you requires "speech synthesis." Apple offers both technologies now, and is involved in continued improvement of recognition accuracy and voice synthesis quality. For Apple, speech technology is more than a novel human-machine interface, it is a path to enhanced user experience and productivity.

Speech is natural-we have been using it all our lives. Using voice commands doesn't require people to master the keyboard or memorize cryptic commands. Thus, computers that can give and receive information via speech communication offer greater productivity potential.

Historically, text-oriented machines have preceded speech-oriented machines, sometimes by centuries. Word recorders (e.g., typewriters) came well before voice/ sound recorders (e.g., phonographs and tape recorders). Computers have also followed that cycle. Since 1947, computers have been text-oriented machines that we wrote to and read from. Now, less than 50 years later, we are on the brink of a transition to speech interaction with personal computers-and Apple is leading the way.

Speech Recognition[edit]

How a Computer Recognizes Speech[edit]

Automatic Speech Recognition (ASR) is the process by which a computer listens to the human voice communicating a message, and converts it into written text and/or commands. This process is made up of several subprocesses: analog-to-digital conversion, signal processing, and recognition search. This may be followed by an optional natural language understanding process.

The analog-to-digital conversion process captures the analog input signal (continuous speech) and converts it into a series of digital values by sampling the speech at regular time intervals. The digitized samples are then passed through a signal processing module.

The signal processing module further simplifies the input into a sequence of patterns. These patterns represent the signal in a manner similar to that used by the human auditory system.

The recognition search process takes these patterns and compares them to a set of templates or models that has been created from several sources of knowledge such as acoustic models, language models, and dictionaries. The acoustic model is made up of the patterns of speech sounds, such as phonemes (the smallest units of sound) or words. The language model incorporates the set of words and sentences that are allowed and expected within the context of the application.

The output of the search process is a group of words corresponding to the best match found in the set of templates. The recognized sequence of words can be used to trigger a command or script. This, in turn, can initiate an action such as creating, retrieving, or printing a document.

Natural language processing takes the recognized sequence of words, and attempts to interpret and extract their meaning in the context of the application. This interpretation is then used to effect underlying actions.

Features of a Good Speech Recognition System[edit]

The ideal speech recognition system is characterized by the following features and capabilities: speaker independence, continuous speech, and flexible vocabulary.

A speaker-independent system can typically recognize the speech of any person without requiring that the user go through a lengthy training period in order for the system to understand his or her voice. This capability is critical for certain applications, such as command-and-control (controlling the computer with voice commands) and over-the-phone speech applications, where the interaction is short and the user requires an immediate action by the computer.

Continuous speech capability enables the user to talk naturally, in whole phrases and sentences, without inserting unnatural pauses between words. Recognition of natural, continuous speech is absolutely essential in order for speech recognition to become widely adopted as a capability within applications, and a valuable addition to the graphical user interface.

Flexible vocabulary enables an ASR system to recognize almost any set of words, and allows easy customization to suit varying tasks and applications. Flexible vocabulary is achieved through dictionary technologies where words are modeled as a sequence of subword units, or phonemes. The active vocabulary size determines the effective size of vocabulary that can be reliably recognized at any given time. To be fully functional, a dictation system would need an active vocabulary in the 20,000 range. A command-and-control system, on the other hand, needs to handle only up to a couple of hundred words or phrases at a time.

What Makes Apple's Speech Recognition Unique?[edit]

To date, Apple's ASR focus has been to use speech recognition to augment its graphical user interface for desktop command-and-control and to enable new functionality of the personal computer. Apple leads its competitors in bringing this technology to the market and has been shipping speech products on Macintosh systems since 1993.

Speech recognition, which is a part of Apple's PlainTalk speech product family, is a speaker-independent system that enables a Power Macintosh computer to recognize the voice of any North American. To develop this technology, voice samples of over 500 adult speakers-male and female-from several different regions of the continent were recorded. These voice samples were used to develop the acoustic models, allowing accommodation of a wide variety of dialects and voices.

Apple's speech recognition system can recognize continuous speech input. This means that the user can speak naturally, without pauses between words. For example, the command: "Schedule a meeting with my manager for tomorrow at 9 A.M.," is easily recognized by a calendar program using Apple's speech recognition technology.

PlainTalk has a flexible vocabulary system as well, enabling the easy incorporation of domain-specific terms into an application. It is just as easy to add speech capabilities to an application that helps users schedule meetings as it is to permit speech navigation within a computer-aided design application.

In addition to speaker independence, continuous speech, and flexible vocabulary, PlainTalk has built-in speech recognition capabilities that enable it to operate reliably and flexibly in real-world environments. It has tolerance for common noises like coughs and slamming doors, and for differences between acoustic environments, such as a conference room versus an office.

Apple's Speech Recognition Applied[edit]

To allow third-party applications to take maximum advantage of the aforementioned capabilities, Apple developed the PlainTalk Speech Recognition Toolbox, which has an application program interface (API) called the Speech Recognition Manager.

To use speech recognition, an application need only make calls to the Speech Recognition Manager. This API provides speech recognition services for applications. By creating a language model, the application can define the words and phrases it is to recognize. For example, an application designed to teach children about different animals could display the pictures of ten animals in a window, one at a time. The application would build a language model containing ten phrases-the names of those ten animals-and recognize when the child spoke the name of the animal shown on the screen.

An application can create several language models and install one of them as the active language model. This is useful if the words or phrases to be recognized might change according to context. Similarly, an application can create language models that contain other language models. A language model is associated with a recognizer, the part of the Speech Recognition Manager that performs the work of recognizing spoken utterances and reporting its results to the application.

Speech recognition provides users and developers with a rich set of functionality available as part of the operating system. The next section covers the second half of PlainTalk-speech synthesis-and makes clear what can be accomplished with speech technologies.

Speech Synthesis[edit]

How to Make a Computer Talk[edit]

Speech synthesis, or Text-to-Speech (TTS), is the process by which a computer converts any readable text into audible speech. Speech synthesis technology makes it possible to generate spoken output from a computer without requiring prior recordings of a human speaker, and without any restrictions to the vocabulary. This process is made up of several subprocesses: text processing, prosodic processing, and signal processing.

Text processing. Some text needs to be expanded and substituted, or "normalized," to ensure that the right words are generated. For example, "$100 million" should not be spoken as "one hundred dollars million." Many abbreviations are ambiguous, such as "Dr. Smith lives on Smith Dr." Text normalization makes sure these and other potentially misinterpreted phrases are spoken properly by the computer.

In addition to generating the right words, the system has to derive their correct pronunciation. In isolation, this is difficult enough. Compare the pronunciation of "ough" in tough, cough, though, thought, through, plough, and thorough. Putting the emPHAsis on the wrong sylLAble is a common mistake among non-native speakers, as well as computers. In addition, the pronunciation of many words depends on their grammatical analysis; for example, "Wind your watch when the wind blows from the west." Names are particularly difficult to pronounce because they often do not follow normal English spelling conventions. Apple's text processing uses extensive context-sensitive grammatical rules, part-of-speech tagging, and a 70,000-word dictionary to address these problems. After the correct words and pronunciations have been generated, the next step is to define the prosody.

Prosodic processing. Prosody is the tune, phrasing, rhythm, and emphasis that jointly make the difference between isolated words and connected speech. Getting the prosody right is crucial to conveying the correct meaning. Consider "Sam struck out my friend" versus "Sam struck out, my friend." Apple has implemented a new algorithm for generating the tune in synthetic speech. This markedly improves the quality and naturalness of the speech, making it less robotic, more pleasant to hear, and easier to understand.

Signal processing. This is the final stage in speech synthesis-to create an acoustic signal that sounds like a human voice saying the correct words with the desired prosody. This is the most computationally expensive component of the whole process because it requires many mathematical operations per sample of the output speech. However, it is completely software based, with no special hardware requirements. Apple provides a range of voices that balance quality and naturalness against memory and computational requirements.

What Makes Apple's Speech Synthesis Unique?[edit]

When a computer acquires a speaking voice, it acquires a new degree of personality and accessibility. Speech synthesis provides a more natural and less obtrusive means of providing information to the user, or requesting further information in return. Natural speech output augments the graphical user interface and ushers in a new range of usage for the personal computer. This is increasingly important as screen real estate becomes smaller and more expensive, devices become more portable, and computer access over the telephone becomes pervasive.

Apple's speech synthesis leads the personal computer industry in intelligibility and naturalness. In addition, Apple makes it possible to further optimize the speech quality for any particular application. For example, developers can modify pronunciations, emphasis, or speaking rate according to their specific needs. The footprint and architecture are scalable to fit into a variety of environments and Apple platforms. Equally important, each of the three stages of speech synthesis is customizable by the developer or user to perform optimally within an application scenario.

The prosody can easily be specified by rules that generate the correct intonation for texts within the context of a particular application. These rules can be automatically applied to any text within that application, to better convey the intended meaning. Alternatively, it is possible to "hand-sculpt" the intonation of a specific sentence to create any desired effect.

In addition, it is possible to select between high-quality male and female voices, or a wide range of smaller footprint voices that include several amusing and entertaining personalities. The dictionary can also be customized to contain application-specific jargon. Currently, there are three synthesis engines available. All run in software, with no special hardware requirements. They vary in the RAM footprint, intelligibility, and naturalness.

MacinTalk 2. The smallest synthesizer, MacinTalk 2 is an extremely compact implementation of speech synthesis. It uses the technique known as wavetable synthesis, adopted from music synthesizers. It runs on all Macintosh systems, from a Macintosh Classic running system 6.1 to today's Power Macintosh computers, and has ten voices.

MacinTalk 3. More natural and less robotic than MacinTalk 2, MacinTalk 3 is based on an acoustic model of the human vocal tract, and has a wide range of voices including robots, talking bubbles, whispering, and even a voice that sings the text rather than simply speaking it. These voices, 19 in all, require a 33-MHz 68030 or higher processor.

MacinTalk Pro. This is the highest-end synthesizer. It is based on samples of real human speech, and so has the most natural and human-sounding voice. This technology is also the most successful for generating female voices. The pronunciations are derived from a dictionary of about 65,000 words, plus the 5,000 most common names in the U.S. The prosody is generated by a state-of-the-art model built on many years of research into the acoustic structure of spoken language.

MacinTalk Pro has three English voices, each capable of three different quality levels. They require a 68040 or higher processor. MacinTalk Pro also has two Spanish voices, each with three quality levels. The Spanish voices require a 68020 or higher processor.

The three different synthesis engines jointly supply a synthesis solution for every Macintosh and every configuration. From the smallest to the largest combination of engine and voice, Apple's speech synthesis provides a range of voices to meet different needs.

Apple's speech synthesis technology also provides multiple voices in multiple languages. This range of voices spans market needs from entertainment and education, to proofreading for desktop publishing, to accessing e-mails and databases over the phone.

Apple's grammatical analysis and text processing enable speech synthesis to cope with the wide variability encountered in normal human-generated text, correctly speaking such items as currency amounts, times, dates, and ambiguous words.

Apple's Speech Synthesis Applied[edit]

A developer may want an application to be able to speak dialog-box messages to users. A word-processing application or spreadsheet might provide the ability to read back selected portions of a document to help the user check for errors. A calendar might have the capability to read out the day's appointments or announce upcoming meetings. An e-mail system might read out messages over the phone. A multimedia application might use speech synthesis to provide narration of a QuickTime movie instead of including sampled-sound data on a movie track. Because sound samples can take up large amounts of room on a disk, using text in place of recorded, digitized sound is much more efficient.

It is easy to incorporate speech synthesis into applications by using the PlainTalk Speech Synthesis Toolbox, called Speech Synthesis Manager. The Speech Synthesis Manager provides an easy-to-use, well-documented API that is the same for all synthesis engines. Therefore, an application does not need to be changed to accommodate the different voices available on the Macintosh. Similarly, localization to other languages is straightforward because the calls are exactly the same to a synthesis engine that speaks a different language.

At its simplest, speech synthesis requires that only one line of code be added to an application. The Speech Synthesis Manager also makes available more complex features to give an application much more detailed control. For example, an application can change the pitch, volume, or speaking rate of a voice; create, install, and manipulate customized pronunciation dictionaries; or even send the synthesized speech output to a telephone via a GeoPort Telecom Adapter.

Solutions Incorporating Speech Technologies[edit]

Apple's Speech Solution Focus Areas[edit]

Speech benefits a host of applications in significant ways. It allows a more natural interaction through voice input and output. This enriches the user's experience and makes learning more effective and games more fun. Speech can also increase productivity by enabling a new class of solutions that were not before possible with personal computers. For example, it allows remote access to information on the computer using only a telephone, or dictating to a computer and having spoken words automatically transformed into text. Apple's solution focus areas for speech technologies and products include education, entertainment, computer telephony, and productivity tools.

Education. Imagine children learning reading, math, and problem-solving skills by interacting with computers that speak. Imagine learning new languages by having a computer automatically check your pronunciation and give pronunciation feedback. Computers have been shown to be effective educational tools, particularly when they encourage interaction and stimulate motivation. Until recently, educational software has typically required that users already have reading ability. But research has proven that learning is fastest and most effective when it builds on what people already do well-usually speaking and listening-to teach new knowledge and skills. For example, an application that teaches biology to older children via environmental simulation but that requires fluent reading skills could be used by a much younger child if the developer incorporated Apple's speech synthesis. Talking dictionaries with unlimited vocabulary in targeted languages help language students of all ages. Apple's speech technologies open up a whole new way to learn.

Entertainment. Imagine, in the heat of a race-car simulation, being able to inform your crew before a pit-stop using voice commands, or having the computer speak to inform you of low fuel or a flat tire. Personal computers are being used more and more for entertainment-type applications. With speech synthesis, developers can add a variety of voice responses or interactive dialogs to their applications. Until recently, this has been limited because of the need for storage-intensive recorded sound bytes. With speech synthesis, only the text needs to be stored. And with speech recognition, voice command-and-control features can add a new dimension to interaction, making games more engaging. In entertainment applications such as simulations, where actions are often hand-and-eye intensive, being able to command by voice may be the only logical choice for added control. Apple's speech technologies enable these novel capabilities that can make entertainment applications truly captivating.

Computer telephony. Imagine being able to call your Macintosh from any telephone, anywhere in the world, and have it read out your e-mail messages, your day's appointments, or the latest sales figures. Previously, a user on the move had to carry at least a notebook machine or a PDA to communicate with the office computer or send e-mail while in the field. With Apple's speech synthesis, information from remote computers can be obtained with nothing more than a common telephone. No extra equipment costs or hassles are necessary. And with speech recognition, users will soon be able to use voice commands to transcend the limits of a telephone keypad-to quickly navigate databases, access information, or control their message retrieval.

Productivity tools. Imagine being able to dictate freely to a computer-memos, letters, and e-mail messages-using natural, fully continuous speech. And imagine having those spoken words transcribed to text, on the screen, in real time. This is part of Apple's vision and solution focus: to make computers easier to use and to enhance productivity. Apple's speech technologies are setting the foundation for practical personal computer­ based dictation. The initial focus is on providing dictation solutions for ideographic languages such as Chinese and Japanese, where keyboard entry of text is the most cumbersome and nonintuitive. With computerized voice dictation, the user can significantly increase the effectiveness and efficiency of text entry in these languages, thereby raising productivity.

Other productivity applications include spreadsheet software using text-to-speech to read out a column of numbers to check accounting accuracy, or a word-processing application reading back text for proofreading and for checking the flow of the written piece. As the technology becomes more pervasive, more and more productivity-boosting applications are appearing on the market.

Speech on the Power Macintosh[edit]

There is a reason why Macintosh computers have led the personal computer market in the adoption of advanced technologies: the Macintosh platform architecture. The Macintosh was the first personal computer to offer an intuitive graphical user interface. It was the computer that revolutionized desktop publishing. Its made-for-multimedia architecture and processor speed are the reasons desktop presentations and multimedia authoring have become so popular on the Macintosh.

The Macintosh operating system has been continuously extended with advanced technology extensions, such as QuickDraw, QuickTime, PowerTalk, PowerShare, and PlainTalk. These permit the integration of advanced technologies into popular applications, and permit users to take advantage of these technologies with little or no additional investment. The Power Macintosh family offers more functionality and performance to provide a robust platform that supports these enhanced applications.

Speech recognition requires digital signal processing, and the Power Macintosh has that ability built in. There's no need to add costly add-in boards with special digital signal processing (DSP) chips. The reason is the PowerPC chip, with its reduced instruction set computing (RISC) architecture. This processor has so much horsepower that it can execute DSP functions written in software, such as the digital signal processing code written into PlainTalk; this ability is known as native signal processing, or NSP. Other platforms that don't utilize the PowerPC chip may need an add-on DSP card because the processor is not powerful enough to do native signal processing.

Speech recognition needs a recognition search engine, and the PlainTalk extension to the operating system provides one. Speech synthesis needs speech synthesizers, and PlainTalk provides those, too. Ultimately, speech synthesis must create the sound of a voice, and the Sound Manager operating system extension works hand-in- hand with the speech synthesis to produce those sounds. Developers of speech-enhanced applications deal only with the speech synthesis; all other hardware and software integration is transparent and seamless.

When Apple embraces an advanced technology, it does so from two perspectives. First, it develops operating system extensions and APIs that make it easy for developers to incorporate those advanced technologies in their applications. Second, it adds those operating system extensions to the platform to enable users to take advantage of the technologies without having to invest in add-in cards and specialized software modules. As of this writing, several Power Macintosh computers are sold with speech technologies included, and earlier Power Macintosh computers can be easily and inexpensively updated with the PlainTalk extensions. Virtually every Macintosh, from the Macintosh SE to the present Power Macintosh models, can support PlainTalk speech synthesis technology.

See Also[edit]