https://wiki.preterhuman.net/index.php?title=Microsoft_Windows_Highly_Intelligent_Speech_Recognizer:_Whisper&feed=atom&action=historyMicrosoft Windows Highly Intelligent Speech Recognizer: Whisper - Revision history2024-03-29T09:05:59ZRevision history for this page on the wikiMediaWiki 1.35.0https://wiki.preterhuman.net/index.php?title=Microsoft_Windows_Highly_Intelligent_Speech_Recognizer:_Whisper&diff=21807&oldid=prevNetfreak at 06:54, 6 October 20202020-10-06T06:54:05Z<p></p>
<table class="diff diff-contentalign-left diff-editfont-monospace" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 06:54, 6 October 2020</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l462" >Line 462:</td>
<td colspan="2" class="diff-lineno">Line 462:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><li>Jelinek, F. &quot;Up From Trigrams&quot;.</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><li>Jelinek, F. &quot;Up From Trigrams&quot;.</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><i>Proceedings of the EuroSpeech Conf</i>., Geneva, Italy 1991.</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><i>Proceedings of the EuroSpeech Conf</i>., Geneva, Italy 1991.</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div><li>Ward W. <del class="diffchange diffchange-inline"><a name="WARD89"></a></del>&quot;Modeling Non-Verbal Sounds</div></td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div><li>Ward W. &quot;Modeling Non-Verbal Sounds</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>for Speech Recognition&quot;. <i>Proceedings of DARPA Speech and</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>for Speech Recognition&quot;. <i>Proceedings of DARPA Speech and</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Language Workshop</i>, October 1989.</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Language Workshop</i>, October 1989.</div></td></tr>
</table>Netfreakhttps://wiki.preterhuman.net/index.php?title=Microsoft_Windows_Highly_Intelligent_Speech_Recognizer:_Whisper&diff=21806&oldid=prevNetfreak at 06:53, 6 October 20202020-10-06T06:53:30Z<p></p>
<table class="diff diff-contentalign-left diff-editfont-monospace" data-mw="interface">
<col class="diff-marker" />
<col class="diff-content" />
<col class="diff-marker" />
<col class="diff-content" />
<tr class="diff-title" lang="en">
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">← Older revision</td>
<td colspan="2" style="background-color: #fff; color: #202122; text-align: center;">Revision as of 06:53, 6 October 2020</td>
</tr><tr><td colspan="2" class="diff-lineno" id="mw-diff-left-l3" >Line 3:</td>
<td colspan="2" class="diff-lineno">Line 3:</td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Microsoft Corporation, One Microsoft Way, Redmond, WA 98052,</div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div>Microsoft Corporation, One Microsoft Way, Redmond, WA 98052,</div></td></tr>
<tr><td class='diff-marker'>−</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;"><div>USA<del class="diffchange diffchange-inline"><</del></div></td><td class='diff-marker'>+</td><td style="color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;"><div>USA</div></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"></td></tr>
<tr><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><h1>ABSTRACT </h1></div></td><td class='diff-marker'> </td><td style="background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;"><div><h1>ABSTRACT </h1></div></td></tr>
</table>Netfreakhttps://wiki.preterhuman.net/index.php?title=Microsoft_Windows_Highly_Intelligent_Speech_Recognizer:_Whisper&diff=21805&oldid=prevNetfreak: Created page with "Xuedong Huang, Alex Acero, Fil Alleva, Mei-Yuh Hwang, Li Jiang and Milind Mahajan Microsoft Corporation, One Microsoft Way, Redmond, WA 98052, USA< <h1>ABSTRACT </h1> <p> S..."2020-10-06T06:53:22Z<p>Created page with "Xuedong Huang, Alex Acero, Fil Alleva, Mei-Yuh Hwang, Li Jiang and Milind Mahajan Microsoft Corporation, One Microsoft Way, Redmond, WA 98052, USA< <h1>ABSTRACT </h1> <p> S..."</p>
<p><b>New page</b></p><div>Xuedong Huang, Alex Acero, Fil Alleva, Mei-Yuh Hwang, Li Jiang<br />
and Milind Mahajan<br />
<br />
Microsoft Corporation, One Microsoft Way, Redmond, WA 98052,<br />
USA<<br />
<br />
<h1>ABSTRACT </h1><br />
<br />
<p><br />
Since January 1993, we have been working to refine and extend<br />
Sphinx-II technologies in order to develop practical speech recognition<br />
at Microsoft. The result of that work has been the Whisper (Windows<br />
Highly Intelligent Speech Recognizer). Whisper represents significantly<br />
improved recognition efficiency, usability, and accuracy, when<br />
compared with the Sphinx-II system. In addition Whisper offers<br />
speech input capabilities for Microsoft Windows and can be scaled<br />
to meet different PC platform configurations. It provides features<br />
such as continuous speech recognition, speaker-independence, on-line<br />
adaptation, noise robustness, dynamic vocabularies and grammars.<br />
For typical Windows Command-and-Control applications (less than<br />
1,000 words), Whisper provides a software only solution on PCs<br />
equipped with a 486DX, 4MB of memory, and a standard sound card<br />
and a desk-top microphone.<br />
<h1>INTRODUCTION</h1><br />
<br />
<p><br />
To make Sphinx-II [10,6] usable in a PC environment, we need to<br />
tackle issues of recognition accuracy, computational efficiency,<br />
and usability simultaneously. A large amount of RAM and high-end<br />
workstation are unrealistic for today's popular PC environments<br />
where low-cost implementations are critical. The system must also<br />
be speaker adaptive, because there will always be some speakers<br />
for which the recognition error rate will be much higher than<br />
average due to variation in dialect, accent, cultural background,<br />
or simply vocal tract shape. The ability of the system to reject<br />
noise is also crucial to the success of commercial speech applications.<br />
Noises include not only environmental noises such as phone rings,<br />
key clicks, air conditioning noise, etc. but also vocal noises<br />
such as coughs; ungrammatical utterances, and Out Of Vocabulary<br />
(OOV) words. For a 20000-word dictation system, on average more<br />
than 3% of the words in an unconstrained test set are missing<br />
from the dictionary, and even when we increase the vocabulary<br />
size to 64,000 words, the OOV rate still remains higher than 1.5%.<br />
Lastly, recognition accuracy remains one of the most important<br />
challenges. Even if we exclude utterances containing OOV words,<br />
the word error rate of the best research systems remains to be<br />
higher than 9% for a 20,000-word continuous dictation task [6].<br />
<p><br />
Whisper [8] not only inherited all the major features of state-of-the-art<br />
research system Sphinx-II, but it also incorporates context-free<br />
grammar decoding, noise rejection, improved channel normalization,<br />
and on-line speaker adaptation. Whisper supports Windows 95 and<br />
Windows NT, and offers speaker-independent continuous speech recognition<br />
for typical Windows command and control applications. In this<br />
paper, we will selectively describe several strategies we used<br />
in Whisper to tackle efficiency and usability problems for command<br />
and control applications.<br />
<h1>EFFICIENCY ISSUES</h1><br />
<br />
<p><br />
We have dramatically improved Whisper's computational and memory<br />
requirements. In comparison with Sphinx-II (under the same accuracy<br />
constraints), the necessary RAM was reduced by a factor of 20,<br />
and the speed was improved by a factor of 5. Efficiency issues<br />
are largely related to the size of models and search architecture<br />
[3], which is closely related to data structures and algorithm<br />
design as well as appropriate acoustic and language modeling technologies.<br />
In this section we discuss two most important improvements, namely<br />
acoustic model compression and context-free grammar search architecture.<br />
<p><br />
By using the techniques described in the following sections, Whisper<br />
can run real-time on a 486DX PC offering speaker-independent continuous-speech<br />
recognition with a CFG required in a typical command-and-control<br />
application within 800KB of RAM, including code and all the data.<br />
In a command-and-control task with 260 words, Whisper offered<br />
a word error rate of 1.4%. For large-vocabulary continuous-speech<br />
dictation, more computational power (CPU and memory) is required<br />
in a trade-off with Whisper's error rate.<br />
<h3>Acoustic model compression</h3><br />
<br />
<p><br />
The acoustic model typically requires a large amount of memory.<br />
In addition, likelihood computations for the acoustic model is<br />
a major factor in determining the final speed of the system. In<br />
order to accommodate both command-and-control and more demanding<br />
dictation applications, Whisper is scaleable, allowing several<br />
configurations of number of codebooks and number of <i>senones</i><br />
[9]. In addition, to reduce the memory required by the acoustic<br />
model, Whisper uses a compression scheme that provides a good<br />
compression ratio while avoiding significant computational overhead<br />
for model decompression. The compression also offers improved<br />
memory locality and cache performance which resulted in a small<br />
improvement in speed.<br />
<p><br />
Like discrete HMMs, semi-continuous HMMs (SCHMM) use common codebooks<br />
for every output distribution. Since the common codebooks are<br />
used for every senone<i> </i>output distribution, the output probability<br />
value for the same codeword entry is often identical for similar<br />
senones. For example, the context-independent phone <i>AA</i><br />
uses about 260 senones for the top-of-the line 7000-senone configuration.<br />
These senones describe different context variations for the phone<br />
<i>AA</i>. We arranged the output probabilities according to the<br />
codeword index instead of according to senones, as is conventionally<br />
done. We observed a very strong output probability correlation<br />
within similar context-dependent senones. This suggested to us<br />
that compressing output probabilities across senones may lead<br />
to some savings. We comp-ressed all the output probabilities with<br />
run-length encoding. The run-length encoding is lossless and extremely<br />
efficient for decoding. To illustrate the basic idea, we display<br />
in Table 1 the output probabilities of senones 1 through 260 for<br />
phone <i>AA</i>.<br />
<br />
<pre width="132"><br />
Sen 1 Sen 2 Sen 3 Sen 4 ... <br />
Codeword 0.020 0.020 0.020 0.0 ... <br />
1 <br />
Codeword 0.28 0.30 0.020 0.0 ... <br />
2 <br />
Codeword 0.035 0.035 0.035 0.035 ... <br />
3 <br />
Codeword 0.0 0.0 0.0 0.0 ... <br />
4 <br />
Codeword 0.0 0.0 0.0 0.0 ... <br />
5 <br />
Codeword 0.0 0.0 0.0 0.076 ... <br />
6 <br />
Codeword 0.0 0.0 0.0 0.070 ... <br />
7 <br />
Codeword 0.057 0.051 0.055 0.054 ... <br />
8 <br />
Codeword 0.057 0.051 0.054 0.051 ... <br />
9 <br />
... ... ... ... ... ... <br />
Codeword 0.0 0.0 0.0 0.080 ... <br />
256 <br />
</pre><br />
<br />
<br />
<p><br />
Table 1. Uncompressed acoustic output probabilities for a 7000-senone<br />
Whisper configuration with 4 codebooks. We show the probabilities<br />
for one of the codebooks.<br />
<p><br />
In Table 1, the sum of each column equals 1.0, which corresponds<br />
to the senone-dependent output probability distribution. For the<br />
run-length encoding, we choose to compress each row instead of<br />
each column. This allows us to make full use of correlation among<br />
different senones. The compressed form is illustrated in Table<br />
2, where multiple identical probabilities are encoded with only<br />
one value and repeat count. For example, in codeword 1, probability<br />
0.020 appears successively in senones 1, 2, and 3 so, we encoded<br />
them with (0.020, 3) as illustrated in Table 2.<br />
<br />
<pre width="132"><br />
Codeword (0.020,3), 0.0, <br />
1 ... <br />
Codeword 0.28,0.30, 0.020, <br />
2 0.0,... <br />
Codeword (0.035, 4), ... <br />
3 <br />
Codeword (0.0,4), ... <br />
4 <br />
.... ..... <br />
Codeword (0.0,3), 0.08,... <br />
256 <br />
</pre><br />
<br />
<br />
<p><br />
Table 2. Run-length compressed acoustic output probabilities for<br />
a 7000-senone Whisper configuration with 4 codebooks. We show<br />
the probabilities for one of the codebooks after run-length encoding<br />
of the values in Table 1.<br />
<p><br />
The proposed compression scheme reduced the acoustic model size<br />
by more than 35% in comparison with the baseline [7]. It is not<br />
only a loss-less compression but also enables us to measurably<br />
speed up acoustic model evaluation in the decoder. This is because<br />
identical output probabilities no longer need to be evaluated<br />
in computing semi-continuous output probabilities. As such they<br />
can be precomputed before evaluating Viterbi paths.<br />
<h3>Search architecture</h3><br />
<br />
<p><br />
Statistical language models based on bigrams and trigrams [12]<br />
have long been used for large-vocabulary speech recognition because<br />
they provide the best accuracy, and Whisper uses them for large-vocabulary<br />
recognition. However, when designing a command-and-control version<br />
of Whisper, we decided to use context-free grammars (CFG). Although<br />
they have the disadvantage of being restrictive and unforgiving,<br />
particularly with novice users, we use it as our preferred language<br />
model because it has advantages like (1) compact representation;<br />
(2) efficient operation; and (3) ease of grammar creation and<br />
modification for new tasks. Users can easily modify the CFG and<br />
add new words to the system. Whenever a new word is added for<br />
a non-terminal node in the CFG, a spelling-to-pronunciation component<br />
is activated to augment the lexicon.<br />
<p><br />
The CFG grammar consists of a set of productions or rules that<br />
expand non-terminals into a sequence of terminals and non-terminals.<br />
Non-terminals in the grammar would tend to refer to high-level<br />
task specific concepts such as dates, font-names, etc. Terminals<br />
are words in the vocabulary. We allow some regular expression<br />
operators on the right hand side of the production as a notational<br />
convenience. We disallow left recursion for ease of implementation.<br />
The grammar format achieves sharing of sub-grammars through the<br />
use of shared non-terminal definition rules.<br />
<p><br />
During decoding, the search engine pursues several paths through<br />
the CFG at the same time. Associated with each of the paths is<br />
a grammar state that describes completely how the path can be<br />
extended further. When the decoder hypothesizes the end of a word,<br />
it asks the grammar module for all possible one word extensions<br />
of the grammar state associated with the word just completed.<br />
A grammar state consists of a stack of production rules. Each<br />
element of the stack also contains the position within the production<br />
rule of the symbol that is currently being explored. The grammar<br />
state stack starts with the production rule for the grammar start<br />
non-terminal at its first symbol. When the path needs to be extended,<br />
we look at the next symbol in the production. If it is a terminal,<br />
the path is extended with the terminal and the search engine tries<br />
to match it against the acoustic data. If it is a non-terminal,<br />
the production rule that defines it, is pushed on the stack and<br />
we start scanning the new rule from its first symbol instead.<br />
When we reach the end of the production rule, we pop the ending<br />
rule off the stack and advance the rule below it by one position,<br />
over the non-terminal symbol, which we have just completed exploring.<br />
When we reach the end of the production rule at the very bottom<br />
of the stack, we have reached an accepting state in which we have<br />
seen a complete grammatical sentence.<br />
<p><br />
In the sake of efficiency, the decoder does not actually pursue<br />
all possible paths. When a particular path is no longer promising,<br />
it is pruned. Pruning is a source of additional errors since the<br />
correct path, which looks unpromising now, may prove to be the<br />
best when all the data is considered. To relax our pruning heuristic<br />
we use a strategy that we have dubbed the <i>&quot;Rich Get Richer&quot;</i><br />
(RGR). RGR enables us to focus on the most promising paths and<br />
treat them with detailed acoustic evaluations and relaxed path<br />
pruning thresholds. On the other hand, poor (less promising paths)<br />
will be extended but probably with less expensive acoustic evaluations<br />
and less forgiving path pruning thresholds. In this way locally<br />
optimal candidates continue to receive the maximum attention while<br />
less optimal candidates are retained but evaluated using less<br />
precise (computationally expensive) acoustic and/or linguistic<br />
models. The RGR strategy gives us finer control over the creation<br />
of new paths. This is particularly important for CFG based grammars<br />
since the number of potential paths is exponential. Furthermore,<br />
RGR gives us control over the working memory size which is important<br />
for relatively small PC platforms.<br />
<p><br />
One instance of RGR used in Whisper is the control over the level<br />
of acoustic detail used in the search. Our goal is to reduce the<br />
number of context dependent senone probability computations required.<br />
Let's define as the best accumulated score at frame <i>t</i> for<br />
all instances of phone <i>p</i> in the beam, and as the output<br />
probability of the context-independent model for phone <i>p</i><br />
at frame . Then, the context-dependent senones associated with<br />
a phone <i>p<b> </b></i>are evaluated for frame if<br />
<p><br />
where <i>(t)</i> is the accumulated score for the best hypothesis<br />
at frame <i>t</i>, and <i>a</i> and <i>K</i> are constants.<br />
<p><br />
In the event that <i>p<b> </b></i>does not fall with in the threshold,<br />
the senone probabilities corresponding to <i>p<b> </b></i>are<br />
estimated using the context independent senones corresponding<br />
to <i>p</i>. In Table 3 we show the improvements of this technique<br />
for Whisper 20,000 word dictation applications. <br />
<br />
<pre width="132"><br />
Reduction in senone 80% 95% <br />
computation <br />
Error Rate Increase 1% 15% <br />
</pre><br />
<br />
<br />
<p><br />
Table 3. Reduction in senone computation vs. error rate increase<br />
by using RGR strategy in a 7000-senone 20000-word configuration<br />
for Whisper.<br />
<h2>USABILITY ISSUES</h2><br />
<br />
<p><br />
To make Whisper more usable, we to tackle problems such as environmental<br />
and speaker variations, ill-formed grammatical speech input, and<br />
sounds not intended for the system, and speaker adaptation.<br />
<h3>Improved channel normalization</h3><br />
<br />
<p><br />
Cepstral mean normalization [11] plays an important role in robust<br />
speech recognition due to variations of channel, microphone, and<br />
speaker. However, mean normalization does not discriminate between<br />
silence and voice when computing the utterance mean, and therefore<br />
the mean is affected by the amount of noise included in the calculation.<br />
For improved speech recognition accuracy, we propose a new efficient<br />
normalization procedure that differentiates noise and speech during<br />
normalization, and computes a different mean for each one. The<br />
new normalization procedure reduced the error rate slightly for<br />
the case of same-environment testing, and significantly reduced<br />
the error rate by 25% when an environmental mismatch exists [2]<br />
over the case of standard mean normalization.<br />
<p><br />
The proposed technique consists of subtracting a correction vector<br />
to each incoming cepstrum vector :<br />
<p><br />
where the correction vector is given by<br />
<p><br />
with being the <i>a posteriori</i> probability of frame <i>i</i><br />
being noise, <b>n</b> and <b>s</b> being the average noise and<br />
average speech cepstral vectors for the current utterance, and<br />
and the average noise and speech cepstral vectors for the database<br />
used to train the system. Since this normalization will be applied<br />
to then training utterances as well, we see that after compensation,<br />
the average noise cepstral vector for all utterances will be,<br />
and the average speech vector for all utterances will be . The<br />
use of the <i>a posteriori</i> probability allows a smooth interpolation<br />
between noise and speech, much like the SDCN and ISDCN algorithms<br />
[1].<br />
<p><br />
Although a more sophisticated modeling could be used to estimate<br />
, we made the approximation that it can be obtained exclusively<br />
from the energy of the current frame. A threshold separating speech<br />
from noise is constantly updated based on a histogram of log-energies.<br />
This results in a very simple implementation that is also very<br />
effective [2].<br />
<h3>Noise rejection</h3><br />
<br />
<p><br />
The ability to detect and notify the user of utterances containing<br />
out-of-vocabulary words; ungrammatical utterances; and non-utterances<br />
such as phone rings, is essential to the usability of a recognizer.<br />
This is particularly true when the language model is a tight context<br />
free grammar, as users may initially have difficulty confining<br />
their speech to such a model. We have added rejection functionality<br />
to Whisper that assigns a confidence level to each segment in<br />
a recognition result, as well as to the whole utterance, which<br />
can be used for an improved user interface.<br />
<p><br />
Previous work on detecting noise words includes an all-phone representation<br />
of the input [4], and use of noise-specific models [13,14]. We<br />
have observed for continuous small-vocabulary tasks that the path<br />
determined by the best context-independent senone score per frame<br />
is a relatively reliable rejection path. We use the output of<br />
a fully connected network of context-independent phones, evaluated<br />
by using a Viterbi beam search with a separate beam width that<br />
may be adjusted to trade off speed for rejection accuracy. Transitions<br />
between phones are weighted by phonetic bigram probabilities that<br />
are trained using a 60,000 word dictionary and language model.<br />
<p><br />
We used one noise model and one garbage model in the system. The<br />
noise model is like a phonetic HMM; its parameters are estimated<br />
using noise-specific data such as phone rings and coughs. The<br />
garbage model is a one-state Markov model whose output probability<br />
is guided by the rejection path. A garbage word based on this<br />
model may be placed anywhere in the grammar as a kind of phonetic<br />
wildcard, absorbing or alerting to the user ungrammatical phonetic<br />
segments after recognition. The noise model, in turn, absorbs<br />
non-speech noise data.<br />
<p><br />
We measure the rejection accuracy using a multi-speaker data set<br />
with a mixture of grammatical and ungrammatical utterances as<br />
well as noise. With our rejection models, Whisper rejects 76%<br />
of utterances that are ungrammatical or noise and 20% of misrecognized<br />
grammatical utterances, while falsely rejecting fewer than 3%<br />
of correctly recognized grammatical utterances. Feedback supplied<br />
by the user is used to train the confidence threshold; this increases<br />
per-speaker rejection accuracy, especially for non-native speakers.<br />
<p><br />
One interesting result is that our confidence measures used for<br />
noise rejection can be used to improve recognition accuracy. Here,<br />
word transitions are penalized by a function<i> </i>of the confidence<br />
measure. So the less confident theories in the search beam are<br />
penalized more than theories that have higher confidence intervals,<br />
which provides us with different information than the accumulated<br />
probability for each path in the beam. This is in the same spirit<br />
of our general RGR strategy used throughout the system. We found<br />
that the error rate for our command and control task was reduced<br />
by more than 20% by incorporating this penalty [5].<br />
<h3>Speaker Adaptation</h3><br />
<br />
<p><br />
To bridge the gap between speaker-dependent and speaker-independent<br />
speech recognition, we incorporated speaker adaptation as part<br />
of the Whisper system. We modify the two most important parameter<br />
sets for each speaker, i.e. the vector quantization codebooks<br />
(or the SCHMM mixture components) and the output distributions<br />
(or the SCHMM mixing coefficients) in the framework of semi-continuous<br />
models. We are interested in developing adaptation algorithms<br />
that are consistent with the estimation criterion used in either<br />
speaker-independent or speaker-dependent systems. We observed<br />
in general a 50% error reduction when a small amount of enrollment<br />
data is used. The adaptation is particularly important for non-native<br />
English speakers<br />
<h1>SUMMARY</h1><br />
<br />
<p><br />
We have significantly improved Whisper's accuracy, efficiency<br />
and usability over the past two years. On a 260-word Windows continuous<br />
command-and-control task and with 800KB working memory configuration<br />
(all the RAM required, including code and data), the average speaker-independent<br />
word recognition error rate was 1.4% on a 1160 utterance testing<br />
set. The system runs real-time on a PC equipped with a 486DX and<br />
4MB of memory.<br />
<p><br />
The emergence of an advanced speech interface is a significant<br />
event that will change today's dominant GUI-based computing paradigm.<br />
It is obvious that the paradigm shift will require not only accurate<br />
speech recognition, but also integrated natural language understanding<br />
as well as a new model for building application user interfaces.<br />
The speech interface cannot be considered highly intelligent until<br />
we make it transparent, natural, and easy to use. Through our<br />
ongoing research efforts, we believe that we can continue to push<br />
the quality of our system above and beyond what is implied by<br />
its acronym.<br />
<h1>ACKNOWLEDGMENTS </h1><br />
<br />
<p><br />
The authors would like to express their gratitude to Douglas Beeferman,<br />
Jack McLaughlin, Rick Rashid, and Shenzhi Zhang for their help<br />
in Whisper development.<br />
<h1>REFERENCES </h1><br />
<br />
<ol><br />
<li>Acero, A. &quot;Acoustical and Environmental<br />
Robustness in Automatic Speech Recognition&quot;. <i>Kluwer Publishers</i>.<br />
1993.<br />
<li>Acero, A. and Huang, X. &quot;Robust<br />
Mean Normalization for Speech Recognition&quot;. <i>US Patent<br />
pending, </i>1994<i>.</i> <br />
<li>Alleva F., Huang X., and Hwang M. &quot;An<br />
Improved Search Algorithm for Continuous Speech Recognition&quot;.<br />
<i>IEEE International Conference on Acoustics, Speech, and Signal<br />
Processing</i>, 1993.<br />
<li>Asadi O., Schwartz R., and Makhoul,<br />
J. &quot;Automatic Modeling of Adding New Words to a Large-Vocabulary<br />
Continuous Speech Recognition System&quot;. <i>IEEE International<br />
Conference on Acoustics, Speech, and Signal Processing</i>, 1991.<br />
<li>Beeferman, D. and Huang, X. &quot;Confidence<br />
Measure and Its Applications to Speech Recognition&quot;. <i>US<br />
Patent pending. </i>1994.<br />
<li>Huang X., Alleva F., Hwang M., and Rosenfeld<br />
R. &quot;An Overview of Sphinx-II Speech Recognition System<i>&quot;.<br />
Proceedings of ARPA Human Language Technology Workshop</i>, March<br />
1993.<br />
<li>Huang, X and Zhang, S. &quot;Data Compression<br />
for Speech Recognition&quot;. <i>US Patent pending,</i> 1993.<br />
<li>Huang X., Acero A., Alleva F., Beeferman<br />
D., Hwang M., and Mahajan M. &quot;From CMU Sphinx-II to Microsoft<br />
Whisper - Making Speech Recognition Usable&quot;, in <i>Automatic<br />
Speech and Speaker Recognition - Advanced Topics</i>. Lee, Paliwal,<br />
and Soong editors, <i>Kluwer Publishers</i>, 1994.<br />
<li>Hwang M. and Huang X. &quot;Subphonetic<br />
Modeling with Markov States -- Senone&quot;. <i>IEEE International<br />
Conference on Acoustics, Speech, and Signal Processing</i>, 1992.<br />
<li>Lee, K.F.,&quot;Automatic Speech Recognition:<br />
The Development of the SPHINX System&quot;. <i>Kluwer Publishers</i>,<br />
Boston 1989.<br />
<li>Liu F., Stern R., Huang X. and Acero A.<br />
&quot;Efficient Cepstral Normalization for Robust Speech Recognition<i>&quot;.<br />
Proceedings of ARPA Human Language Technology Workshop</i>, March<br />
1993.<br />
<li>Jelinek, F. &quot;Up From Trigrams&quot;.<br />
<i>Proceedings of the EuroSpeech Conf</i>., Geneva, Italy 1991.<br />
<li>Ward W. <a name="WARD89"></a>&quot;Modeling Non-Verbal Sounds<br />
for Speech Recognition&quot;. <i>Proceedings of DARPA Speech and<br />
Language Workshop</i>, October 1989.<br />
<li>Wilpon J., Rabiner L., Lee C., and<br />
Goldman. &quot;Automatic Recognition of Keywords in Unconstrained<br />
Speech using Hidden Markov Models&quot;.<i> IEEE Trans. on Acoustics,<br />
Speech, and Signal Processing</i>, Vol.- ASSP-38, pp. 1870-1878,<br />
1990.<br />
</ol><br />
<br />
=See Also=<br />
* [[Microsoft Research]]<br />
<br />
[[Category:Microsoft]]</div>Netfreak