Thank you for your interest in helping this site to continue to develop. Some of the information we give you here can save you thousands of dollars the next time you're arranging travel, or will substantially help the quality of your travel experiences in other, non-cash ways.
The omnipresent eye of
the HAL 9000 computer in
2001 A Space Odyssey introduced the world to
modern speech recognition in 1968. What was science fiction
then is close to science fact now.
This is the first part of a series on
speech recognition software. See related articles
listed on the right.
Reliable speech recognition is
something that has been long sought after, but only recently is
becoming practical on normal computers.
The extraordinary computing
power of a modern home computer, and the evolving
capabilities of speech recognition software now offer the
promise, and possibly the reality, of being able to effortlessly
control and communicate with and via one's computer merely by
talking normally to it.
Read through this and the rest
of our five part series to understand what
speech recognition is now capable of, if it might be suitable
for you and your needs, and how to best use it in
your own work environment.
A Short History of Speech
Speech versus voice
First, perhaps we need to
define some terms. We are using the term 'speech
recognition' to refer to a computer being able to listen to an
ordinary speaking voice and understand the words and sentences
Voice recognition is
something different. We consider voice recognition to be the
ability to hear someone speaking and identify the person whose
voice is being heard. This process is completely
different, and the process of voice recognition may not actually
involve understanding any of the words, but might be just
limited to recognizing the voice.
This article is all about
speech recognition, not voice recognition.
Slightly more than 40 years of
technology has often been incorporated into science fiction, but
for a long time it seemed as fanciful and impossible as death
rays and faster than light interstellar travel.
Death rays are now a
reality. Faster than light travel - at least at the
subatomic level - is becoming a possibility, and after 40 years
of hard slog, so to is speech recognition.
Of course the greatest
enabler of modern speech recognition capabilities is the ever
increasing computing power of a modern computer. But even
limitless computing power would be useless without the
appropriate programming to drive a speech recognition
AT&T's Bell Labs developed
the first-ever speech recognition device way back in the late
1940s and early 1950s. But this was more a proof of
concept rather than a practical device that could be deployed in
the real world. Until the late 1960s, the focus was on
developing systems that would recognize 'discrete' words - that
is, words spoken separately and distinctly. (A fascinating
and detailed history can be found
While such systems might
have some limited application in some specialized fields, modern
'continuous' speech recognition capabilities first started to be
developed in the early 1970s, when research into the theoretical concepts
that allow for speech recognition, developed at Princeton University, was taken up
by several ARPA (Advanced Research Projects Agency -- the same
agency that brought us the Internet) contractors.
Some of the underlying theory
you wondered, the underlying theory involves using a technique
known as 'Hidden Markov Modeling'. This is a way of
identifying something without actually seeing the thing itself,
by determining what it probably might be, based on other things
associated with it. For example, if you wondered what the
temperature was outside, and if you saw a person walking down
the street wearing only a T-shirt and shorts, you might
reasonably infer that it was warm.
The magic of this with
speech recognition is that it enables a computer to imprecisely identify words, and then to
'fill in the gaps' based
on the words around each other word, more or less the same way
we do when we are listening to someone speak ourselves.
The context of a word gives clues as to which the word is -
particularly with words that sound the same (for example,
consider the phrase 'He gave two balls to the other boy too' -
with three different words to/too/two all sounding the same but,
based on context, being clearly different).
This leads to the second
'magic' part of modern speech recognition. Statistically
speaking, computers can accurately predict the next word in a
phrase based on the words before it. Indeed, as an
immediate and trivial example if you think about the sentence
immediately before this one, if the last word was missing, you
could probably guess that the last word would be 'it'.
Studies have shown that computer statistical models are more
accurate at competing phrases that we as people are when we
intuitively do the same thing.
Early products released to the
public in the mid 1990s
The various techniques for
speech recognition were massively refined during the 1980s.
After various experimental
and high end products had been released to limited markets, 1995
saw the release of the first public speech recognition software.
This software, released by Dragon, was a "discrete word" product
that required the speaker to clearly enunciate each individual
Two years later, in 1997,
modern speech recognition software appeared as we know it today.
This new Dragon product, called "NaturallySpeaking", allowed
exactly as its name implies. No longer does a speaker need
to sound each word individually. Instead, they could speak
in a normal conversational voice, and the computer would be able
to break a steady flow of sound into individual words, even if
there was no perceptible pause or break between the end of one
word and the start of the next.
Since that time, the various
different companies offering speech recognition software have
all merged, and there is one major company remaining -- Nuance
Software, which sells its product under the Dragon
The product, now at version
10.1, has continued to improve over the years, and to make
better use of the evermore powerful computers available.
One could pointlessly debate whether or not earlier versions of
their software were truly ready for prime time or not; the key
issue which this article series attempts to address is whether
the current version is now something that you should consider
The Difference between Discrete
and Continuous Speech
Think about how you or
anyone else normally talks. You run your words together,
with almost no pause between the end of one word and the start
of the next, indeed, sometimes, people will use the end of one
word to modify the start of the next word, either deliberately
as a type of slang, or unconsciously because it makes for easier
For example, the phrase
'It's a big one' might be pronounced 'It sa bigwun'.
The first two words have been broken at a point so that part of
the first word spills to the second word, making both words
sound different, and the second two words will be pronounced as
if they are a single word. Or maybe the first two words
will be run together the same way as the second two words, as 'itsa'.
A discrete speech
recognition system would require each word to be carefully
sounded out separately. This is not the way we talk, and
so makes discrete speech recognition systems less convenient.
A continuous speech
recognition system will happily understand what you say, and to
prove my point I will pronounce that short phrase four different
ways, first sounding each word separately, secondly is to run
the words together in a single utterance without pause, thirdly as three words with the first two words broken in
the wrong place, and fourth by breaking the phrase into two
two-word groups. Let's see how Dragon understands me.
You can also see the CPU loading on the computer while Dragon is
hard at work.
How to best watch the sample
If you have a reasonably
fast Internet connection, I would recommend that after you click
on the play button, you then increase the resolution of the
video from its default 360 setting, and possibly keep on going up past
480, perhaps all the way to either 720 or 1080. You should then
increase the video size so it fills your screen, and that way
with the larger video image and the higher resolution you can clearly see the text appearing on the video of my screen
as I speak.
The option to change the
resolution appears on the bottom line, but only after you have
started playing the video. If you want the video to go
fullscreen, you should click on the button next to the video
resolution option button that has the four arrows pointing out
to the corners.
Alternatively, click on
this link to open up a regular YouTube page in a separate
Note - the second video in
this two part video will be available next week.
Technical notes about this
This test was done on a
Dell E6400, with an Intel Core 2 Duo T9600 CPU at 2.8GHz and
with 4GB of DDR2 memory, running Win7 32 bit with a Logitech
ClearChat Pro USB headset.
NOTE : The sound you hear
is NOT from the Logitech headset, it is recorded from the
microphone on the camcorder. The sound that Dragon would
hear from the Logitech headset would be very much better, and
with less background noise.
Summary of Part 1 of this
Modern speech recognition
systems are designed to work best when you speak normally, and
in a continuous flow. the software, which has evolved over
the last 40 or so years, is still not perfect, but it is getting
Please read on to the
second part of our
series, where we talk about whether your type of work is
well suited for speech recognition or not.
(And, of course, there's
lots more good stuff in the subsequent parts of the series too.)
If so, please donate to keep the website free and fund the addition of more articles like this. Any help is most appreciated - simply click below to securely send a contribution through a credit card and Paypal.
7 May 2010, last update
28 Nov 2012
You may freely reproduce or distribute this article for noncommercial purposes as long as you give credit to me as original writer.