Design Process
Developing a product using Sensory ICs and technologies requires hardware platform development, software development, product integration, and human-interaction testing. For good speech recognition and synthesis performance, each of these design areas must be error-free. Please refer to the FAQ page to understand the issues inherent with speech recognition and good product design.
Sensory provides free design consultation and design reviews to ensure the quality of your finished product. Send us your schematics, flowcharts, code, board layout, microphone enclosure specifications and a production prototype, and we will review them and give you our recommendations.
Please contact Sensory Sales for assistance.
In general, these are the steps to take to maximize your experience with Sensory speech technologies:
- Define product flow chart, product concept, and product specification.
- Determine which of Sensory's technologies are needed for the product.
- Provide product flow, concept, code, schematics, board layout and microphone enclosure to Sensory for free evaluation and input.
- Make appropriate changes to specification suggested by Sensory.
- Develop prototype or utilize Sensory's professional services to create prototype.
- Test prototype and make modifications.
- Place order for chips.
A brief description of design considerations follows below. For a complete explanation of the issues involved in designing an application using speech, please reference the Design Notes section.
Types of Speech Recognition
The two general classes of speech recognition are called "speaker-dependent" (SD) and "speaker-independent" (SI) recognition. For speaker-dependent recognition, the speaker trains the system to recognize his/her voice by speaking each of the words in a recognition vocabulary set. In speaker-independent recognition, the system uses recognition sets that are pre-trained for virtually any speaker of a particular target language (accents count). SD is great for products that require end user customization and SI is great for products that must work “out of the box”. Sensory provides the Quick T2SI (text-to-speaker-independent) tool for quickly creating SI sets for many languages.
Because a speaker-dependent recognition system must save the end user custom training data, such a device requires programmable storage such as EEPROM, RAM, or Flash. So speaker-independent recognition is the lowest cost and should be used unless end-user customization is required.
Selection of Vocabulary
The success rate of a speech recognition product depends primarily on the number of alternative words in a recognition set, and the distinctiveness of the words in the set - different types of sounds and numbers of syllables. Care must be given to the choice of the words in a recognition inventory. For example,if the task is to distinguish "cat" from "rat," the product will be improved if the words are changed to "cat" and "mouse."
At different times in the use of a speech recognition product, the recognition set vocabulary to be recognized can be different. For example, at some stage, the speech synthesizer may ask a question and expect to recognize "yes" from "no." At another time, the speech synthesizer may ask a different question whose answer is either "dog," "horse," "elephant," or "dinosaur." In this example, keeping the number of words in each recognition set small, and ensuring the words in the set have different numbers of syllables and types of sounds will greatly enhance recognition accuracy. In general, the vocabulary to be recognized at any point in the product should contain the smallest number of words or phrases possible, and each word or phrase should be different in terms of the number of syllables and types of sounds. Sensory recommends keeping set sizes under 10 words or phrases per set, though it is possible to exceed 20 given a benign usage model.
Trigger words and commands
There are two general operation modes for the recognizer - trigger word or command word. The recognizer may listen continuously for a key word or trigger word, which one may think of as the product’s name. Listening continuously may make the recognizer more prone to false firing, so limiting to a single word reduces that probability. Once the trigger word gets the attention of the product, the product can listen for a variety of commands. A command set will typically time-out after 2-3 seconds, so the chances of a false fire are small even with multiple commands being weighed for recognition.
Power Consumption
In operation, the speech recognition circuit may draw a current of 10 milliamperes. If it is powered on continuously to listen for a trigger word, it will drain a large alkaline battery in a few days. So if an application requires the recognizer be on continuously, it should operate from AC wall power. Conversely, if the product is to operate on batteries, then it must be designed to be awakened from a low power "sleep" mode by some external I/O event. The product should be designed to be awakened by the user each time a recognition event is desired, followed by returning to sleep.
The Signal Volume
If the distance from a speaker’s mouth to a microphone is doubled (e.g. 6 inches to 12 inches), the signal power decreases by a factor of four. The difference between a loud and a soft voice can also be more than a factor of four. Thus, the recognizer must function over a wide dynamic range of input signal strength. It’s recognition accuracy will be diminished if the input signal either saturates the electronics or is too small. Sensory chips include automatic gain control (AGC) to help address this issue. This circuit changes the system gain to compensate for too small or too large a signal. The application developer must design the microphone circuit gain to be well centered between the two extremes.
Noise and Selection of a Microphone
An electronic speech recognizer has difficulty recognizing words in a noisy environment, just as humans do. Speech recognition products will always work best in a quiet environment with the speaker's mouth in close physical proximity to the microphone. If the product is meant to be used in a noisy environment, care must be taken to manage the signal-to-noise-ratio (SNR). For example, if speech recognition is used in a video game with sound effects, a head set will dramatically improve recognition performance by separating the effects from the input speech. In some cases where the user’s location relative to the product is always known, a directional microphone may be suitable. Since directional microphones have a frequency response that depends on their distance and angle from the sound source, such microphones should be used with care, and speaker independent set may require special processing with these mic characteristics in mind. For many applications an inexpensive omni-directional electret capacitor microphone is acceptable and is all the application bill-of-materials (BOM) can allow. In this case, the user will be required to speak loudly enough to exceed the background noise and speech by a comfortable margin.
Add Prompts to Reduce Recognition Errors
When recognizing words or phrases, the Sensory recognition technology judges its probability of success. The product can thus be designed to prompt for cues if the desired accuracy isn't reached. For example, if a voice dialing phone is told to ``place call" and the recognizer is confident of its accuracy, it can accept the command. If it is unsure but suspects this command it can confirm by asking "Do you want to place a call?" If it has very low confidence, it can ask, "What did you say". In some cases it will be less frustrating to wait quietly, as it is natural for a user to repeat themselves if a person or product do not respond the first time. The application developer should judge which approach is appropriate.
|