Creating a Good Speech Recognition Experience

Everyone is on the go. We’re driving, walking, eating, washing our hands. We don’t always have the time or inclination to do the thumb-dance on our phone’s screen. If we can use our voice instead, that’s often our first choice.But the voice recognition has to do what we expect, or it can be frustrating.Here I will address some of the technical issues at a high level, as well as some ways to make sure your users have a positive experience.I worked on an iOS app that tracks nutrition information based on natural language processing. For example, you can say, “I had a tuna sandwich and a banana for lunch yesterday,” and the calories and macronutrients automatically appear in your meal log. I worked with a skilled designer to make this an easy and delightful experience for users.There were certain challenges that came up in the design and implementation, which I’ll teach you about today so you don’t have to learn them the hard way.The technical aspects here are mostly geared toward iOS, but much of it applies to Android as well.

Choosing a Service Provider

iPhone and iPad

With current technology, all speech recognition needs to be done on a server. There is limited ability to catch certain keywords locally on the device, but it’s not enough to recognize arbitrary words and sentences.An official Speech Framework became available in iOS 10, allowing you to use the same voice recognition technology that Siri uses. Now, depending on your experience with Siri, you may see this as a positive or a negative – the quality of the actual recognition is not always the best. But it is currently the officially sanctioned way to recognize speech on iOS devices, and it is free for you, up to certain limits.If you are not happy with Apple’s Speech Framework, or you need to support devices running older versions of iOS, another alternative is Nuance SpeechKit. It was available to developers before Apple’s framework was released, which is how they snagged the official-sounding name of SpeechKit. It has better recognition, more advanced features, and offers you more control over the experience. The tradeoff is that it costs money.For developers, my recommendation is to understand both SDKs and create a simple wrapper interface that can work with either platform. This is what I did and it allowed the choice of platform to be a purely business decision, based on the tradeoff of features and quality vs cost.

Android

On Android, the official speech recognition SDK has been available for much longer, and Google’s recognition is of higher quality than Apple. As usual, though, you have to be aware of the fragmentation of the Android ecosystem, so be prepared for varying microphone quality, API versions, and user experiences. Even with that, it is an easier choice on Android to stick with the built-in capabilities.One reason you may want to use Nuance’s paid service is if you have a cross-platform app and you want to make it easier to port parts of the code between the two platforms. In that case, the fact that Nuance has both an iOS and Android version of their SDK could be helpful.

The Basic Experience

People use speech recognition features to make things easier, so you want your user’s experience to be as easy and seamless as possible. The basic flow goes like this:1. Press a big button2. Talk3. Get resultsYou want it to be just that simple. There is deceptive complexity behind the scenes, and a lot can go wrong. The key is to hide as much of that from your user as possible.

The Big Button

I do mean big. Don’t hide your speech recognition functionality behind a tiny little microphone icon or multi-layered menu. When we want to talk at our phone, we want to do so because we don’t have the time or ability to poke around at the touch screen. Make the button big, obvious, and easy to tap while driving or otherwise occupied.

Talking and Listening

After receiving the button press, make it really obvious that the phone is now listening. Provide a visual cue through animation or a screen change, and an audio cue by playing what’s known as an “earcon” (a play on icon, but for ears). An earcon is just a sound that says, “go ahead, I’m listening,” ala Frasier Crane. (Of course, I don’t recommend actually saying that out loud, but the sound you choose should give that impression.)You should have at least three, possibly more, different earcons, for situations like:· Started listening· Finished listening successfully· Listening failed to recognize anything· Understood some words but couldn’t find a corresponding action· Timed outAs your app listens, you should also give visual feedback. If you get partial results from the API, show them on the screen. If not, provide some kind of animation to give the user confidence that something is happening. Ideally the animation should be tied to the audio you’re recording – something like an oscilloscope, spectrograph, or simple volume meter. I recommend adding in some randomness to provide movement even when the speaker is silent between words.

Time Limit

You need to set a time limit on how long you will listen. Choose a limit appropriate to how much text you expect to interpret. And check with your service provider on what their limit is. At the time of this writing, Apple has a 1-minute maximum, for example.Provide some visual indication to the user of the time limit. Maybe a progress bar that slowly fills up, a countdown, or a changing color.

Getting Results

Once the speech recognition finishes with some results, play your “success” earcon, then transition the screen appropriately. Be sure to display the text on the screen so the user can see what was recognized.I recommend a pausing a second or two with the text visible to give the user a chance to cancel or retry if it came out wrong. You wouldn’t want to act on the phrase “drop the bomb” if what they really said was “remain calm.”

Things that can go wrong

Be prepared for failure. Lots can go wrong.That means don’t make speech recognition the only way to enter data. Provide a keyboard interface or some other way to input the same information. This is of course crucial for accessibility, but beyond that, there are many technical points of failure.

Rate Limits

Service providers limit how much speech recognition you can do, because it puts a load on their servers. Apple plays coy on what the specific limits are, but you should be prepared to gracefully handle failure if you do run into them.

Permissions

On all mobile platforms, you need to get permission from the user to perform speech recognition. So be prepared for some people to say no.On iOS you need two separate permissions – one to access the microphone, and one to perform speech recognition. Be sure to test the sequencing to make sure you ask for both permissions in the right order before attempting to listen, and be able to handle either one being denied.Do not undervalue the user experience for permission deniers – those who deny app permissions are often the most vocal users, and most likely to complain or tell their friends about a bad experience. If you really need the permissions, provide a convincing reason. If they still say no, make sure the app works anyway. Don’t nag them, but do offer a way for them to change their mind.

Incorrect Words

As Jim Croce lamented, sometimes the words just come out wrong. There are many reasons this can happen, including noisy environments, speakers with accents, or just the limitations of the hardware and software. It’s going to happen at some point, so give the user a chance to correct the words before acting on them, and if possible offer an “undo” functionality after the fact.Depending on what your app does, try to perform some kind of sanity check on the results. For example, if you think the user said “I ate five thousand tuna sandwiches” there’s a chance you got something wrong.

Security Considerations

All of this data is going over the network, being displayed on the screen, and being performed within earshot of possible bad guys, so do think about security. Don’t use speech recognition to ask for sensitive information. Don’t store recordings longer than necessary. And don’t store text that the user has corrected or deleted.

Technical Considerations

This section is mostly iOS focused, but similar issues can come up on any platform.The Apple Audio-Visual SDK leaves much to be desired. To ensure you use it correctly, write an abstract wrapper around it.One common gotcha is synchronization. Make sure the earcon finishes playing before you start listening, or you’ll just record that sound. If you are doing any other audio playback or recording in your app, make sure it doesn’t overlap with speech recognition. Since the user can trigger recognition at any time through the big button, this takes some careful concurrency considerations.For Evolve, I created a high-level SoundService to keep track and synchronize all the different ways the app can interact with the audio hardware. This is in addition to a SpeechService class to wrap the speech recognition, as I described above. Be sure to follow good software engineering practices and design these classes so that they can be mocked out for testing purposes.Since there is only one copy of the audio hardware, I created a mechanism to only allow one audio action to proceed at a time. If you try to listen while playback is occurring, you have to wait until playback finishes. And vice versa. I accomplished this using the classic Apple concurrency technique of semaphores and dispatch queues. If there’s interest, I’ll write another post with more details.Getting this right is especially important on iOS. Apple provides several ways to control audio playback and recording, at different layers of abstraction (AVKit, AV Foundation, Core Audio), but they all access the same hardware. You can read the high-level documentation here. One thing Apple doesn’t mention is that if you get the synchronization wrong, they don’t return an error, they crash your entire app. The exact details of where and how it crashes vary depending on the iOS version, the device, and other random conditions.Usually it looks like EXC_BREAKPOINT or an assertion failure.You may also see “unable to set preferred number of channels” or “required condition is false: IsFormatSampleRateAndChannelCountValid(format)”This can cause much frustration because the error messages are unrelated to the actual error, so spend the time to make sure you synchronize all your audio access.

Testing

As I mentioned above, include the ability to mock your audio and speech classes, so you can perform unit tests without relying on actual hardware and network communication.But you also need to do a lot of manual testing on actual devices with the actual software. There is no substitute for that. Be sure to test all the following:· Denying permission· Receiving a phone call while recognition is happening· Receiving a text while recognition is happening· Triggering recognition while your app is playing audio· Triggering recognition while another app is playing audio in the background· Noisy environments· Timeout· Cancellation· Correction of incorrect words· Slow network connection· No network connection (graceful error handling)

Future Directions

The most glaring weakness of current speech recognition technology is the need to rely on a server. If the connection is weak or not there it just won’t work. As devices get more powerful, I predict this limitation will go away, although you will probably have to go through unofficial channels at first.The applications for speech recognition are just barely being tapped. Think of any time you want to accomplish something but don’t want to stop what you’re doing to type on a touchscreen. That is an opportunity.What ideas will you come up with?