Is multi-modality and network-based speech recognition the future?

I was recently asked the question about where did I see speech recognition moving towards in the near future. In my mind, the two directions I’m seeing have to do with who is using it and the reason they are using it.

On one hand, I see speech recognition becoming a critical player on the multi-modal market. This is the market dominated by tech-savvy, young customers, who are used to be always online, anywhere, using various devices, not only cell phones. Therefore, they are bringing to the phone a similar set of expectations from what they experience online. This can be observed in the amount of activity taking place from a Web 2.0 and social media perspective, where users are being given the choice of their preferred input method, whatever it might be – text, audio, video, etc. – while being offered innovative applications of the technology as well as well-crafted “Mashup”applications and services.

And on the other, I see speech recognition evolving from a server architecture towards a networked model where the cell phone simply becomes the equivalent of a ‘web browser’ that gives them access to a whole suite of services that exist on the network cloud. This can be observed in the amount of consolidation activity taking place in our industry (Nuance and Bevocal, Microsoft and Tellme), in the integration of speech recognition as part of the platform itself or the Operating System (as seen in the recent Nuance competition and Microsoft’s Vista), as well as in the new offerings from startups such as Mobeus who will soon offer speech to text capabilities for cell phones.

As explained in this video, Mobeus follows the network-based model in the sense that the cell phone contains a small ‘client’ that performs the voice capture (end-pointing, compression, etc.), which then processes the utterance on a network of powerful servers and then returns the results back to the client. This is a similar play to what AT&T has been offering its wireless subscribers as the #121 service (aka “Voice Info”) which is basically a shortcut into Tellme’s services.

What do you think? Are you seeing other trends in our industry? Which of these models do you think will ultimately survive?

4 thoughts on “Is multi-modality and network-based speech recognition the future?”

  1. Interesting ideas. I think network-based speech is a prerequisite to any sort of wide adoption of speech rec. Multi-modal works well in hands-free activities such as driving or field service workers. It doesn’t make as much sense for cell phones which are typically held to the ear or (with Bluetooth) worn on a belt. That being said, one real holy grail is speech-to-text dictation of notes. There would be a huge market for the ability to speak an e-mail. This is not currently possible with our legacy 64 kb/s telecom network with its poor audio quality of 4KHz bandwidth. Network-based speech breaks this boundary by capturing high-quality audio and transmitting it as data to speech rec engines on the network.

    The car, because it is a private space, and the user’s hands and eyes are busy, offers huge opportunities for speech apps.

  2. I totally agree with you in that speech-to-text is likely to continue to be very important as speech recognition technologies continue to evolve. But one very interesting trend I’ve been seeing with some of the new services that offer similar types of services where you leave a voice mail and then receive the text version of it on your cell phone (Jott, Simulscribe, etc.), is that the way the ‘solve’ the accuracy issue is simply by outsourcing the transcription effort to agents in another country where the labor costs certainly justify using them as an alternative to any sort of automated technology. For example, one company taking advantage of this setup is Truemors – you can call in a rumor and by using one of these services the company transcribes the contents of your message and publishes it on the website.

  3. This kind of conversation usually gets me debating internally about how the networks truely exchange data between each other. I guess it harks back to when the net was first being looked at by Berners Lee and their primary objectives at CERN.

  4. lol a few of the commentary bloggers distribute are a little out there, sometimes i think about whether they truly read the content pieces and items before writing or whether or not they just simply read the titles and compose the very first thought that drifts into their heads. nevertheless, it’s useful to read clever commentary once in a while rather than the same exact, old blog vomit which i frequently discover on the net i’m off to have fun with a few rounds of facebook poker cheers

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>