Archive for July, 2009

A very interesting report from Nielsen was recently published highlighting some of the challenges mobile users face when accessing web information.

Aside from the sad news about average success rates being around 59%, it was interesting to me to see how most of the Mobile Problems outlined in the report can be actually seen as opportunities to seriously consider the use of Speech Recognition.

I know most companies suggest Speech Recognition as the killer app for mobile devices, but I would argue that it should be seen instead as the ideal complementary mode of interaction when navigating the internet and retrieving information on mobile devices, not as the silver bullet that would solve all mobility hurdles.

For example, thinking about speech in the context of those problems raised in the report:

  • Small screens: Yes, small size is a natural result of being portable. Yet, having a limited number of options at any given time and relying on short-term memory are the bread and butter of most Speech Recognition Systems. Therefore, adding an audible element and allowing users to express themselves in more natural ways helps compensate those visual limitations. Furthermore, multislot interactions and natural language understanding help alleviate the challenge of multiple windows and advanced behaviors present in purely visual interactions.
  • Awkward input (especially for typing): Once again, Speech Recognition shines here since it’s the facto way of interaction amongst humans. Words can easily trump visual counterparts such as menus, buttons, and links not only because of how natural interactions are but also because it avoids the inherent limitations of tiny keypads, trackballs and mini-keyboards.
  • Download delays: Even though Speech cannot solve the problem of being able to download screens faster, it can help in those instances where information can be delivered in an audible form since users can continue to interact with the system and move along their intended goal since prompts and logic can be embedded in a device without requiring network connectivity or optimized and compressed for faster delivery.

Very interesting debate was triggered by the recent tests (Part 1 and Part 2) on Google Voice performed by readers of the Gadgetwise blog.

The overall premise was for readers to call the article′s writer phone number and leave a creative voicemail to gauge the effectiveness of Google′s voicemail transcription system.

Even though at first glance this seems to be another “why speech recognition isn’t ready for prime time” type of article (yes, I know the author claimed they wanted to test the boundaries of the technologies, but as many readers pointed out, individual items such as the president′s name aren′t that far fetched and should′ve worked), I think it also brings up some interesting issues often faced when testing speech recognition systems:

1) What are we testing? - This one very often depends on who you′re talking to, particularly on the business side of things. Some team members look care about containment rates (how many individuals stay in a system without having to talk to an agent), some others care about transfer rates (the increase or decrease in the volume of calls going into the call center), while some others care about customer experience (how long does it take for someone to solve accomplish their goal), and even a few (sorry to say upper management included) call systems to see how well they work when given odd statements or commands, or even worse, how close the system matches their particular expectations (without taking into consideration how the system was designed in the first place). So for me, this is one of the most important aspects of any system, which should be captured as part of the requirements phase – knowing what project owners want to test allows you to stir your design in the right direction (and push back when necessary as early as possible). For example, in the case of these Google Voice tests, there were some very interesting comments from readers because some felt they were testing the accuracy of the transcriptions, while others thought the test should only involve how well is the overall intent being captured, while some others (sadly) though they were testing how does speech recognition work nowadays.

2) How are we testing it? - This one depends a lot on what the answer to #1 might be. For example, in the case of Google Voice, I felt the test would′ve been much more valid if readers were asked to forward samples of their own voicemails into the writer′s voicemail (meaning real world examples) instead of having them come up with messages that seemed to have turned into a challenge to see who came up with the one that broke the system the most. Going back to some of the things business owners normally want to test, some of the methods in which we might need to test those items might vary significantly: for example, to test containment or transfer rates, one should not only look at raw numbers but at reasons behind those numbers – it′s very different if the numbers are driven by users exceeding a failure threshold than if they are due to users pressing 0 or if they are truly due to business requirements whose proper behavior is to retain/transfer the user.

3) What do these results mean? Particularly when dealing with numbers and percentages, the interpretation of results if very often tricky. For example, would you modify a menu if 50% of your users end up making the wrong selection? (I′m sure your gut reaction is “yes”, “of course”)… but what if that number is based on 2 out of 4 users that someone listened to during a morning′s test? Similarly, we sometimes run into situations where decisions are based solely on someone′s like or dislike (often C-level individuals) about how the system is performing (subjective analysis) without any consideration for the reasons behind the choice, the data of a much larger sample, or the fact that the system might still be on a pre-pilot phase that will eventually get tuned. I felt this was probably one of the main things lacking from the article.

The examples are definitively interesting (and funny sometimes), but I think it would′ve been worth doing some sort of analysis about the possible reasons behind some of those misrecognitions (line quality, odd pausing, user′s accents, etc.) as well as a more detailed explanation of what the transcription process really is. Some readers might think the results reflect the accuracy of an advanced speech recognition engine when in reality most transcription processes out there in the market involve a hybrid environment where the recognition engine might perform the first pass, and then human beings perform a second pass, reviewing what the machine recognized and/or interpreting those segments the machine might not have been able to recognize in the first place.

Have you tried it yet?