Google Voice and three truths about testing
Posted by: eolvera, in Customer Experience, Speech Industry, What's Broken
Very interesting debate was triggered by the recent tests (Part 1 and Part 2) on Google Voice performed by readers of the Gadgetwise blog.
The overall premise was for readers to call the article′s writer phone number and leave a creative voicemail to gauge the effectiveness of Google′s voicemail transcription system.
Even though at first glance this seems to be another “why speech recognition isn’t ready for prime time” type of article (yes, I know the author claimed they wanted to test the boundaries of the technologies, but as many readers pointed out, individual items such as the president′s name aren′t that far fetched and should′ve worked), I think it also brings up some interesting issues often faced when testing speech recognition systems:
1) What are we testing? - This one very often depends on who you′re talking to, particularly on the business side of things. Some team members look care about containment rates (how many individuals stay in a system without having to talk to an agent), some others care about transfer rates (the increase or decrease in the volume of calls going into the call center), while some others care about customer experience (how long does it take for someone to solve accomplish their goal), and even a few (sorry to say upper management included) call systems to see how well they work when given odd statements or commands, or even worse, how close the system matches their particular expectations (without taking into consideration how the system was designed in the first place). So for me, this is one of the most important aspects of any system, which should be captured as part of the requirements phase – knowing what project owners want to test allows you to stir your design in the right direction (and push back when necessary as early as possible). For example, in the case of these Google Voice tests, there were some very interesting comments from readers because some felt they were testing the accuracy of the transcriptions, while others thought the test should only involve how well is the overall intent being captured, while some others (sadly) though they were testing how does speech recognition work nowadays.
2) How are we testing it? - This one depends a lot on what the answer to #1 might be. For example, in the case of Google Voice, I felt the test would′ve been much more valid if readers were asked to forward samples of their own voicemails into the writer′s voicemail (meaning real world examples) instead of having them come up with messages that seemed to have turned into a challenge to see who came up with the one that broke the system the most. Going back to some of the things business owners normally want to test, some of the methods in which we might need to test those items might vary significantly: for example, to test containment or transfer rates, one should not only look at raw numbers but at reasons behind those numbers – it′s very different if the numbers are driven by users exceeding a failure threshold than if they are due to users pressing 0 or if they are truly due to business requirements whose proper behavior is to retain/transfer the user.
3) What do these results mean? Particularly when dealing with numbers and percentages, the interpretation of results if very often tricky. For example, would you modify a menu if 50% of your users end up making the wrong selection? (I′m sure your gut reaction is “yes”, “of course”)… but what if that number is based on 2 out of 4 users that someone listened to during a morning′s test? Similarly, we sometimes run into situations where decisions are based solely on someone′s like or dislike (often C-level individuals) about how the system is performing (subjective analysis) without any consideration for the reasons behind the choice, the data of a much larger sample, or the fact that the system might still be on a pre-pilot phase that will eventually get tuned. I felt this was probably one of the main things lacking from the article.
The examples are definitively interesting (and funny sometimes), but I think it would′ve been worth doing some sort of analysis about the possible reasons behind some of those misrecognitions (line quality, odd pausing, user′s accents, etc.) as well as a more detailed explanation of what the transcription process really is. Some readers might think the results reflect the accuracy of an advanced speech recognition engine when in reality most transcription processes out there in the market involve a hybrid environment where the recognition engine might perform the first pass, and then human beings perform a second pass, reviewing what the machine recognized and/or interpreting those segments the machine might not have been able to recognize in the first place.
Have you tried it yet?
As a follow-up to my
Wow, just when you think you’ve heard it all, you get an odd call at home. Is this what evolution has in store for us? I mean, I’ve heard how visionaries imagine a world where machines will interact with each other automatically, without the need of a human intermediary, but I’ve got to admit the call I got yesterday freaked me out a little bit.
Entries (RSS)