daniel bigham.ca

Wolfram Search
March 9, 2009

Along side Powerset comes Wolfram Search, due to launch in May. This is exciting stuff, and I can't wait to see what they've come up with. From the little I've read, it sounds as though they're actually modeling the knowledge and then using natural language parsing to allow the user to query that model. Very cool.

Microsoft Research at TechFest 2009
March 5, 2009

I was really pleased to read this article yesterday detailing some of the interesting things that Microsoft Research is working on.

The thing that I was most interested in, personally, was Commute UX which has a video on this page. See also this poster.

Basically, it's exactly the kind of use cases I was working on this fall, but in the car rather than the kitchen. They've got a groovy little car simulator that they use to test the software out, which makes a lot of sense... testing it out in a real car would be a little dangerous!

Anyway, it's very cool to see Microsoft working on these problems, and I was impressed with their demo. The key thing for me is that it's a natural language voice interface that actually works well. Some cars on the market today have voice interfaces, but you need to know the specific command language to make it work, and even if you take the time to learn it, it's easy to forget.

This stuff is similar to the local company here in Waterloo named iLane.

Evaluation of Grace: Part I: 3rd Party Technology
February 26, 2009

It has been a couple of months now using "Grace", so it's time to do some evaluation of the technologies I've used to make this application work. (The context here is that last year I wrote an application named "Grace" that runs on a computer in the kitchen and can be interacted with via voice)

Here are the biggest challenges, the things that don't work well:

Did you say something?

An aspect of SAPI (5.1) that I have found very frustrating of late is how increasingly often it interprets/recognizes non-vocal noise as if it were speech. Back in January when I first starting using Grace, this was a significant but manageable issue, but in the last couple of weeks it has made Grace almost unusable in the noisy kitchen environment. Simply walking across the room or opening a drawer causes SAPI to recognize the command "Grace". Put a glass on the kitchen counter and it will recognize the sentence "Open my inbox". This is where I draw the line: Behavior like that is ridiculous, especially the later example. A couple of days ago I was making some bread and chatting with Meredith, and apparently it heard the sentence "Open my inbox" about 5 times.

I expect that one of the culprits here is that SAPI tries to "learn" over time, adjusting its internal probabilities so that words or phrases that it has heard more often are more likely to be recognized. The obvious problem with this approach is that once you have used a command or phrase a few dozen times, it becomes weighted so heavily that more and more often background noise will match the word or phrase, to the point that you start seeing behavior like I have described above. I believe there is a way to disable this adaptation, which I will likely have to do, but there is a downside to doing this, because I expect that for the most part, this adaptation has a positive effect on recognition rates.

Overall, this is a commentary on where voice recognition technology is at for use in environments that aren't perfectly quiet. If I were to assign a grade on how well SAPI protects itself from recognizing noise as speech, it would have to be an "F". More research needs to be done in this area.

Keeping the monitor off

Grace is primarily a voice interface: You speak a command or query, and it speaks back the answer. To make this work, the computer needs to be running, but there is no need for the monitor to be on until if and when information needs to be displayed to the user. Indeed, in today's world where the environment and energy conservation are important issues, it would be very wasteful to have a computer monitor on all day when it's not needed.

There are Windows APIs that a program can use to put an LCD monitor into and then later out of sleep mode, and at first glance, this seems to solve the problem: The software can keep the monitor off until information needs to be displayed, at which point, it can turn the monitor on. LCDs can come out of standby mode within a second or two -- perfect, right?

Unfortunately, SAPI contains a "feature" whereby audio input automatically takes the monitor out of standby mode. The reasoning is that if a computer is employing a voice interface, audio input is the equivalent of a mouse movement or keyboard key press. Thus, if you're in the kitchen and open a cupboard or even shift in your chair, the monitor turns back on.

The only workaround that I've come up with is to run a loop that tells the monitor to go to standby mode 20 times a second, so that when SAPI goes to bring the monitor out of standby mode, the software immediately overrides it. I worry though that this may be causing additional stress on the hardware. And even with this workaround in place, the software needs to make sure that a black window is completely obscuring the screen, otherwise when you move around in the kitchen the monitor is constantly flickering as it comes out of and then back into standby mode, displaying the Windows desktop for a fraction of a second each time. Gross.

Microsoft: The conclusion here is that for SAPI to be used in an always-on environment where electricity needs to be conserved by keeping a monitor in standby mode, this setting needs to be configurable. Until that time, ugly ugly hacks are required.

When to listen

Another challenge is for the software to know when to listen and when not to listen. For example, if you are playing some music in the kitchen, you obviously don't want SAPI listening. Fortunately, iTunes offers a COM interface that allows the software to know when music starts and stops, so recognition can be enabled or disable.

Unfortunately, I'm not currently aware of any integrations for Windows Media Player, so there doesn't seem to be any way of being smart about stopping/starting recognition while listening to a radio station through Media Player. Perhaps there is a more direct way to accomplish this via DirectShow, etc.

...

Ok, so those are the challenges, the things that don't work very well. Here are the things that work pretty well, but have room for improvement:

...

Recognition accuracy

While far from perfect, I'm relatively happy with recognition accuracy, that is, when you are actually speaking to the software. Grace uses a fairly complex command and control grammar that allows for natural language commands and queries, and accuracy isn't bad. I'm sure this is an area of research that will improve over time, but I can live with where things are at.

One area that hasn't worked that well is numbers. For example, the recognizer seems to have a lot of difficulty distinguishing between words like "seventy" and "seventeen".

Occasionally it will recognize completely bizarre statements that are nothing even close to what I said, but this doesn't happen too often. Interestingly, accuracy seems to be improved when commands and queries are longer VS shorter. For instance, playing a song by saying "play the song Chariots of Fire" will result in fewer mis-recognitions than if the grammar allowed for "play Chariots of Fire". This is a nice attribute to have for a system that prefers commands and queries be spoken in natural language, but sometimes it does make more sense for a command to be short and concise, and it's frustrating when that translates to more mis-recognitions.

iTunes

It has turned out that iTunes has been an important component of a kitchen computer: Music playback, yes, but more importantly video podcasts. I can watch the nightly news by saying "Play the ABC news podcast", likewise the NASA podcast, and TED podcast.

Having a COM interface has made interfacing with iTunes possible. Without a COM interface, there would have been some serious problems such as knowing when to listen and when not to listen. And as it turns out, many podcasts seem to have a relatively low volume, so the software can also adjust the system volume to an appropriate level when a podcast is being viewed, and then restore it to the default level when it stops being played.

While iTunes has been a very important piece, there are frustrations: For instance, if the Windows tray opens an information balloon, video playback drops to about 0.2 frames per second, and you have to get up and fight with the computer trying to close the darn thing before you can continue watching your program. It also seems impossible to make the video full screen via the COM API, which is unfortunate.

...

And finally, things that have worked very well:

...

iMac

The iMac hardware is really ideal for a kitchen installation. It's very quiet, pretty, and compact, all of which are very important. And of course, it now runs Windows.

What I'm most impressed by is how quiet it is: Probably an order of magnitude quieter than many desktop computers I've owned, and ends up being virtual silent in the kitchen environment. This can easily be a show stopper for a kitchen installation since a noisy fan is extremely tiring to listen to, and many people, myself included, wouldn't have patience for it.

I also love how the iMac looks: The screen is a beautiful glossy black when it's off, which looks great in the corner of the kitchen, and the anodized aluminum looks similarly nice. I wouldn't want a ugly computer in the corner of my kitchen, so this is an important attribute for it to have.

The compactness: I couldn't be more pleased with how compact it is: It saved me drilling a hole in my kitchen counter which would have been required if I had used a desktop + LCD monitor. Even the keyboard is understated. Perfect.

And finally, the Apple remote! What a wonderful gadget, and this turns out to be quite important because there's no way to pause audio or video, skip tracks, or adjust the volume using a voice interface because SAPI isn't going to be able to hear you over the audio that the computer is plying.

My one gripe has been that the wireless adapter appears to have gone flaky and then died on me -- and what's with Apple mice? I replaced the standard mouse with a wireless Microsoft mouse.

Anyway, the iMac has been a very important component of this project and has worked remarkably well. It was Meredith's idea too, so good thinking Meredith!

VoiceTracker Array Microphone

I'm very happy with this purchase: It's an array microphone that even works from 12 feet away, albeit with moderate performance at times from that distance.

A project like this is only really possible with a high quality array microphone. I experimented with Blue Tooth headsets, but:

1.	Who wants to wear one around the house? Not me.

2.	Recognition accuracy sucked.

Another alternative would have been to use a high quality wireless microphone, but the whole idea here is for the system to be hands free, because when you're in the kitchen, you're often busy doing things, or have wet or grimy hands and don't want to have stop what you're doing to handle a device.

So bravo to the VoiceTracker team!

My one beef here is that the USB adapter they send you has been gimped so that it only produces 1/10th the volume that it would by default. This makes recognition from 12 feet lousy. I would normally just bypass this and plug the microphone directly into the iMac, but as I discovered, the iMac doesn't have a microphone input. How's that for frustrating! I ended up purchasing something called an 'iBooster' to get around this, but I'm unclear as to how well this is working. I wonder whether it is causing clipping when I'm actually close to the computer, and I'm a bit confused because the line input volume seems to jump around: Does Windows automatically adjust line input volume when it's used for SR? I'll need to do some more playing around with this.

older >>