Improving Siri and OK Google Voice Assistants

Google I/O and Apple’s WWDC are coming up fast, both with expectations on what the next versions of their respective operating systems will have. There are many folks out there who have been asking for an API into Siri or OK Google. At the very least, it looks like Android may be providing such capability. There is a talk scheduled for Google I/O called: “Building voice actions for your Android app” (https://events.google.com/io2015/schedule?sid=4c59a197-b6d4-e411-b87f-00155d5066d7#). The summary states:

In this talk, attendees will learn how to drive traffic to their Android app using voice actions. We'll cover how apps should inform Google which “OK Google” requests they can handle on phones, tablets, and watches.

It sounds promising, but will it be enough? And will Apple produce similar access to enable developers with the ability to call their apps through Siri? I’ve done a decent amount of work in this genre of mobile apps, most recently with Credit Sesame’s Siri integration, where members who have the app installed can ask Siri “What’s my credit score?” and the app will launch and tell you your credit score. It’s pretty cool when you see it in action, but it’s also very limited. I think the concept of voice assistants on mobile devices should be extended beyond the purpose of predefined commands and merely launching apps.

 

The Case for Providing Developer Access to Voice Assistants

One of the core concepts for mobile design is honing the mission of the product to serve a very focused goal. This means eliminating complexity, deferring to the content or primary functionality, and hiding or eliminating secondary features. Voice assistants can provide the purest, most streamlined nature of the mobile experience. Consider the following typical flow of a mobile app:

  • Unlock phone
  • Swipe to screen the app is on
  • Tap on the app (or tap on a folder that contains the app, and then tap on the app)
  • App launches, find feature you want to access, tap to perform function.

 

Basically, tap-tap-tap-tap with a few swipes in between. Sure this is light years ahead of a desktop experience, but consider a voice assistant driven experience:

  • Tell phone what you want to do and the phone does it for you

You’ve basically eliminated all the pre-launch steps of navigating to find the app, eliminated all the steps to access the functionality of the app, and allowed the phone to do the heavy lifting for you. Now, one might consider this to be too simple of an example, but I’m thinking of one segment that would benefit tremendously from this type of capability: the vision impaired.

There is a lot of material in existence on how to provide good user experiences for people with vision impairments, especially on mobile. In fact, smartphones have some of the best benefits for the vision impaired, with the cameras, GPS, compasses, etc. that they come with. The phone can tell you which way to go, what you’re looking at, and more. But the core accessibility experience still seems to be birthed from the desktop era, where the operating system attempts to describe the screen or whatever you are tapping on. This is not a good experience. Many people have been able to work with it, but that doesn’t make it good. People deserve a better mobile experience.

Instead, it’s almost better to imagine if you had no screen at all. What if there was nothing to tap? How would you get navigate and take advantage of the device’s features and capabilities? You already know what you want to do, but the difference between a screen and voice is that with a screen you are doing the work to translate your intent into action to get a result. With voice access, the phone interprets your intent and provides you with the result.

 

The Current State of Mobile Voice Assistants

When Siri was embedded into iOS, many people had high expectations, including myself. Although it launched in beta, it was soon shown that it would not live up to the hype of even its own commercials. OK Google launched later with its own commercials, but considerably less fanfare and similar limitations. Microsoft even got into the game, and provides the same basic features. In a nutshell these are:

  • Launch an app
  • Perform a set of predefined tasks integrated with existing system apps (set a reminder, make a phone call, send a text message, etc.)
  • Search a predefined set of services (Yelp, Fandango, etc.), with the possibility to launch an app if it is installed.
  • Search the web, Wikipedia, etc.

 

This is limited access. Google has expanded it to include more features with its launch of Android Wear (https://developer.android.com/training/wearables/apps/voice.html), but it’s still limited access. Apple has included some capability for Siri with HomeKit (https://developer.apple.com/homekit/ui-guidelines/), but again, it’s limited to actions associated with HomeKit.

 

The Challenge of Building a Voice Assistant with Wide Access

The problem with building a feature like this for developers to access is that you are required to take semi-structured content and interpret it into actionable intents – in multiple languages. This is a luxury that an icon driven GUI largely does not need to worry about. Pictures can say a thousand words; it’s harder to draw an accurate picture from just a few words.

It’s not to say that there isn’t existing work out there. The folks at API.ai (http://api.ai/) have been developing a system to integrate third party apps with their voice assistant technology. Also, there’s a pre-existing and significant amount of work with natural language processing to translate web search phrases into searchable terms. Although the former might be easier to manage on a technical level, it still relies on the latter and doesn’t get you integrated with the operating system. What is needed is to hone down to the core use cases that drive the actions of the user intent.

 

Use Cases for Developer Enabled Voice Assistant Capability

Translating an intent into an action to provide a result has three potential basic stages:

  • Establishing the type of the intent
  • Finding the app to provide the action
  • Generating the result

 

Establishing the Type of the Intent

When you look at the variety of different queries that Siri or OK Google can respond to, there doesn’t seem to be much structure. Upon taking a step back though, there are really two different types of main intents: questions and commands.

 

Questions

Commands

What is …?

Do …

Who is …?

Set … (alarm / reminder / room temperature)

Where is …?

Call …

Can I do …?

Launch …

When does … happen?

Send …

Tell me …

Play (or watch) …

 

This is not an exhaustive list, but is patterned after the existing queries that both Siri and OK Google have programed. There may not even be too much of a distinction, since a question is merely a command that in the form of a question. The primary distinction is that questions may more often be sent to an online service, whereas commands may rely more on an installed app to provide the response.

 

Turning the Intent into an Action

Once an intent has been established, it’s now time to find out how to turn that into an action. This is where the app developers start their work. There needs to be a set of core actions that associate with a given set of intents. For example, a phone app could have a set of actions associated with “Call [name] at [location]” or “What is [name]’s cellphone number”. These actions would act as the primary corpus to drive an app response, and could be stored either in the app bundle/manifest (like other intents or app launching mechanisms). Alternatively, this information could be stored within the app’s metadata within the App Store / Google Play listing. This would allow management of the intents without releasing a new version of the app and allow the stores to extend its own search capability.

There are about five main cases that the voice assistant will have to accommodate for every intent provided, in order to translate it into an action:

 

Use Case

Action

App is not installed

Search store for provided intent, alternatively search web

App is installed

Call the app to provide the result

Multiple apps installed

Provide for a choice between the apps

App is on screen

App provides the result directly

Different app is on screen

Provide result or call app to provide result

 

There are other sub-cases, including if the intent was provided from the lock screen, or a wearable device, if the device is plugged in, whether or not internet services are available, etc. The above may apply globally for all cases, and allow for further extension later.

 

Generating the Result

This may sound like the easiest part, but could actually be the most challenging. After all, you’ve only established that an app could provide a result. It is now up to the app to interpret the intent. This is probably why there hasn’t been any breakthrough in providing an open API for developers. Should each developer write it’s own translation mechanism? That seems like a lot to ask and could hurt adoption of the API. On the other hand, a centralized service to parse an intent into a command logic could reuse the previously mentioned infrastructure, but it could also become overly complicated and introduce latency issues. There are three major ways the voice assistant could provide a result to the user (given if a related app is installed):

  • Open the related app: The app could be launched with the intent parameters, which would trigger the required feature
  • Query the related app: Instead of launching the app, the assistant could launch a sub-process of the app to get a result without providing the apps UI. This could be like a database query.
  • Query a service associated with app: Sometimes the app may not have the data required, like sports scores or move times. The voice assistant could have direct access to an online service that would provide the answers. This could be more efficient than launching an app sub-process to get the answers from an online service.

 

This of course does not preclude the ability of the voice assistant to query the web or the store to get other results or find an app that could provide the results.

 

Final Thoughts

We’ll see very soon whether Apple or Google are serious about extending the features of their voice assistants and enabling developers to start plugging into them. Their plans may differ from what I’ve mentioned, but I’ve tried to lay out a framework on how it might come together. Treating voice assistant technology as a platform, instead of a feature, and enabling developers to build on that platform is really the only way Siri and OK Google will be taken seriously and be used as deeply as they are in the commercials.

Comments (0)

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
3 + 9 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.