Notes on a talk from Dr. Jason Mars of Clinc
- “The new frontier of free-form, voice enabled gaming”
- Goal: Reduce complexity of UI interactions
- Voice UI has been “one shot”—no context, dumb single commands
- Clinc rejects computational linguistics in favor of recurrent neural nets
- No parts of speech trees, so that you don’t have to encode what it should understand ahead of time
- Goal is to understand language like a human does (including previously unknown words and phrases)
- Goal is to extract key semantic features without parsing like traditional bottom-up methods
- Live demo
- Pretty good, but has problems with text to speech, and understanding context is hit or miss
- Demoer mmmmmay be adding more context to the speech than a real user would
- Conversations are less constrained, less linear than demos from Google and the like
- Looks like it’s 70 or 80% of the way to perfect
- Use cases for us:
- Control a copilot
- Interact with ATC
- Virtual assistant for onboarding
- Multilanguage is “free” as long as you have training data for that language
- Demo in Booth P1657
- Runs on-premises… can even run inference on an Arduino (not training, obviously)
- Internal model
- Has conversational flow, each state of which is represented by a “competency” (a thing it can do), like add something to your cart or confirm a transaction
- This is “a thing the system knows how to do”
- These are all stateful—doesn’t matter what order it gets the info it needs
- Can have actions attached to each state transition dependent on what information you’ve gotten so far
- Has conversational flow, each state of which is represented by a “competency” (a thing it can do), like add something to your cart or confirm a transaction
- Speech recognition: the model is trained on text, so you first need speech to text; they integrate into whatever speech to text is already available on the client side (usually the OS provided one)