Skip to content

Grammar-based recognition stt tnl

Grammar-based recognition constrains a recognizer to a set of words and structures defined by a grammar. Focusing recognition on a limited set of phrases can improve speed and accuracy at the expense of recognizing arbitrary input.

LVCSR models support grammar-based recognition with build-capable models. STT models that support grammar decoding use the same grammar syntax. Some special symbols are model-specific; those cases are noted below.

Syntax

A context-free grammar is a set of rules that describes the sequences of words that a recognizer can match.

Definition

  1. Grammars use UTF-8 encoding.
  2. # marks the start of a comment, which extends to the end of the line.
  3. A grammar is a series of rules representing variable definitions. The final rule in a grammar specifies the recognition vocabulary and typically references rules defined earlier. It should include the sentence start (<s>) and end (</s>) markers.
  4. A rule is an assignment of the form name = expr ; where name is a symbol and expr is a sequence of symbols and operators. expr is a type of regular expression.
  5. A symbol is a sequence of characters that does not include any whitespace or operators, optionally prefixed by sigils $ or ~. A symbol without a sigil is called a terminal and is part of the recognition vocabulary, for example temperature. Special symbols are predefined terminals that describe input characteristics such as pauses and the edges of an utterance.
  6. The $ sigil does rule substitution at build time. The parser substitutes the value of the rule named name for $name. Substitutions include an implicit grouping operator: Grammar a = 1 | 2 | 3; b = <s> $a </s>; is equivalent to b = <s> (1 | 2 | 3) </s>;.
  7. The ~ sigil substitutes a named recognition class at runtime.
    • Each class is a recognizer with its own grammar, separate from the main grammar.
    • All references to a class use instances of the same class recognizer.
    • You can update each class in isolation, without having to recompile the main grammar.
    • If you have a large rule that's referenced multiple times, converting it to a class can speed up build time significantly.
    • Use classes to augment a recognition vocabulary at runtime. In a voice dialing application, for example, you can define the entire recognition grammar at build time but use ~contacts instead of a predefined list of contact names. Once loaded, the application can scan the address book and build only the ~contacts class.
    • Specify class definitions with grammar-stream.classname or phrases-stream.classname, for example phrases-stream.contacts.
    • LVCSR models can also use class libraries, which are pre-built binary class repositories supplied separately from a grammar. Class libraries are LVCSR-only.
  8. Operators include grouping parentheses, brackets, and braces, infix operators that indicate logical AND and OR between symbols, and postfix operators that change how the preceding symbol matches input. The operator precedence table lists the order and direction in which the parser applies operators.
  9. Grouping
    • ( ) Parentheses enclose items that are grouped together.
    • [ ] Square brackets enclose optional items. [...] is equivalent to (...)?.
    • { } Braces implement slot-capturing lightweight NLU markup.
      • {slotName a b c} makes a b c available as the nlu-slot-value of nlu-slot-name slotName when the recognizer matches a b c to the input audio.
      • You can nest NLU slots to an arbitrary depth.
      • The outermost slots are defined as intents and all the nested slots in each intent as entities.
      • Each identified intent invokes handlers registered for ^nlu-intent and ^nlu-slot.
      • {rule} is shorthand for {rule $rule}.
      • With this grammar:
        seconds = 1 | 2 | 4 | 8 | half:0.5 a:? | a:? quarter:0.25 [of: a:];
        shutterSpeed = set shutter speed to {seconds} ( second | seconds );
        cmd = <s> {shutterSpeed} </s>;
        
        an utterance of "set shutter speed to a quarter of a second" will produce set shutter speed to 0.25 second as recognition output, with an additional ^nlu-intent callback for the top-level shutterSpeed slot:
        NLU intent: shutterSpeed (0) = set shutter speed to 0.25 second
        NLU entity:   seconds (0) = 0.25
        
  10. Infix operators
    • These are valid between symbols and may be surrounded by whitespace.
    • ^ is the conjunction operator and is implied between adjacent terminals: Grammar g = one two three; will recognize only the sequence "one two three".
    • | is the disjunctive operator. It separates alternative items. Grammar g = one | two | three; will recognize "one", or "two", or "three".
  11. Postfix operators
    • These directly follow a symbol without any intervening whitespace.
    • ? A question mark following a symbol makes that symbol optional: It requires zero or one repetitions of the symbol.
    • + A plus sign following a symbol or a group requires one or more repetitions of it.
    • * An asterisk following a symbol or a group requires zero or more repetitions.
    • : is the rewrite operator.
      • left:right recognizes symbol left but produces terminal right as a recognition result.
      • left: recognizes symbol left but rewrites that to an empty string, eliding left from the recognition result.
      • :right inserts right into the recognition result. If you say "one two three", grammar g = <s> one :mississippi two :mississippi three </s>; produces "one mississippi two mississippi three".
    • / A forward slash following a symbol followed by a floating point number defines a weight to be associated with that symbol. If there's a rewrite operator (:) the slash must follow the rewritten-to terminal, for example: one:een/0.123 Weights are in the logprob domain, convert from a \([0, 1]\) probability to a weight with \(w = -log_{10}(p)\). The default symbol weight is 0 for a probability of 1.0.
  12. \ escape symbol. To include a literal special character in a grammar specification, escape it with a backslash. The list of characters that support this include: ^, |, *, +, ?, =, [ ], ( ), ;, #, and :.

grammar-stream, phrases-stream, nlu-grammar-stream, ^nlu-intent, ^nlu-slot

Operator precedence

The following table lists the precedence and associativity of grammar operators. Operators are listed in descending precedence: level 0 is applied first and level 5 last.

Precedence Operator Description Associativity
0 : Rewrite output
0 / Symbol weight
1 ( ) Grouping
1 [ ] Optional group
1 { } Slot-capturing semantic markup
2 ? Zero-or-one symbol left-to-right
2 + One-or-more symbols left-to-right
2 * Zero-or-more symbols left-to-right
3 ^ And, implied between symbols right-to-left
4 | Alternative right-to-left
5 = Rule assignment right-to-left

This grammar:

a = one | two three four;
g = <s> ( $a | five six) </s>;

will recognize only these phrases:

one
two three four
five six

Special symbols

A grammar can include these special symbols:

  • <s> - The silence at the start of a sentence.
  • </s> - The silence at the end of a sentence.
  • <wp> - Short pauses between words. The grammar compiler automatically adds these where needed, so there is no need to do so explicitly. Do not add <wp> to NLU grammars, use <pause/> instead.
  • <pause/> - An explicit short pause.
  • <no-match/> tnl - Matches when none of the alternatives are likely (i.e. "none of the above").
    • Recognition results at the phrase level can include <no-match/> even if this symbol was not explicitly used in the grammar. This is an indication that the result was rejected due to search.frame-nota, or that RAM or CPU constraints limited the recognizer's ability to produce a result.
  • <unknown/> tnl - Similar to <no-match/>. In some LVCSR models the threshold for determining whether this symbol matches better than any other is different from that of <no-match/>.
    • stt In STT grammars, <unknown/> matches an out-of-grammar word span at a specific point. Use this when the grammar should keep matching even if part of the utterance is not in the fixed vocabulary. In word-level STT models, <unknown/> expands to the primitive wildcard sequence <unknown-start/> <unknown-cont/>*. In character-level STT models, it expands to (<unknown-start/> | <unknown-cont/>).
    • stt By default, <unknown/> has an OOV penalty so in-grammar words win when they fit the audio. Add an explicit symbol weight, for example <unknown/>/2, to tune this penalty in a grammar.
    • stt In NLU output, the matched word is emitted unless the grammar uses a rewrite. Use <unknown/>: to match and drop it from the NLU result.
  • <unknown-start/> / <unknown-cont/> stt - Primitive OOV wildcard symbols used by <unknown/>. Use them directly only when you need explicit control over word-boundary matching or continuation weighting.
  • <dictation/> stt - Hands off recognition to the STT model's statistical language component for free-form speech. This is a one-way hand-off; matching does not return to the grammar afterward.
  • . - When used with lightweight NLU grammars, a single period matches any input word. If . is also used as the output label, the matched input word is echoed. Use .:* to match any input words and remove them from the NLU result.

Out-of-grammar words in STT grammars

Use <unknown/> to allow an STT grammar slot to absorb words outside the fixed grammar vocabulary. Add the empty rewrite operator (:) when the grammar should match the unknown word but omit it from the captured value:

digit = one | two | three | four | five | six | seven | eight | nine | zero | <unknown/>:;
g = <s> {number $digit+} </s>;

A non-zero weight overrides the default OOV penalty for that arc:

item = <unknown/>/2;