LVCSR tnl¶

These recognizers use a phonetic acoustic model and an FST vocabulary decoder. They are suitable for small to medium vocabulary tasks, but not for unconstrained audio transcription.

These models have task-type==lvcsr and filenames that by convention match lvcsr-*.snsr

You can create LVCSR recognizers with VoiceHub or by specifying a grammar with build-capable¹ model.

LVCSR recognizers include support for decoding with statistical language models, but Sensory does not distribute the tools used to create these². Language models can provide improved accuracy for constrained target domains. For transcription type tasks, an STT model is a better fit.

The Sensory FST decoder supports hybrid models that contain both grammar-based and language model components.

LVCSR models included in this distribution.

Operation¶

flowchart TD
    start((start))
    fetch[/samples from ->audio-pcm/]
    audio(^sample-count)
    process
    partial(^result-partial)
    intent(^nlu-intent)
    slot(^nlu-slot)
    result(^result)
    nlu{NLU<br>match?}
    start --> fetch
    fetch --> audio
    audio --> process
    process --> fetch
    process -->|hypothesis| partial
    partial --> fetch
    process -->|VAD endpoint<br>or STREAM_END| nlu
    nlu -->|yes| intent
    nlu -->|no| result
    intent --> slot
    slot --> result
    slot -->|more| intent
    result --> fetch

Recognition flow.

Read audio data from ->audio-pcm.
Invoke ^sample-count.
Invoke ^result-partial with interim recognition hypotheses every partial-result-interval ms.
Continue processing until STREAM_END occurs on ->audio-pcm, one of the event handlers returns a code other than OK, or an external VAD detects a speech endpoint.
If NLU is configured, invoke ^nlu-intent and ^nlu-slot for each top-level result that matches.
Invoke ^result with the final recognition hypothesis.
Resume processing from step 1.

Note

LVCSR recognizers do not produce a final recognition hypothesis until they run out of audio samples to process, or an external VAD detects a speech endpoint.

With live audio you should use these with a VAD template such as tpl-vad-lvcsr, tpl-opt-spot-vad-lvcsr, or tpl-spot-vad-lvcsr.

Settings¶

^nlu-intent, ^nlu-slot, ^result, ^result-partial, ^sample-count

none

audio-stream, audio-stream-first, audio-stream-last

->audio-pcm, audio-stream-from, audio-stream-to, grammar-stream, phrases-stream

ac-prune-top-k, audio-stream-size, complete-only, partial-result-interval, ram-limit, samples-per-second, search.frame-nota, show-silence

lvcsr

live-spot.c, snsr-eval.c, PhraseSpot.java, segmentSpottedAudio.java

Notes¶

Sensory optimizes hybrid models with a background component only to detect speech that is not in the specified grammar. These models report an nlu-intent-name of background when they detect out-of-grammar utterances. You should not use the out-of-grammar recognition text result as this will have a high word error rate. Consider using STT for transcription tasks instead.

phrases-stream provides a convenient way to specify a recognition vocabulary from an exhaustive list of alternative utterances.

LVCSR grammar-based recognition 6.7.0¶

Sensory's LVCSR models use grammars to constrain the possible utterances they can recognize. Focusing on a limited set of words and structures defined in these grammars improves recognition speed and accuracy at the expense of recognizing arbitrary input.

You can create a custom recognizer by specifying a fixed grammar during development if the recognition vocabulary is entirely known, or at runtime if it is not. You can also use a hybrid approach and build the invariant parts during development, and delay adding variable parts (such as a list of favorite TV channels) until runtime.

See Grammar-based recognition for the shared grammar syntax, operator precedence, NLU markup, and special symbols supported by grammar-based LVCSR and STT recognizers.

Creating a recognizer¶

Create a grammar-based recognizer using the command-line tools. This example uses data/grammars/enrollments.txt which contains a sample grammar specification for the enrollment recordings in data/enrollments/.

To create a custom recognizer using this grammar with snsr-edit, specify an LVCSR model that supports building and grammar-stream.

data/grammars/enrollments.txt

# LVCSR grammar specification for test utterances in data/enrollments/
#
# In a tpl-spot-vad-lvcsr pipeline the prefix would be consumed by the spotter.
prefix = armadillo | jackalope | terminator;

# List of known utterances in the *-c.wav files.
sentence =
 18 percent of 643 |
 call the nearest target |
 how far away is winco |
 play more songs by this artist |
 record a video |
 start a timer for 20 minutes |
 i'm running low on gas |
 cancel all my meetings on friday |
 directions to susan's house |
 do i have any new texts |
 open my calendar to next week |
 set an alarm for 6 am tomorrow;

# Match the prefix and zero or one of the sentences.
# <s> and </s> are sentence start and end markers that
# match silence and small amounts of extraneous speech.
g = <s> $prefix $sentence? </s>;

% cd $HOME/Sensory/TrulyNaturalSDK/7.9.0-pre.0+19.ged1a5d37de

% bin/snsr-edit -vv -t model/lvcsr-build-enUS-14.0.2-5MB.snsr \
    -f grammar-stream data/grammars/enrollments.txt \
    -o lvcsr-enrollments.snsr
Loading "model/lvcsr-build-enUS-14.0.2-5MB.snsr" as the template model.
Loading "data/grammars/enrollments.txt" into setting "grammar-stream".
Saved edited model to "lvcsr-enrollments.snsr".

Run the new model with snsr-eval:

% bin/snsr-eval -t lvcsr-enrollments.snsr \
    -s partial-result-interval=0 \ # (1)!
    data/enrollments/armadillo-1-3-c.wav
   165   2745 armadillo play more songs by this artist

partial-result-interval= 0 shows only the final recognition hypothesis.

For small grammars such as this the build time is negligible. snsr-eval can build and run the recognizer in a single operation:

% bin/snsr-eval -t model/lvcsr-build-enUS-14.0.2-5MB.snsr \
    -f grammar-stream data/grammars/enrollments.txt \
    -s partial-result-interval=0 \
    data/enrollments/armadillo-1-3-c.wav
   165   2745 armadillo play more songs by this artist

Classes¶

A symbol that starts with the tilde ~ sigil specifies a recognition class. Class recognizers have their own grammar specifications, separate from the top-level grammar. The behavior of a class-based recognizer is similar to that specified by a rule. Classes, however, can be updated without recompiling the rest of the grammar, and all references to a class use the same recognizer. This can reduce the recognizer size and improve build speed.

This example uses a modified enrollment grammar which references two toy classes: ~number and ~place:

enrollments-class.txt

# LVCSR grammar specification for test utterances in data/enrollments/
# This references two class sub-recognizers: ~number and ~place
#
# In a tpl-spot-vad-lvcsr pipeline the prefix would be consumed by the spotter.
prefix = armadillo | jackalope | terminator;

# List of known utterances in the *-c.wav files.
sentence =
 ~number percent of ~number |
 call the nearest ~place |
 how far away is ~place |
 play more songs by this artist |
 record a video |
 start a timer for ~number minutes |
 i'm running low on gas |
 cancel all my meetings on friday |
 directions to ~place |
 do i have any new texts |
 open my calendar to next week |
 set an alarm for ~number am tomorrow;

# Match the prefix and zero or one of the sentences.
# <s> and </s> are sentence start and end markers that
# match silence and small amounts of extraneous speech.
g = <s> $prefix $sentence? </s>;

place.txt

# Example place name class recognizer.

g = target | winco | susan's house;

The ~number and ~place classes referenced in enrollments-class.txt create two new dynamic settings for these classes: grammar-stream.number and grammar-stream.place. Specify these to create a complete recognizer:

% snsr-edit -v -t model/lvcsr-build-enUS-14.0.2-5MB.snsr\
    -f grammar-stream enrollments-class.txt \
    -g grammar-stream.number "g = 18 | 643 | 20 | 6;" \ # (1)!
    -o lvcsr-enrollments-class.snsr
Output written to "lvcsr-enrollments-class.snsr".

snsr-edit's -g option sets the grammar-stream.number stream to a string argument. A file can also be used for the number grammar.

Run the recognizer:

% snsr-eval -v -t lvcsr-enrollments-class.snsr \
    -s partial-result-interval=0 \
    data/enrollments/armadillo-1-0-c.wav
   375   3150 (1.863e-08) armadillo 18 percent of 643

Class libraries 6.15.0¶

TrulyNatural 6.15.0 introduced support for pre-built binary class repositories. These contain classes built from frequently used grammar fragments such as dates, times, and numbers.

Class libraries are supported by LVCSR models only. Load binary class repositories into the same Session as an LVCSR model to add this capability to the model. If a grammar references a class that's not explicitly defined, the class name is looked up in the provided class library or libraries. System class libraries provided by Sensory use a prefix of s. for all class names.

See lvcsr-lib-enUS-14.0.2.snsr for a description of the classes used below.

class-lib.txt

# Example recognizer with classes from a class library
call = call {number ~s.phone-number};
emergency = ~s.call-emergency;
timer = {timer ~s.timer-phrases};
commands = {call} | {emergency} | $timer;
g = <s> $commands </s>;

This example uses live audio, so it needs snsr-eval's -a flag to add a VAD to find the end of each utterance and signal the recognizer to produce a final hypothesis.

% snsr-eval -a -t model/lvcsr-build-enUS-14.0.2-5MB.snsr \
    -t model/lvcsr-lib-enUS-14.0.2.snsr \
    -f grammar-stream class-lib.txt \
    -s partial-result-interval=0

# Say: Call 1 800 555 1212
NLU intent: call (0) = call one eight hundred five five five one two one two
NLU entity:   number (0) = one eight hundred five five five one two one two
  3360   6855 call one eight hundred five five five one two one two

# Say: Set a timer for 31 minutes.
NLU intent: timer (0) = set a timer for thirty one minutes
 14610  16770 set a timer for thirty one minutes

# Say: Call the fire department.
NLU intent: emergency (0) = call the fire department
 24540  25890 call the fire department

C/C++JavaPython

Configuring class-based recognition with the C API:

SnsrSession s;

snsrNew(&s);
snsrLoad(s,   snsrStreamFromFileName("model/tpl-vad-lvcsr-3.17.0.snsr", "r"));
snsrSetStream(s, SNSR_SLOT_0,
              snsrStreamFromFileName("model/lvcsr-build-enUS-14.0.2-5MB.snsr", "r"));
snsrLoad(s,   snsrStreamFromFileName("model/lvcsr-lib-enUS-14.0.2.snsr", "r"));
snsrSetStream(s, SNSR_GRAMMAR_STREAM,
              snsrStreamFromFileName("class-lib.txt", "r"));
if (snsrRC(s) != SNSR_RC_OK) {
    fprintf(stderr, "ERROR: %s\n", snsrErrorDetail(s));
    return snsrRC(s);
}

Configuring class-based recognition with the Java API:

SnsrSession s = new SnsrSession();
try {
    s.load(SnsrStream.fromFileName("model/tpl-vad-lvcsr-3.17.0.snsr", "r"));
    s.setStream(Snsr.SLOT_0,
                SnsrStream.fromFileName("model/lvcsr-build-enUS-14.0.2-5MB.snsr", "r"));
    s.load(SnsrStream.fromFileName("model/lvcsr-lib-enUS-14.0.2.snsr", "r"));
    s.setStream(Snsr.GRAMMAR_STREAM,
                SnsrStream.fromFileName("class-lib.txt", "r"));
} catch (IOException e) {
    e.printStackTrace();
    return s.rC();
}

Configuring class-based recognition with the Python API:

try:
    with snsr.Session() as s:
        s.load("model/tpl-vad-lvcsr-3.17.0.snsr")
        s.set_stream(
            snsr.SLOT_0,
            snsr.Stream.from_filename("model/lvcsr-build-enUS-14.0.2-5MB.snsr", "r"),
        )
        s.load("model/lvcsr-lib-enUS-14.0.2.snsr")
        s.set_stream(
            snsr.GRAMMAR_STREAM,
            snsr.Stream.from_filename("class-lib.txt", "r"),
        )
except snsr.Error as e:
    print(f"ERROR: {e.message}")

LVCSR models created by VoiceHub include build components only if the grammar references at least one user-defined class, such as ~dynamic-1. If the grammar contains no unresolved classes VoiceHub removes the build components to reduce model files size and RAM use. ↩
Contact your sales representative if you would like to explore using a custom language model for your application. ↩