---
source_path: "models/types/stt.md"
canonical_url: "https://doc.sensory.com/tnl/7.9/models/types/stt/"
---

# Speech To Text _(STT only)_

These models do audio transcription with transformers.

STT models have [task-type](https://doc.sensory.com/tnl/7.9/api/setting-keys/configuration.md#task-type)` == `[lvcsr](https://doc.sensory.com/tnl/7.9/api/setting-keys/values.md#lvcsr) and filenames that
by convention match `stt-*.snsr`

Some STT models support [grammar-based recognition](https://doc.sensory.com/tnl/7.9/reference/grammar.md#grammar-based-recognition) through
[grammar-stream](https://doc.sensory.com/tnl/7.9/api/setting-keys/runtime.md#grammar-stream). Grammar-constrained STT recognition uses the same grammar
syntax as LVCSR grammar recognition, with STT-only support for
[`<unknown/>`](https://doc.sensory.com/tnl/7.9/reference/grammar.md#grammar-special) and [`<dictation/>`](https://doc.sensory.com/tnl/7.9/reference/grammar.md#grammar-special).

**Also see these related items:** [STT models](https://doc.sensory.com/tnl/7.9/models/index.md#stt-models) included in this distribution.

## Operation

```mermaid
flowchart TD
    start((start))
    fetch[/samples from ->audio-pcm/]
    audio(^sample-count)
    process[process]
    partial(^result-partial)
    intent(^nlu-intent)
    slot(^nlu-slot)
    result(^result)
    nlu{NLU<br>match?}

    slm{SLM<br>included?}
    generate[generate]
    slmstart(^slm-start)
    slmresultpartial(^slm-result-partial)
    slmresult(^slm-result)

    start --> fetch
    fetch --> audio
    audio --> process
    process --> fetch
    process -->|hypothesis| partial
    partial --> fetch
    process -->|VAD endpoint<br>or STREAM_END| nlu
    nlu -->|yes| intent
    nlu -->|no| result
    intent --> slot
    slot --> result
    slot -->|more| intent

    result --> slm
    slm -->|yes| slmstart
    slm -->|no| fetch
    slmstart -->|OK| generate
    slmstart -->|STOP| fetch
    generate -->|response| slmresultpartial
    slmresultpartial --> generate
    generate -->|done| slmresult
    slmresult --> fetch
```

Recognition flow.

1. Read audio data from [->audio-pcm](https://doc.sensory.com/tnl/7.9/api/setting-keys/runtime.md#-audio-pcm).
2. Invoke [^sample-count](https://doc.sensory.com/tnl/7.9/api/setting-keys/events.md#sample-count-event).
3. Invoke [^result-partial](https://doc.sensory.com/tnl/7.9/api/setting-keys/events.md#result-partial) with interim recognition hypotheses
   every [partial-result-interval](https://doc.sensory.com/tnl/7.9/api/setting-keys/configuration.md#partial-result-interval) ms.
5. Continue processing until [STREAM_END](https://doc.sensory.com/tnl/7.9/api/inference.md#rc_stream_end) occurs on [->audio-pcm](https://doc.sensory.com/tnl/7.9/api/setting-keys/runtime.md#-audio-pcm),
   one of the event handlers returns a code other than [OK](https://doc.sensory.com/tnl/7.9/api/inference.md#rc_ok), or
   an external [VAD](https://doc.sensory.com/tnl/7.9/api/setting-keys/values.md#vad) detects a speech endpoint.
6. If NLU is configured, invoke [^nlu-intent](https://doc.sensory.com/tnl/7.9/api/setting-keys/events.md#nlu-intent) and [^nlu-slot](https://doc.sensory.com/tnl/7.9/api/setting-keys/events.md#nlu-slot) for each
   top-level result that matches.
7. Invoke [^result](https://doc.sensory.com/tnl/7.9/api/setting-keys/events.md#result) with the final recognition hypothesis.
8. If an SLM is not available, resume processing at step 1.
9. Invoke [^slm-start](https://doc.sensory.com/tnl/7.9/api/setting-keys/events.md#slm-start). If the handler returns [STOP](https://doc.sensory.com/tnl/7.9/api/inference.md#rc_stop),
   resume processing at step 1.
10. Invoke [^slm-result-partial](https://doc.sensory.com/tnl/7.9/api/setting-keys/events.md#slm-result-partial) as the model generates text.
11. Invoke [^slm-result](https://doc.sensory.com/tnl/7.9/api/setting-keys/events.md#slm-result) when text generation is complete.
12. Resume processing at step 1.

**Note:**

STT recognizers do **not** produce a final recognition hypothesis until they
run out of audio samples to process, or an external VAD detects a speech
endpoint.

With live audio you should use these with a VAD template such as
[tpl-vad-lvcsr](https://doc.sensory.com/tnl/7.9/models/index.md#tpl-vad-lvcsr), [tpl-opt-spot-vad-lvcsr](https://doc.sensory.com/tnl/7.9/models/index.md#tpl-opt-spot-vad-lvcsr), or [tpl-spot-vad-lvcsr](https://doc.sensory.com/tnl/7.9/models/index.md#tpl-spot-vad-lvcsr).

## Settings

**Available events:** [^nlu-intent](https://doc.sensory.com/tnl/7.9/api/setting-keys/events.md#nlu-intent), [^nlu-slot](https://doc.sensory.com/tnl/7.9/api/setting-keys/events.md#nlu-slot), [^result](https://doc.sensory.com/tnl/7.9/api/setting-keys/events.md#result), [^result-partial](https://doc.sensory.com/tnl/7.9/api/setting-keys/events.md#result-partial), [^sample-count](https://doc.sensory.com/tnl/7.9/api/setting-keys/events.md#sample-count-event), [^slm-result](https://doc.sensory.com/tnl/7.9/api/setting-keys/events.md#slm-result), [^slm-result-partial](https://doc.sensory.com/tnl/7.9/api/setting-keys/events.md#slm-result-partial), [^slm-start](https://doc.sensory.com/tnl/7.9/api/setting-keys/events.md#slm-start)

**Available iterators:** _none_

**Available results:** [audio-stream](https://doc.sensory.com/tnl/7.9/api/setting-keys/results.md#audio-stream), [audio-stream-first](https://doc.sensory.com/tnl/7.9/api/setting-keys/results.md#audio-stream-first), [audio-stream-last](https://doc.sensory.com/tnl/7.9/api/setting-keys/results.md#audio-stream-last)

**Available runtime settings:** [->audio-pcm](https://doc.sensory.com/tnl/7.9/api/setting-keys/runtime.md#-audio-pcm), [audio-stream-from](https://doc.sensory.com/tnl/7.9/api/setting-keys/runtime.md#audio-stream-from), [audio-stream-to](https://doc.sensory.com/tnl/7.9/api/setting-keys/runtime.md#audio-stream-to), [grammar-stream](https://doc.sensory.com/tnl/7.9/api/setting-keys/runtime.md#grammar-stream), [nlu-grammar-stream](https://doc.sensory.com/tnl/7.9/api/setting-keys/runtime.md#nlu-grammar-stream)

**Available configuration settings:** [audio-stream-size](https://doc.sensory.com/tnl/7.9/api/setting-keys/configuration.md#audio-stream-size), [custom-vocab](https://doc.sensory.com/tnl/7.9/api/setting-keys/configuration.md#custom-vocab), [partial-result-interval](https://doc.sensory.com/tnl/7.9/api/setting-keys/configuration.md#partial-result-interval), [samples-per-second](https://doc.sensory.com/tnl/7.9/api/setting-keys/configuration.md#samples-per-second), [stt-profile](https://doc.sensory.com/tnl/7.9/api/setting-keys/configuration.md#stt-profile)

**Available values:** [lvcsr](https://doc.sensory.com/tnl/7.9/api/setting-keys/values.md#lvcsr)

**Also see these related items:** [live-spot.c](https://doc.sensory.com/tnl/7.9/api/sample/c/live-spot.md#live-spot-code), [snsr-eval.c](https://doc.sensory.com/tnl/7.9/api/sample/c/snsr-eval.md#snsr-eval-code), [PhraseSpot.java](https://doc.sensory.com/tnl/7.9/api/sample/android/enroll-trigger.md#et-code), [segmentSpottedAudio.java](https://doc.sensory.com/tnl/7.9/api/sample/java/segmentSpottedAudio.md#segmentspottedaudio-code)

## STT grammar-based recognition

STT models that support grammar decoding accept grammar specifications through
[grammar-stream](https://doc.sensory.com/tnl/7.9/api/setting-keys/runtime.md#grammar-stream). See [Grammar-based recognition](https://doc.sensory.com/tnl/7.9/reference/grammar.md#grammar-based-recognition) for syntax, operators,
NLU markup, special symbols, and weights.

STT grammars can use runtime classes supplied through `grammar-stream.classname`
or `phrases-stream.classname`, but they do not support LVCSR binary
[class libraries](https://doc.sensory.com/tnl/7.9/models/types/lvcsr.md#grammar-class-libraries).

Use `<unknown/>` in an STT grammar to allow an out-of-grammar span at a specific
position. Use `<dictation/>` to hand off free-form speech to the STT model's
statistical language component. This hand-off is one-way; matching does not
return to the grammar afterward.

<!-- Abbreviation definitions from includes/abbreviations.md -->
*[API]: Application Programming Interface
*[LVCSR]: Large Vocabulary Continuous Speech Recognition model, feed-forward neural net acoustic model with FST decoder
*[NLU]: Natural Language Understanding model
*[SLM]: Generative Small Language Model
*[STT]: Speech To Text: transformers with language model and CTC decoding
*[TNL]: TrulyNatural, Sensory's large-vocabulary speech recognition technology
*[VAD]: Voice Activity Detector
