Assist Pipelines
The Assist pipeline integration runs the common steps of a voice assistant:
- Wake word detection
- Speech to text
- Intent recognition
- Text to speech
Pipelines are run via a WebSocket API:
{
"type": "assist_pipeline/run",
"start_stage": "stt",
"end_stage": "tts",
"input": {
"sample_rate": 16000,
}
}
The following input fields are available:
Name | Type | Description |
---|---|---|
start_stage | enum | Required. The first stage to run. One of wake_word , stt , intent , tts . |
end_stage | enum | Required. The last stage to run. One of stt , intent , tts . |
input | dict | Depends on start_stage :
|
pipeline | string | Optional. ID of the pipeline (use assist_pipeline/pipeline/list to get names). |
conversation_id | string | Optional. Unique id for conversation. |
timeout | number | Optional. Number of seconds before pipeline times out (default: 300). |
Events
As the pipeline runs, it emits events back over the WebSocket connection. The following events can be emitted:
Name | Description | Emitted | Attributes |
---|---|---|---|
run-start | Start of pipeline run | always | pipeline - ID of the pipelinelanguage - Language used for pipelinerunner_data - Extra WebSocket data:
|
run-end | End of pipeline run | always | |
wake_word-start | Start of wake word detection | audio only | engine : wake engine usedmetadata : incoming audiotimeout : seconds before wake word timeout metadata |
wake_word-end | End of wake word detection | audio only | wake_word_output - Detection result data:
|
stt-start | Start of speech to text | audio only | engine : STT engine usedmetadata : incoming audio metadata |
stt-vad-start | Start of voice command | audio only | timestamp : time relative to start of audio stream (milliseconds) |
stt-vad-end | End of voice command | audio only | timestamp : time relative to start of audio stream (milliseconds) |
stt-end | End of speech to text | audio only | stt_output - Object with text , the detected text. |
intent-start | Start of intent recognition | always | engine - Agent engine usedlanguage : Processing language. intent_input - Input text to agent |
intent-end | End of intent recognition | always | intent_output - conversation response |
tts-start | Start of text to speech | audio only | engine - TTS engine usedlanguage : Output language.voice : Output voice. tts_input : Text to speak. |
tts-end | End of text to speech | audio only | media_id - Media Source ID of the generated audiourl - URL to the generated audiomime_type - MIME type of the generated audio |
error | Error in pipeline | on error | code - Error code (see below)message - Error message |
Error codes
The following codes are returned from the pipeline error
event:
wake-engine-missing
- No wake word engine is installedwake-provider-missing
- Configured wake word provider is not availablewake-stream-failed
- Unexpected error during wake word detectionwake-word-timeout
- Wake word was not detected within timeoutstt-provider-missing
- Configured speech-to-text provider is not availablestt-provider-unsupported-metadata
- Speech-to-text provider does not support audio format (sample rate, etc.)stt-stream-failed
- Unexpected error during speech-to-textstt-no-text-recognized
- Speech-to-text did not return a transcriptintent-not-supported
- Configured conversation agent is not availableintent-failed
- Unexpected error during intent recognitiontts-not-supported
- Configured text-to-speech provider is not available or options are not supportedtts-failed
- Unexpected error during text-to-speech
Sending speech data
After starting a pipeline with stt
as the first stage of the run and receiving a stt-start
event, speech data can be sent over the WebSocket connection as binary data. Audio should be sent as soon as it is available, with each chunk prefixed with a byte for the stt_binary_handler_id
.
For example, if stt_binary_handler_id
is 1
and the audio chunk is a1b2c3
, the message would be (in hex):
stt_binary_handler_id
||
01a1b2c3
||||||
audio
To indicate the end of sending speech data, send a binary message containing a single byte with the stt_binary_handler_id
.
Wake word detection
When start_stage
is set to wake_word
, the pipeline will not run until a wake word has been detected. Clients should avoid unnecessary audio streaming by using a local voice activity detector (VAD) to only start streaming when human speech is detected.
For wake_word
, the input
object should contain a timeout
float value. This is the number of seconds of silence before the pipeline will time out during wake word detection (error code wake-word-timeout
).
If enough speech is detected by Home Assistant's internal VAD, the timeout will be continually reset.
Audio Enhancements
The following settings are available as part of the input
object when start_stage
is set to wake_word
:
noise_suppression_level
- level of noise suppression (0 = disabled, 4 = max)auto_gain_dbfs
- automatic gain control (0 = disabled, 31 = max)volume_multiplier
- audio samples multiplied by constant (1.0 = no change, 2.0 = twice as loud)
If your device's microphone is fairly quiet, the recommended settings are:
noise_suppression_level
- 2auto_gain_dbfs
- 31volume_multiplier
- 2.0
Increasing noise_suppression_level
or volume_multiplier
may cause audio distortion.