Documentation Index
Fetch the complete documentation index at: https://mintlify.com/facebookresearch/audioseal/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The AudioSeal generator (AudioSealWM class) creates imperceptible watermark signals that can be added to audio. The watermark has the same length as the input audio and is designed to be robust to various audio transformations.
Generator Architecture
The generator consists of three main components working in sequence:1. SEANet Encoder
The encoder compresses the audio signal into a latent representation:- Temporal downsampling through strided convolutions
- Residual blocks with dilated convolutions for receptive field
- Default ratios:
[8, 5, 4, 2]→ 320x compression (16 kHz audio to 50 Hz latent)
The encoder reduces the temporal dimension significantly, enabling efficient processing of long audio sequences.
2. Message Processor (Optional)
If a secret message is provided, it’s embedded into the latent representation:The message embedding uses a learned
torch.nn.Embedding layer with 2 * nbits entries, allowing the model to learn optimal representations for each bit value.3. SEANet Decoder
The decoder upsamples the latent representation back to audio rate:- Transposed convolutions for temporal upsampling
- Matches the downsampling ratios of the encoder
- Trimmed to exact input length
- Optional final activation (e.g., tanh) for bounded output
Normalization and Envelope Fitting
A critical component for imperceptibility is theNormalizationProcessor, which ensures the watermark fits within the audio’s natural envelope:
Envelope Fitting Process
Step-by-Step Process
Step-by-Step Process
- Window the Signals: Divide both audio and watermark into overlapping windows (window_size=5, overlap=50%)
- Compute RMS: Calculate root mean square energy for each window
- Calculate Gain: Determine scaling factor to fit watermark under audio envelope
- Apply Hann Window: Smooth transitions between windows
- Reconstruct: Use torch.nn.Fold to reconstruct the normalized signal
Why This Matters
Why This Matters
- Imperceptibility: Watermark is scaled to be quieter than the original audio
- Adaptive: Different scaling for different parts of the audio (louder during loud sections, quieter during quiet sections)
- Smooth: Hann windowing prevents audible artifacts at window boundaries
Complete Generation Process
Here’s the full workflow of theget_watermark method:
Usage Examples
Basic Watermark Generation
With Custom Message
With Alpha Scaling
Streaming Mode
The
streaming() context manager enables convolutional caching, allowing efficient processing of audio streams without redundant computation.Design Choices
Why Encoder-Decoder Architecture?
Efficiency
Processing in compressed latent space is much faster than operating directly on audio samples
Receptive Field
Encoder captures long-range context, allowing watermark to adapt to audio characteristics
Imperceptibility
Latent representation enables learning perceptually-aware watermarks
Message Embedding
Easy to inject message information in the compact latent space
Why Envelope Fitting?
Without envelope fitting, the watermark might be:- Too loud in quiet sections → audible artifacts
- Too quiet in loud sections → reduced robustness
Technical Specifications
| Parameter | Default Value | Description |
|---|---|---|
channels | 1 | Number of audio channels (mono) |
dimension | 128 | Latent representation size |
n_filters | 32 | Base channel width |
n_residual_layers | 3 | Residual blocks per stage |
ratios | [8, 5, 4, 2] | Temporal compression ratios |
nbits | 16 | Message length in bits |
window_size | 5 | Envelope fitting window size |
reference_rms | 0.1 | Loudness normalization target |
Performance Considerations
Memory Usage
Memory Usage
- Encoder latent: ~320x smaller than input audio
- Batch processing supported for multiple files
- Gradient checkpointing available during training
Speed
Speed
- Real-time factor: ~0.1x (10x faster than real-time on GPU)
- Streaming mode: minimal latency with convolutional caching
- Batch processing: linear speedup with batch size
Compatibility
Compatibility
- Supports any sample rate (trained on 16 kHz)
- Works with mono and stereo audio
- TorchScript compatible (except envelope fitting)
Next Steps
Detection
Learn how to detect watermarks
Localization
Understand sample-level precision
API Reference
Full API documentation
