Documentation Index
Fetch the complete documentation index at: https://mintlify.com/facebookresearch/audioseal/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The AudioSeal detector (AudioSealDetector class) identifies watermarked audio segments and decodes embedded messages with sample-level precision. Unlike traditional watermark detectors that output a single binary decision, AudioSeal provides frame-by-frame probabilities, enabling localized detection in edited or concatenated audio.
Detector Architecture
The detector is simpler than the generator, consisting of two main components:1. SEANet Encoder (Keep Dimension)
SEANetEncoderKeepDimension instead of regular SEANetEncoder
Why Keep Dimension?
Why Keep Dimension?
The standard encoder downsamples audio by a factor of 320 (with default ratios), collapsing temporal information. The detector needs to maintain temporal resolution to provide frame-by-frame detection probabilities.
SEANetEncoderKeepDimension processes audio while preserving the temporal dimension, enabling localized watermark detection.Architecture Details
Architecture Details
- Same convolutional structure as the generator encoder
- No temporal downsampling (or compensated with appropriate padding/upsampling)
- Outputs:
(batch, output_dim=32, frames)where frames ≈ input_samples - Much larger output than compressed encoder
2. Detection Head (1x1 Convolution)
A simple 1x1 convolution projects the encoder output to detection logits:- Channel 0-1: Detection logits (watermark present/absent)
- Channel 2-(1+nbits): Message decoding logits (16 channels for 16-bit message)
The 1x1 convolution acts as a learned linear projection applied independently to each time frame, enabling efficient frame-by-frame prediction.
Detection Process
The forward pass consists of several steps:Step 1: Optional Loudness Normalization
Step 2: Encoder Processing
Step 3: Detection Probability Calculation
- Channel 0: P(no watermark)
- Channel 1: P(watermark present)
After softmax,
result[:, 0, :] + result[:, 1, :] = 1.0 for each frame, ensuring valid probability distribution.Step 4: Message Decoding
Why Average Across Frames?
Why Average Across Frames?
The same message is embedded throughout the entire watermarked audio. By averaging predictions across all frames, we:
- Reduce noise and improve accuracy
- Aggregate evidence from the entire audio
- Obtain a single consensus message prediction
Why Sigmoid?
Why Sigmoid?
After averaging, the raw logits are passed through sigmoid to convert to probabilities in [0, 1], where:
- Values close to 0 indicate bit = 0
- Values close to 1 indicate bit = 1
- Values near 0.5 indicate uncertainty
High-Level Detection API
Thedetect_watermark method provides a convenient interface:
Threshold Parameters
Two key thresholds control detection behavior:Detection Threshold
Lower Threshold (e.g., 0.3)
- More sensitive detection
- Higher recall (fewer false negatives)
- More false positives
Higher Threshold (e.g., 0.7)
- More conservative detection
- Higher precision (fewer false positives)
- More false negatives
Message Threshold
Usage Examples
Basic Detection
Low-Level Detection (Frame-by-Frame)
Custom Thresholds
Localized Detection in Edited Audio
This localized detection enables identifying which parts of an audio file are watermarked, even if the audio has been edited or concatenated with unwatermarked content.
Performance Characteristics
Speed
Single forward pass through a convolutional network. Up to 100x faster than iterative decoding methods.
Accuracy
State-of-the-art detection performance even after compression, noise, and editing.
Localization
Frame-level precision enables detection in edited audio at 1/16,000 second resolution.
Scalability
Efficient batch processing for large-scale detection tasks.
Robustness to Audio Transformations
The detector is trained to be robust against common audio manipulations:Compression
Compression
- MP3 encoding (various bitrates)
- AAC encoding
- Opus codec
Noise and Interference
Noise and Interference
- Additive Gaussian noise
- Environmental noise
- Background music
Editing Operations
Editing Operations
- Cutting and splicing
- Concatenation
- Speed changes
- Volume adjustments
Resampling
Resampling
- Different sample rates (24kHz, 44.1kHz, 48kHz)
- Sample rate conversion
Technical Specifications
| Parameter | Value | Description |
|---|---|---|
encoder.output_dim | 32 | Encoder output channels |
nbits | 16 | Message length (0 for detection-only) |
detection_threshold | 0.5 | Default frame-level threshold |
message_threshold | 0.5 | Default message bit threshold |
frames_per_second | ~16,000 | Temporal resolution at 16kHz |
Design Choices
Why Frame-by-Frame Detection?
Traditional watermark detectors output a single binary decision for an entire audio file. AudioSeal’s frame-by-frame approach enables:- Localized Detection: Identify which parts are watermarked
- Edit Detection: Find where audio was cut or modified
- Robustness: Aggregate evidence across multiple frames
- Flexibility: Apply different thresholds for different use cases
Why Separate Detection and Message Channels?
The detector outputs both detection logits (2 channels) and message logits (16 channels) simultaneously:- Detection is always active and works even without a message
- Message is optional metadata that doesn’t affect detection
- Allows using the same model for both 0-bit (detection-only) and 16-bit (detection + message) watermarking
You can train a detector with
nbits=0 for detection-only applications, reducing model size and complexity.Next Steps
Localized Watermarking
Understand sample-level precision
Generation
Learn about watermark generation
API Reference
Full detector API documentation
