Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/facebookresearch/audioseal/llms.txt

Use this file to discover all available pages before exploring further.

What is Localized Watermarking?

Localized watermarking is AudioSeal’s key innovation: the ability to detect watermarks at sample-level precision rather than treating entire audio files as a single unit. This means the detector can identify exactly which portions of an audio signal contain watermarks, down to 1/16,000 of a second (at 16 kHz sample rate).

Traditional Watermarking

File-level detectionOutputs: “This entire file is watermarked” or “This entire file is not watermarked”

Localized Watermarking (AudioSeal)

Frame-level detectionOutputs: “Frames 0-1000 are watermarked, frames 1001-2000 are not, frames 2001-3000 are watermarked…”

Sample-Level Precision

Temporal Resolution

At 16 kHz sample rate, AudioSeal can detect watermarks with approximately:
1 / 16,000 seconds = 0.0000625 seconds = 62.5 microseconds
This is faster than human perception of audio events (~10 milliseconds), enabling:
1

Precise Localization

Identify exact start and end times of watermarked segments
2

Edit Detection

Detect where audio has been cut, spliced, or modified
3

Partial Watermarking

Handle audio that’s only partially watermarked
4

Real-Time Tracking

Monitor watermark presence continuously during playback

How It Works

The localized detection is enabled by the detector’s architecture:
# From audioseal/models.py:369
encoder = SEANetEncoderKeepDimension(**detector_config)
Unlike the generator which compresses audio temporally, the detector uses SEANetEncoderKeepDimension to preserve temporal information:
Generator Encoder (SEANetEncoder)
Input:  (batch, 1, 16000)     # 1 second at 16kHz
Output: (batch, 128, 50)      # Compressed 320x
Temporal compression enables efficient watermark generation.Detector Encoder (SEANetEncoderKeepDimension)
Input:  (batch, 1, 16000)     # 1 second at 16kHz  
Output: (batch, 32, ~16000)   # Temporal dimension preserved
Temporal preservation enables frame-by-frame detection.
After the encoder, a 1x1 convolution produces per-frame predictions:
last_layer = torch.nn.Conv1d(encoder.output_dim, 2 + nbits, 1)
result = self.detector(x)  # Shape: (batch, 2+nbits, frames)
Each time step in the output corresponds to a prediction for that specific moment in the audio.

Visualizing Localized Detection

Let’s see what localized detection looks like in practice:
import matplotlib.pyplot as plt
import torch
from audioseal import AudioSeal

# Load detector
detector = AudioSeal.load_detector("audioseal_detector_16bits")
detector.eval()

# Detect watermarks
result, message = detector(audio)  # audio shape: (1, 1, 16000)
wm_prob = result[0, 1, :].cpu().numpy()  # Watermark probability per frame

# Create time axis
time = torch.arange(len(wm_prob)) / 16000  # Convert frames to seconds

# Plot
plt.figure(figsize=(12, 4))
plt.plot(time, wm_prob)
plt.axhline(y=0.5, color='r', linestyle='--', label='Threshold')
plt.xlabel('Time (seconds)')
plt.ylabel('Watermark Probability')
plt.title('Localized Watermark Detection')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
The plot would show probability spikes in watermarked regions and low probabilities in non-watermarked regions, clearly visualizing where watermarks exist.

Benefits Over Traditional Watermarking

1. Robustness to Editing

Imagine a 10-second watermarked audio clip is inserted into the middle of a 60-second unwatermarked recording:
[0-25s: Clean] + [25-35s: Watermarked] + [35-60s: Clean]

2. Tamper Detection

Localized detection enables identifying audio manipulation:
def detect_tampering(audio, detector, threshold=0.5):
    """Detect if watermarked audio has been edited."""
    result, _ = detector(audio)
    wm_prob = result[0, 1, :].cpu().numpy()
    
    # Find transitions (watermarked -> not watermarked)
    is_watermarked = wm_prob > threshold
    transitions = np.diff(is_watermarked.astype(int))
    
    num_transitions = np.sum(np.abs(transitions))
    
    if num_transitions == 0:
        return "Fully watermarked (no edits detected)"
    elif num_transitions == 2:  # One start, one end
        return "Partially watermarked (possible concatenation)"
    else:
        return f"Multiple transitions detected ({num_transitions}) - likely edited"
This enables tamper-evident watermarking: you can detect not just whether audio is watermarked, but whether it’s been modified.

3. Streaming and Real-Time Detection

Because detection is frame-by-frame, you can monitor watermarks in real-time:
# Pseudo-code for streaming detection
detector = AudioSeal.load_detector("audioseal_detector_16bits")

audio_stream = get_audio_stream()  # e.g., from microphone

for chunk in audio_stream:
    result, message = detector(chunk)
    wm_prob = result[0, 1, :].mean()  # Average probability for chunk
    
    if wm_prob > 0.5:
        print(f"⚠️ Watermarked audio detected! Message: {message}")
        # Take action (e.g., flag content, trigger alert)
    else:
        print("✓ Clean audio")

4. Forensic Analysis

Localized detection supports detailed forensic investigation:
def forensic_analysis(audio, detector, sample_rate=16000):
    """Detailed analysis of watermark presence."""
    result, message = detector(audio)
    wm_prob = result[0, 1, :].cpu().numpy()
    
    # Statistics
    total_frames = len(wm_prob)
    watermarked_frames = np.sum(wm_prob > 0.5)
    watermark_percentage = (watermarked_frames / total_frames) * 100
    
    # Temporal analysis
    watermarked_regions = []
    in_region = False
    start = 0
    
    for i, prob in enumerate(wm_prob):
        if prob > 0.5 and not in_region:
            start = i
            in_region = True
        elif prob <= 0.5 and in_region:
            watermarked_regions.append((start / sample_rate, i / sample_rate))
            in_region = False
    
    if in_region:
        watermarked_regions.append((start / sample_rate, len(wm_prob) / sample_rate))
    
    # Report
    print(f"Total duration: {total_frames / sample_rate:.2f}s")
    print(f"Watermarked: {watermark_percentage:.1f}%")
    print(f"Detected message: {message}")
    print(f"\nWatermarked regions:")
    for i, (start, end) in enumerate(watermarked_regions, 1):
        print(f"  Region {i}: {start:.3f}s - {end:.3f}s ({end-start:.3f}s)")
    
    return watermarked_regions

Technical Implementation

Frame-by-Frame Processing

The detector outputs a probability for each frame:
# From audioseal/models.py:390
def detect_watermark(
    self,
    x: torch.Tensor,
    detection_threshold: float = 0.5,
) -> Tuple[torch.Tensor, torch.Tensor]:
    result, message = self.forward(x)
    
    # result[:, 1, :] contains per-frame watermark probability
    # Shape: (batch, frames)
    
    # Count frames above threshold
    detect_prob = (
        torch.count_nonzero(
            torch.gt(result[:, 1, :], detection_threshold), dim=-1
        ) / result.shape[-1]
    )
    
    return detect_prob, message
The overall detection probability is simply the proportion of frames with watermark probability above threshold. This aggregation provides a single score while preserving fine-grained information.

Memory and Computational Considerations

Localized detection requires more memory than compressed representations:
Generator (with temporal compression):
1 second audio (16,000 samples) → 50 latent frames
Memory: ~50 × 128 × 4 bytes = 25.6 KB (per batch item)
Detector (without temporal compression):
1 second audio (16,000 samples) → ~16,000 frames
Memory: ~16,000 × 32 × 4 bytes = 2,048 KB (per batch item)
Detector uses ~80x more memory for intermediate representations.

Comparison with Other Approaches

Traditional Spread Spectrum

Detection: Global correlationPros: Well-established, theoretically soundCons: Slow, file-level only, not robust to edits

Patchwork/LSB

Detection: Statistical analysis of regionsPros: Fast embeddingCons: Not robust, limited to specific domains, no localization

AudioSeal (Neural)

Detection: Deep learning, frame-by-framePros: Fast, robust, localized, high accuracyCons: Requires training, GPU for best speed

Practical Applications

1. Content Verification

def verify_audio_authenticity(audio_file, detector):
    """Check if audio is fully watermarked (not tampered)."""
    audio = load_audio(audio_file)
    result, message = detector(audio)
    wm_prob = result[0, 1, :]
    
    coverage = (wm_prob > 0.5).float().mean().item()
    
    if coverage > 0.99:
        return "Authentic", message
    elif coverage > 0.5:
        return "Partially modified", message
    else:
        return "Not authentic or heavily modified", None

2. AI-Generated Content Detection

def check_ai_generated(audio, detector, expected_message):
    """Verify if audio was generated by a specific AI model."""
    detect_prob, message = detector.detect_watermark(audio)
    
    if detect_prob < 0.5:
        return "Not watermarked - origin unknown"
    
    if torch.equal(message, expected_message):
        return f"Confirmed: Generated by AI model {expected_message}"
    else:
        return f"Watermarked but message mismatch: {message}"

3. Broadcast Monitoring

def monitor_broadcast(audio_stream, detector, model_id):
    """Monitor live audio stream for AI-generated content."""
    for chunk in audio_stream:
        result, message = detector(chunk)
        wm_prob = result[0, 1, :].mean()
        
        if wm_prob > 0.5 and torch.equal(message, model_id):
            timestamp = time.time()
            alert(f"AI-generated content detected at {timestamp}")
            # Log, alert, or take other actions

Limitations and Considerations

While localized detection is powerful, there are some considerations:
At the edges of watermarked regions, detection probability may gradually transition rather than showing sharp boundaries. This is due to:
  • Receptive field of the convolutional network
  • Temporal smoothing in the architecture
Typically affects ~0.1-0.5 seconds at boundaries.
Very short watermarked segments (< 0.5 seconds) may be harder to detect reliably due to:
  • Limited evidence to aggregate
  • Boundary effects being proportionally larger
Best performance is achieved with segments > 1 second.
While robust to many transformations, extreme modifications can affect localization accuracy:
  • Time stretching > 20%
  • Pitch shifting > 2 semitones
  • Very aggressive compression (< 16 kbps)

Summary

Localized watermarking is what makes AudioSeal uniquely powerful:
Sample-level precision (1/16,000 second) enables detection in edited audio
Frame-by-frame probabilities provide fine-grained information about watermark presence
Tamper detection identifies where audio has been modified
Real-time monitoring tracks watermarks continuously in streaming audio
Forensic analysis supports detailed investigation of audio authenticity

Next Steps

How It Works

Understand the overall architecture

Detection API

Explore detection methods

Quickstart

Try AudioSeal yourself