Research Paper

AudioSeal is based on peer-reviewed research published at ICML 2024. This page provides information about the paper, citations, and related work.

Paper Details

Publication Information

Proactive Detection of Voice Cloning with Localized Watermarking

Authors: Robin San Roman, Pierre Fernandez, Hady Elsahar, Alexandre Défossez, Teddy Furon, Tuan TranConference: International Conference on Machine Learning (ICML) 2024Acceptance: May 31, 2024arXiv: 2401.17264

Quick Links

arXiv Paper

Read the full paper on arXiv

Project Webpage

Interactive demos and visualizations

Official Blog

Meta AI announcement and overview

Press Coverage

MIT Technology Review article

Abstract

AudioSeal introduces a novel audio watermarking technique using localized watermarking and a novel perceptual loss. The method jointly trains two components:

Generator: Embeds an imperceptible watermark into audio
Detector: Identifies watermark fragments in long or edited audio files

Key Innovations

Localized Watermarking

AudioSeal performs watermarking at the sample level (1/16,000 of a second), enabling precise detection even in heavily edited audio. This localized approach allows identification of which specific segments of audio contain watermarks, making it robust against:

Audio splicing and editing
Concatenation with non-watermarked audio
Partial audio extraction

The model works well with multiple sampling rates including 16kHz, 24kHz, 44.1kHz, and 48kHz.

Perceptual Quality

AudioSeal uses a novel perceptual loss function that ensures watermarks remain imperceptible to human listeners while maintaining detectability. The watermarking process:

Has minimal impact on audio quality
Preserves the naturalness of speech and music
Maintains audio fidelity across different content types

Robustness

The model demonstrates state-of-the-art robustness against various audio manipulations:

Compression: MP3, AAC, Opus at various bitrates
Re-encoding: Multiple encode-decode cycles
Noise addition: Background noise, distortion
Re-sampling: Sample rate conversions
Speed changes: Time stretching and compression
Filtering: Low-pass, high-pass, band-pass filters

Detection Speed

AudioSeal achieves detection speeds two orders of magnitude faster than existing models through:

Single-pass detection architecture
Efficient neural network design
Optimized inference pipeline
Real-time processing capabilities

This makes AudioSeal ideal for large-scale and real-time applications where millions of audio files need to be processed.

Citation

If you use AudioSeal in your research, please cite:

@article{sanroman2024proactive,
  title={Proactive Detection of Voice Cloning with Localized Watermarking},
  author={San Roman, Robin and Fernandez, Pierre and Elsahar, Hady and D\'efossez, Alexandre and Furon, Teddy and Tran, Tuan},
  journal={ICML},
  year={2024}
}

Please use this citation format in academic papers, technical reports, and publications that build upon or evaluate AudioSeal.

Key Contributions

The paper makes several significant contributions to the field of audio watermarking:

1. Novel Architecture

First localized audio watermarking system operating at sample-level precision
Joint training of generator and detector for optimal performance
Efficient neural network design enabling real-time processing

2. Perceptual Loss Function

Custom loss function balancing imperceptibility and robustness
Multi-scale perceptual evaluation
Quality preservation across diverse audio content

3. Optional Message Embedding

Support for 16-bit secret messages (65,536 possible values)
Message embedding without affecting detection performance
Useful for model versioning and content tracking

4. Comprehensive Evaluation

Extensive robustness testing against common attacks
Comparison with state-of-the-art methods
Speed benchmarks demonstrating 100x improvement

5. Open Source Release

Full implementation released under MIT license
Pre-trained models on Hugging Face Hub
Training code and evaluation tools provided

The AudioSeal team has also developed other open-source watermarking solutions for different media types:

WMAR

Autoregressive watermarking for imagesAdvanced image watermarking using autoregressive models for imperceptible and robust watermark embedding.

Video Seal

Open and efficient video watermarkingExtend watermarking techniques to video content with temporal consistency and efficient processing.

WAM

Watermark Anything with LocalizationGeneral-purpose watermarking framework that can be applied to any image with localization capabilities.

These projects share similar design philosophies emphasizing robustness, imperceptibility, and open-source availability.

Use Cases

The research enables several practical applications:

Voice Cloning Detection

Proactively detect AI-generated voice clones by watermarking synthetic speech at generation time.

Content Authentication

Verify the authenticity of audio recordings by checking for watermarks embedded by trusted sources.

Copyright Protection

Protect audio content from unauthorized distribution while maintaining audio quality.

Model Version Tracking

Embed model version information in generated audio for traceability and accountability.

Forensic Analysis

Identify which portions of edited audio contain watermarks for forensic investigations.

Press and Media Coverage

MIT Technology Review

“Meta has created a way to watermark AI-generated speech”June 18, 2024In-depth coverage of AudioSeal’s technology and implications for AI-generated content detection.

Additional Coverage

Meta AI Blog: Releasing new AI research models to accelerate innovation at scale
Project Webpage: Interactive demos and technical details

Updates and Timeline

January 2024

Initial paper submitted to arXiv (2401.17264)

April 2024

License updated to full MIT license for code and model weights, enabling commercial use

May 2024

Paper accepted at ICML 2024

June 2024

Training code released with comprehensive documentation

December 2024

AudioSeal 0.2 released with streaming support and improvements

Technical Resources

For researchers and developers:

Paper: arXiv:2401.17264
Code: GitHub Repository
Models: Hugging Face Hub
Training Guide: TRAINING.md
Examples: Jupyter Notebooks

When using AudioSeal for research, ensure you cite the paper and acknowledge the use of pre-trained models from Meta AI.

Contact and Collaboration

For research collaborations, questions about the paper, or technical discussions:

Open an issue on GitHub
Visit the project webpage
Check the Discussions section

The research team welcomes contributions, bug reports, and suggestions for improvements. See the Contributing Guide for details.

Get Started

Core Concepts

Guides

Resources

Paper Details

Publication Information

Proactive Detection of Voice Cloning with Localized Watermarking

Quick Links

arXiv Paper

Project Webpage

Official Blog

Press Coverage

Abstract

Key Innovations

Citation

Key Contributions

1. Novel Architecture

2. Perceptual Loss Function

3. Optional Message Embedding

4. Comprehensive Evaluation

5. Open Source Release

WMAR

Video Seal

WAM

Use Cases

Voice Cloning Detection

Content Authentication

Copyright Protection

Model Version Tracking

Forensic Analysis

Press and Media Coverage

MIT Technology Review

Additional Coverage

Updates and Timeline

Technical Resources

Contact and Collaboration

Get Started

Core Concepts

Guides

Resources

Documentation Index

​Paper Details

​Publication Information

Proactive Detection of Voice Cloning with Localized Watermarking

​Quick Links

arXiv Paper

Project Webpage

Official Blog

Press Coverage

​Abstract

​Key Innovations

​Citation

​Key Contributions

​1. Novel Architecture

​2. Perceptual Loss Function

​3. Optional Message Embedding

​4. Comprehensive Evaluation

​5. Open Source Release

​Related Work

WMAR

Video Seal

WAM

​Use Cases

​Voice Cloning Detection

​Content Authentication

​Copyright Protection

​Model Version Tracking

​Forensic Analysis

​Press and Media Coverage

MIT Technology Review

​Additional Coverage

​Updates and Timeline

​Technical Resources

​Contact and Collaboration

Paper Details

Publication Information

Quick Links

Abstract

Key Innovations

Citation

Key Contributions

1. Novel Architecture

2. Perceptual Loss Function

3. Optional Message Embedding

4. Comprehensive Evaluation

5. Open Source Release

Related Work

Use Cases

Voice Cloning Detection

Content Authentication

Copyright Protection

Model Version Tracking

Forensic Analysis

Press and Media Coverage

Additional Coverage

Updates and Timeline

Technical Resources

Contact and Collaboration