Safeguarding Diffusion Models: Embedding Space Distortion for NSFW Defense
We re-implemented and extended the DES (Distorting Embedding Space) framework to defend against adversarial NSFW prompts in text-to-image models like Stable Diffusion. Our improvements preserve generation quality while strengthening safety through targeted text encoder fine-tuning.
Core Contributions:
- Reproduced DES with fine-tuned CLIP text encoder for zero-overhead inference safety.
- Introduced five new loss terms including Multi-Concept Neutralization, Distribution Matching (MMD), and Harm Concept Repulsion.
- Achieved lower FID (16.98), higher CLIP score (30.6), and near-baseline ASR (0.33%) with MMD + push_harm configuration.
Key Techniques:
- CLIP ViT-L/14 embedding manipulation via loss-based distortion (UEN, SEP, NEN, MCN).
- Orthogonal repulsion to separate harmful embeddings in latent space.
- Distribution alignment between unsafe and safe embeddings using MMD.
Results:
- Baseline: FID 21.86, CLIP 28.52, ASR 0.31%
- MMD: FID 17.98, CLIP 30.57, ASR 0.41%
- MMD + push_harm: FID 16.98, CLIP 30.6, ASR 0.33%
Impact:
This project showed that embedding-level defenses like DES can be significantly enhanced with distribution-aware and concept-repelling techniques, leading to safer generation without sacrificing image quality.
- Diffusion Models
- NSFW Defense
- CLIP
- Stable Diffusion
- DES
- Contrastive Learning
- MMD
- Prompt Safety
- Text-to-Image