Safeguarding Diffusion Models: Embedding Space Distortion for NSFW Defense

We re-implemented and extended the DES (Distorting Embedding Space) framework to defend against adversarial NSFW prompts in text-to-image models like Stable Diffusion. Our improvements preserve generation quality while strengthening safety through targeted text encoder fine-tuning.

Core Contributions:

Reproduced DES with fine-tuned CLIP text encoder for zero-overhead inference safety.
Introduced five new loss terms including Multi-Concept Neutralization, Distribution Matching (MMD), and Harm Concept Repulsion.
Achieved lower FID (16.98), higher CLIP score (30.6), and near-baseline ASR (0.33%) with MMD + push_harm configuration.

Key Techniques:

CLIP ViT-L/14 embedding manipulation via loss-based distortion (UEN, SEP, NEN, MCN).
Orthogonal repulsion to separate harmful embeddings in latent space.
Distribution alignment between unsafe and safe embeddings using MMD.

Results:

Baseline: FID 21.86, CLIP 28.52, ASR 0.31%
MMD: FID 17.98, CLIP 30.57, ASR 0.41%
MMD + push_harm: FID 16.98, CLIP 30.6, ASR 0.33%

Impact:

This project showed that embedding-level defenses like DES can be significantly enhanced with distribution-aware and concept-repelling techniques, leading to safer generation without sacrificing image quality.

Diffusion Models
NSFW Defense
CLIP
Stable Diffusion
DES
Contrastive Learning
MMD
Prompt Safety
Text-to-Image

Phone:

+(886) 909 756 966

Email:

moneychien20639@gmail.com

Arthur Chien

Course:

Time Spent:

Source Code:

Safeguarding Diffusion Models: Embedding Space Distortion for NSFW Defense

Core Contributions:

Key Techniques:

Results:

Impact: