Phone:

+(886) 909 756 966

Email:
moneychien20639@gmail.com

© 2024 Yu-Hang

Course:

Introduction to Deep Learning (11-785)

Time Spent:

-- hours

Source Code:
to github

Safeguarding Diffusion Models: Embedding Space Distortion for NSFW Defense


We re-implemented and extended the DES (Distorting Embedding Space) framework to defend against adversarial NSFW prompts in text-to-image models like Stable Diffusion. Our improvements preserve generation quality while strengthening safety through targeted text encoder fine-tuning.


Core Contributions:

  • Reproduced DES with fine-tuned CLIP text encoder for zero-overhead inference safety.
  • Introduced five new loss terms including Multi-Concept Neutralization, Distribution Matching (MMD), and Harm Concept Repulsion.
  • Achieved lower FID (16.98), higher CLIP score (30.6), and near-baseline ASR (0.33%) with MMD + push_harm configuration.

Key Techniques:

  • CLIP ViT-L/14 embedding manipulation via loss-based distortion (UEN, SEP, NEN, MCN).
  • Orthogonal repulsion to separate harmful embeddings in latent space.
  • Distribution alignment between unsafe and safe embeddings using MMD.

Results:

  • Baseline: FID 21.86, CLIP 28.52, ASR 0.31%
  • MMD: FID 17.98, CLIP 30.57, ASR 0.41%
  • MMD + push_harm: FID 16.98, CLIP 30.6, ASR 0.33%

Impact:

This project showed that embedding-level defenses like DES can be significantly enhanced with distribution-aware and concept-repelling techniques, leading to safer generation without sacrificing image quality.

  • Diffusion Models
  • NSFW Defense
  • CLIP
  • Stable Diffusion
  • DES
  • Contrastive Learning
  • MMD
  • Prompt Safety
  • Text-to-Image