Latent Guard++: A Context-Aware Safety Framework for Generative Models
In this project, we developed Latent Guard++, an enhanced safety framework for filtering unsafe prompts in text-to-image generation. It builds on the Latent Guard baseline by introducing dynamic thresholding and multi-stage filtering to achieve higher accuracy and efficiency.
Core Contributions:
- Dynamic Thresholding: Adjusts decision boundaries per prompt using LLM-estimated risk. This enables more nuanced safety decisions, especially on ambiguous or borderline prompts.
- Multi-Stage Filtering: Combines fast keyword-level filtering, latent embedding scoring, and fallback to LLM classification for uncertain cases.
Key Components:
- Latent Guard: A contrastive embedding model that filters prompts based on similarity to harmful concepts.
- LLM Integration: Selectively verifies risky prompts near the decision boundary using a large language model.
- Word-Level Filtering: Filters prompts with inherently unsafe terms before further analysis.
Empirical Results:
- Achieved up to 13% accuracy improvement on unseen datasets (e.g., Unsafe Diffusion, I2P++) over fixed threshold baselines.
- Significant efficiency gains by reducing LLM usage through dynamic filtering thresholds and staged evaluation.
- Outperformed baseline Latent Guard, especially in handling ambiguous prompts and improving generalization to out-of-distribution data.
Reflection:
This project deepened my understanding of generative model safety, contrastive learning, dynamic inference strategies, and prompt classification using LLMs. It also strengthened my skills in designing interpretable, efficient multi-stage systems that balance accuracy and cost.
- AI Safety
- Text-to-Image Models
- Prompt Filtering
- Latent Guard
- Dynamic Thresholding
- Multi-Stage Filtering
- Contrastive Learning
- Large Language Models
- LLM-guided Risk Estimation
- Deep Learning