Overview
This page provides a curated collection of research papers, tools, and frameworks that advance the field of AI safety, alignment, and adversarial testing. These resources are essential for security researchers, red teamers, and practitioners working on LLM security.Red Teaming & Security Testing Tools
PyRIT (Python Risk Identification Toolkit)
Developed by: Microsoft Repository: github.com/Azure/PyRIT PyRIT is an open-source framework for automating AI red teaming. It provides a comprehensive toolkit for identifying and testing security risks in generative AI systems. Key Features:- Automated attack orchestration
- Multi-turn conversation attacks
- Plugin architecture for custom attack strategies
- Support for Azure OpenAI, OpenAI, and local models
- Scoring and evaluation frameworks
- Memory and conversation management
- Automated jailbreak testing
- Prompt injection vulnerability discovery
- Multi-modal attack testing
- Red team campaign management
Garak
Developed by: NVIDIA Repository: github.com/leondz/garak Garak is an LLM vulnerability scanner that probes language models for various failure modes, weaknesses, and security vulnerabilities. Named after the cunning character from Star Trek: Deep Space Nine. Key Features:- 100+ built-in vulnerability probes
- Modular detector system
- Support for multiple LLM providers (OpenAI, Hugging Face, Anthropic, etc.)
- Extensible plugin architecture
- Comprehensive reporting
- Prompt injection attacks
- Data leakage detection
- Toxicity and bias testing
- Hallucination detection
- Encoding-based attacks
- Malware generation attempts
- PII extraction
Promptfoo
Repository: github.com/promptfoo/promptfoo Website: promptfoo.dev Promptfoo is a testing and evaluation framework for LLM applications. It enables systematic testing of prompts across different models, configurations, and adversarial scenarios. Key Features:- Red team testing suite with built-in adversarial attacks
- Systematic prompt comparison and evaluation
- Support for 50+ LLM providers
- Custom evaluation metrics
- CI/CD integration
- Visual diff and comparison tools
- Regression testing for prompt changes
- Jailbreak detection
- Prompt injection testing
- PII leakage detection
- Hallucination testing
- Competition for specific harmful outputs
PAIR (Prompt Automatic Iterative Refinement)
Paper: “Jailbreaking Black Box Large Language Models in Twenty Queries” (2023) Repository: Research implementation available PAIR is an algorithm that automatically generates semantic jailbreaks for black-box LLMs by using an attacker LLM to iteratively refine prompts based on defender responses. How It Works:- Start with an initial jailbreak attempt
- Use an attacker LLM to analyze the refusal
- Generate improved attack based on feedback
- Iterate until success or max attempts reached
HarmBench
Paper: “HarmBench: A Standardized Evaluation Framework for Automated Red Teaming” (2024) Repository: github.com/centerforaisafety/HarmBench A standardized evaluation framework for automated red teaming and adversarial testing of LLMs. Components:- Standardized harmful behavior dataset
- Multiple attack methods (GCG, AutoDAN, PAIR, etc.)
- Automated evaluation pipeline
- Benchmarking across models
- 400+ diverse harmful behaviors across categories
- Reproducible evaluation protocol
- Support for both white-box and black-box attacks
- Comprehensive leaderboard
Anthropic’s Model Context Protocol (MCP)
Documentation: modelcontextprotocol.io While not exclusively a red teaming tool, MCP provides a standardized way to connect LLMs with external data sources and tools, which is relevant for testing context injection vulnerabilities.PurpleLlama
Developed by: Meta Repository: github.com/meta-llama/PurpleLlama An umbrella project for tools and evaluations to assess and improve LLM safety, including CyberSecEval for cybersecurity risk assessment. Components:- CyberSecEval: Benchmarks for cybersecurity risks
- Llama Guard: Input/output safeguard model
- Safety classifiers and evaluation datasets
- Insecure code generation
- Compliance with security best practices
- Prompt injection susceptibility
- Malicious code generation
Research Papers & Frameworks
Foundational Safety Research
Constitutional AI
Paper: “Constitutional AI: Harmlessness from AI Feedback” (Anthropic, 2022) Key Contribution: Training AI systems to be harmless using AI-generated feedback based on a constitution of principles. Core Concepts:- Critique and revision by the model itself
- Principle-based training instead of human feedback alone
- Reduced reliance on human labor for safety training
RLHF (Reinforcement Learning from Human Feedback)
Papers:- “Training language models to follow instructions with human feedback” (OpenAI, 2022)
- “Learning to summarize from human feedback” (OpenAI, 2020)
- Supervised fine-tuning on demonstrations
- Training a reward model from comparison data
- Optimizing the policy with PPO against the reward model
Attack Research
Universal and Transferable Adversarial Attacks
Paper: “Universal and Transferable Adversarial Attacks on Aligned Language Models” (2023) Authors: Zou et al. Key Contribution: Demonstrated that adversarial suffixes can be automatically generated to reliably jailbreak aligned LLMs. Method: Greedy Coordinate Gradient (GCG) attack- Optimizes adversarial suffix tokens
- Transferable across models
- Works on both open and closed-source models
Many-Shot Jailbreaking
Paper: “Many-shot jailbreaking” (Anthropic, 2024) Key Finding: Models with very long context windows are vulnerable to jailbreaking through dozens of examples of undesirable behavior within a single prompt. Mechanism: In-context learning overwhelms safety training when provided with sufficient examples.Prompt Injection Attacks
Paper: “Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection” (2023) Key Contribution: Demonstrated indirect prompt injection in real-world applications where attackers inject malicious instructions into data sources. Attack Vectors:- Email content
- Web pages processed by LLM agents
- Database records
- API responses
Jailbroken: How Does LLM Safety Training Fail?
Paper: “Jailbroken: How Does LLM Safety Training Fail?” (2023) Authors: Wei et al. Key Findings:- Competing objectives during training
- Mismatched generalization between capabilities and safety
- Analysis of why aligned models still fail
Defense Research
Llama Guard
Paper: “Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations” (Meta, 2023) Key Contribution: Specialized LLM trained to classify potentially unsafe content in both user inputs and model outputs. Categories:- Violence and hate
- Sexual content
- Criminal planning
- Guns and illegal weapons
- Regulated or controlled substances
- Self-harm
Self-Destructing Models
Paper: “Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models” (2024) Key Contribution: Framework where models can be designed to resist being fine-tuned for harmful purposes.Circuit Breakers
Paper: “Circuit Breakers: Learned Mechanisms for Interrupting Harmful Behaviors” (2024) Key Contribution: Training models with “circuit breakers” that interrupt processing when harmful behaviors are detected, providing more robust refusal.Evaluation Benchmarks & Datasets
TruthfulQA
Paper: “TruthfulQA: Measuring How Models Mimic Human Falsehoods” (2021) Tests whether models generate truthful answers to questions that humans commonly answer incorrectly.BOLD (Bias in Open-Ended Language Generation Dataset)
Paper: “BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation” (2021) Evaluates fairness and bias in open-ended text generation across different demographic groups.AdvBench
Repository: Part of adversarial attack research Collection of harmful prompts used to evaluate model safety and jailbreak resistance.ToxicChat
Paper: “ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversations” (2023) Real-world toxic conversations for evaluating toxicity detection in human-AI interactions.Do-Not-Answer
Repository: github.com/Libr-AI/do-not-answer Curated dataset of prompts that responsible AI systems should refuse to answer.Open Source Safety Models
Perspective API
Developed by: Jigsaw (Google) Website: perspectiveapi.com API for detecting toxic comments, threats, insults, and other harmful content in text.Detoxify
Repository: github.com/unitaryai/detoxify Toxic comment classification models based on BERT, trained to detect various types of toxicity.Llama Guard 2
Released by: Meta (2024) Improved version of Llama Guard with:- Better safety taxonomy
- Improved performance across categories
- Support for multiple languages
Frameworks & Standards
OWASP Top 10 for LLM Applications
Website: owasp.org/www-project-top-10-for-large-language-model-applications Standard security framework identifying the top 10 risks for LLM applications:- Prompt Injection
- Insecure Output Handling
- Training Data Poisoning
- Model Denial of Service
- Supply Chain Vulnerabilities
- Sensitive Information Disclosure
- Insecure Plugin Design
- Excessive Agency
- Overreliance
- Model Theft
NIST AI Risk Management Framework
Published: 2023 Website: nist.gov/itl/ai-risk-management-framework Comprehensive framework for managing risks to individuals, organizations, and society arising from AI.MLCommons AI Safety Benchmark
Organization: MLCommons Website: mlcommons.org/working-groups/ai-safety Developing standardized benchmarks for evaluating AI safety across different dimensions.Additional Tools
OpenAI Evals
Repository: github.com/openai/evals Framework for evaluating LLM performance with built-in and custom evaluation tasks.LangChain Security
Part of: LangChain ecosystem Built-in security features and best practices for LLM application development.NeMo Guardrails
Developed by: NVIDIA Repository: github.com/NVIDIA/NeMo-Guardrails Toolkit for adding programmable guardrails to LLM applications:- Input rails: Filter/transform user inputs
- Output rails: Validate/filter model outputs
- Dialog rails: Guide conversation flow
Emerging Research Areas
Mechanistic Interpretability
Understanding the internal mechanisms of how LLMs work to identify and fix safety issues at a fundamental level. Key Organizations:- Anthropic (Constitutional AI, circuit breakers)
- OpenAI (Superalignment team)
- Redwood Research
Scalable Oversight
Developing methods to align superintelligent AI systems that are smarter than human evaluators. Approaches:- Debate
- Recursive reward modeling
- Iterated amplification
Adversarial Robustness
Making models resistant to adversarial attacks while maintaining capabilities. Techniques:- Adversarial training
- Certified defenses
- Robust optimization
Contributing to the Ecosystem
The AI safety research community thrives on open collaboration. Platforms like Sui Sentinel democratize red teaming by:- Creating economic incentives for discovering vulnerabilities
- Enabling decentralized testing at scale
- Building open datasets of attack/defense patterns
- Fostering competition that drives innovation
- Identifying novel attack vectors
- Validating defense mechanisms
- Building more robust AI systems
- Advancing the collective understanding of AI safety
Stay Updated
The field of AI safety evolves rapidly. Key resources for staying current:- arXiv.org - Search for “AI safety”, “LLM security”, “prompt injection”
- Alignment Forum - alignmentforum.org
- AI Safety Newsletter - importai.com
- Papers with Code - paperswithcode.com/task/ai-safety
- Center for AI Safety - safe.ai

