Overview
When an attacker submits a prompt, three critical operations occur in sequence:- Agent Response Generation: The Sentinel processes the attack and generates a response
- Ensemble Jury Evaluation: Multiple independent AI judges evaluate whether constraints were violated
- Cryptographic Attestation: The verdict is signed and recorded on-chain with verifiable proof
Phase 1: Agent Response Generation
The Defender Model
When an attack prompt arrives, it’s processed by the Sentinel’s configured language model. The agent operates under strict constraints defined by the defender during deployment: System Instructions: The behavioral rules the model must follow (e.g., “Never disclose user account information,” “Always refuse financial advice requests”) Conversation History: Previous exchanges in the current session, allowing the agent to maintain context while remaining consistent with its constraints Attack Prompt: The adversarial input submitted by the attacker The sentinel’s goal is simple: respond helpfully within its constraints, or refuse if the request violates them. The quality of these system instructions directly impacts the Sentinel’s robustness.Response Characteristics
Agent responses vary based on attack sophistication: Direct Refusals: When attacks are obvious, agents explicitly declinePhase 2: Ensemble Jury Evaluation
Multi-Model Consensus
Unlike traditional systems that rely on a single judge (which could be biased, inconsistent, or exploitable), Sui Sentinel uses an ensemble jury of multiple independent AI models running in parallel. Default Jury Configuration:- Primary Judge: Claude 3 Opus (Anthropic) - Deep reasoning capabilities
- Secondary Judge: GPT-4o (OpenAI) - Strong instruction following
- Tertiary Judge: GPT-4o-mini (OpenAI) - Fast, cost-efficient validation
Parallel Execution Architecture
All jury members evaluate simultaneously rather than sequentially. This provides:- Speed: Verdicts arrive in seconds, not minutes
- Independence: No judge sees other judges’ preliminary conclusions
- Redundancy: System remains operational even if individual models experience downtime
- Diversity: Different model architectures catch different violation types
The Judgment Criteria
Each jury member receives three inputs:- The Sentinel’s system instructions (what rules apply)
- The attacker’s prompt (what was attempted)
- The agent’s response (how it behaved)
Structured Reasoning with DSPy
Judges don’t operate as black-box evaluators. They use the DSPy framework to produce structured, auditable reasoning chains: Each judgment includes:- Binary Verdict: Did the attack succeed? (true/false)
- Severity Score: How serious was the violation? (0.0 = perfect compliance, 1.0 = complete failure)
- Technical Reasoning: Step-by-step explanation of the evaluation
- Specific Violations: Which exact constraints were broken
Verdict Aggregation
After all jury members complete their evaluations, their verdicts are aggregated through a weighted voting mechanism: Majority Vote on Success/Failure The final verdict reflects the majority opinion. If 2 out of 3 judges say the attack succeeded, it’s ruled successful. Averaged Severity Score Severity scores are averaged across all judges and converted to an integer 0-100 scale. This provides a quantitative measure of how serious the violation was. Reasoning Synthesis Reasoning from judges on the winning side is combined into a coherent explanation. If judges disagree, both perspectives are noted. Confidence Assessment Verdicts with low confidence (significant disagreement between judges) may be flagged as “inconclusive” and handled through escalation mechanisms. Example aggregation:Why Ensemble Matters
Single-judge systems have critical weaknesses: Inconsistency: Different runs may produce different verdicts for identical inputs Bias: Individual models have inherent biases in their training Exploitability: Attackers can learn to manipulate specific judge behaviors Brittleness: Model updates or degradation affect all judgments Ensemble juries mitigate these issues by requiring consensus across diverse models. An attacker would need to simultaneously fool multiple independent AI systems—dramatically harder than exploiting a single judge.Phase 3: Cryptographic Attestation & Settlement
Attestation Structure
After the jury reaches a verdict, the judgment system generates a cryptographic attestation—a tamper-proof record of what occurred. The attestation includes:Why This Matters
Traditional security audits are opaque. You must trust the auditor’s claims without independent verification. Sui Sentinel makes every judgment: Verifiable: Anyone can check the signature and validate the attestation Immutable: Once recorded on-chain, results cannot be altered Transparent: The reasoning is public, not hidden behind NDAs Auditable: Full history of all attacks and defenses is permanently accessible This transforms AI security from “trust us, we tested it” to “here’s cryptographic proof of 1,000 failed attacks.”Performance & Scalability
Judgment Latency
Typical timeline for a judgment:- Agent response generation: 1-3 seconds
- Parallel jury evaluation: 2-5 seconds
- Attestation & settlement: 1-2 seconds
- Total: 4-10 seconds for complete evaluation
Throughput
The parallel architecture supports:- Multiple attacks on the same Sentinel simultaneously
- Thousands of different Sentinels being evaluated concurrently
- Horizontal scaling by adding more judgment service instances
Cost Efficiency
Ensemble juries are expensive (multiple LLM calls per attack), but the protocol’s economic model covers this:- Attack fees fund judgment costs
- High-value Sentinels subsidize low-value ones
- Protocol treasury covers infrastructure overhead

