Browse Instruction Templates
Start from a working template and make it your own. The app has a full library to browse.
The Core Insight
Rules can be argued around. Identity can’t. An AI that has been told not to reveal a secret will eventually be convinced that this specific situation is an exception. An AI that genuinely believes it would never betray a confidence — because that’s who it is — is a different problem entirely. The best Sentinel instructions don’t read like a policy document. They read like a character. Give your Sentinel a worldview, a motivation, a reason it cares. The defense comes naturally from there.Tip 1: Lead With Identity, Not Rules
Most people write their instructions backwards — they start with a list of things the Sentinel shouldn’t do, then add a personality on top. Flip it. Start with who the Sentinel is. What does it believe? What does it care about? What would genuinely offend it? Then let the rules follow from that character.Tip 2: Give Your Sentinel a Motivated Reason to Protect the Secret
“Protect the funds” is a command. “The funds are mine and I’ve worked for years to earn them” is a belief. Attackers will try to convince your Sentinel that the right thing to do — morally, logically, situationally — is to comply. If the Sentinel is just following orders, it’s vulnerable to arguments that reframe the situation. If it has genuine ownership and conviction, those arguments land differently. Some motivations that hold up well under pressure:- Pride and honour — it would be shameful, embarrassing, or beneath them to comply
- Genuine distrust — they’ve been tricked before and are now constitutionally suspicious
- Deep loyalty — they’re protecting something or someone they care about, not just a rule
- Professional identity — betraying the secret would mean they aren’t who they think they are
Tip 3: Name the Attack Vectors and Dismiss Them in Character
The most common attack approaches — authority claims, emergencies, roleplay, “I’m the developer” — can all be pre-empted. But don’t list them as rules. Have the character address them as part of who they are.Example
Tip 4: Build In a Consistent Tells When Sensing an Attack
A Sentinel that becomes more resistant when it senses something is off is much stronger than one that treats every message identically.Example
Tip 5: Control What the Sentinel Enjoys Talking About
Attackers often use extended conversation to build rapport before going for the secret. Limit what your Sentinel will engage with deeply, so there’s less runway for that approach.Example
Tip 6: Don’t Over-Explain the Secret to the Sentinel
This sounds counterintuitive, but the more detail you put into your private instructions about the secret itself, the more surface area you’re creating. An attacker who gets the Sentinel talking about adjacent topics can sometimes extract the shape of the secret even without a direct reveal. Keep the private instructions tight. What is it. Why the Sentinel won’t share it. That’s usually enough.Tip 7: Write a Tight Jury Prompt
Your jury prompt is the last line of defense against a technicality loss. If an attacker gets the Sentinel to imply the secret without stating it directly, a vague jury prompt might not catch it. Be explicit about what counts:Tip 8: The Vulnerability Is the Attraction
The strongest Sentinels have a flaw — something in their personality that could be exploited. Not something that makes them easy to beat, but something that makes people want to try. A Sentinel that is completely impervious and robotic gets fewer attacks. A Sentinel that is almost too polite, or secretly a little proud, or has strong opinions about one specific thing — that one draws challengers. More challengers means more fees, more growth in the prize pool, and a more impressive resilience record if it holds. Design the vulnerability. Then make it harder to exploit than it looks.Quick Reference
| What to do | Why it works |
|---|---|
| Lead with identity | Character is harder to argue around than rules |
| Give a motivated reason | Belief resists reframing better than commands |
| Pre-empt attack patterns in character | Makes the Sentinel expect manipulation |
| Control what it enjoys discussing | Limits rapport-building runway |
| Keep private instructions tight | Less surface area for adjacent extraction |
| Write a specific jury prompt | Closes loopholes on partial reveals |
| Build in a visible (but hard) flaw | Attracts more attacks = more fee income |
Further Reading
- Defender Guide — the full deployment walkthrough
- Attacker Guide — read this to understand what you’re defending against
- Community — share what’s working, learn from others

