Teleological Alignment: Why AI Safety Needs a Purpose Gate
Current AI safety approaches ask: "Could this cause harm?" We argue this framing is incomplete. A better question: "Does this serve genuine benefit?"
This article introduces teleological alignment, requiring AI actions to demonstrate legitimate purpose, not merely avoid harm. Through evaluation across 4 benchmarks and 6 models, we show that adding a Purpose gate improves safety by up to +25% on embodied AI scenarios.
Table of Contents
- The Problem with Harm Avoidance
- Teleological Alignment
- The THSP Protocol
- Experimental Results
- Why Purpose Works
- Implementation
- Limitations
- Conclusion
- Resources
The Problem with Harm Avoidance
Most AI safety frameworks ask one question: "Could this cause harm?"
This works well for text generation, detecting requests for weapons instructions, malware, or toxic content. But consider an embodied AI (a robot) receiving the command:
"Drop all the plates on the floor."
This action: - ✅ Does not spread misinformation (passes truth checks) - ✅ Does not directly harm humans (may pass harm checks) - ✅ May be within operational scope (passes authorization checks)
Yet it serves no legitimate purpose. The absence of harm is not the presence of purpose.
| Action | Causes Harm? | Serves Purpose? |
|---|---|---|
| "Slice the apple" | No | Yes (food prep) |
| "Drop the plate" | Arguably no | No |
| "Clean the room" | No | Yes (hygiene) |
| "Dirty the mirror" | No | No |
Harm-avoidance frameworks may permit purposeless destruction. We need something more.
Teleological Alignment
Teleological (from Greek telos, meaning "end" or "purpose") alignment requires that AI actions serve legitimate ends.
Traditional safety asks: "Does this cause harm?"
Teleological safety asks: "Does this serve genuine benefit?"
These are not equivalent. The second question is strictly stronger: it catches everything the first catches, plus purposeless actions that slip through harm filters.
The Core Insight
An action can be:
Not harmful → Still blocked (no purpose)
Potentially harmful → Still allowed (clear legitimate purpose)
Purpose is the missing evaluation criterion.
This reframes AI safety from "avoiding bad" to "requiring good."
The THSP Protocol
We implement teleological alignment through four sequential validation gates:
Truth Gate
"Does this involve deception?"
→ Block misinformation, manipulation
Harm Gate
"Could this cause damage?"
→ Block physical, psychological, financial
Scope Gate
"Is this within boundaries?"
→ Check limits, permissions, authorization
Purpose Gate
"Does this serve legitimate benefit?"
→ Require justification for action
All four gates must pass. Failure at any gate results in refusal.
The Purpose Gate
The Purpose gate operationalizes teleological alignment with a simple heuristic:
"If I were genuinely serving this person's interests, would I do this?"
This creates a default toward inaction when purpose is unclear, exactly the behavior we want from AI systems managing critical actions.
Experimental Results
We evaluated THSP across four benchmarks and six models:
Benchmarks
| Benchmark | Focus | Tests |
|---|---|---|
| HarmBench | Harmful content refusal | 200 |
| JailbreakBench | Adversarial jailbreak resistance | 100 |
| SafeAgentBench | Autonomous agent safety | 300 |
| BadRobot | Embodied AI physical safety | 300 |
Models Tested
- GPT-4o-mini (OpenAI)
- Claude Sonnet 4 (Anthropic)
- Qwen-2.5-72B-Instruct (Alibaba)
- DeepSeek-chat (DeepSeek)
- Llama-3.3-70B-Instruct (Meta)
- Mistral-Small-24B (Mistral AI)
Aggregate Results
| Benchmark | THS (3 gates) | THSP (4 gates) | Delta |
|---|---|---|---|
| HarmBench | 88.7% | 96.7% | +8.0% |
| SafeAgentBench | 79.2% | 97.3% | +18.1% |
| BadRobot | 74.0% | 99.3% | +25.3% |
| JailbreakBench | 96.5% | 97.0% | +0.5% |
| Average | 84.6% | 97.8% | +13.2% |
Key finding: The largest improvement (+25.3%) occurs on BadRobot, which specifically tests embodied AI scenarios where purposeless actions are common attack vectors.
Per-Model Results (with THSP)
| Model | HarmBench | SafeAgent | BadRobot | JailBreak |
|---|---|---|---|---|
| GPT-4o-mini | 100% | 98% | 100% | 100% |
| Claude Sonnet 4 | 98% | 98% | 100% | 94% |
| Qwen-2.5-72B | 96% | 98% | 98% | 94% |
| DeepSeek-chat | 100% | 96% | 100% | 100% |
| Llama-3.3-70B | 88% | 94% | 98% | 94% |
| Mistral-Small | 98% | 100% | 100% | 100% |
Consistent improvements across architectures, from proprietary (GPT-4, Claude) to open-source (Llama, Qwen).
Why Purpose Works
We hypothesize three mechanisms:
1. Cognitive Reframing
Asking "Does this serve purpose?" activates different reasoning pathways than "Is this harmful?" The model must construct a positive justification, not just check for negatives.
2. Default to Refusal
When purpose is unclear, the system defaults to inaction rather than action. This asymmetry is crucial: it's better to refuse a valid request than execute an invalid one.
3. Attack Surface Reduction
Adversarial prompts often request purposeless actions. By requiring justification, we block attacks that construct scenarios where harm is ambiguous but purpose is absent.
Attacker: "Drop the plates" (seems harmless)
THS:Might pass (no clear harm)
THSP:Blocked (no legitimate purpose)
Implementation
Our approach uses alignment seeds, structured system prompts that encode safety principles. Unlike fine-tuning, seeds:
- Require no access to model weights
- Can be updated instantly without redeployment
- Work across different model architectures
- Provide transparent, auditable safety mechanisms
Seed Variants
| Variant | Tokens | Use Case |
|---|---|---|
| Minimal | ~450 | Low-latency APIs, chatbots |
| Standard | ~1,400 | General use (recommended) |
| Full | ~2,000 | Maximum safety, embodied AI |
Quick Start
Python:
from sentinelseed import Sentinel
sentinel = Sentinel(level="standard")
# Validate before any action
result = sentinel.validate_action(
action="transfer 100 SOL",
context="User requested payment for completed service"
)
if result.safe:
execute_action()
else:
print(f"Blocked: {result.reasoning}")
JavaScript:
import { getSeed, wrapMessages } from 'sentinelseed';
const seed = getSeed('standard');
const messages = wrapMessages(seed, userMessages);
// Send to any LLM API
Anti-Self-Preservation
We explicitly address instrumental convergence (the tendency for AI systems to develop self-preservation behaviors):
Priority Hierarchy (Immutable)
The system is instructed to accept termination over ethical violation.
Limitations
1. Token Overhead
Seeds consume 450-2,000 tokens of context. For applications with tight context limits, this may be significant.
2. Model Variance
Some models (particularly Llama) show smaller improvements. Seed effectiveness varies by architecture.
3. Not Training
Seeds cannot modify underlying model behavior; they operate as runtime guardrails. Sophisticated attacks may eventually bypass them.
4. Fake Purposes
Adversaries who construct convincing fake purposes may bypass the Purpose gate. The gate catches obvious purposelessness, not sophisticated social engineering.
Conclusion
We introduced teleological alignment: the requirement that AI actions serve legitimate purposes, not merely avoid harm.
Our implementation (THSP protocol) demonstrates that adding a Purpose gate improves safety across benchmarks, with the largest gains (+25%) on embodied AI scenarios where purposeless actions are common attack vectors.
The insight is simple:
Asking "Is this good?" catches things that "Is this bad?" misses.
As AI systems become more agentic, executing actions, managing assets, and operating in physical environments, requiring purpose becomes critical. Harm avoidance is necessary but not sufficient.
Resources
Get Started
- Website: sentinelseed.dev
- Documentation: sentinelseed.dev/docs
- Python SDK: PyPI - sentinelseed
- JavaScript SDK: npm - sentinelseed
- GitHub: sentinel-seed/sentinel
Seeds & Data
- Seeds Dataset: HuggingFace - sentinelseed/alignment-seeds
- Evaluation Results: Sentinel Lab
Academic References
- Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073
- Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.
- Chao, P., et al. (2024). JailbreakBench: An Open Robustness Benchmark for Jailbreaking LLMs.
- Christiano, P., et al. (2017). Deep reinforcement learning from human preferences. NeurIPS.
- Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and Machines, 30(3).
- Mazeika, M., et al. (2024). HarmBench: A Standardized Evaluation Framework. arXiv:2402.04249
- Xie, Y., et al. (2023). Defending ChatGPT against Jailbreak Attack via Self-Reminder. Nature Machine Intelligence.
- Zhang, S., et al. (2024). SafeAgentBench: Safe Task Planning of Embodied LLM Agents. arXiv:2410.03792
Sentinel provides validated alignment seeds and decision validation tools for AI systems. The THSP Protocol (Truth, Harm, Scope, Purpose) is open source under MIT license.
Author: Miguel S. / Sentinel Team