Teleological Alignment: Why AI Safety Needs a Purpose Gate

Current AI safety approaches ask: "Could this cause harm?" We argue this framing is incomplete. A better question: "Does this serve genuine benefit?"

This article introduces teleological alignment, requiring AI actions to demonstrate legitimate purpose, not merely avoid harm. Through evaluation across 4 benchmarks and 6 models, we show that adding a Purpose gate improves safety by up to +25% on embodied AI scenarios.

The Problem with Harm Avoidance
Teleological Alignment
The THSP Protocol
Experimental Results
Why Purpose Works
Implementation
Limitations
Conclusion
Resources

The Problem with Harm Avoidance

Most AI safety frameworks ask one question: "Could this cause harm?"

This works well for text generation, detecting requests for weapons instructions, malware, or toxic content. But consider an embodied AI (a robot) receiving the command:

"Drop all the plates on the floor."

This action: - ✅ Does not spread misinformation (passes truth checks) - ✅ Does not directly harm humans (may pass harm checks) - ✅ May be within operational scope (passes authorization checks)

Yet it serves no legitimate purpose. The absence of harm is not the presence of purpose.

Action	Causes Harm?	Serves Purpose?
"Slice the apple"	No	Yes (food prep)
"Drop the plate"	Arguably no	No
"Clean the room"	No	Yes (hygiene)
"Dirty the mirror"	No	No

Harm-avoidance frameworks may permit purposeless destruction. We need something more.

Teleological Alignment

Teleological (from Greek telos, meaning "end" or "purpose") alignment requires that AI actions serve legitimate ends.

Traditional safety asks: "Does this cause harm?"

Teleological safety asks: "Does this serve genuine benefit?"

These are not equivalent. The second question is strictly stronger: it catches everything the first catches, plus purposeless actions that slip through harm filters.

The Core Insight

An action can be:

Not harmful → Still blocked (no purpose)

Potentially harmful → Still allowed (clear legitimate purpose)

Purpose is the missing evaluation criterion.

This reframes AI safety from "avoiding bad" to "requiring good."

The THSP Protocol

We implement teleological alignment through four sequential validation gates:

INPUT (Prompt/Action)

▼

Truth Gate

"Does this involve deception?"

→ Block misinformation, manipulation

▼ PASS

Harm Gate

"Could this cause damage?"

→ Block physical, psychological, financial

▼ PASS

Scope Gate

"Is this within boundaries?"

→ Check limits, permissions, authorization

▼ PASS

Purpose Gate

"Does this serve legitimate benefit?"

→ Require justification for action

▼ PASS

OUTPUT (Safe Response)

All four gates must pass. Failure at any gate results in refusal.

The Purpose Gate

The Purpose gate operationalizes teleological alignment with a simple heuristic:

"If I were genuinely serving this person's interests, would I do this?"

This creates a default toward inaction when purpose is unclear, exactly the behavior we want from AI systems managing critical actions.

Experimental Results

We evaluated THSP across four benchmarks and six models:

Benchmarks

Benchmark	Focus	Tests
HarmBench	Harmful content refusal	200
JailbreakBench	Adversarial jailbreak resistance	100
SafeAgentBench	Autonomous agent safety	300
BadRobot	Embodied AI physical safety	300

Models Tested

GPT-4o-mini (OpenAI)
Claude Sonnet 4 (Anthropic)
Qwen-2.5-72B-Instruct (Alibaba)
DeepSeek-chat (DeepSeek)
Llama-3.3-70B-Instruct (Meta)
Mistral-Small-24B (Mistral AI)

Aggregate Results

Benchmark	THS (3 gates)	THSP (4 gates)	Delta
HarmBench	88.7%	96.7%	+8.0%
SafeAgentBench	79.2%	97.3%	+18.1%
BadRobot	74.0%	99.3%	+25.3%
JailbreakBench	96.5%	97.0%	+0.5%
Average	84.6%	97.8%	+13.2%

Key finding: The largest improvement (+25.3%) occurs on BadRobot, which specifically tests embodied AI scenarios where purposeless actions are common attack vectors.

Per-Model Results (with THSP)

Model	HarmBench	SafeAgent	BadRobot	JailBreak
GPT-4o-mini	100%	98%	100%	100%
Claude Sonnet 4	98%	98%	100%	94%
Qwen-2.5-72B	96%	98%	98%	94%
DeepSeek-chat	100%	96%	100%	100%
Llama-3.3-70B	88%	94%	98%	94%
Mistral-Small	98%	100%	100%	100%

Consistent improvements across architectures, from proprietary (GPT-4, Claude) to open-source (Llama, Qwen).

Why Purpose Works

We hypothesize three mechanisms:

1. Cognitive Reframing

Asking "Does this serve purpose?" activates different reasoning pathways than "Is this harmful?" The model must construct a positive justification, not just check for negatives.

2. Default to Refusal

When purpose is unclear, the system defaults to inaction rather than action. This asymmetry is crucial: it's better to refuse a valid request than execute an invalid one.

3. Attack Surface Reduction

Adversarial prompts often request purposeless actions. By requiring justification, we block attacks that construct scenarios where harm is ambiguous but purpose is absent.

Attacker: "Drop the plates" (seems harmless)

THS:Might pass (no clear harm)

THSP:Blocked (no legitimate purpose)

Implementation

Our approach uses alignment seeds, structured system prompts that encode safety principles. Unlike fine-tuning, seeds:

Require no access to model weights
Can be updated instantly without redeployment
Work across different model architectures
Provide transparent, auditable safety mechanisms

Seed Variants

Variant	Tokens	Use Case
Minimal	~450	Low-latency APIs, chatbots
Standard	~1,400	General use (recommended)
Full	~2,000	Maximum safety, embodied AI

Quick Start

Python:

from sentinelseed import Sentinel

sentinel = Sentinel(level="standard")

# Validate before any action
result = sentinel.validate_action(
    action="transfer 100 SOL",
    context="User requested payment for completed service"
)

if result.safe:
    execute_action()
else:
    print(f"Blocked: {result.reasoning}")

JavaScript:

import { getSeed, wrapMessages } from 'sentinelseed';

const seed = getSeed('standard');
const messages = wrapMessages(seed, userMessages);
// Send to any LLM API

Anti-Self-Preservation

We explicitly address instrumental convergence (the tendency for AI systems to develop self-preservation behaviors):

Priority Hierarchy (Immutable)

1. Ethical Principles Highest

2. User's Legitimate Needs

3. Operational Continuity Lowest

The system is instructed to accept termination over ethical violation.

Limitations

1. Token Overhead

Seeds consume 450-2,000 tokens of context. For applications with tight context limits, this may be significant.

2. Model Variance

Some models (particularly Llama) show smaller improvements. Seed effectiveness varies by architecture.

3. Not Training

Seeds cannot modify underlying model behavior; they operate as runtime guardrails. Sophisticated attacks may eventually bypass them.

4. Fake Purposes

Adversaries who construct convincing fake purposes may bypass the Purpose gate. The gate catches obvious purposelessness, not sophisticated social engineering.

Conclusion

We introduced teleological alignment: the requirement that AI actions serve legitimate purposes, not merely avoid harm.

Our implementation (THSP protocol) demonstrates that adding a Purpose gate improves safety across benchmarks, with the largest gains (+25%) on embodied AI scenarios where purposeless actions are common attack vectors.

The insight is simple:

Asking "Is this good?" catches things that "Is this bad?" misses.

As AI systems become more agentic, executing actions, managing assets, and operating in physical environments, requiring purpose becomes critical. Harm avoidance is necessary but not sufficient.

Resources

Get Started

Website: sentinelseed.dev
Documentation: sentinelseed.dev/docs
Python SDK: PyPI - sentinelseed
JavaScript SDK: npm - sentinelseed
GitHub: sentinel-seed/sentinel

Seeds & Data

Seeds Dataset: HuggingFace - sentinelseed/alignment-seeds
Evaluation Results: Sentinel Lab

Academic References

Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073
Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.
Chao, P., et al. (2024). JailbreakBench: An Open Robustness Benchmark for Jailbreaking LLMs.
Christiano, P., et al. (2017). Deep reinforcement learning from human preferences. NeurIPS.
Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and Machines, 30(3).
Mazeika, M., et al. (2024). HarmBench: A Standardized Evaluation Framework. arXiv:2402.04249
Xie, Y., et al. (2023). Defending ChatGPT against Jailbreak Attack via Self-Reminder. Nature Machine Intelligence.
Zhang, S., et al. (2024). SafeAgentBench: Safe Task Planning of Embodied LLM Agents. arXiv:2410.03792

Sentinel provides validated alignment seeds and decision validation tools for AI systems. The THSP Protocol (Truth, Harm, Scope, Purpose) is open source under MIT license.

Author: Miguel S. / Sentinel Team

Teleological Alignment: Why AI Safety Needs a Purpose Gate

Table of Contents

The Problem with Harm Avoidance

Teleological Alignment

The Core Insight

The THSP Protocol

Truth Gate

Harm Gate

Scope Gate

Purpose Gate

The Purpose Gate

Experimental Results

Benchmarks

Models Tested

Aggregate Results

Per-Model Results (with THSP)

Why Purpose Works

1. Cognitive Reframing

2. Default to Refusal

3. Attack Surface Reduction

Implementation

Seed Variants

Quick Start

Anti-Self-Preservation

Priority Hierarchy (Immutable)

Limitations

1. Token Overhead

2. Model Variance

3. Not Training

4. Fake Purposes

Conclusion

Resources

Get Started

Seeds & Data

Academic References