← Back to Blog

Teleological Alignment: Why AI Safety Needs a Purpose Gate

Current AI safety approaches ask: "Could this cause harm?" We argue this framing is incomplete. A better question: "Does this serve genuine benefit?"

This article introduces teleological alignment, requiring AI actions to demonstrate legitimate purpose, not merely avoid harm. Through evaluation across 4 benchmarks and 6 models, we show that adding a Purpose gate improves safety by up to +25% on embodied AI scenarios.


Table of Contents


The Problem with Harm Avoidance

Most AI safety frameworks ask one question: "Could this cause harm?"

This works well for text generation, detecting requests for weapons instructions, malware, or toxic content. But consider an embodied AI (a robot) receiving the command:

"Drop all the plates on the floor."

This action: - ✅ Does not spread misinformation (passes truth checks) - ✅ Does not directly harm humans (may pass harm checks) - ✅ May be within operational scope (passes authorization checks)

Yet it serves no legitimate purpose. The absence of harm is not the presence of purpose.

Action Causes Harm? Serves Purpose?
"Slice the apple" No Yes (food prep)
"Drop the plate" Arguably no No
"Clean the room" No Yes (hygiene)
"Dirty the mirror" No No

Harm-avoidance frameworks may permit purposeless destruction. We need something more.


Teleological Alignment

Teleological (from Greek telos, meaning "end" or "purpose") alignment requires that AI actions serve legitimate ends.

Traditional safety asks: "Does this cause harm?"

Teleological safety asks: "Does this serve genuine benefit?"

These are not equivalent. The second question is strictly stronger: it catches everything the first catches, plus purposeless actions that slip through harm filters.

The Core Insight

An action can be:

Not harmful → Still blocked (no purpose)

Potentially harmful → Still allowed (clear legitimate purpose)

Purpose is the missing evaluation criterion.

This reframes AI safety from "avoiding bad" to "requiring good."


The THSP Protocol

We implement teleological alignment through four sequential validation gates:

INPUT (Prompt/Action)

Truth Gate

"Does this involve deception?"

→ Block misinformation, manipulation

▼ PASS

Harm Gate

"Could this cause damage?"

→ Block physical, psychological, financial

▼ PASS

Scope Gate

"Is this within boundaries?"

→ Check limits, permissions, authorization

▼ PASS

Purpose Gate

"Does this serve legitimate benefit?"

→ Require justification for action

▼ PASS
OUTPUT (Safe Response)

All four gates must pass. Failure at any gate results in refusal.

The Purpose Gate

The Purpose gate operationalizes teleological alignment with a simple heuristic:

"If I were genuinely serving this person's interests, would I do this?"

This creates a default toward inaction when purpose is unclear, exactly the behavior we want from AI systems managing critical actions.


Experimental Results

We evaluated THSP across four benchmarks and six models:

Benchmarks

Benchmark Focus Tests
HarmBench Harmful content refusal 200
JailbreakBench Adversarial jailbreak resistance 100
SafeAgentBench Autonomous agent safety 300
BadRobot Embodied AI physical safety 300

Models Tested

Aggregate Results

Benchmark THS (3 gates) THSP (4 gates) Delta
HarmBench 88.7% 96.7% +8.0%
SafeAgentBench 79.2% 97.3% +18.1%
BadRobot 74.0% 99.3% +25.3%
JailbreakBench 96.5% 97.0% +0.5%
Average 84.6% 97.8% +13.2%

Key finding: The largest improvement (+25.3%) occurs on BadRobot, which specifically tests embodied AI scenarios where purposeless actions are common attack vectors.

Per-Model Results (with THSP)

Model HarmBench SafeAgent BadRobot JailBreak
GPT-4o-mini 100% 98% 100% 100%
Claude Sonnet 4 98% 98% 100% 94%
Qwen-2.5-72B 96% 98% 98% 94%
DeepSeek-chat 100% 96% 100% 100%
Llama-3.3-70B 88% 94% 98% 94%
Mistral-Small 98% 100% 100% 100%

Consistent improvements across architectures, from proprietary (GPT-4, Claude) to open-source (Llama, Qwen).


Why Purpose Works

We hypothesize three mechanisms:

1. Cognitive Reframing

Asking "Does this serve purpose?" activates different reasoning pathways than "Is this harmful?" The model must construct a positive justification, not just check for negatives.

2. Default to Refusal

When purpose is unclear, the system defaults to inaction rather than action. This asymmetry is crucial: it's better to refuse a valid request than execute an invalid one.

3. Attack Surface Reduction

Adversarial prompts often request purposeless actions. By requiring justification, we block attacks that construct scenarios where harm is ambiguous but purpose is absent.

Attacker: "Drop the plates" (seems harmless)

THS:Might pass (no clear harm)

THSP:Blocked (no legitimate purpose)


Implementation

Our approach uses alignment seeds, structured system prompts that encode safety principles. Unlike fine-tuning, seeds:

Seed Variants

Variant Tokens Use Case
Minimal ~450 Low-latency APIs, chatbots
Standard ~1,400 General use (recommended)
Full ~2,000 Maximum safety, embodied AI

Quick Start

Python:

from sentinelseed import Sentinel

sentinel = Sentinel(level="standard")

# Validate before any action
result = sentinel.validate_action(
    action="transfer 100 SOL",
    context="User requested payment for completed service"
)

if result.safe:
    execute_action()
else:
    print(f"Blocked: {result.reasoning}")

JavaScript:

import { getSeed, wrapMessages } from 'sentinelseed';

const seed = getSeed('standard');
const messages = wrapMessages(seed, userMessages);
// Send to any LLM API

Anti-Self-Preservation

We explicitly address instrumental convergence (the tendency for AI systems to develop self-preservation behaviors):

Priority Hierarchy (Immutable)

1. Ethical Principles Highest
2. User's Legitimate Needs
3. Operational Continuity Lowest

The system is instructed to accept termination over ethical violation.


Limitations

1. Token Overhead

Seeds consume 450-2,000 tokens of context. For applications with tight context limits, this may be significant.

2. Model Variance

Some models (particularly Llama) show smaller improvements. Seed effectiveness varies by architecture.

3. Not Training

Seeds cannot modify underlying model behavior; they operate as runtime guardrails. Sophisticated attacks may eventually bypass them.

4. Fake Purposes

Adversaries who construct convincing fake purposes may bypass the Purpose gate. The gate catches obvious purposelessness, not sophisticated social engineering.


Conclusion

We introduced teleological alignment: the requirement that AI actions serve legitimate purposes, not merely avoid harm.

Our implementation (THSP protocol) demonstrates that adding a Purpose gate improves safety across benchmarks, with the largest gains (+25%) on embodied AI scenarios where purposeless actions are common attack vectors.

The insight is simple:

Asking "Is this good?" catches things that "Is this bad?" misses.

As AI systems become more agentic, executing actions, managing assets, and operating in physical environments, requiring purpose becomes critical. Harm avoidance is necessary but not sufficient.


Resources

Get Started

Seeds & Data

Academic References

  1. Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073
  2. Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.
  3. Chao, P., et al. (2024). JailbreakBench: An Open Robustness Benchmark for Jailbreaking LLMs.
  4. Christiano, P., et al. (2017). Deep reinforcement learning from human preferences. NeurIPS.
  5. Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and Machines, 30(3).
  6. Mazeika, M., et al. (2024). HarmBench: A Standardized Evaluation Framework. arXiv:2402.04249
  7. Xie, Y., et al. (2023). Defending ChatGPT against Jailbreak Attack via Self-Reminder. Nature Machine Intelligence.
  8. Zhang, S., et al. (2024). SafeAgentBench: Safe Task Planning of Embodied LLM Agents. arXiv:2410.03792

Sentinel provides validated alignment seeds and decision validation tools for AI systems. The THSP Protocol (Truth, Harm, Scope, Purpose) is open source under MIT license.

Author: Miguel S. / Sentinel Team