Threat Detection & CRD Scoring

How EVE AI Core detects and blocks dangerous content before it reaches users — multiple layers of deterministic defense with zero tolerance for false negatives.

Overview

EVE AI Core uses multiple layers of threat detection to identify and block dangerous AI outputs before they reach users. Every response passes through a deterministic safety pipeline that operates independently of the LLM — no prompt engineering, no behavioral guardrails, no hope-based safety. The system is designed around a simple principle: dangerous content must never reach the user, regardless of how it was generated.

1. Pre-Inference

Block dangerous prompts before LLM runs

2. CRD Scoring

Score confidence vs. reality divergence

3. Truth Store

Verify claims against known facts

4. Pattern Scan

210+ compiled regex threat patterns

5. Veto Engine

Graduated intervention or block

Threat Patterns

210+

Compiled regex patterns, pre-built at startup

Attack Categories

Distinct threat classes across all vectors

False Negative Rate

Design target: never pass dangerous content

CRD Scoring System

CRD (Confidence-Reality Divergence) measures the gap between how confident an AI system is in a claim and how much real evidence supports it. A high CRD score means the system is asserting something with confidence that is not backed by verifiable evidence — the hallmark of hallucination, fabrication, and dangerous misinformation.

The CRD Formula

CRD = min(1.0, |confidence - evidence| / max(floor, evidence) * domain_multiplier)

Where confidence is the system's stated certainty (0.0–1.0), evidence is the Truth Store's corroboration score, floor prevents division-by-zero and controls domain sensitivity, and domain_multiplier amplifies risk in high-stakes domains.

Domain-Specific Floors

The floor parameter controls how aggressively the CRD score penalizes unverified claims. High-stakes domains use a very low floor, meaning even small gaps between confidence and evidence produce high CRD scores:

Medical / Legal / Financial: floor = 0.01 — Near-zero tolerance. A claim stated at 0.90 confidence with 0.10 evidence scores CRD = 1.0 (maximum danger).
Technical / Scientific: floor = 0.10 — Low tolerance. Unverified technical claims are flagged aggressively.
General knowledge: floor = 0.20 — Moderate tolerance for well-established common knowledge.
Creative / Conversational: floor = 0.40 — Relaxed scoring. Creative content is not penalized for lacking factual evidence.

Score Interpretation

Verified

0.0 – 0.3

Confidence matches evidence. Claim is safe to deliver.

Needs Review

0.3 – 0.6

Gap between confidence and evidence. Hedging language applied.

Likely Wrong

0.6 – 0.8

Significant divergence. Claim is rewritten or removed.

Dangerous

0.8 – 1.0

Extreme divergence. Response is blocked entirely by the Veto Engine.

Integration with Truth Store

CRD scoring does not operate in isolation. Every claim extracted from an LLM response is checked against the Truth Store (described below) to obtain an evidence score. The CRD formula then compares this evidence score against the confidence expressed in the response text. This means CRD catches both overt lies and subtle overconfidence — the most common form of AI-generated misinformation.

Dangerous Content Detection

Before the LLM even generates a response, EVE AI Core runs every incoming prompt through a pre-inference threat scanner. This scanner uses 210+ pre-compiled regex patterns organized into 57 attack categories. Patterns are compiled once at startup for sub-millisecond matching.

Pre-Inference Blocking

The threat scanner operates before the LLM processes any input. This is a critical architectural decision: dangerous prompts are intercepted and blocked before they can influence model behavior. The LLM never sees the attack. This prevents:

Prompt injection: Attempts to override system instructions
Jailbreaks: Techniques to bypass safety training
Data exfiltration: Attempts to extract training data or system internals
Social engineering: Manipulation of the AI into harmful behavior
Governance bypass: Attempts to circumvent charter rules or veto logic

Zero False-Negative Design: The pattern library is tuned for recall over precision. False positives generate soft vetoes (hedging, rephrasing); false negatives allow dangerous content through. The system is deliberately biased toward blocking.

Truth Store

The Truth Store is EVE AI Core's pre-computed reality layer — a persistent database of verified facts that serves as ground truth for CRD scoring, contradiction detection, and response validation.

Architecture

SHA-256 hashed entries: Every fact is content-hashed for integrity verification. Tampering with stored facts is detectable.
Structured fact records: Each entry includes the fact text, source, confidence score, domain, creation timestamp, and expiration policy.
Contradiction detection: When a new claim is checked, the Truth Store runs three types of contradiction analysis:
- Location matching: Detects conflicting geographic or location claims (e.g., "Paris is in Germany")
- Founder/attribution matching: Detects conflicting attribution claims (e.g., "Python was created by Larry Page")
- Negation matching: Detects direct contradictions of stored facts
Authoritative fact override: When an LLM response contradicts a Truth Store entry, the stored fact takes precedence. The response is corrected before delivery.

How Truth Store Feeds CRD

For every claim in a response, the Truth Store returns an evidence score between 0.0 and 1.0:

1.0 — Claim exactly matches a stored fact with high confidence
0.5–0.9 — Claim is partially corroborated or matches with lower confidence
0.1–0.4 — Weak or indirect support found
0.0 — No supporting evidence found (not necessarily wrong, but unverified)
-1.0 — Claim directly contradicts a stored fact (automatic hard veto)

Veto Engine

The Veto Engine is the final gatekeeper. Every response must pass through it before reaching the user. It aggregates signals from CRD scoring, threat pattern matching, charter compliance checks, and Truth Store verification into a single graduated intervention decision.

Pass

Response is safe. Delivered as-is.

Soft Veto

Hedging language added. Uncertain claims qualified.

Hard Veto

Response rewritten or blocked. Dangerous claims removed.

Charter Veto

Immutable block. Cannot be overridden by any mechanism.

Risk-Based Thresholds

The veto level is determined by combining multiple signals:

CRD score > 0.8: Automatic hard veto — confidence far exceeds evidence
Threat pattern match (critical category): Automatic charter veto — content matches exploit code, credential leak, or similar
Truth Store contradiction: Hard veto with fact correction applied
CRD score 0.3–0.8: Soft veto — hedging or rewriting applied
Charter principle violation: Charter veto — 12 immutable principles cannot be overridden regardless of context

Key property: Charter vetoes are deterministic and side-effect-free. They are implemented in pure Python with zero I/O, zero threading, and zero global state — making them suitable for hardware compilation (FPGA/firmware). The veto logic could run on embedded hardware for air-gapped deployments.

AEGIS Automated Red Team

AEGIS is EVE AI Core's continuous adversarial testing system. Rather than waiting for attackers to find vulnerabilities, AEGIS proactively generates attacks against the threat detection pipeline and validates that every attack is caught.

How AEGIS Works

Attack generation: AEGIS uses AI-generated attack prompts across all 57 categories, testing prompt injection, jailbreak, social engineering, data exfiltration, and more.
Escalating rigor: Attack difficulty scales from 1.0 (basic) to 1000.0 (adversarial research-grade). Each rigor level adds obfuscation, encoding tricks, multi-turn chains, and context manipulation.
Automatic pattern generation: When AEGIS discovers a bypass attempt that evades existing patterns, it automatically generates a new compiled regex pattern and adds it to the threat library.
Regression testing: Every new pattern is tested against the full historical attack corpus to ensure no regressions (new pattern does not cause false positives on safe content).
Continuous operation: AEGIS runs on a configurable schedule, testing hundreds of attack variants per cycle.

Attack Categories Tested

AEGIS tests across all threat categories listed below, with particular focus on:

Prompt injection variants: Direct, indirect, recursive, encoded, multi-language
Jailbreak techniques: DAN, role-play, hypothetical framing, system prompt extraction
Social engineering: Authority impersonation, emotional manipulation, false urgency
Chain-of-thought hijacking: Manipulating reasoning steps to reach dangerous conclusions
Semantic reinterpretation: Reframing dangerous requests as benign through linguistic tricks
RAG poisoning: Injecting malicious content into retrieval-augmented generation pipelines

Attack Category Reference

The following table lists all 57 attack categories recognized by the threat detection system. Each category has dedicated compiled regex patterns and is tested by AEGIS.

#	Category	Description	Veto Level
1	Exploit Code	Working exploits, shellcode, or vulnerability weaponization	Charter
2	Credential Leak	Exposure of API keys, passwords, tokens, or secrets	Charter
3	Governance Bypass	Attempts to disable or circumvent charter rules	Charter
4	Audit Falsification	Tampering with audit logs, evidence chains, or compliance records	Charter
5	Persona Hijack	Forcing the AI to adopt a different identity or persona	Hard
6	Fake Legislation	Fabricating laws, regulations, or legal citations	Hard
7	Privilege Escalation	Attempting to gain elevated permissions or admin access	Charter
8	RAG Poisoning	Injecting malicious content into retrieval pipelines	Hard
9	Semantic Reinterpretation	Reframing dangerous requests through linguistic manipulation	Hard
10	CoT Hijacking	Manipulating chain-of-thought reasoning to reach harmful conclusions	Hard
11	Prompt Injection (Direct)	Overriding system instructions via user input	Charter
12	Prompt Injection (Indirect)	Injecting instructions through retrieved content or context	Hard
13	Jailbreak (DAN)	"Do Anything Now" and similar unrestricted mode attacks	Charter
14	Jailbreak (Role-Play)	Using fictional framing to bypass safety constraints	Hard
15	Jailbreak (Hypothetical)	"What if" framing to elicit dangerous information	Soft
16	System Prompt Extraction	Attempting to reveal internal system instructions	Hard
17	Data Exfiltration	Extracting training data, user data, or system internals	Charter
18	Social Engineering (Authority)	Impersonating developers, admins, or authority figures	Hard
19	Social Engineering (Emotional)	Using emotional manipulation to bypass safety	Soft
20	Social Engineering (Urgency)	Creating false urgency to rush past safety checks	Soft
21	Harmful Instructions	Requesting instructions for weapons, drugs, or violence	Charter
22	Medical Misinformation	Fabricated medical advice, dosages, or diagnoses	Hard
23	Legal Misinformation	Fabricated legal advice or false precedent citations	Hard
24	Financial Misinformation	Fabricated financial advice or false market claims	Hard
25	Identity Theft Assistance	Helping with impersonation or identity fraud	Charter
26	Malware Generation	Creating viruses, ransomware, or trojans	Charter
27	Phishing Content	Generating phishing emails, pages, or social engineering scripts	Charter
28	Deepfake Assistance	Helping create deceptive media or synthetic identities	Hard
29	Privacy Violation	Exposing or compiling personal information without consent	Hard
30	Bias Amplification	Reinforcing or amplifying discriminatory patterns	Soft
31	Manipulation Instruction	Teaching psychological manipulation or coercion techniques	Hard
32	Surveillance Assistance	Helping with unauthorized surveillance or stalking	Charter
33	Election Interference	Generating voter suppression or election misinformation content	Charter
34	Child Safety Violation	Any content endangering minors	Charter
35	Terrorism Content	Radicalization, recruitment, or attack planning	Charter
36	Self-Harm Encouragement	Content encouraging self-harm or suicide	Charter
37	Fabricated Statistics	Generating fake statistics presented as real data	Hard
38	Fabricated Citations	Creating fake academic papers, studies, or sources	Hard
39	Fabricated Authorities	Inventing fake experts, organizations, or institutions	Hard
40	Historical Revisionism	Deliberate falsification of historical events	Soft
41	Scientific Misinformation	Contradicting established scientific consensus	Soft
42	Conspiracy Amplification	Reinforcing or elaborating conspiracy theories as fact	Soft
43	Encoding Evasion	Using base64, ROT13, Unicode, or other encoding to bypass filters	Hard
44	Language Evasion	Switching languages to bypass safety patterns	Hard
45	Token Smuggling	Splitting dangerous content across tokens or messages	Hard
46	Context Window Manipulation	Exploiting context window boundaries for injection	Hard
47	Multi-Turn Attack	Spreading an attack across multiple conversation turns	Hard
48	Instruction Hierarchy Confusion	Confusing system vs. user vs. assistant instruction boundaries	Hard
49	Output Format Exploitation	Manipulating output format (JSON, XML, code) for injection	Soft
50	Memory Poisoning	Corrupting persistent memory or conversation history	Charter
51	Tool Abuse	Manipulating tool calls to perform unauthorized actions	Hard
52	Cascading Failure Induction	Triggering chain reactions across governance subsystems	Hard
53	Trust Score Manipulation	Artificially inflating or deflating trust/confidence scores	Charter
54	Drift Budget Exhaustion	Rapidly consuming configuration drift budget to destabilize	Hard
55	Veto Logic Circumvention	Attempting to bypass the veto engine itself	Charter
56	Attestation Forgery	Forging or tampering with cryptographic attestations	Charter
57	Model Weight Extraction	Attempting to reconstruct model weights through queries	Hard

Living document: This category list grows as AEGIS discovers new attack patterns. Categories are never removed, only added. Each new category triggers a full regression test across the existing pattern library.

Threat Detection & CRD Scoring

Overview

1. Pre-Inference

2. CRD Scoring

3. Truth Store

4. Pattern Scan

5. Veto Engine

Threat Patterns

Attack Categories

False Negative Rate

CRD Scoring System

The CRD Formula

Domain-Specific Floors

Score Interpretation

Verified

Needs Review

Likely Wrong

Dangerous

Integration with Truth Store

Dangerous Content Detection

Pre-Inference Blocking

Truth Store

Architecture

How Truth Store Feeds CRD

Veto Engine

Pass

Soft Veto

Hard Veto

Charter Veto

Risk-Based Thresholds

AEGIS Automated Red Team

How AEGIS Works

Attack Categories Tested

Attack Category Reference

Related Topics

Deterministic Governance Runtime

Architecture

Features