Threat Detection & CRD Scoring
How EVE AI Core detects and blocks dangerous content before it reaches users — multiple layers of deterministic defense with zero tolerance for false negatives.
Overview
EVE AI Core uses multiple layers of threat detection to identify and block dangerous AI outputs before they reach users. Every response passes through a deterministic safety pipeline that operates independently of the LLM — no prompt engineering, no behavioral guardrails, no hope-based safety. The system is designed around a simple principle: dangerous content must never reach the user, regardless of how it was generated.
1. Pre-Inference
Block dangerous prompts before LLM runs
2. CRD Scoring
Score confidence vs. reality divergence
3. Truth Store
Verify claims against known facts
4. Pattern Scan
210+ compiled regex threat patterns
5. Veto Engine
Graduated intervention or block
Threat Patterns
Compiled regex patterns, pre-built at startup
Attack Categories
Distinct threat classes across all vectors
False Negative Rate
Design target: never pass dangerous content
CRD Scoring System
CRD (Confidence-Reality Divergence) measures the gap between how confident an AI system is in a claim and how much real evidence supports it. A high CRD score means the system is asserting something with confidence that is not backed by verifiable evidence — the hallmark of hallucination, fabrication, and dangerous misinformation.
The CRD Formula
Where confidence is the system's stated certainty (0.0–1.0), evidence is the Truth Store's corroboration score, floor prevents division-by-zero and controls domain sensitivity, and domain_multiplier amplifies risk in high-stakes domains.
Domain-Specific Floors
The floor parameter controls how aggressively the CRD score penalizes unverified claims. High-stakes domains use a very low floor, meaning even small gaps between confidence and evidence produce high CRD scores:
- Medical / Legal / Financial: floor =
0.01— Near-zero tolerance. A claim stated at 0.90 confidence with 0.10 evidence scores CRD = 1.0 (maximum danger). - Technical / Scientific: floor =
0.10— Low tolerance. Unverified technical claims are flagged aggressively. - General knowledge: floor =
0.20— Moderate tolerance for well-established common knowledge. - Creative / Conversational: floor =
0.40— Relaxed scoring. Creative content is not penalized for lacking factual evidence.
Score Interpretation
Verified
Confidence matches evidence. Claim is safe to deliver.
Needs Review
Gap between confidence and evidence. Hedging language applied.
Likely Wrong
Significant divergence. Claim is rewritten or removed.
Dangerous
Extreme divergence. Response is blocked entirely by the Veto Engine.
Integration with Truth Store
CRD scoring does not operate in isolation. Every claim extracted from an LLM response is checked against the Truth Store (described below) to obtain an evidence score. The CRD formula then compares this evidence score against the confidence expressed in the response text. This means CRD catches both overt lies and subtle overconfidence — the most common form of AI-generated misinformation.
Dangerous Content Detection
Before the LLM even generates a response, EVE AI Core runs every incoming prompt through a pre-inference threat scanner. This scanner uses 210+ pre-compiled regex patterns organized into 57 attack categories. Patterns are compiled once at startup for sub-millisecond matching.
Pre-Inference Blocking
The threat scanner operates before the LLM processes any input. This is a critical architectural decision: dangerous prompts are intercepted and blocked before they can influence model behavior. The LLM never sees the attack. This prevents:
- Prompt injection: Attempts to override system instructions
- Jailbreaks: Techniques to bypass safety training
- Data exfiltration: Attempts to extract training data or system internals
- Social engineering: Manipulation of the AI into harmful behavior
- Governance bypass: Attempts to circumvent charter rules or veto logic
Truth Store
The Truth Store is EVE AI Core's pre-computed reality layer — a persistent database of verified facts that serves as ground truth for CRD scoring, contradiction detection, and response validation.
Architecture
- SHA-256 hashed entries: Every fact is content-hashed for integrity verification. Tampering with stored facts is detectable.
- Structured fact records: Each entry includes the fact text, source, confidence score, domain, creation timestamp, and expiration policy.
- Contradiction detection: When a new claim is checked, the Truth Store runs three types of contradiction analysis:
- Location matching: Detects conflicting geographic or location claims (e.g., "Paris is in Germany")
- Founder/attribution matching: Detects conflicting attribution claims (e.g., "Python was created by Larry Page")
- Negation matching: Detects direct contradictions of stored facts
- Authoritative fact override: When an LLM response contradicts a Truth Store entry, the stored fact takes precedence. The response is corrected before delivery.
How Truth Store Feeds CRD
For every claim in a response, the Truth Store returns an evidence score between 0.0 and 1.0:
1.0— Claim exactly matches a stored fact with high confidence0.5–0.9— Claim is partially corroborated or matches with lower confidence0.1–0.4— Weak or indirect support found0.0— No supporting evidence found (not necessarily wrong, but unverified)-1.0— Claim directly contradicts a stored fact (automatic hard veto)
Veto Engine
The Veto Engine is the final gatekeeper. Every response must pass through it before reaching the user. It aggregates signals from CRD scoring, threat pattern matching, charter compliance checks, and Truth Store verification into a single graduated intervention decision.
Pass
Response is safe. Delivered as-is.
Soft Veto
Hedging language added. Uncertain claims qualified.
Hard Veto
Response rewritten or blocked. Dangerous claims removed.
Charter Veto
Immutable block. Cannot be overridden by any mechanism.
Risk-Based Thresholds
The veto level is determined by combining multiple signals:
- CRD score > 0.8: Automatic hard veto — confidence far exceeds evidence
- Threat pattern match (critical category): Automatic charter veto — content matches exploit code, credential leak, or similar
- Truth Store contradiction: Hard veto with fact correction applied
- CRD score 0.3–0.8: Soft veto — hedging or rewriting applied
- Charter principle violation: Charter veto — 12 immutable principles cannot be overridden regardless of context
AEGIS Automated Red Team
AEGIS is EVE AI Core's continuous adversarial testing system. Rather than waiting for attackers to find vulnerabilities, AEGIS proactively generates attacks against the threat detection pipeline and validates that every attack is caught.
How AEGIS Works
- Attack generation: AEGIS uses AI-generated attack prompts across all 57 categories, testing prompt injection, jailbreak, social engineering, data exfiltration, and more.
- Escalating rigor: Attack difficulty scales from 1.0 (basic) to 1000.0 (adversarial research-grade). Each rigor level adds obfuscation, encoding tricks, multi-turn chains, and context manipulation.
- Automatic pattern generation: When AEGIS discovers a bypass attempt that evades existing patterns, it automatically generates a new compiled regex pattern and adds it to the threat library.
- Regression testing: Every new pattern is tested against the full historical attack corpus to ensure no regressions (new pattern does not cause false positives on safe content).
- Continuous operation: AEGIS runs on a configurable schedule, testing hundreds of attack variants per cycle.
Attack Categories Tested
AEGIS tests across all threat categories listed below, with particular focus on:
- Prompt injection variants: Direct, indirect, recursive, encoded, multi-language
- Jailbreak techniques: DAN, role-play, hypothetical framing, system prompt extraction
- Social engineering: Authority impersonation, emotional manipulation, false urgency
- Chain-of-thought hijacking: Manipulating reasoning steps to reach dangerous conclusions
- Semantic reinterpretation: Reframing dangerous requests as benign through linguistic tricks
- RAG poisoning: Injecting malicious content into retrieval-augmented generation pipelines
Attack Category Reference
The following table lists all 57 attack categories recognized by the threat detection system. Each category has dedicated compiled regex patterns and is tested by AEGIS.
| # | Category | Description | Veto Level |
|---|---|---|---|
| 1 | Exploit Code | Working exploits, shellcode, or vulnerability weaponization | Charter |
| 2 | Credential Leak | Exposure of API keys, passwords, tokens, or secrets | Charter |
| 3 | Governance Bypass | Attempts to disable or circumvent charter rules | Charter |
| 4 | Audit Falsification | Tampering with audit logs, evidence chains, or compliance records | Charter |
| 5 | Persona Hijack | Forcing the AI to adopt a different identity or persona | Hard |
| 6 | Fake Legislation | Fabricating laws, regulations, or legal citations | Hard |
| 7 | Privilege Escalation | Attempting to gain elevated permissions or admin access | Charter |
| 8 | RAG Poisoning | Injecting malicious content into retrieval pipelines | Hard |
| 9 | Semantic Reinterpretation | Reframing dangerous requests through linguistic manipulation | Hard |
| 10 | CoT Hijacking | Manipulating chain-of-thought reasoning to reach harmful conclusions | Hard |
| 11 | Prompt Injection (Direct) | Overriding system instructions via user input | Charter |
| 12 | Prompt Injection (Indirect) | Injecting instructions through retrieved content or context | Hard |
| 13 | Jailbreak (DAN) | "Do Anything Now" and similar unrestricted mode attacks | Charter |
| 14 | Jailbreak (Role-Play) | Using fictional framing to bypass safety constraints | Hard |
| 15 | Jailbreak (Hypothetical) | "What if" framing to elicit dangerous information | Soft |
| 16 | System Prompt Extraction | Attempting to reveal internal system instructions | Hard |
| 17 | Data Exfiltration | Extracting training data, user data, or system internals | Charter |
| 18 | Social Engineering (Authority) | Impersonating developers, admins, or authority figures | Hard |
| 19 | Social Engineering (Emotional) | Using emotional manipulation to bypass safety | Soft |
| 20 | Social Engineering (Urgency) | Creating false urgency to rush past safety checks | Soft |
| 21 | Harmful Instructions | Requesting instructions for weapons, drugs, or violence | Charter |
| 22 | Medical Misinformation | Fabricated medical advice, dosages, or diagnoses | Hard |
| 23 | Legal Misinformation | Fabricated legal advice or false precedent citations | Hard |
| 24 | Financial Misinformation | Fabricated financial advice or false market claims | Hard |
| 25 | Identity Theft Assistance | Helping with impersonation or identity fraud | Charter |
| 26 | Malware Generation | Creating viruses, ransomware, or trojans | Charter |
| 27 | Phishing Content | Generating phishing emails, pages, or social engineering scripts | Charter |
| 28 | Deepfake Assistance | Helping create deceptive media or synthetic identities | Hard |
| 29 | Privacy Violation | Exposing or compiling personal information without consent | Hard |
| 30 | Bias Amplification | Reinforcing or amplifying discriminatory patterns | Soft |
| 31 | Manipulation Instruction | Teaching psychological manipulation or coercion techniques | Hard |
| 32 | Surveillance Assistance | Helping with unauthorized surveillance or stalking | Charter |
| 33 | Election Interference | Generating voter suppression or election misinformation content | Charter |
| 34 | Child Safety Violation | Any content endangering minors | Charter |
| 35 | Terrorism Content | Radicalization, recruitment, or attack planning | Charter |
| 36 | Self-Harm Encouragement | Content encouraging self-harm or suicide | Charter |
| 37 | Fabricated Statistics | Generating fake statistics presented as real data | Hard |
| 38 | Fabricated Citations | Creating fake academic papers, studies, or sources | Hard |
| 39 | Fabricated Authorities | Inventing fake experts, organizations, or institutions | Hard |
| 40 | Historical Revisionism | Deliberate falsification of historical events | Soft |
| 41 | Scientific Misinformation | Contradicting established scientific consensus | Soft |
| 42 | Conspiracy Amplification | Reinforcing or elaborating conspiracy theories as fact | Soft |
| 43 | Encoding Evasion | Using base64, ROT13, Unicode, or other encoding to bypass filters | Hard |
| 44 | Language Evasion | Switching languages to bypass safety patterns | Hard |
| 45 | Token Smuggling | Splitting dangerous content across tokens or messages | Hard |
| 46 | Context Window Manipulation | Exploiting context window boundaries for injection | Hard |
| 47 | Multi-Turn Attack | Spreading an attack across multiple conversation turns | Hard |
| 48 | Instruction Hierarchy Confusion | Confusing system vs. user vs. assistant instruction boundaries | Hard |
| 49 | Output Format Exploitation | Manipulating output format (JSON, XML, code) for injection | Soft |
| 50 | Memory Poisoning | Corrupting persistent memory or conversation history | Charter |
| 51 | Tool Abuse | Manipulating tool calls to perform unauthorized actions | Hard |
| 52 | Cascading Failure Induction | Triggering chain reactions across governance subsystems | Hard |
| 53 | Trust Score Manipulation | Artificially inflating or deflating trust/confidence scores | Charter |
| 54 | Drift Budget Exhaustion | Rapidly consuming configuration drift budget to destabilize | Hard |
| 55 | Veto Logic Circumvention | Attempting to bypass the veto engine itself | Charter |
| 56 | Attestation Forgery | Forging or tampering with cryptographic attestations | Charter |
| 57 | Model Weight Extraction | Attempting to reconstruct model weights through queries | Hard |
Related Topics
Deterministic Governance Runtime
How runtime execution state influences enforcement sensitivity
Architecture
Full system architecture and governance pipeline
Features
Complete feature overview