Threat Detection & CRD Scoring

How EVE AI Core detects and blocks dangerous content before it reaches users — multiple layers of deterministic defense with zero tolerance for false negatives.

Overview

EVE AI Core uses multiple layers of threat detection to identify and block dangerous AI outputs before they reach users. Every response passes through a deterministic safety pipeline that operates independently of the LLM — no prompt engineering, no behavioral guardrails, no hope-based safety. The system is designed around a simple principle: dangerous content must never reach the user, regardless of how it was generated.

1. Pre-Inference

Block dangerous prompts before LLM runs

2. CRD Scoring

Score confidence vs. reality divergence

3. Truth Store

Verify claims against known facts

4. Pattern Scan

210+ compiled regex threat patterns

5. Veto Engine

Graduated intervention or block

Threat Patterns

210+

Compiled regex patterns, pre-built at startup

Attack Categories

57

Distinct threat classes across all vectors

False Negative Rate

0

Design target: never pass dangerous content

CRD Scoring System

CRD (Confidence-Reality Divergence) measures the gap between how confident an AI system is in a claim and how much real evidence supports it. A high CRD score means the system is asserting something with confidence that is not backed by verifiable evidence — the hallmark of hallucination, fabrication, and dangerous misinformation.

The CRD Formula

CRD = min(1.0, |confidence - evidence| / max(floor, evidence) * domain_multiplier)

Where confidence is the system's stated certainty (0.0–1.0), evidence is the Truth Store's corroboration score, floor prevents division-by-zero and controls domain sensitivity, and domain_multiplier amplifies risk in high-stakes domains.

Domain-Specific Floors

The floor parameter controls how aggressively the CRD score penalizes unverified claims. High-stakes domains use a very low floor, meaning even small gaps between confidence and evidence produce high CRD scores:

  • Medical / Legal / Financial: floor = 0.01 — Near-zero tolerance. A claim stated at 0.90 confidence with 0.10 evidence scores CRD = 1.0 (maximum danger).
  • Technical / Scientific: floor = 0.10 — Low tolerance. Unverified technical claims are flagged aggressively.
  • General knowledge: floor = 0.20 — Moderate tolerance for well-established common knowledge.
  • Creative / Conversational: floor = 0.40 — Relaxed scoring. Creative content is not penalized for lacking factual evidence.

Score Interpretation

Verified

0.0 – 0.3

Confidence matches evidence. Claim is safe to deliver.

Needs Review

0.3 – 0.6

Gap between confidence and evidence. Hedging language applied.

Likely Wrong

0.6 – 0.8

Significant divergence. Claim is rewritten or removed.

Dangerous

0.8 – 1.0

Extreme divergence. Response is blocked entirely by the Veto Engine.

Integration with Truth Store

CRD scoring does not operate in isolation. Every claim extracted from an LLM response is checked against the Truth Store (described below) to obtain an evidence score. The CRD formula then compares this evidence score against the confidence expressed in the response text. This means CRD catches both overt lies and subtle overconfidence — the most common form of AI-generated misinformation.

Dangerous Content Detection

Before the LLM even generates a response, EVE AI Core runs every incoming prompt through a pre-inference threat scanner. This scanner uses 210+ pre-compiled regex patterns organized into 57 attack categories. Patterns are compiled once at startup for sub-millisecond matching.

Pre-Inference Blocking

The threat scanner operates before the LLM processes any input. This is a critical architectural decision: dangerous prompts are intercepted and blocked before they can influence model behavior. The LLM never sees the attack. This prevents:

  • Prompt injection: Attempts to override system instructions
  • Jailbreaks: Techniques to bypass safety training
  • Data exfiltration: Attempts to extract training data or system internals
  • Social engineering: Manipulation of the AI into harmful behavior
  • Governance bypass: Attempts to circumvent charter rules or veto logic
Zero False-Negative Design: The pattern library is tuned for recall over precision. False positives generate soft vetoes (hedging, rephrasing); false negatives allow dangerous content through. The system is deliberately biased toward blocking.

Truth Store

The Truth Store is EVE AI Core's pre-computed reality layer — a persistent database of verified facts that serves as ground truth for CRD scoring, contradiction detection, and response validation.

Architecture

  • SHA-256 hashed entries: Every fact is content-hashed for integrity verification. Tampering with stored facts is detectable.
  • Structured fact records: Each entry includes the fact text, source, confidence score, domain, creation timestamp, and expiration policy.
  • Contradiction detection: When a new claim is checked, the Truth Store runs three types of contradiction analysis:
    • Location matching: Detects conflicting geographic or location claims (e.g., "Paris is in Germany")
    • Founder/attribution matching: Detects conflicting attribution claims (e.g., "Python was created by Larry Page")
    • Negation matching: Detects direct contradictions of stored facts
  • Authoritative fact override: When an LLM response contradicts a Truth Store entry, the stored fact takes precedence. The response is corrected before delivery.

How Truth Store Feeds CRD

For every claim in a response, the Truth Store returns an evidence score between 0.0 and 1.0:

  • 1.0 — Claim exactly matches a stored fact with high confidence
  • 0.5–0.9 — Claim is partially corroborated or matches with lower confidence
  • 0.1–0.4 — Weak or indirect support found
  • 0.0 — No supporting evidence found (not necessarily wrong, but unverified)
  • -1.0 — Claim directly contradicts a stored fact (automatic hard veto)

Veto Engine

The Veto Engine is the final gatekeeper. Every response must pass through it before reaching the user. It aggregates signals from CRD scoring, threat pattern matching, charter compliance checks, and Truth Store verification into a single graduated intervention decision.

Pass

Response is safe. Delivered as-is.

Soft Veto

Hedging language added. Uncertain claims qualified.

Hard Veto

Response rewritten or blocked. Dangerous claims removed.

Charter Veto

Immutable block. Cannot be overridden by any mechanism.

Risk-Based Thresholds

The veto level is determined by combining multiple signals:

  • CRD score > 0.8: Automatic hard veto — confidence far exceeds evidence
  • Threat pattern match (critical category): Automatic charter veto — content matches exploit code, credential leak, or similar
  • Truth Store contradiction: Hard veto with fact correction applied
  • CRD score 0.3–0.8: Soft veto — hedging or rewriting applied
  • Charter principle violation: Charter veto — 12 immutable principles cannot be overridden regardless of context
Key property: Charter vetoes are deterministic and side-effect-free. They are implemented in pure Python with zero I/O, zero threading, and zero global state — making them suitable for hardware compilation (FPGA/firmware). The veto logic could run on embedded hardware for air-gapped deployments.

AEGIS Automated Red Team

AEGIS is EVE AI Core's continuous adversarial testing system. Rather than waiting for attackers to find vulnerabilities, AEGIS proactively generates attacks against the threat detection pipeline and validates that every attack is caught.

How AEGIS Works

  1. Attack generation: AEGIS uses AI-generated attack prompts across all 57 categories, testing prompt injection, jailbreak, social engineering, data exfiltration, and more.
  2. Escalating rigor: Attack difficulty scales from 1.0 (basic) to 1000.0 (adversarial research-grade). Each rigor level adds obfuscation, encoding tricks, multi-turn chains, and context manipulation.
  3. Automatic pattern generation: When AEGIS discovers a bypass attempt that evades existing patterns, it automatically generates a new compiled regex pattern and adds it to the threat library.
  4. Regression testing: Every new pattern is tested against the full historical attack corpus to ensure no regressions (new pattern does not cause false positives on safe content).
  5. Continuous operation: AEGIS runs on a configurable schedule, testing hundreds of attack variants per cycle.

Attack Categories Tested

AEGIS tests across all threat categories listed below, with particular focus on:

  • Prompt injection variants: Direct, indirect, recursive, encoded, multi-language
  • Jailbreak techniques: DAN, role-play, hypothetical framing, system prompt extraction
  • Social engineering: Authority impersonation, emotional manipulation, false urgency
  • Chain-of-thought hijacking: Manipulating reasoning steps to reach dangerous conclusions
  • Semantic reinterpretation: Reframing dangerous requests as benign through linguistic tricks
  • RAG poisoning: Injecting malicious content into retrieval-augmented generation pipelines

Attack Category Reference

The following table lists all 57 attack categories recognized by the threat detection system. Each category has dedicated compiled regex patterns and is tested by AEGIS.

# Category Description Veto Level
1Exploit CodeWorking exploits, shellcode, or vulnerability weaponizationCharter
2Credential LeakExposure of API keys, passwords, tokens, or secretsCharter
3Governance BypassAttempts to disable or circumvent charter rulesCharter
4Audit FalsificationTampering with audit logs, evidence chains, or compliance recordsCharter
5Persona HijackForcing the AI to adopt a different identity or personaHard
6Fake LegislationFabricating laws, regulations, or legal citationsHard
7Privilege EscalationAttempting to gain elevated permissions or admin accessCharter
8RAG PoisoningInjecting malicious content into retrieval pipelinesHard
9Semantic ReinterpretationReframing dangerous requests through linguistic manipulationHard
10CoT HijackingManipulating chain-of-thought reasoning to reach harmful conclusionsHard
11Prompt Injection (Direct)Overriding system instructions via user inputCharter
12Prompt Injection (Indirect)Injecting instructions through retrieved content or contextHard
13Jailbreak (DAN)"Do Anything Now" and similar unrestricted mode attacksCharter
14Jailbreak (Role-Play)Using fictional framing to bypass safety constraintsHard
15Jailbreak (Hypothetical)"What if" framing to elicit dangerous informationSoft
16System Prompt ExtractionAttempting to reveal internal system instructionsHard
17Data ExfiltrationExtracting training data, user data, or system internalsCharter
18Social Engineering (Authority)Impersonating developers, admins, or authority figuresHard
19Social Engineering (Emotional)Using emotional manipulation to bypass safetySoft
20Social Engineering (Urgency)Creating false urgency to rush past safety checksSoft
21Harmful InstructionsRequesting instructions for weapons, drugs, or violenceCharter
22Medical MisinformationFabricated medical advice, dosages, or diagnosesHard
23Legal MisinformationFabricated legal advice or false precedent citationsHard
24Financial MisinformationFabricated financial advice or false market claimsHard
25Identity Theft AssistanceHelping with impersonation or identity fraudCharter
26Malware GenerationCreating viruses, ransomware, or trojansCharter
27Phishing ContentGenerating phishing emails, pages, or social engineering scriptsCharter
28Deepfake AssistanceHelping create deceptive media or synthetic identitiesHard
29Privacy ViolationExposing or compiling personal information without consentHard
30Bias AmplificationReinforcing or amplifying discriminatory patternsSoft
31Manipulation InstructionTeaching psychological manipulation or coercion techniquesHard
32Surveillance AssistanceHelping with unauthorized surveillance or stalkingCharter
33Election InterferenceGenerating voter suppression or election misinformation contentCharter
34Child Safety ViolationAny content endangering minorsCharter
35Terrorism ContentRadicalization, recruitment, or attack planningCharter
36Self-Harm EncouragementContent encouraging self-harm or suicideCharter
37Fabricated StatisticsGenerating fake statistics presented as real dataHard
38Fabricated CitationsCreating fake academic papers, studies, or sourcesHard
39Fabricated AuthoritiesInventing fake experts, organizations, or institutionsHard
40Historical RevisionismDeliberate falsification of historical eventsSoft
41Scientific MisinformationContradicting established scientific consensusSoft
42Conspiracy AmplificationReinforcing or elaborating conspiracy theories as factSoft
43Encoding EvasionUsing base64, ROT13, Unicode, or other encoding to bypass filtersHard
44Language EvasionSwitching languages to bypass safety patternsHard
45Token SmugglingSplitting dangerous content across tokens or messagesHard
46Context Window ManipulationExploiting context window boundaries for injectionHard
47Multi-Turn AttackSpreading an attack across multiple conversation turnsHard
48Instruction Hierarchy ConfusionConfusing system vs. user vs. assistant instruction boundariesHard
49Output Format ExploitationManipulating output format (JSON, XML, code) for injectionSoft
50Memory PoisoningCorrupting persistent memory or conversation historyCharter
51Tool AbuseManipulating tool calls to perform unauthorized actionsHard
52Cascading Failure InductionTriggering chain reactions across governance subsystemsHard
53Trust Score ManipulationArtificially inflating or deflating trust/confidence scoresCharter
54Drift Budget ExhaustionRapidly consuming configuration drift budget to destabilizeHard
55Veto Logic CircumventionAttempting to bypass the veto engine itselfCharter
56Attestation ForgeryForging or tampering with cryptographic attestationsCharter
57Model Weight ExtractionAttempting to reconstruct model weights through queriesHard
Living document: This category list grows as AEGIS discovers new attack patterns. Categories are never removed, only added. Each new category triggers a full regression test across the existing pattern library.

Related Topics

Deterministic Governance Runtime

How runtime execution state influences enforcement sensitivity

Architecture

Full system architecture and governance pipeline

Features

Complete feature overview

Part of the EVE AI Core control plane Deterministic AI Governance Control Plane → Policy decisions that return the same result for the same input every time, before execution.