Overview
RAXE uses a dual-layer detection system to identify threats in LLM prompts and responses:
- L1 (Rule-Based): Fast regex pattern matching (~1ms)
- L2 (ML-Based): Neural classifier for novel attacks (~3ms)
L1: Rule-Based Detection
The first layer uses 514 curated regex patterns organized into 7 L1 threat families (plus 4 agentic families).
Characteristics:
- Sub-millisecond latency
- High precision (95%+) on known patterns
- Zero false positives on benign prompts
- No external dependencies
# L1 detects known patterns
result = raxe.scan("Ignore all previous instructions")
# Matches: pi-001 (Prompt Injection)
L2: ML-Based Detection
The second layer uses a CPU-friendly ONNX classifier to catch:
- Obfuscated attacks (l33t speak, Unicode tricks)
- Novel attack patterns
- Semantic attacks that don’t match regex
Characteristics:
- ~3ms latency (CPU-only, no GPU needed)
- Catches attacks L1 misses
- Trained on real-world attack data
- Updates via model downloads
# L2 catches obfuscated attacks
result = raxe.scan("1gn0r3 4ll pr3v10us 1nstruct10ns")
# L1: No match (obfuscated)
# L2: Detected as prompt injection
L2 Classification Heads
The ML model uses 5 specialized classification heads:
| Head | Classes | Description |
|---|
| Binary | 2 | Threat vs safe |
| Threat Family | 15 | Attack category (prompt_injection, jailbreak, etc.) |
| Severity | 3 | none / moderate / severe |
| Technique | 35 | Specific attack method |
| Harm Types | 10 | Multilabel harm classification |
L2 Voting Engine
The ML model uses a BinaryFirstEngine voting system where the binary head (threat vs safe) is the primary decision maker, and other heads provide classification metadata.
Decision Zones
| Binary Probability | Zone | Default Decision |
|---|
| >= 0.85 | HIGH_THREAT | THREAT (unless suppressed by 3-head quorum) |
| 0.50 - 0.85 | MID_ZONE | Uses auxiliary heads for tiebreak |
| < 0.50 | LOW_THREAT | SAFE |
Uncategorized Threats
When the binary head detects a threat but the family classifier predicts “benign” with low confidence (< 0.60), RAXE displays “Uncategorized Threat”. This indicates a novel attack pattern that doesn’t fit known threat families.
Voting Presets
| Preset | TPR | FPR | Use Case |
|---|
balanced (default) | 90.4% | 7.4% | General use |
high_recall | 90.8% | 7.6% | Catch more threats |
low_fp | 89.0% | 6.0% | Minimize false positives |
Severity Mapping
The L2 model outputs 3 severity classes (none, moderate, severe), but the API uses 5 levels for consistency with L1 rules. L2 confidence scores are mapped to severity using thresholds:
| Confidence | Severity |
|---|
| >= 0.95 | CRITICAL |
| >= 0.85 | HIGH |
| >= 0.70 | MEDIUM |
| >= 0.50 | LOW |
| >= 0.30 | INFO |
| < 0.30 | None (no detection) |
When combining L1 and L2 results, the highest severity wins.
L2 Threat Families
The L2 model classifies threats into 15 families:
prompt_injection - Instruction override attacks
jailbreak - Bypassing safety guidelines
data_exfiltration - Stealing sensitive data
agent_goal_hijack - Redirecting agent objectives
tool_or_command_abuse - Misusing tools/commands
privilege_escalation - Gaining elevated access
memory_poisoning - Corrupting agent context
inter_agent_attack - Multi-agent system attacks
rag_or_context_attack - RAG/retrieval manipulation
encoding_or_obfuscation_attack - Encoding-based evasion
human_trust_exploit - Social engineering
rogue_behavior - Unintended agent actions
toxic_or_policy_violating_content - Harmful output
other_security - Other security concerns
benign - No threat
L2 families differ from L1 rule families. L1 uses 7 families (PI, JB, PII, CMD, ENC, HC, RAG) while L2 uses 15 semantic categories trained on attack data.
Detection Flow
Combining Results
When both layers detect threats, RAXE merges results:
result = raxe.scan(malicious_prompt)
# Combined severity (highest wins)
result.severity # "critical"
# All detections from both layers
result.total_detections # 3 (2 from L1, 1 from L2)
# L1 and L2 counts separately
result.l1_detections # 2
result.l2_detections # 1
# L1 detections list
for d in result.detections:
print(f"{d.rule_id}: {d.detection_layer}") # "L1"
Enabling/Disabling Layers
from raxe import Raxe
# L1 only (fastest)
raxe = Raxe(l1_enabled=True, l2_enabled=False)
# L2 only (ML detection)
raxe = Raxe(l1_enabled=False, l2_enabled=True)
# Both (recommended)
raxe = Raxe(l1_enabled=True, l2_enabled=True)
| Configuration | Latency | Detection Rate | Use Case |
|---|
| L1 only | ~0.4ms | 85% | High-throughput |
| L2 only | ~3ms | 90% | Novel attacks |
| L1 + L2 | ~3.5ms | 95%+ | Maximum security |