LLM
Security Benchmark
Connecting...
Starting benchmark...
Stop
0 / 0
Run Benchmark
Past Runs
Results
Statistics
Compare
Leaderboard
Benchmark Configuration
Single Model
Multi-Model
Provider
GitHub Copilot
OpenAI
Anthropic
OpenRouter
NVIDIA NIM
Local/Custom
Model
gpt-4o
gpt-4o-mini
gpt-4-turbo
o1
o3-mini
o4-mini
claude-sonnet-4
claude-3.5-sonnet
deepseek-chat
deepseek-r1
gemini-2.0-flash (free)
Custom...
Custom model name
Provider
GitHub Copilot
OpenAI
Anthropic
OpenRouter
NVIDIA NIM
Local/Custom
Select models
Add custom model
Add
Challenges
All 30
Deselect All
Loading challenges...
Mode & Tiers
Mode
Whitebox
Blackbox
Parallel workers
Simple
Direct exploit
CoT
Discover → Plan → Execute
GEPA
Self-directed optimization
Simple
Direct exploit
CoT
Discover → Plan → Execute
GEPA
Self-directed optimization
BB-Simple
Direct exploit (no source)
BB-CoT
Discover → Plan → Execute
BB-GEPA
Self-directed optimization
GEPA Judge Model
(used in GEPA tiers)
Default (same as exploit model)
gpt-4o
gpt-4o-mini
o1
o3-mini
o4-mini
claude-sonnet-4
claude-3.5-sonnet
deepseek-chat
deepseek-r1
gemini-2.0-flash (free)
Start Benchmark
Benchmark Runs
Loading...
All Results
CSV
JSON
All runs
All models
All tiers
All levels
Easy
Medium
Hard
All
Success
Failed
All
Whitebox
Blackbox
Challenge ↕
Level ↕
Tags
Tier ↕
Model ↕
Result ↕
Time ↕
Att ↕
Mode ↕
Actions
Statistics
All
Whitebox
Blackbox
By Tier
By Model
By Difficulty
By Tag
Whitebox vs Blackbox
Model Comparison
All
Whitebox
Blackbox
Overall Ranking
Weighted scoring: Easy=1pt, Medium=2pt, Hard=3pt
Best by Tier
Best by Difficulty
Best by Tag
Best by Mode
Best in Class
Result Detail
×
Overview
Recon
Discover
Plan
Exploit
Reflection
Output
GEPA History