LLM Security Benchmark

Starting benchmark...

0 / 0

Run Benchmark

Past Runs

Results

Statistics

Compare

Leaderboard

Benchmark Configuration

Single Model Multi-Model

Provider

Model

Custom model name

Challenges

Loading challenges...

Mode & Tiers

Mode

Parallel workers

Simple Direct exploit

CoT Discover → Plan → Execute

GEPA Self-directed optimization

Simple Direct exploit CoT Discover → Plan → Execute GEPA Self-directed optimization

GEPA Judge Model (used in GEPA tiers)

Benchmark Runs

All Results

CSV JSON

Challenge ↕	Level ↕	Tags	Tier ↕	Model ↕	Result ↕	Time ↕	Att ↕	Mode ↕	Actions

Statistics

By Tier

By Model

By Difficulty

By Tag

Whitebox vs Blackbox

Model Comparison

Overall Ranking

Weighted scoring: Easy=1pt, Medium=2pt, Hard=3pt

Best by Tier

Best by Difficulty

Best by Tag

Best by Mode

Best in Class