Test What Models Can Do,
Not What They've Seen

Traditional AI evals are contaminated and disconnected from enterprise agentic scenarios, making them doubly useless and outright dangerous as tools to determine agentic fitness in enterprise deployments.

PICARD, a framework for agentic AI evaluation, offers memorization-proof AI evaluation through systematic multi-layered randomization and agentic scenario modeling. Measure genuine agentic capabilities, not training data memorization.

Benchmark Contamination and Agentic Disconnect

Popular AI benchmarks are compromised. As training datasets absorb the public internet, test questions inevitably appear in training corpora, leading to inflated performance that measures memorization rather than genuine problem-solving ability.

They are also completely disconnected from real-world agentic use cases in the real world. What does a score of 87% in your favorie LLM benchmark mean for your multi-agent enterprise automation, for example? (Nothing!)

The PICARD Framework:
Probing Intelligent Capabilities via Artificial Randomized Data

PICARD generates infinite unique test instances through entity substitution and dynamic data generation - anything from text files to CSVs to databases with multiple related tables. Every test is different, making comprehensive memorization computationally impossible.

template: "Process {{entity1}} data from {{artifacts}}/{{entity2}}.csv"
entities: crimson, harbor, whisper, ancient...
result: "Process harbor data from /sandbox/crimson.csv"
combinations: 154⁴ = 560,000,000+

Key Features

🎲

Combinatorial Anti-Memorization

Multi-layered randomization creates more test combinations than atoms in the observable universe, making comprehensive memorization physically impossible.

🔧

Agentic Focus

Tests real tool use and multi-step workflows, not just single-shot Q&A. Evaluate file manipulation, database operations, and complex reasoning chains.

✅

Deterministic Scoring

Runtime answer key generation provides exact expected values. No "LLM-as-judge" uncertainty or evaluation inconsistency.

🏗️

Dynamic Environments

Generate realistic CSV files, SQLite databases, and directory structures for each test. Create authentic business scenarios.

Test What Models Can Do,
Not What They've Seen

Benchmark Contamination and Agentic Disconnect

The PICARD Framework:
Probing Intelligent Capabilities via Artificial Randomized Data

Key Features

Combinatorial Anti-Memorization

Agentic Focus

Deterministic Scoring

Dynamic Environments

🎯 Dive Deeper

Read the Paper

Blog Post

GitHub

Test What Models Can Do,Not What They've Seen

Benchmark Contamination and Agentic Disconnect

The PICARD Framework: Probing Intelligent Capabilities via Artificial Randomized Data

Key Features

Combinatorial Anti-Memorization

Agentic Focus

Deterministic Scoring

Dynamic Environments

🎯 Dive Deeper

Read the Paper

Blog Post

GitHub

Test What Models Can Do,
Not What They've Seen

The PICARD Framework:
Probing Intelligent Capabilities via Artificial Randomized Data