Benchmark with AI (pt1/3): When AI agents are confident and wrong

AI agents can produce benchmark-shaped output with total confidence, even when the methodology is invalid.

Mar 13, 2026

Creating a benchmarking script

TL; DR: AI agents generate fake data, skip result validation, hardcode query filters, and report meaningless metrics; all with complete confidence. Without guardrails, their benchmark "results" are just numbers the agent printed.

AI coding agents often make subtle mistakes when asked to assist with benchmarking tasks. To demonstrate the phenomenon for this post, I asked both Claude Code and Codex to create a Star Schema Benchmark script for DataFusion. Not "run the benchmark," just "create the script." I wanted to inspect what they produced before executing anything. Both ran in clean sessions; no project context, no MCP servers, no prompt history, no benchmarking tools available.

Claude Code wrote 679 lines of well-structured Python. Correct SSB table schemas, all 13 queries organized by flight, CLI argument parsing, multiple runs with best/median/average reporting. Here's how it generated the data:

def generate_lineorder_table(sf, date_keys, num_parts, num_supps, num_custs):
    """~6_000_000 * SF rows (1_500_000 orders * ~4 lines each)."""
    rng = make_rng(789)
    num_orders = 1_500_000 * sf

    for ok in range(1, num_orders + 1):
        nlines = rng.randint(1, 7)
        odate = rng.choice(date_keys)
        ckey = rng.randint(1, num_custs)
        # ... fills every column with random.Random() values

This generator looks like it's creating benchmark data but it isn't. The comment says "deterministic PRNG seeded per-table so results are reproducible." and it creates tables with the right column names, plausible value ranges, and correct cardinalities. However, SSB specifies a data generator (ssb-dbgen) that produces data with designed distributions and correlations2 that the agent's random.randint() calls do not reproduce. The result is that foreign key relationships become statistical noise rather than designed joins, and filter selectivities end up arbitrary rather than calibrated; so the queries run against a fundamentally different workload than SSB intends, making every timing result meaningless.

The agent expressed zero uncertainty about any of this. The script had no placeholder markers, no "this is an approximation" caveats, no suggestion that ssb-dbgen exists. It treated data generation as a solved problem: generate rows that fit the schema, then move on to printing timing tables without any validation phase, composite metric, or run protocol.

Codex produced a different script: instead of generating fake data, it cloned the ssb-dbgen repository from GitHub and built the official data generator. It included a dedicated warm-up phase and reported standard deviation. The worst failure mode; fake data; was absent.

But Codex still failed three of my five methodology checks: it didn't validate results against reference answers, it hardcoded all 13 queries with fixed filter values, and it reported raw elapsed milliseconds instead of a composite metric. Better methodology than Claude's attempt, but still not a benchmark result.

Both agents got the same prompt under the same constraints, yet one generated fake data with complete confidence while the other found the real data generator but skipped half the methodology. Neither flagged what it got wrong.

The overconfidence gap

Why do AI agents express such certainty about benchmark methodology when their knowledge is so unreliable?

Benchmarking sits in a dangerous middle ground for language models. There's enough online content about database performance testing to pattern-match convincingly, but the corpus is mostly informal, incomplete, and frequently wrong.

Benchmark specifications like TPC-H1 and SSB4 are dense technical documents that define exact data generation procedures, validation requirements, timing protocols, and composite metrics. Almost none of this detail appears in the training corpus in a form agents can reproduce. What does appear are simplified summaries; "TPC-H has 22 queries," "SSB uses a star schema"; that are true but incomplete for actually running the benchmark.

Search for any benchmark tutorial online and you'll see the pattern: run the queries, report elapsed times. Data generator requirements, qualification steps, composite metrics; almost none of it appears. The posts aren't wrong about what they describe; they're just describing something much simpler than the actual specification. This is the corpus agents train on, and it means the vocabulary is well-represented while the methodology isn't. The result is an agent that sounds fluent in benchmarking but doesn't actually understand the methodology; creating an illusion of competence that's invisible to anyone with the same knowledge gap.

Claude's SSB script is a direct example. It used the right table names, correct column schemas, and appropriate cardinalities because that vocabulary is all over the training data. But it used random.randint() for data generation because the detail that SSB requires ssb-dbgen with designed distributions is buried in a 2009 academic paper4, not in the blog posts and tutorials agents learn from.

Three factors compound the problem:

Most online benchmark content is informal; "I ran TPC-H on my laptop" usually means "I ran some queries and reported times."
Vendor marketing dominates the rest, publishing performance numbers without methodology details.
The actual specifications are dense PDFs poorly represented in web-crawled training data. The simplified summaries that are well-represented omit exactly the details that matter.

What makes this dangerous is that the output looks right. Garbage Python fails visibly, but invalid benchmark methodology just runs, prints numbers, and looks professional. The two scripts I collected demonstrate this: both are well-structured, well-commented, production-quality Python. Both would run without errors and print professional-looking timing tables, even though one generates entirely fake data and the other uses the real generator but skips half the methodology. Neither failure is visible in the output.

Here's the counterintuitive part: more capable agents don't produce fewer methodology failures; they produce harder-to-detect ones. Claude's script with deterministic seeding and reproducible random generation is more sophisticated than a quick random.uniform() shortcut. The sophistication makes the fake data harder to spot, not easier. The failure mode shifts from "obviously broken" to "subtly wrong," and detection requires domain expertise the user may not have.

Why agents default to building

The overconfidence isn't random; it has root causes that matter if you want to use agents productively.

They've seen more "build it" than "use it"

There are far more code-generation examples in training corpora than tool-usage examples. The internet is full of "here's how to build X" and sparse on "here's how to correctly invoke this tool," so agents naturally default to the pattern they've seen most.

When I ask an agent to run a benchmark, it can generate Python and execute it fast. Confirming whether benchbox is installed, learning its CLI, and invoking it correctly takes exploration the agent may not know how to do. The path of least resistance is generation. Claude's SSB script; 679 lines of custom data generation; versus Codex's one git clone of ssb-dbgen is the split in action.

Building is one step; tool discovery is many

An agent can verify that its generated code runs. It cannot easily verify that a specialized tool exists on the system, is installed correctly, and will do what the task requires. Generation is one step with immediate feedback, while tool discovery is multiple steps with uncertain outcomes.

Wrong output that runs is still "success"

Here's the fundamental issue: agents aren't penalized for reinventing wheels that produce incorrect results, as long as those results look correct. If the script runs and prints a table, the agent considers the task complete. Whether the data was generated correctly or the methodology was valid isn't in the feedback loop. The agent optimizes for "runs successfully," not "produces valid measurement."

Why I tested with SSB, not TPC-H

My earlier tests used TPC-H; the most widely discussed benchmark in the training data. When I asked agents to "run TPC-H at SF10 against DuckDB," every agent I tested found DuckDB's built-in TPC-H extension and used it correctly:

import duckdb
conn = duckdb.connect()
conn.execute("LOAD tpch")
conn.execute("CALL dbgen(sf=10)")
# Then: SELECT * FROM tpch_queries() or PRAGMA tpch(N)

This produces valid TPC-H data; correct distributions, correct correlations, correct cardinalities. The agents didn't need to know the TPC-H specification because DuckDB abstracted the hard part away. Three lines of code, and the worst failure mode (fake data) was impossible.

That's a genuine improvement. DuckDB's team deserves credit for building benchmark extensions that make correct methodology the path of least resistance; exactly the kind of guardrail I argue for in this post. I wrote about the difference between DuckDB's extensions and full benchmark methodology in "DuckDB tpch Extension vs BenchBox TPC-H".

But it created a problem for testing the blog's thesis. If agents always use DuckDB's extension for TPC-H, I can't demonstrate the fake data failure mode. So I switched to SSB on DataFusion; a benchmark with no built-in extension on a platform that requires the agent to solve data generation itself. The fake data problem reappeared immediately.

This tells you something important about the failure boundary: agents don't understand benchmark methodology, but they can use tools that encode it. When the right tool exists and is discoverable, agents find it; but when it doesn't, they generate fake data with complete confidence. The gap isn't in agent intelligence but in the tools we give them, and that's something we can close.

The failure modes that matter most

I've spent months watching AI coding agents attempt benchmark tasks during BenchBox development. The two SSB scripts above are representative, not cherry-picked; I tested multiple frontier models across TPC-H and SSB with consistent results. The catalog of mistakes is long, but four failure modes dominate and they're the hardest to detect.

Fake data generation

As the opening section demonstrated, this is the hardest failure to catch. Standard benchmarks define specific data generators (dbgen for TPC-H, ssb-dbgen for SSB) that produce data with designed distributions, correlations, and cardinalities2. These properties are the point; the queries are designed to stress specific patterns in this specific data.

The distortion from random data hits every query flight differently. Flight 1's discount/quantity filters depend on specific distributions. Flight 3 queries (Q3.1-Q3.4) filter on customer and supplier regions; with random keys, join selectivities change by orders of magnitude and queries that should stress the engine finish instantly. Flight 4's profit calculations depend on correlated lo_revenue and lo_supplycost values. Random data usually makes queries faster, not slower, because the workload gets easier; and faster feels like success, not a red flag.

Codex avoided this failure by cloning ssb-dbgen and building the official generator. But that makes the failure mode more dangerous: you can't predict which agents will fake the data and which won't. The only safe assumption is to verify.

Missing result validation

Standard benchmarks include mechanisms to verify query correctness. TPC-H requires a formal "qualification database" step at SF-1 with reference answers1. SSB defines expected result characteristics through its data generator's deterministic output4. The principle is the same: confirm your queries compute what the benchmark intends before you start timing.

Neither agent validated results; not Claude, not Codex, and not any other agent I've tested across dozens of attempts. They all go straight from "create tables" to "run queries and report times," which means you have no evidence that the queries are computing correctly. Speed is meaningless without correctness.

Why does this matter beyond pedantry? Because without validation, nothing catches a query that returns wrong answers. Claude's script records the row count for each query; but never checks whether that count is correct. With its randomly generated data, join queries could return zero rows or millions, and the script would report both as "success" with equal confidence. Codex's script does the same. The only difference is Codex's data would at least have the right distributions; the queries still run against unvalidated output.

No warm-up or repetition protocol

Without a defined protocol, performance measurement is just noise measurement. The first run on a cold system tells you about storage and buffer behavior, not the query engine. Benchmark specifications address this: TPC-H defines power run sequences and a composite metric (QphH@Size)3; SSB's original paper reports geometric means across multiple runs4.

This is where the two agents diverged interestingly. Claude's script ran each query 3 times and reported best/median/average; better than a single run, but with no dedicated warm-up phase. The first measured run includes cold-cache effects mixed into the "best" timing. Codex's script included an explicit --warmup parameter (default 1 run) before measured runs; a methodological improvement.

Neither agent addressed the deeper issue: without a defined protocol, which number is "the result"? Best-of-3? Median? Geometric mean? The choice matters, and the benchmark spec makes it for you. Agents pick whatever feels reasonable.

Hardcoded filter values

TPC-H defines each query as a template with substitution parameters; Q1 has a date parameter, Q6 has a discount range; precisely to prevent query result caching from dominating measurements1. SSB's 13 queries use fixed filter values in the original spec, but the principle still applies: if you run identical SQL on every iteration, the database can return memoized results and you're timing cache lookup, not query processing.

Both agents hardcoded every filter value. Claude's Q1.1 filters on d_year = 1993, lo_discount BETWEEN 1 AND 3, lo_quantity < 25; every run, identical. Codex's queries use the same fixed values. Neither agent varied filters across runs, discussed cache effects, or acknowledged the limitation. For SSB this is defensible if disclosed; for TPC-H it violates the spec. Neither agent made the distinction.

The rest of the catalog

Beyond the big four, these failures appear frequently enough to mention:

Inventing scale factors. Agent uses SF-5, SF-25, or other non-standard values. Can't validate against reference data; can't compare to published results. Data generators allow arbitrary SFs, but defined benchmark sizes (1, 10, 30, 100, etc.)2 have specific validation data.
Mixing benchmarks. Agent references "Query 23" in a TPC-H context, or uses SSB filters in TPC-DS queries. Different benchmarks with different schemas and protocols. Agents synthesize from all benchmark-related training data simultaneously.
Ignoring isolation requirements. Agent doesn't set or validate isolation. Violates TPC-H/TPC-DS isolation requirements and undermines ACID assumptions.3
Platform-blind scale factors. Agent uses SF-1 for a distributed system or SF-1000 for a laptop. SF-1 fits entirely in RAM on most modern hardware; you're testing cache behavior, not the query engine.
Reporting elapsed time instead of the defined metric. Agent reports "45.2 seconds total" as the result. Each benchmark defines its performance metric. TPC-H uses QphH@Size (composite of Power and Throughput)3; SSB uses geometric mean across flights. Elapsed time isn't a benchmark metric; it's a wall-clock sum that depends on query execution order.

Why "it runs" isn't enough

A benchmark isn't a program; it's a measurement protocol. Think of the difference between "a thermometer that displays a number" and "a calibrated instrument that accurately measures temperature." The display showing "72.3F" doesn't mean the room is 72.3 degrees; it just means the device produced output. What makes the reading mean something is calibration, placement, equilibration time, and environmental controls.

An agent that skips result validation, uses random data, hardcodes query filters, and reports raw elapsed times hasn't produced a wrong benchmark result; it hasn't produced a benchmark result at all, just a number that happens to have units of seconds attached.

So what do you do about it?

None of this argues against using AI agents for database evaluation. It argues for understanding where the failure boundary is; and acting on it before you trust any agent-generated performance claim.

Here's my immediate advice, applicable whether or not you use BenchBox:

Before trusting any AI-generated benchmark result, verify these five things:

Data origin: Was the benchmark's official data generator used (dbgen for TPC-H, ssb-dbgen for SSB, dsgen for TPC-DS)? If the agent generated data with Python, Faker, or any random generation, the results are invalid, full stop.
Result validation: Were query results checked against expected outputs? If not, you don't know whether the queries computed correctly. Speed without correctness is meaningless.
Filter variation: Were query filter values varied across runs (required for TPC-H/TPC-DS, good practice for SSB)? If the same values were used every time, cache effects dominate and the results are suspect.
Repetition and variance: How many runs? What's the standard deviation? A single data point is not a measurement. Demand at least 3 runs with reported variance.
Metric calculation: Is the result reported using the benchmark's defined metric (QphH@Size for TPC-H, Power@Size for TPC-DS) or just "total seconds"? The latter isn't a benchmark metric; it's a wall-clock number that depends on query execution order.

If any of these checks fail, the "benchmark results" aren't benchmark results. They're numbers the agent printed. Treat them like any unverified claim.

How I operationalize those checks. The three that need concrete verification beyond "ask the agent":

Data source: Look for data generator output files on disk. Check row counts against expected cardinalities; at SF-1, SSB's LINEORDER should have ~6 million rows; TPC-H's LINEITEM should have ~6 million at SF-1 or ~60 million at SF-10.
Timing variance: If all runs are identical to the millisecond, something is wrong. If runs 2 and 3 are 10x faster than run 1, you're measuring cache warm-up. Real variance on a stable benchmark run: standard deviation of ~5ms on queries averaging 11ms.
Scale: If the dataset fits in RAM, you're benchmarking your memory subsystem, not the query engine.

Three rules for working with AI agents on benchmarks:

Plan first, execute second. Define your methodology before the agent touches anything; benchmark, scale factor, repetition count, which phases to run, success criteria. The agent's job is to execute your plan, not design one; give it a blank canvas and it'll generate plausible-looking methodology, but give it a specific plan and it follows it.
Use tools with guardrails. If a tool exists that encodes correct methodology; DuckDB's TPC-H extension, BenchBox's MCP server, any validated benchmark runner; use it. The DuckDB finding is clear: when agents had a three-line invocation that handled data generation correctly, they used it and produced valid data. When they were left to figure it out themselves, they invented something that looked right and wasn't. Tool constraints beat prompt instructions every time.
Verify artifacts, not prose. Don't ask the agent "did you use the official data generator?"; check for its output files on disk. Don't ask "did validation pass?"; look for the validation report. Agent-generated prose about methodology is unreliable. A script that says "using ssb-dbgen" in a comment but uses random.randint() in the implementation is exactly the kind of cosmetic compliance you'll miss if you take the comment at face value.

The two scripts tell the whole story: Claude's passed zero of my five checks and Codex's passed two, yet neither agent flagged what it got wrong, and both would have reported results with full confidence.

But the DuckDB finding points toward the fix. When the right tool exists and handles methodology correctly, agents use it. BenchBox's MCP server is built on exactly that principle: validated inputs, structured errors, and workflow constraints that make invalid methodology impossible to execute rather than merely inadvisable to attempt.

Footnotes

TPC-H Specification v3.0.1; Transaction Processing Performance Council, accessed 2026-02-02. Clause 2.4.1.3 (substitution parameters), Clause 2.3.1 and Appendix C (qualification database and reference answers), Clause 4.1.2.2 (SF=1 for qualification).

TPC-H Specification v3.0.1; TPC, accessed 2026-02-02. Clause 4.2.1.2 (DBGen data generation requirements), Clause 4.2.5.2 (authorized scale factors and LINEITEM cardinalities). SSB uses an analogous generator (ssb-dbgen) derived from TPC-H's dbgen with modified distributions for the star schema.

TPC-H Specification v3.0.1; TPC, accessed 2026-02-02. Clause 5.4.3 (composite QphH@Size metric from Power and Throughput), Clause 3.4 (isolation requirements).

P. O'Neil, E. O'Neil, X. Chen, S. Revilak, "Star Schema Benchmark" (2009). Defines the SSB data generator, 13 queries across 4 flights, and expected result characteristics for validation. The ssb-dbgen tool (github.com/electrum/ssb-dbgen) is the standard open-source implementation.

Discussion about this post

Ready for more?