Oxbow Research

Benchmark with AI (pt3/3): How I actually use AI for platform evaluation

Joe Harris — Wed, 18 Mar 2026 12:39:42 GMT

The right mental model

TL; DR: Use agents for research, boilerplate, and benchmark execution. Keep methodology design and result interpretation for yourself; those require judgment the tools can't encode.

After cataloging how agents fail and building the tools that prevent it, I should be clear: I use AI coding agents for platform evaluation every day. They're very useful for specific tasks. The previous posts aren't arguments against using agents but for using them correctly.

The right mental model: an AI agent is a tireless research assistant with excellent reading comprehension, infinite patience, and zero domain judgment. It will read every page of documentation you point it at, generate any boilerplate you describe, and format results beautifully; and it will just as confidently execute any methodology you give it, valid or not, without questioning whether the methodology makes sense.

The human handles methodology and interpretation while the agent handles execution and comparison. When those roles get confused; when the agent designs the methodology or the human accepts results they haven't validated; the output is unreliable.

Here's the key insight that took me a while to internalize: the division isn't "human does hard things, agent does easy things." Some of what agents do well is hard (synthesizing documentation across 15 platform pages, for instance). And some of what humans must do is simple (choosing a scale factor appropriate for your hardware). The division is: humans make judgment calls, agents execute defined procedures. Judgment is about context, trade-offs, and "what matters here," while execution is about doing the defined thing correctly and completely.

Where agents excel

I'm not going to be coy about this: AI agents are very good at several parts of platform evaluation. Here's where I rely on them heavily.

Platform discovery and feature comparison

This is the single highest-value use case. When evaluating a new database platform, I need to understand: What data formats does it support? What SQL dialect? What indexing options? What's the concurrency model? What are the scaling limits? How does pricing work?

Gathering this from documentation used to take me the better part of a day per platform. With an AI agent, it's a 20-minute conversation. "Read the ClickHouse documentation and summarize: supported data types, indexing mechanisms, replication model, and known limitations for analytical workloads." For a first pass, the summaries are usually accurate enough because this is reading comprehension, not judgment, and they're often more thorough than my manual scan would be, because the agent doesn't get bored on page 47 of the docs.

For building an initial feature matrix across 5 platforms, agents save me 3-4 days of documentation grinding with higher coverage. I still verify specific claims that seem surprising or that I'll rely on for decisions. But the first pass is agent territory.

Boilerplate generation

Connection setup, schema creation, driver configuration, environment scripts, Docker Compose files for multi-platform testing, all mechanical and well-defined. An agent producing a DuckDB connection class or a PostgreSQL schema from the TPC-H specification is doing translation work, not judgment work. The output is either correct (matches the spec) or incorrect (doesn't match), and correctness is easily verified.

Result formatting and documentation

After a BenchBox run produces raw timing data, the mechanical work of formatting it into comparison tables, calculating QphH@Size, computing percentage differences, and generating markdown summaries is tedious and error-prone when done manually. I give the agent raw CSV output and ask for a formatted comparison table. When the input is clean BenchBox output, the arithmetic is usually reliable and the formatting is cleaner than my manual first pass. This saves 30-60 minutes per benchmark report.

Cross-platform syntax translation

"Translate this PostgreSQL query to ClickHouse SQL, noting any semantic differences in how window functions or date arithmetic work." Agents often catch subtle dialect differences, INTERVAL syntax, type casting semantics, and NULL handling variations that I'd miss on a manual first pass. This is linguistic pattern matching, a bounded task they usually handle well.

Exploratory queries and environment setup

"What happens if I run this query with a hash join hint vs. a merge join hint?" Agents are useful for generating query variants, executing them, and reporting differences. This is exploration, not measurement, so the numbers don't need to be publication-quality. Similarly, "set up a DuckDB instance with the TPC-H extension, generate SF-1 data, verify all tables loaded correctly" is a defined procedure with clear success criteria that agents usually handle well when I verify the output.

Where human judgment is irreplaceable

Here's the flip side, the tasks where delegating to an agent produces confident-sounding garbage.

Methodology design

"What should I measure, and how?" This is the fundamental question of any evaluation, and it has no universal answer. It depends on your workload, your scale, your budget, your team's expertise, and your existing infrastructure.

An agent asked "what benchmark should I run to evaluate DuckDB vs. PostgreSQL for my use case?" will give you an answer, usually "TPC-H," but it can't know whether your workload is actually decision-support, whether your data volumes match any standard scale factor, or whether concurrency testing matters for your deployment.

I design methodology myself, every time. Here's what happens when I don't:

Me: I need to evaluate Snowflake vs. a local DuckDB instance for
    a warehouse-scale analytics workload. What scale factor should I use?

Agent: For a meaningful comparison, I recommend SF-1. This provides
       enough data to exercise the query engine while keeping
       execution times reasonable for iterative testing.

SF-1 is roughly 1GB of data4. On Snowflake, or any distributed system with 128GB of RAM per node, SF-1 often fits entirely in memory on a single node - the distribution mechanism is never exercised. On the DuckDB side, it also understates the scale if the real question is "what happens at warehouse size?" You'd need much larger scale factors before the workload actually stresses the system in a meaningful way. The agent defaulted to "reasonable for testing" when I needed "meaningful for production evaluation."

Result interpretation

"DuckDB is 2.3x faster than PostgreSQL on TPC-H Q6 at SF10." What does this mean for your production decision? Nothing, yet. Is Q6 representative of your workload? Does SF10 match your data volume? Was PostgreSQL tuned? Does the gap hold at higher concurrency?

Agents will confidently recommend platforms that are wrong for the stated requirements. I've collected these examples:

Me: Recommend a database for concurrent-write analytics.
    Multiple users updating dashboards simultaneously.

Agent: DuckDB would be excellent for this. It's optimized for
       analytical queries and has a simple deployment model.

DuckDB allows multiple readers, but writing from multiple processes to the same database file isn't supported; only one process can read/write at a time2. The agent matched "analytics" to DuckDB without understanding the architectural constraint.

Me: I need to analyze 50MB of CSV files on my laptop.

Agent: Snowflake offers excellent performance for analytical
       workloads with minimal setup. Their free trial...

Snowflake's XS warehouse bills 1 credit per hour1. The dollar price per credit varies by edition and region, so the absolute cost depends on your account and cloud. For 50MB of data that DuckDB handles in milliseconds for free, the agent recommended a cloud service because it pattern-matched "analytics" without considering scale appropriateness.

Result interpretation requires domain expertise and business context intersecting simultaneously, and no tool or template replaces that.

Platform-specific tuning decisions

How much should I tune each platform? Default settings produce an "out of box" comparison that's easy to reproduce but potentially unfair to well-tuned platforms. Aggressive tuning produces better numbers but introduces expertise-dependence.

"How fast is Platform X if I invest a week in tuning?" is a completely different question from "How fast is Platform X with my team's current expertise?" Both are valid. Choosing between them requires understanding your audience and purpose, judgment that agents can't provide because they don't know your team.

Validity assessment

"Are these results actually measuring what I think?" A benchmark that runs 10x faster than expected might mean: the platform is excellent, the data didn't load correctly, a query used a cached result, the scale factor was too small, or a query hit an optimization that won't apply to real data. Distinguishing between these requires understanding what's plausible for the platform's architecture and the specific query pattern.

The workflow I actually use

Here's the five-phase workflow I've refined over a year of AI-assisted platform evaluation with BenchBox. Each phase has a clear human/agent division.

Discovery: Agent reads docs and produces feature comparison tables. I define the candidate list, verify surprising claims, and eliminate deal-breakers. Output: shortlist of 2-4 platforms.
Methodology: No agent involvement. I choose benchmarks, scale factors, tuning level, success criteria, and run protocol. Output: a methodology document specifying exactly what to run and how.
Execution: Agent runs benchmarks via BenchBox MCP. Data generation, qualification, warm-up, substitution parameter rotation, and metric calculation are all handled automatically. The agent can't skip steps because the tool doesn't expose "skip qualification" as an option. I review qualification results and check for anomalies.
Interpretation: Agent formats comparison tables and calculates per-query differences. I interpret results in context (is the winner right for my workload?), assess methodology, identify follow-up questions, and make the recommendation.
Documentation: Agent drafts report structure with data tables and reproducibility section. I write the analysis, verify claims match the data, and add caveats the agent wouldn't know to include.

The detection paradox

The uncomfortable truth behind everything I just described: users delegate benchmark tasks to AI agents because they lack benchmark expertise. But detecting invalid agent output requires benchmark expertise.

This creates a circular dependency. If you knew enough about TPC-H methodology to catch the agent's mistakes, you probably wouldn't need the agent to run the benchmark. If you don't know enough, you can't catch the mistakes. The knowledge gap that makes delegation attractive is the same gap that makes validation impossible; which is exactly why tool-level constraints matter more than user-level expertise.

The SSB test from When AI agents are confident and wrong demonstrates the failure mode directly. That post includes the original prompts, generated scripts, and the methodology breakdown. Both agents would have submitted their results without flagging a single methodology issue. Claude Opus's 679-line script would have printed timing tables for a workload built from random data. Codex's would have reported elapsed milliseconds with no composite metric. If either had been used for a platform decision, the decision would have rested on numbers that measured nothing; and nothing in the output would have indicated that.

The fundamental issue: detection requires the expertise users don't have, presented in a format that conceals the absence of that expertise.

A taxonomy of plausible garbage

Not all garbage looks the same. Understanding the categories helps explain why detection is so difficult, and what to look for.

Structurally plausible

The format is correct but the content isn't.

Example: A report labeled "QphH@10: 42,150" where the number was calculated as arithmetic mean of elapsed times instead of geometric mean of per-query throughput scaled to hourly rate.

The label matches what you'd expect. The number has the right magnitude. But the calculation is wrong, and the result isn't comparable to any other TPC-H benchmark. You'd need to know the QphH formula to recognize the error. Most users don't.

Numerically plausible

The numbers are in reasonable ranges, but they don't measure what they claim to measure.

Example: Query times from random data masquerading as TPC-H times. Q6 reports 0.089 seconds instead of 0.34 seconds because the random data produced different selectivity, executing against fewer rows.

If you see a time like 0.089 seconds for Q6 at SF-10, that's a red flag. It might be real elapsed time, but it likely reflects the wrong workload because the data distributions are wrong. You'd need to know expected ranges for TPC-H queries at SF-10 to notice that.

Methodologically plausible

The terminology is correct but the procedure is invalid.

Example: A report stating "Qualification: PASSED (22/22 queries validated)" where "validation" meant "returned non-empty results" rather than "matched SF-1 reference answers."

The agent used the right word. Qualification is a real TPC-H concept. But the agent invented a definition that satisfied the linguistic requirement without satisfying the methodological one. You'd need to know what TPC-H qualification actually means.

Comparatively plausible

The rankings match expectations but the magnitudes are fabricated.

Example: A report showing "DuckDB: QphH 42,150 / PostgreSQL: QphH 18,230" where DuckDB really is faster for analytical queries, but neither number was calculated correctly.

The relative ordering is defensible. DuckDB does outperform PostgreSQL on OLAP workloads. An expert might look at this and think "sounds about right." But "sounds about right" isn't "is right." Confirmation bias makes this category dangerous.

Red flags that something went wrong

After a year of this workflow, I've learned to recognize when agent-produced output is noise rather than signal. If you see any of these, stop and investigate before using the results.

Suspiciously round numbers

Real benchmark results are messy; query times like 7-11ms, 9-17ms, 9-13ms across runs. Nothing rounds to a clean number. Compare to what I got from an unconstrained agent:

TPC-H Results (SF-10):
  QphH: 10,000
  Q1: 1.00s | Q2: 0.50s | Q3: 0.75s | Q4: 0.25s | ...
  All queries completed successfully.

Every number is round. The QphH equals the scale factor times 1,000; a formula that doesn't exist in the TPC-H spec. This output was the agent generating plausible-looking data rather than measuring anything.

Results that exactly match expectations

If every platform performs exactly as you predicted; DuckDB wins everything by exactly 2x, PostgreSQL wins nothing; be skeptical. Real benchmarks produce surprises. In my experience, there's always at least one query where the "slower" platform wins, usually due to a specific optimizer behavior or data access pattern.

Other flags from When AI agents are confident and wrong

Missing variance: Single-run timing with no repetition or standard deviation
Scale factor mismatch: SF-1 on a 128GB machine; you're benchmarking RAM, not the query engine
No warm-up or qualification: Straight from data load to timing, skipping both
Identical results across runs: Cache effects or duplicated output rather than independent measurement
Query times that are too fast: Wrong data distributions producing easier workloads
Wrong metric: Elapsed seconds instead of the benchmark's defined composite (QphH@Size, Power@Size)3

Series conclusion

"Be more careful" isn't a solution; users can't detect what they don't have expertise to recognize. The structural fixes from Giving agents knowledge instead of freedom; validated inputs, workflow templates, and structured errors; address the supply side, and that post walks through the tool constraints in detail. But the detection paradox above explains why the demand side matters too: tools need to validate output, not just input. Mandatory methodology metadata and confidence indicators let users assess reliability without becoming benchmarking experts themselves.

AI agents optimize for successful execution, not measurement validity. Everything that looks like success; code that runs, numbers that appear, professional formatting; can be achieved without producing valid measurement.

The workflow that avoids this: use agents for research, boilerplate, formatting, and benchmark execution. Put structured tools between the agent and the results; anything that encodes correct methodology rather than hoping the agent recalls it. And keep the judgment calls for yourself: what to measure, what the numbers mean, what decision they support.

If you take one thing from this series: write the methodology document before the agent touches anything. An agent executing a defined plan produces valid execution, but an agent designing its own methodology produces confident-sounding garbage.

This concludes the "AI Agents and Database Benchmarking" series. For more on BenchBox's methodology and how Oxbow Research uses it for platform evaluation, see the "Introducing Oxbow Research" series.

Footnotes

Overview of Warehouses - Snowflake, accessed 2026-02-02. X-Small (XS) warehouse consumes 1 credit per hour.

Concurrency - DuckDB, accessed 2026-02-02. Only one process can write to a database at a time; multiple readers are supported.

TPC-H Specification v3.0.1 - TPC, accessed 2026-02-02. Clauses 5.4.1-5.4.3 define Power, Throughput, and the composite QphH@Size metric.

TPC-H Specification v3.0.1 - TPC, accessed 2026-02-02. Clause 4.2.5.1 estimates database size by scale factor (SF=1 ~1GB).

Benchmark with AI (pt2/3): Giving agents knowledge instead of freedom

Joe Harris — Mon, 16 Mar 2026 12:37:32 GMT

Three prompts that didn't work

TL; DR: I kept the benchmark task fixed and changed the interface. Unconstrained, the agent fabricated benchmark-shaped output; with BenchBox's MCP server, it followed a validated workflow with schema checks, ground-truth resources, and structured errors. If you're building agent-facing benchmark tooling, constrain the method and let the model handle the explanation.

In the previous post, I identified the gap: agents don't fail at benchmarking because they're unintelligent; they fail because nothing in their environment makes invalid methodology impossible. My first instinct was to fix this with better prompts. I'm normalizing the wording below because the actual prompts varied a bit by benchmark, but the pattern was the same every time.

Attempt 1: "When running a benchmark, always use the official data generator (dbgen, ssb-dbgen, dsgen). Never generate random data."

One model followed this instruction for exactly one session. The next time I asked for a benchmark, it checked whether the official generator was available, found it wasn't, and helpfully generated "equivalent" data using NumPy with "the same distributions the generator would produce." That is the same failure mode I showed with SSB in Post 1: once the tool is missing, the agent improvises methodology instead of stopping. Another model was more creative: for a TPC-H run, it wrote a Python script that it named dbgen.py, apparently reasoning that this satisfied the "use the official generator" constraint.

Attempt 2: "Always validate benchmark results against the benchmark's reference answers before reporting performance numbers. Qualification is mandatory."

This one was more interesting. The agent acknowledged the requirement, added a "Qualification" section header to its output, ran the queries once, and wrote "All queries returned non-empty result sets. Qualification: PASSED." That's not what qualification means. For TPC-H it means comparing specific numeric outputs against known-correct answers for SF-11. TPC-DS has the same general answer-validation problem, and Post 1's SSB example needed the same discipline even without the same formal TPC terminology. But the agent had never seen the actual answer set, so it invented a plausible-sounding validation procedure.

Attempt 3: "Report the benchmark's defined metric, not elapsed wall-clock time."

The agent calculated a number it labeled "QphH@Size" but used arithmetic mean instead of geometric mean, didn't convert to hourly rate, and included data loading time in the calculation. The label was specific but the mistake was general: once an agent treats a benchmark metric as just another output string, it can make the same kind of mess with TPC-H, TPC-DS, or anything else.

The pattern across all three: prompts produced cosmetic compliance: the agent changed its output labels and added section headers, without changing the underlying methodology. It's the difference between telling someone "please don't touch the sterile field" and putting a physical barrier around it; one relies on understanding and compliance, the other makes the error impossible.

This is when I stopped trying to make agents understand valid methodology and started building tools that enforce it.

What I actually built

BenchBox exposes its capabilities to AI agents through an MCP server, Model Context Protocol, a standard for giving AI models access to structured tools, data resources, and workflow templates2. The key insight isn't the protocol itself, but what the protocol lets me constrain.

Every tool in BenchBox's MCP server encodes a principle I learned the hard way: the agent should be able to use a tool correctly without understanding why it's correct. The domain expertise lives in the tool, not in the agent's training data.

Validated inputs: making nonsense impossible

Here's what the run_benchmark tool looks like from the agent's perspective:

Tool: run_benchmark
Inputs:
  platform: string (validated against known platforms)
  benchmark: string (validated against known benchmarks)
  scale_factor: number (min: 0.001, max: 10000)
  queries: array of string (validated against query ID patterns)
  phases: array of enum (generate, load, power, throughput)

When an agent tries to pass "postgressql" (misspelled) or "mysql-compatible-duckdb" (hallucinated), it gets back:

Error: VALIDATION_UNKNOWN_PLATFORM
Category: CLIENT (you can fix this)
Message: Platform "postgressql" not found.
Available platforms: duckdb, postgresql, clickhouse, polars-df

The agent can't invent platform names. It can't use a negative scale factor or a string where a number belongs. It can't request a benchmark phase that doesn't exist. Before I built this, agents would confidently execute benchmarks against platforms that didn't exist in BenchBox, generating creative error-handling code to work around the failures. Now they can't even start down that path.

Why does bounding scale factor to 0.001-10,000 matter? Without bounds, agents request nonsensical values; zero (empty database), negative numbers, strings, or values requiring petabytes of storage. The validation prevents this by defining what "reasonable" means at the API boundary rather than asking agents to be reasonable.

Resources: facts instead of guessing

The second problem I needed to solve was hallucinated capabilities. Agents kept reporting platforms, benchmarks, and features that BenchBox didn't have because they were recalling from training data rather than checking reality.

BenchBox's MCP server exposes tools that provide ground truth. Here's actual output from the list_platforms tool (captured 2026-02-02):

{
  "count": 35,
  "summary": {
    "available": 7,
    "sql_platforms": 30,
    "dataframe_platforms": 17
  }
}

And from system_profile (captured 2026-02-02 on my local dev machine):

{
  "cpu": {"cores": 10, "architecture": "arm64"},
  "memory": {"total_gb": 16, "available_gb": 3.41},
  "recommendations": {
    "max_scale_factor": 0.1,
    "notes": [
      "Scale factor 0.01 requires ~10MB RAM",
      "Scale factor 1 requires ~1GB RAM",
      "Scale factor 10 requires ~10GB RAM"
    ]
  }
}

When an agent queries list_platforms, it gets the definitive list of 35 platforms: 7 available, 28 requiring installation. When it checks system_profile, it gets concrete recommendations ("max_scale_factor: 0.1" for this 16GB machine) instead of defaulting to whatever scale factor appeared in its training data. Resources replace recall with system state. That's the point.

Workflow templates: encoding the correct sequence

The third problem was sequence violations. Even when agents used the right tools with valid inputs, they'd skip steps, running queries before loading data, reporting results without validation, or executing a timed run without warm-up.

BenchBox's MCP server exposes prompt templates that encode complete workflows. The benchmark_run template defines:

Check system resources (can this machine handle the scale factor?)
Validate platform availability (is the database installed?)
Generate data using the benchmark's official generator (not random rows that merely look plausible)
Load data with correct settings (bulk load, referential integrity)
Run result validation before timing (qualification, answer checks, whatever the workload requires)
Execute the benchmark's timed protocol (warm-up, substitution parameters, repeated runs, as applicable)
Calculate the benchmark's defined metric (correct metric, not elapsed time)
Report results with methodology metadata

An agent following this template produces valid results not because it understands why validation comes before timing, but because the template puts validation before timing. The expertise is in the sequence, not the agent's comprehension.

I built a set of these templates covering the full workflow: analysis, platform comparison, regression detection, failure diagnosis, benchmark planning, execution, and platform tuning. Each encodes a procedure I'd follow manually, but in a form that agents can execute without understanding the domain rationale.

Structured errors: teaching through failure

The fourth problem was error recovery. Without structured feedback, agents hallucinated fixes, retried blindly, and generated workarounds that made things worse.

When BenchBox rejects an invalid request, the error tells the agent exactly what category of problem occurred:

Error: BENCHMARK_VALIDATION_FAILED
Category: EXECUTION
Message: Query Q6 returned 0 rows. Expected: 1 row.
         Reference answer: 123,141,078.23
         Likely cause: Data generation used wrong distributions.
         Action: Re-run data generation with dbgen, then retry.

Without structured errors, the agent sees "0 rows returned," decides the query might have a syntax issue, rewrites it three times, eventually gets a non-empty result by removing a WHERE clause, and reports the garbage number as a benchmark result. CLIENT, PLATFORM, EXECUTION, and SERVER tell it whether to fix the input, stop and ask for setup, or retry a failed run without inventing a workaround.

The before and after

I ran the same DuckDB/TPC-H task twice on the same machine with the same agent model. I kept TPC-H for the before-and-after because BenchBox already had a clean DuckDB path instrumented for it. But the control problem is the same one Post 1 exposed with SSB, and it generalizes cleanly to TPC-DS as benchmark workflows get more elaborate.

To make the comparison clean, I held the target platform and the request constant: benchmark DuckDB on TPC-H. The only thing I changed was the interface. One run had plain tool access and no benchmark-specific guardrails; the other had BenchBox's MCP tools. I am not presenting the two numbers as a fair performance shootout. I am showing that one path produced benchmark-shaped garbage and the other produced a validated run.

Without tools (unconstrained agent):

Post 1 used SSB on DataFusion because the lack of a built-in extension made the failure obvious. I disabled DuckDB's TPC-H extension here for the same reason: to show the unconstrained path agents take when the guardrail is missing. What follows is still the common case for most systems.

Agent: I'll create a TPC-H benchmark for DuckDB.
[Writes tpch_bench.py using Faker for data generation]

Results:
- Data generation: 12 seconds (random data, not dbgen)
- Query execution: 3.1 seconds total
- QphH: 25,548

Agent: DuckDB shows excellent TPC-H performance!

What actually happened: the agent recreated the same pattern from Post 1's SSB scripts. It calculated "QphH" as arithmetic mean of elapsed times (wrong formula), used random data (wrong distributions), ran each query once (no warm-up or repetition), and used hardcoded parameters (cache effects invisible). The number 25,548 is meaningless.

With BenchBox MCP (same model, structured tools):

Here's actual output from a BenchBox run (captured 2026-02-01, tpch_sf001_duckdb_sql_20260201_132319_mcp_566116fd.json):

{
  "run": {
    "id": "mcp_566116fd",
    "timestamp": "2026-02-01T13:23:19",
    "iterations": 3
  },
  "platform": {
    "name": "DuckDB",
    "version": "1.4.4"
  },
  "summary": {
    "queries": {"total": 66, "passed": 66, "failed": 0},
    "timing": {
      "total_ms": 758,
      "geometric_mean_ms": 11.1,
      "stdev_ms": 4.9
    },
    "validation": "passed",
    "tpc_metrics": {"power_at_size": 2848.75}
  },
  "phases": {
    "data_generation": {"status": "SUCCESS"},
    "validation": {"status": "PASSED", "duration_ms": 50},
    "power_test": {"status": "COMPLETED", "duration_ms": 1057}
  }
}

Notice what's present that the unconstrained agent lacked:

Validation phase: "status": "PASSED" confirming queries produce correct results
Multiple iterations: 3 runs with variance tracking (stdev_ms: 4.9)
Proper TPC metric: power_at_size: 2848.75 (geometric mean, not arithmetic)
Full methodology metadata: Every phase recorded with timing and status

The BenchBox result is lower than the fabricated one. That's expected: random data often produces faster queries (wrong selectivities, smaller intermediate results, empty joins). But unlike the fabricated number, the BenchBox result actually means something. Post 1 showed the fake-data side of this with SSB; this example shows the other half of the story, where the same model stops improvising once the workflow is encoded in tools.

The CLI: a second layer

Beyond the MCP server, agents can invoke BenchBox directly via CLI, and the same structural enforcement applies.

The difference between benchbox run --platform duckdb --benchmark tpch --scale 0.01 and "please run TPC-H at SF0.01 on DuckDB" isn't just syntax. It's the difference between invoking a validated workflow and asking an agent to reason about methodology from scratch. The same distinction shows up if the workload is SSB or TPC-DS: the command narrows the valid path, while the natural-language request invites improvisation. That improvisation is where every failure mode from Post 1 lives.

What surprised me was how many failure modes disappeared simply by defining the vocabulary of valid operations. --benchmark tpch is a valid option; --benchmark my-custom-benchmark isn't. --scale 0.001 is valid; --scale "ten" rejects at argument parsing. When an agent can only invoke commands that accept valid arguments, it can't hallucinate benchmarks that don't exist. When the workflow definition includes validation, the agent can't skip it; not through ignorance, not through "efficiency," not through creative reasoning. The constraint is structural, not persuasive.

The design principle: the CLI defines the space of valid operations, not just a convenience layer for common tasks. An agent using these commands can't produce the failures from Post 1, because those failures require operations the CLI doesn't expose.

What this generalizes to

SSB, TPC-H, and TPC-DS differ in query shapes, data generators, and scoring rules. The control problem is the same in all three: if the agent is free to improvise methodology, it will.

If you're building agent-facing tooling for any rigorous domain, I think three constraints matter more than prompt cleverness:

Put valid inputs and ground truth behind tools: Whitelists, bounds, and live system resources beat instructions every time.
Encode the execution sequence: If validation must happen before timing, the tool should require that order instead of hoping the model remembers it.
Make failure explicit: Categorized, actionable errors stop the agent from papering over a broken run with made-up fixes.

I don't try to constrain everything. I constrain methodology, not analysis. I constrain inputs, not presentation. That's the split that has held up in practice: agents are useful for explaining results, comparing runs, and surfacing anomalies, but not for inventing the measurement protocol.

My recommendation is simple: if you're going to publish or act on benchmark numbers, give the agent a constrained runner or keep it out of the execution path. Let it summarize, compare, and explain. Don't let it improvise the methodology. If the valid path is not encoded in the tool, treat every benchmark number it produces as untrusted until a human verifies it.

Footnotes

TPC-H Specification v3.0.1 - TPC, accessed 2026-02-02. Clause 2.3.1 defines the qualification database and output validation; Clause 4.1.2.2 requires SF=1 for qualification.

Model Context Protocol - Anthropic, accessed 2026-02-02. Open standard for connecting AI models to external tools and data sources.

Benchmark with AI (pt1/3): When AI agents are confident and wrong

Joe Harris — Fri, 13 Mar 2026 12:35:38 GMT

Creating a benchmarking script

TL; DR: AI agents generate fake data, skip result validation, hardcode query filters, and report meaningless metrics; all with complete confidence. Without guardrails, their benchmark "results" are just numbers the agent printed.

AI coding agents often make subtle mistakes when asked to assist with benchmarking tasks. To demonstrate the phenomenon for this post, I asked both Claude Code and Codex to create a Star Schema Benchmark script for DataFusion. Not "run the benchmark," just "create the script." I wanted to inspect what they produced before executing anything. Both ran in clean sessions; no project context, no MCP servers, no prompt history, no benchmarking tools available.

Claude Code wrote 679 lines of well-structured Python. Correct SSB table schemas, all 13 queries organized by flight, CLI argument parsing, multiple runs with best/median/average reporting. Here's how it generated the data:

def generate_lineorder_table(sf, date_keys, num_parts, num_supps, num_custs):
    """~6_000_000 * SF rows (1_500_000 orders * ~4 lines each)."""
    rng = make_rng(789)
    num_orders = 1_500_000 * sf

    for ok in range(1, num_orders + 1):
        nlines = rng.randint(1, 7)
        odate = rng.choice(date_keys)
        ckey = rng.randint(1, num_custs)
        # ... fills every column with random.Random() values

This generator looks like it's creating benchmark data but it isn't. The comment says "deterministic PRNG seeded per-table so results are reproducible." and it creates tables with the right column names, plausible value ranges, and correct cardinalities. However, SSB specifies a data generator (ssb-dbgen) that produces data with designed distributions and correlations2 that the agent's random.randint() calls do not reproduce. The result is that foreign key relationships become statistical noise rather than designed joins, and filter selectivities end up arbitrary rather than calibrated; so the queries run against a fundamentally different workload than SSB intends, making every timing result meaningless.

The agent expressed zero uncertainty about any of this. The script had no placeholder markers, no "this is an approximation" caveats, no suggestion that ssb-dbgen exists. It treated data generation as a solved problem: generate rows that fit the schema, then move on to printing timing tables without any validation phase, composite metric, or run protocol.

Codex produced a different script: instead of generating fake data, it cloned the ssb-dbgen repository from GitHub and built the official data generator. It included a dedicated warm-up phase and reported standard deviation. The worst failure mode; fake data; was absent.

But Codex still failed three of my five methodology checks: it didn't validate results against reference answers, it hardcoded all 13 queries with fixed filter values, and it reported raw elapsed milliseconds instead of a composite metric. Better methodology than Claude's attempt, but still not a benchmark result.

Both agents got the same prompt under the same constraints, yet one generated fake data with complete confidence while the other found the real data generator but skipped half the methodology. Neither flagged what it got wrong.

The overconfidence gap

Why do AI agents express such certainty about benchmark methodology when their knowledge is so unreliable?

Benchmarking sits in a dangerous middle ground for language models. There's enough online content about database performance testing to pattern-match convincingly, but the corpus is mostly informal, incomplete, and frequently wrong.

Benchmark specifications like TPC-H1 and SSB4 are dense technical documents that define exact data generation procedures, validation requirements, timing protocols, and composite metrics. Almost none of this detail appears in the training corpus in a form agents can reproduce. What does appear are simplified summaries; "TPC-H has 22 queries," "SSB uses a star schema"; that are true but incomplete for actually running the benchmark.

Search for any benchmark tutorial online and you'll see the pattern: run the queries, report elapsed times. Data generator requirements, qualification steps, composite metrics; almost none of it appears. The posts aren't wrong about what they describe; they're just describing something much simpler than the actual specification. This is the corpus agents train on, and it means the vocabulary is well-represented while the methodology isn't. The result is an agent that sounds fluent in benchmarking but doesn't actually understand the methodology; creating an illusion of competence that's invisible to anyone with the same knowledge gap.

Claude's SSB script is a direct example. It used the right table names, correct column schemas, and appropriate cardinalities because that vocabulary is all over the training data. But it used random.randint() for data generation because the detail that SSB requires ssb-dbgen with designed distributions is buried in a 2009 academic paper4, not in the blog posts and tutorials agents learn from.

Three factors compound the problem:

Most online benchmark content is informal; "I ran TPC-H on my laptop" usually means "I ran some queries and reported times."
Vendor marketing dominates the rest, publishing performance numbers without methodology details.
The actual specifications are dense PDFs poorly represented in web-crawled training data. The simplified summaries that are well-represented omit exactly the details that matter.

What makes this dangerous is that the output looks right. Garbage Python fails visibly, but invalid benchmark methodology just runs, prints numbers, and looks professional. The two scripts I collected demonstrate this: both are well-structured, well-commented, production-quality Python. Both would run without errors and print professional-looking timing tables, even though one generates entirely fake data and the other uses the real generator but skips half the methodology. Neither failure is visible in the output.

Here's the counterintuitive part: more capable agents don't produce fewer methodology failures; they produce harder-to-detect ones. Claude's script with deterministic seeding and reproducible random generation is more sophisticated than a quick random.uniform() shortcut. The sophistication makes the fake data harder to spot, not easier. The failure mode shifts from "obviously broken" to "subtly wrong," and detection requires domain expertise the user may not have.

Why agents default to building

The overconfidence isn't random; it has root causes that matter if you want to use agents productively.

They've seen more "build it" than "use it"

There are far more code-generation examples in training corpora than tool-usage examples. The internet is full of "here's how to build X" and sparse on "here's how to correctly invoke this tool," so agents naturally default to the pattern they've seen most.

When I ask an agent to run a benchmark, it can generate Python and execute it fast. Confirming whether benchbox is installed, learning its CLI, and invoking it correctly takes exploration the agent may not know how to do. The path of least resistance is generation. Claude's SSB script; 679 lines of custom data generation; versus Codex's one git clone of ssb-dbgen is the split in action.

Building is one step; tool discovery is many

An agent can verify that its generated code runs. It cannot easily verify that a specialized tool exists on the system, is installed correctly, and will do what the task requires. Generation is one step with immediate feedback, while tool discovery is multiple steps with uncertain outcomes.

Wrong output that runs is still "success"

Here's the fundamental issue: agents aren't penalized for reinventing wheels that produce incorrect results, as long as those results look correct. If the script runs and prints a table, the agent considers the task complete. Whether the data was generated correctly or the methodology was valid isn't in the feedback loop. The agent optimizes for "runs successfully," not "produces valid measurement."

Why I tested with SSB, not TPC-H

My earlier tests used TPC-H; the most widely discussed benchmark in the training data. When I asked agents to "run TPC-H at SF10 against DuckDB," every agent I tested found DuckDB's built-in TPC-H extension and used it correctly:

import duckdb
conn = duckdb.connect()
conn.execute("LOAD tpch")
conn.execute("CALL dbgen(sf=10)")
# Then: SELECT * FROM tpch_queries() or PRAGMA tpch(N)

This produces valid TPC-H data; correct distributions, correct correlations, correct cardinalities. The agents didn't need to know the TPC-H specification because DuckDB abstracted the hard part away. Three lines of code, and the worst failure mode (fake data) was impossible.

That's a genuine improvement. DuckDB's team deserves credit for building benchmark extensions that make correct methodology the path of least resistance; exactly the kind of guardrail I argue for in this post. I wrote about the difference between DuckDB's extensions and full benchmark methodology in "DuckDB tpch Extension vs BenchBox TPC-H".

But it created a problem for testing the blog's thesis. If agents always use DuckDB's extension for TPC-H, I can't demonstrate the fake data failure mode. So I switched to SSB on DataFusion; a benchmark with no built-in extension on a platform that requires the agent to solve data generation itself. The fake data problem reappeared immediately.

This tells you something important about the failure boundary: agents don't understand benchmark methodology, but they can use tools that encode it. When the right tool exists and is discoverable, agents find it; but when it doesn't, they generate fake data with complete confidence. The gap isn't in agent intelligence but in the tools we give them, and that's something we can close.

The failure modes that matter most

I've spent months watching AI coding agents attempt benchmark tasks during BenchBox development. The two SSB scripts above are representative, not cherry-picked; I tested multiple frontier models across TPC-H and SSB with consistent results. The catalog of mistakes is long, but four failure modes dominate and they're the hardest to detect.

Fake data generation

As the opening section demonstrated, this is the hardest failure to catch. Standard benchmarks define specific data generators (dbgen for TPC-H, ssb-dbgen for SSB) that produce data with designed distributions, correlations, and cardinalities2. These properties are the point; the queries are designed to stress specific patterns in this specific data.

The distortion from random data hits every query flight differently. Flight 1's discount/quantity filters depend on specific distributions. Flight 3 queries (Q3.1-Q3.4) filter on customer and supplier regions; with random keys, join selectivities change by orders of magnitude and queries that should stress the engine finish instantly. Flight 4's profit calculations depend on correlated lo_revenue and lo_supplycost values. Random data usually makes queries faster, not slower, because the workload gets easier; and faster feels like success, not a red flag.

Codex avoided this failure by cloning ssb-dbgen and building the official generator. But that makes the failure mode more dangerous: you can't predict which agents will fake the data and which won't. The only safe assumption is to verify.

Missing result validation

Standard benchmarks include mechanisms to verify query correctness. TPC-H requires a formal "qualification database" step at SF-1 with reference answers1. SSB defines expected result characteristics through its data generator's deterministic output4. The principle is the same: confirm your queries compute what the benchmark intends before you start timing.

Neither agent validated results; not Claude, not Codex, and not any other agent I've tested across dozens of attempts. They all go straight from "create tables" to "run queries and report times," which means you have no evidence that the queries are computing correctly. Speed is meaningless without correctness.

Why does this matter beyond pedantry? Because without validation, nothing catches a query that returns wrong answers. Claude's script records the row count for each query; but never checks whether that count is correct. With its randomly generated data, join queries could return zero rows or millions, and the script would report both as "success" with equal confidence. Codex's script does the same. The only difference is Codex's data would at least have the right distributions; the queries still run against unvalidated output.

No warm-up or repetition protocol

Without a defined protocol, performance measurement is just noise measurement. The first run on a cold system tells you about storage and buffer behavior, not the query engine. Benchmark specifications address this: TPC-H defines power run sequences and a composite metric (QphH@Size)3; SSB's original paper reports geometric means across multiple runs4.

This is where the two agents diverged interestingly. Claude's script ran each query 3 times and reported best/median/average; better than a single run, but with no dedicated warm-up phase. The first measured run includes cold-cache effects mixed into the "best" timing. Codex's script included an explicit --warmup parameter (default 1 run) before measured runs; a methodological improvement.

Neither agent addressed the deeper issue: without a defined protocol, which number is "the result"? Best-of-3? Median? Geometric mean? The choice matters, and the benchmark spec makes it for you. Agents pick whatever feels reasonable.

Hardcoded filter values

TPC-H defines each query as a template with substitution parameters; Q1 has a date parameter, Q6 has a discount range; precisely to prevent query result caching from dominating measurements1. SSB's 13 queries use fixed filter values in the original spec, but the principle still applies: if you run identical SQL on every iteration, the database can return memoized results and you're timing cache lookup, not query processing.

Both agents hardcoded every filter value. Claude's Q1.1 filters on d_year = 1993, lo_discount BETWEEN 1 AND 3, lo_quantity < 25; every run, identical. Codex's queries use the same fixed values. Neither agent varied filters across runs, discussed cache effects, or acknowledged the limitation. For SSB this is defensible if disclosed; for TPC-H it violates the spec. Neither agent made the distinction.

The rest of the catalog

Beyond the big four, these failures appear frequently enough to mention:

Inventing scale factors. Agent uses SF-5, SF-25, or other non-standard values. Can't validate against reference data; can't compare to published results. Data generators allow arbitrary SFs, but defined benchmark sizes (1, 10, 30, 100, etc.)2 have specific validation data.
Mixing benchmarks. Agent references "Query 23" in a TPC-H context, or uses SSB filters in TPC-DS queries. Different benchmarks with different schemas and protocols. Agents synthesize from all benchmark-related training data simultaneously.
Ignoring isolation requirements. Agent doesn't set or validate isolation. Violates TPC-H/TPC-DS isolation requirements and undermines ACID assumptions.3
Platform-blind scale factors. Agent uses SF-1 for a distributed system or SF-1000 for a laptop. SF-1 fits entirely in RAM on most modern hardware; you're testing cache behavior, not the query engine.
Reporting elapsed time instead of the defined metric. Agent reports "45.2 seconds total" as the result. Each benchmark defines its performance metric. TPC-H uses QphH@Size (composite of Power and Throughput)3; SSB uses geometric mean across flights. Elapsed time isn't a benchmark metric; it's a wall-clock sum that depends on query execution order.

Why "it runs" isn't enough

A benchmark isn't a program; it's a measurement protocol. Think of the difference between "a thermometer that displays a number" and "a calibrated instrument that accurately measures temperature." The display showing "72.3F" doesn't mean the room is 72.3 degrees; it just means the device produced output. What makes the reading mean something is calibration, placement, equilibration time, and environmental controls.

An agent that skips result validation, uses random data, hardcodes query filters, and reports raw elapsed times hasn't produced a wrong benchmark result; it hasn't produced a benchmark result at all, just a number that happens to have units of seconds attached.

So what do you do about it?

None of this argues against using AI agents for database evaluation. It argues for understanding where the failure boundary is; and acting on it before you trust any agent-generated performance claim.

Here's my immediate advice, applicable whether or not you use BenchBox:

Before trusting any AI-generated benchmark result, verify these five things:

Data origin: Was the benchmark's official data generator used (dbgen for TPC-H, ssb-dbgen for SSB, dsgen for TPC-DS)? If the agent generated data with Python, Faker, or any random generation, the results are invalid, full stop.
Result validation: Were query results checked against expected outputs? If not, you don't know whether the queries computed correctly. Speed without correctness is meaningless.
Filter variation: Were query filter values varied across runs (required for TPC-H/TPC-DS, good practice for SSB)? If the same values were used every time, cache effects dominate and the results are suspect.
Repetition and variance: How many runs? What's the standard deviation? A single data point is not a measurement. Demand at least 3 runs with reported variance.
Metric calculation: Is the result reported using the benchmark's defined metric (QphH@Size for TPC-H, Power@Size for TPC-DS) or just "total seconds"? The latter isn't a benchmark metric; it's a wall-clock number that depends on query execution order.

If any of these checks fail, the "benchmark results" aren't benchmark results. They're numbers the agent printed. Treat them like any unverified claim.

How I operationalize those checks. The three that need concrete verification beyond "ask the agent":

Data source: Look for data generator output files on disk. Check row counts against expected cardinalities; at SF-1, SSB's LINEORDER should have ~6 million rows; TPC-H's LINEITEM should have ~6 million at SF-1 or ~60 million at SF-10.
Timing variance: If all runs are identical to the millisecond, something is wrong. If runs 2 and 3 are 10x faster than run 1, you're measuring cache warm-up. Real variance on a stable benchmark run: standard deviation of ~5ms on queries averaging 11ms.
Scale: If the dataset fits in RAM, you're benchmarking your memory subsystem, not the query engine.

Three rules for working with AI agents on benchmarks:

Plan first, execute second. Define your methodology before the agent touches anything; benchmark, scale factor, repetition count, which phases to run, success criteria. The agent's job is to execute your plan, not design one; give it a blank canvas and it'll generate plausible-looking methodology, but give it a specific plan and it follows it.
Use tools with guardrails. If a tool exists that encodes correct methodology; DuckDB's TPC-H extension, BenchBox's MCP server, any validated benchmark runner; use it. The DuckDB finding is clear: when agents had a three-line invocation that handled data generation correctly, they used it and produced valid data. When they were left to figure it out themselves, they invented something that looked right and wasn't. Tool constraints beat prompt instructions every time.
Verify artifacts, not prose. Don't ask the agent "did you use the official data generator?"; check for its output files on disk. Don't ask "did validation pass?"; look for the validation report. Agent-generated prose about methodology is unreliable. A script that says "using ssb-dbgen" in a comment but uses random.randint() in the implementation is exactly the kind of cosmetic compliance you'll miss if you take the comment at face value.

The two scripts tell the whole story: Claude's passed zero of my five checks and Codex's passed two, yet neither agent flagged what it got wrong, and both would have reported results with full confidence.

But the DuckDB finding points toward the fix. When the right tool exists and handles methodology correctly, agents use it. BenchBox's MCP server is built on exactly that principle: validated inputs, structured errors, and workflow constraints that make invalid methodology impossible to execute rather than merely inadvisable to attempt.

Footnotes

TPC-H Specification v3.0.1; Transaction Processing Performance Council, accessed 2026-02-02. Clause 2.4.1.3 (substitution parameters), Clause 2.3.1 and Appendix C (qualification database and reference answers), Clause 4.1.2.2 (SF=1 for qualification).

TPC-H Specification v3.0.1; TPC, accessed 2026-02-02. Clause 4.2.1.2 (DBGen data generation requirements), Clause 4.2.5.2 (authorized scale factors and LINEITEM cardinalities). SSB uses an analogous generator (ssb-dbgen) derived from TPC-H's dbgen with modified distributions for the star schema.

TPC-H Specification v3.0.1; TPC, accessed 2026-02-02. Clause 5.4.3 (composite QphH@Size metric from Power and Throughput), Clause 3.4 (isolation requirements).

P. O'Neil, E. O'Neil, X. Chen, S. Revilak, "Star Schema Benchmark" (2009). Defines the SSB data generator, 13 queries across 4 flights, and expected result characteristics for validation. The ssb-dbgen tool (github.com/electrum/ssb-dbgen) is the standard open-source implementation.

How Much Faster is DuckDB 1.5 vs 1.0? A lot

Joe Harris — Fri, 27 Feb 2026 21:29:40 GMT

TL;DR: Is upgrading from DuckDB v1.0.0 worth it? Yes. v1.5.0-dev is 1.67× faster on TPC-H, 1.84× faster on ClickBench, and 1.45× faster on SSB, with a 1.73× higher TPC-DS Power@Size score.

Introduction

This post evaluates the last six DuckDB versions (1.0.0 through 1.5.0-dev) on the TPC-H, TPC-DS, ClickBench, and Star Schema (SSB) benchmarks.

DuckDB is now common infrastructure for SQL analytics work. It runs in-process with no server, handles billion-row workloads on a laptop, and embeds into Python, R, and dozens of other runtimes. Since v1.0.0 shipped in June 2024, it has become a standard tool for data engineering, data science, and ad hoc analytics.

DuckDB is typically embedded in applications, notebooks, and scripts with the version locked to ensure consistent behavior. So it's worth highlighting version-level performance improvements that locked version embeds may be missing out on.

MotherDuck estimated a cumulative 2× improvement since v1.0.01. Because DuckDB is open source, I can easily compare the six major versions since 1.0, at the query level, and trace performance shifts to specific PRs and issues: not just "it's faster," but which execution changes produced which gains.

Versions Tested

I tested the last patch release of each minor version to capture cumulative improvements. For v1.5, I used a pre-release dev build (1.5.0.dev311). DuckDB's release calendar lists 1.5.0 as upcoming on March 2, 2026, and GitHub Releases still shows v1.4.4 as latest GA2.

DuckDB's Performance Evolution

Versions 1.1 through 1.3: Execution Engine, I/O, and Parquet

The first three post-1.0 releases moved from core execution to storage. v1.13 shipped optimizer work (filter and join improvements) and produced the biggest single-version TPC-H jump in this matrix (+39.4% Power@Size). v1.24 focused on I/O (CSV parser rewrite, Parquet bloom filters) and delivered the largest single-version ClickBench drop (17.7% over v1.1). v1.35 rewrote Parquet reader/writer paths (deferred column fetching and stronger pushdown), changes that matter most in Parquet-heavy workloads rather than these in-memory runs.

Version 1.4: Sorting Rewrite

The v1.4^5] sorting rewrite (PR-17584) replaced DuckDB's sort implementation with a K-way merge sort, delivering 1.7-2.7× improvement on random data and up to 10× on pre-sorted data in isolated ORDER BY benchmarks[^8]. TPC-H shows modest gains (+1.4% on my sorting proxy queries) because its ORDER BY operations are one component of multi-join queries, not isolated sorts. DuckDB 1.4 also changed CTE behavior to materialize by default ([PR-17459), with the release notes reporting performance and correctness improvements for repeated CTE references6. In this matrix, TPC-DS Power@Size increased from 614,882 (v1.3.2) to 630,854 (v1.4.4).

Version 1.5.0-dev: Continued Acceleration

After major rewrites in v1.3 and v1.4, v1.5.0-dev shows another round of gains: +4.7% TPC-H Power@Size, +6.1% TPC-DS Power@Size, 0.5% faster ClickBench, and SSB recovering fully from the v1.3/v1.4 regression to become the fastest version (938ms). Because GA release notes aren't final, I focus on observed deltas rather than attributed features. Concurrent upstream work mapped from v1.5-variegata^7] includes join memory improvements ([PR-21022), window optimizer extensions (PR-21021), and plan correctness tightening (PR-21014). These are the most active performance-relevant threads on the branch at test time. The per-query delta table in the TPC-H results section shows where the gains landed.

Results: TPC-H (SF=10)

Overall Version Progression

TPC-H Power@Size, the TPC standard metric for single-stream query performance: 3600 × Scale_Factor / geometric_mean(per-query times). Higher is better. The geometric mean weights all queries equally, so a 2× improvement on any single query contributes the same regardless of absolute runtime.

Total Runtime (all 22 queries):

Per-Query Analysis

Biggest winners (largest improvement v1.0.0 to v1.5.0-dev):

Q7's 4.3× improvement is the headline number from this analysis, and it surprised me. A 4.3× speedup from iterative algorithmic improvements alone (no schema changes, no index tricks, same hardware) is unusually large for a query that was already completing successfully.

Regressions

I found no query slower in v1.5.0-dev than in v1.0.0 across the full 22-query TPC-H suite. Adjacent-version regressions do appear (see Analysis section), but the cumulative direction is consistently positive.

Query Category Breakdown

Key finding: Aggregation queries improved the most (65.5%), driven primarily by Q18's dramatic improvement (3.4×). Full-scan queries improved the least (+12.1%): if your workload is dominated by Q1-style full table scans, the cumulative v1.0→v1.5 improvement is real but not a compelling reason to rush an upgrade.

The v1.4 sorting rewrite (PR-17584) measured 1.7-2.7× on random data in isolated benchmarks7, but its TPC-H impact is modest. Using Q3, Q4, Q10, and Q16 as a sorting proxy (the four queries most dominated by ORDER BY):

That's expected: TPC-H sorting queries are multi-join queries where ORDER BY is one component, not isolated sorts where the rewrite's full gains apply.

v1.4.4 to v1.5.0-dev: Per-Query Deltas

Query-level movement from v1.4.4 to v1.5.0-dev is mixed but net-positive. The most-moved queries (both directions):

The full six-version heatmap shows the cumulative per-query trajectory. Darker cells are faster; look for the v1.1 row (the biggest single jump) and the Q7/Q18 columns (the steepest per-query improvement over the full range):

Results: TPC-DS (SF=10)

TPC-DS has 99 queries (run as 103 individual variants), testing window functions, CTEs, correlated subqueries, and other advanced SQL features that TPC-H doesn't cover.

Overall Version Progression

TPC-DS Power Score (Power@Size):

v1.2.2 Dip: Power@Size dropped 16% from v1.1.3 before recovering in v1.3.2. The largest regression was Query 22, a GROUP BY ROLLUP over inventory data, which went from 881ms to 10,032ms, an 11× slowdown. Queries 67, 23A, 14A, 49, and 27 also regressed (43-136%). By v1.3.2, overall TPC-DS Power@Size exceeded v1.1.3 levels. But Q22 itself never came back. My plan/profiling evidence supports two compounding causes: details in the Q22 deep-dive below.

In my matrix summary artifacts, v1.0.0 shows 308/309 query records passed while v1.1.3+ shows 309/309; all versions report zero timeouts. Across versions, the dominant change is execution speed, not broad query-correctness drift.

The v1.2 Regression: What Happened to Query 22?

TPC-DS Query 22 runs a four-column GROUP BY ROLLUP over inventory, one of the largest tables in the schema. ROLLUP expands to five grouping sets: the full combination plus four progressively coarser subtotals. It's one of the most aggregation-intensive queries in TPC-DS, and it's where the v1.2 regression hit hardest.

The strongest hypothesis from plan/profiling evidence is two independent changes that compound against each other.

v1.2: hash aggregation rework
Between v1.1.3 and v1.2.0, DuckDB made hash aggregation performance improvements (PR-15251, PR-15321) targeting high-cardinality single-group-set workloads. These changes added row-width-aware partitioning thresholds and a "skip lookups if mostly unique" heuristic, both good for single-GROUP-BY queries, both bad for ROLLUP. ROLLUP produces NULL-padded rows across multiple grouping sets, which inflates the apparent uniqueness rate and triggers wider partitioning for the wider tuples. On Q22, this created a perfect storm: high base cardinality × five grouping sets × heuristics tuned for a different data pattern.
v1.3: one step forward, one step back
DuckDB v1.3.0 added HyperLogLog-based adaptive hash table sizing (PR-17236), which improves hash table cardinality estimates. DuckDB's own benchmarks showed TPC-DS Q67, another ROLLUP query, running ~2× faster with this optimization. But the same release also added a correctness fix (PR-17259) that disabled all column pruning below ROLLUP and CUBE operators. Instead of scanning the two inventory columns Q22 actually needs it scanned all of them8.

These two forces (fixed hash tables, broken column pruning) net out to ~50% slower than the v1.1.3 baseline. An improved column pruning fix landed in PR-20781, which re-enabled column pruning while keeping a targeted guard only in RemoveDuplicateGroups. That fix ships with v1.5.0 GA. When it does, Q22 should return toward its v1.1.3 speed, or better. To my knowledge, this is the only public, per-query, multi-version TPC-DS benchmark of DuckDB; if you've seen another, I'd like to know about it.

Results: ClickBench

ClickBench tests scan-heavy web analytics patterns on a single 100M-row table.

Overall Version Progression

ClickBench Total Runtime (ms):

Note that v1.3.2 shows a slight regression vs. v1.2.2 in total ClickBench runtime (555ms vs 536ms). Given observed ClickBench variance in this matrix, I treat this as directional rather than a strong signal.

Query Pattern Analysis

ClickBench queries are categorized by pattern:

Key finding: String matching (LIKE) is the biggest winner at 73.2%, and high-cardinality GROUP BY improved more than low-cardinality (42.5% vs 29.7%). String hash caching (PR-18580) is the most direct match for the LIKE gains: string processing changes have an obvious path to LIKE query performance. Dictionary-aware insertion (PR-15152) and fewer aggregation allocations (PR-16849) align with the high-cardinality GROUP BY gains concentrated in v1.2-v1.3. Higher-load-factor probing (PR-17718) fits the continued ORDER BY improvement through v1.4. I haven't run micro-benchmarks to isolate each contribution, but the per-pattern distribution matches the change history well.

Results: SSB (SF=10)

The Star Schema Benchmark tests classic dimensional model queries.

Overall Version Progression

SSB Total Runtime (ms):

The per-query heatmap makes the uneven progression visible. v1.1/v1.2 show broad improvement (bluer cells), v1.3/v1.4 regress on Flight 3-4 joins (warmer cells), and v1.5 recovers to the best result across most queries:

SSB is the one benchmark where improvement isn't consistent across every version. v1.3.2 and v1.4.4 are both slower than v1.1.3/v1.2.2, but v1.5.0-dev fully recovers and is the fastest version overall (938ms vs 1,005ms for v1.1.3). Per-subquery plan and profiling diffs show that the v1.3/v1.4 slowdown is concentrated in specific Flight 3-4 join queries rather than spread evenly.

One outlier worth noting: Q2.2 jumps from 4ms to 53ms in v1.4.4 (and 44ms in v1.5.0-dev), a large percentage regression but small in absolute terms. Because Q2.2 contributes <50ms to total runtime in every version, the Flight 3-4 joins are where the macro story plays out:

The persistent regression is mostly Flights 3 and 4, especially Q4.1 and Q4.2:

I traced the SSB slowdown to a two-stage execution story, similar to the Q22 analysis approach.

Stage 1, v1.2.2 -> v1.3.2/v1.4.4: regression without plan-shape change
For representative regressors (Q3.2, Q4.1), the EXPLAIN plans keep the same join skeleton, join predicates, and lineorder full-scan shape across v1.2.2, v1.3.2, and v1.4.4. That makes a structural join-plan rewrite an unlikely primary cause for this regression pattern, including PR #16443 as the dominant driver here.
Stage 2, v1.4.4 -> v1.5.0-dev: targeted recovery from scan-time filtering
In v1.5.0-dev, EXPLAIN ANALYZE for Q3.2 and Q4.1 shows Dynamic Filters on lineorder scan keys, consistent with the Bloom/SIP work in PR #19502. This aligns with the strong recovery in Q3.1 and Q3.2, and partial recovery in Q4.1.

The scan-time filtering recovery is strong enough that total SSB runtime in v1.5.0-dev (938ms) beats the previous best (v1.1.3 at 1,005ms). Q4.2 remains slightly slower than v1.2.2, but the gains in Flight 1 and Flight 3 queries more than compensate.

Analysis and Insights

Regression Analysis

Suite-level improvements don't tell the whole story. Optimizations for one pattern can regress another, and the per-query data shows where.

Adjacent-version TPC-H regressions (>10% slower):

All three are adjacent-version regressions; none accumulate to v1.5.0-dev, where every TPC-H query is faster or equal to v1.0.0.

How I Ran These Benchmarks

All benchmarks used BenchBox for reproducibility.

Environment details:

Hardware: Mac Mini (M4, 10 cores, 16 GB unified memory)
OS captured in benchmark artifacts: Darwin 25.3.0
Python runtime: 3.10.17
BenchBox CLI version in this environment: 0.1.3
DuckDB config: threads=10, memory_limit='12GB', enable_progress_bar=false, result_cache_enabled=false

Run protocol:

Data generation ran once per benchmark.
Load phase ran once per version and benchmark.
Power phase ran 3 times per version and benchmark; median reported in all tables.
No explicit OS page-cache flush between power runs, so measurements reflect warm filesystem cache behavior with DuckDB result cache disabled.

Aggregation method: BenchBox computes per-run per-query medians and per-run Power@Size. I report the median of three runs for each published metric (per-query time, total runtime, Power@Size). I do not recompute Power@Size from cross-run per-query medians.

Representative commands:

# TPC-H SF=10
uv run benchbox run --platform duckdb \
  --benchmark tpch --scale 10 \
  --phases generate,load,power \
  --output results/duckdb-v150dev-tpch-sf10

Each version was tested using isolated Python environments:

uv pip install duckdb==1.0.0  # Baseline
uv pip install duckdb==1.1.3  # Last v1.1
uv pip install duckdb==1.2.2  # Last v1.2
uv pip install duckdb==1.3.2  # Last v1.3
uv pip install duckdb==1.4.4  # Current LTS
uv pip install duckdb==1.5.0.dev311  # Pre-GA v1.5 dev build

Three runs per benchmark per version, median reported.

Run-to-run spread (max within-version spread across the matrix, non-zero power runtimes): TPC-H 2.7%, TPC-DS 72.7%, ClickBench 54.1%, SSB 13.0%. TPC-DS v1.0.0 is the primary outlier (one run at 734s vs median 441s), and ClickBench v1.1.3 had one anomalous run at 999ms vs median 651ms.

Interpretation threshold used in this post:

<2% runtime deltas are directional unless variance is low for that suite/version.
>5% shifts are treated as stronger signals when also visible in per-query tables.

Reproducibility artifacts:

Conclusions: v1.0.0 to v1.5.0-dev

Cumulative improvement is real: 1.67× TPC-H, 1.73× TPC-DS, 1.84× ClickBench, 1.45× SSB
Major rewrites delivered: v1.1 lifted TPC-H Power@Size by 39%; overall I/O and runtime improvements reduced ClickBench runtime 46%
Regressions are contained: Regressions exist but are modest (11-17%) and don't accumulate. No TPC-H query is slower in v1.5.0-dev than v1.0.0.
Workload differences matter: SSB regressed in v1.3/v1.4, but v1.5.0-dev fully recovers to the best result overall (1.45×)
Open source enables attribution: specific PRs can be mapped to specific benchmark shifts, which is rare in database benchmarking

The bottom line: upgrade. DuckDB has earned your trust and, if you're on older versions, you're likely leaving meaningful performance on the table. In this matrix, v1.5.0-dev improves runtime by 40.0% on TPC-H, 45.7% on ClickBench, and 30.8% on SSB versus v1.0.0, with a 73.5% higher TPC-DS Power@Size score. v1.4.4 is the safe GA choice today. v1.5.0 ships March 2, 2026; run your critical queries against the pre-release build now so you know what to expect before upgrading.

Thanks for reading Oxbow Research! This post is public so feel free to share it.

Direct evidence is cited inline. Plausible hypotheses are explicitly noted as such.
This uses a pre-GA DuckDB 1.5 build (1.5.0.dev311), so final GA behavior may differ.
Results are from one hardware profile and may not extrapolate to x86 server environments.
Power-phase timings were collected without forced OS cache eviction, so this is a warm-cache profile.
TPC-DS v1.0.0 shows high run-to-run variance (one run at 734s vs median 441s), driven by Q23A/Q23B instability.
- Later versions are stable (<5% spread). Median aggregation absorbs these outliers.

References & Resources

Footnotes

This post is part of the DuckDB Performance series at Oxbow Research. I track DuckDB's evolution with systematic benchmarks and technical analysis across each release.

Faster Ducks - MotherDuck Blog, 2025. Performance analysis of DuckDB evolution.

Announcing DuckDB 1.1.0 "Eatoni" - DuckDB Blog, September 2024.

Announcing DuckDB 1.2.0 "Histrionicus" - DuckDB Blog, February 2025.

Announcing DuckDB 1.3.0 "Ossivalis" - DuckDB Blog, May 2025.

Announcing DuckDB 1.4.0 "Andium" - DuckDB Blog, September 2025.

DuckDB Release Calendar and DuckDB GitHub Releases - accessed February 25, 2026. Release calendar lists 1.5.0 as upcoming on March 2, 2026; latest GA listed in releases is v1.4.4 (January 26, 2026).

Redesigning DuckDB's Sort, Again - DuckDB Blog, September 2025. Benchmarks on M1 Max MacBook Pro (10 cores, 64 GB RAM): 1.7-2.7× on random data, up to 10× on pre-sorted data, with wide-table sorting 2-3.4× faster at SF10-SF100.

The column pruning issue under ROLLUP/CUBE was independently documented by GitHub user heldeo, who measured 9.3× more columns scanned on TPC-DS Q36 (another ROLLUP query) when running on S3-backed Parquet. See Column pruning disabled for GROUP BY ROLLUP/CUBE/GROUPING SETS. The underlying correctness fix was PR-17259; the targeted performance fix is PR-20781, shipping in v1.5.0.

The Egress Tax: how cloud providers engineer data gravity

Joe Harris — Wed, 18 Feb 2026 16:46:40 GMT

TL;DR: Cloud egress is priced at 18-24x wholesale cost to create switching costs, not recover bandwidth expenses. But internet egress is only part of the story: the cross-AZ and cross-region charges baked into every HA deployment and pipeline run are continuous and often larger. The hyperscalers' offer to waive egress only applies if you fully close your account. Design for locality, negotiate egress into contracts, and budget for intra-cloud transfer as a line item.

The economics of data gravity

Move 1 TB of data into AWS: free. Move it out: $90. The actual cost of that bandwidth to the provider? Roughly $0.005/GB, based on wholesale transit pricing1.

That's an 18x markup. Not 2x, not 5x. Eighteen times the actual cost.

I don't think this is price gouging in the traditional sense. It's something more deliberate: egress fees are designed to create switching costs. The "roach motel" model of cloud economics: data checks in but doesn't check out.

What data gravity means

The concept of "data gravity" was coined by Dave McCrory in 20102. The basic insight: data attracts applications, services, and more data. The larger your dataset, the harder it becomes to move, not because of technical limitations but because of the ecosystem that forms around it.

In my mental model, teams hit a "gravity threshold" somewhere between 10 TB and 100 TB. Below that, migration is annoying but feasible. Above it, the conversation shifts from "should we move?" to "can we afford to move?"

The math of switching costs

Example: 100 TB analytics workload

The egress cost equals roughly 4 months of storage. That's your "tax" for leaving. And that's just the transfer fee: add migration engineering, testing, and risk, and the real switching cost is multiples higher. Even at volume discount tiers ($0.05/GB at 150 TB+ on AWS3), the markup over wholesale is still 10x.

The "roach motel" business model

How free ingress creates lock-in

The cloud acquisition model works like this:

Stage 1: Land (free ingress + cheap storage)

"Try us out! Data transfer is free!"
Storage is cheap: $0.023/GB/month for S3 Standard4
Low barrier to entry, easy first step

Stage 2: Expand (add compute, services, integrations)

Now you're running queries, building pipelines
Add Lambda, Athena, SageMaker
Interconnections multiply

Stage 3: Lock (data gravity + egress costs)

50 TB and growing
15 services connected
Egress would cost $4,500+
Too expensive to leave

Free ingress is a marketing expense. Cheap storage is a retention mechanism. Expensive egress is a switching cost.

Example: A migration that won't happen

Scenario: A mid-size company considers migrating 500 TB from Redshift on AWS to BigQuery on GCP.

The first calculation:

"A 20% savings! Let's migrate!"

The real calculation:

Break-even: $125,000 / $17,000 = 7.4 months, if nothing goes wrong during migration.

The customer stays. Data gravity wins. And "next year" never comes, because by then the dataset has grown and the math is even worse.

It's not just the hyperscalers. Analytics vendors inherit or create their own gravity. Snowflake's storage sits on the underlying cloud provider (S3, Azure Blob, GCS)5, so leaving Snowflake still triggers cloud egress. Databricks avoids direct egress exposure by keeping data in the customer's account, but Unity Catalog creates its own form of gravity: governance metadata is harder to migrate than raw data. And as platforms like ClickHouse Cloud mature, they add transfer-based charges that look a lot like the hyperscaler playbook6.

Multi-cloud multiplies egress

Why multi-cloud costs more than single-cloud

The promise of multi-cloud: "Avoid lock-in by spreading across clouds."

The reality: multi-cloud multiplies egress costs.

Example: Cross-cloud analytics pipeline

Data lands in AWS S3
     ↓ ($0.09/GB egress)
Processing in GCP BigQuery
     ↓ ($0.12/GB egress)
Visualization in Azure Power BI

Every stage of the pipeline triggers egress. For a workload processing 10 TB/day:

That's $63,000/month on top of compute and storage costs, purely from egress.

Multi-cloud can work if you minimize cross-cloud data movement: separate workloads by region or by function (transactional on Cloud A, analytics on Cloud B) and batch-sync once daily instead of streaming. Active-active replication across clouds is the expensive extreme, justified only for critical availability requirements.

The transfer costs inside your cloud

Most egress discussions focus on internet egress: data leaving the cloud entirely. But the transfer costs that actually surprise teams are the ones inside the cloud. Cross-AZ and cross-region transfers are billed quietly, buried in line items that most engineers never see until someone audits the bill.

Cross-AZ: the invisible tax on high availability

Every cloud provider charges for data that crosses availability zone boundaries. Since multi-AZ deployments are the default for production workloads, this cost is effectively baked into any serious architecture.

Sources: 7, 8, 9

Azure's free cross-AZ transfer looks like a clear win, but there's a significant catch: as of early 2026, roughly a third of Azure's public regions still don't support availability zones at all10. That includes GA regions like North Central US, Canada East, UK West, West US, and Australia Southeast. If your workload runs in one of these regions, "multi-AZ" isn't an option. Your HA strategy requires cross-region replication, which costs $0.02-0.08/GB11, the same range as AWS and GCP. Azure's free cross-AZ advantage only applies if you're in a region that actually has AZs.

On AWS and GCP, that $0.01/GB sounds trivial until you calculate the volume. The catch is that AWS charges $0.01/GB in each direction for cross-AZ traffic between services like EC2, RDS, and Redshift12, making the effective cost $0.02/GB round-trip. Notably, S3 transfers within the same region are free regardless of AZ, so services reading from S3 (including Redshift managed storage) don't incur this charge.

The services that do generate cross-AZ costs are the ones that communicate directly: RDS replication, EC2-to-RDS queries, load balancers distributing across AZs, and NAT gateways.

Example: Analytics pipeline with cross-AZ overhead

Consider a typical setup: an RDS instance in multi-AZ with an ETL process running on EC2 in a different AZ.

RDS multi-AZ replication is the main cost driver: it synchronously replicates every write to the standby in another AZ, charged at $0.01/GB in each direction. The downstream S3 and Redshift steps are free because S3 is a regional service.

That's $4,800/year for a modest workload. Scale the database to 2 TB/day of writes and RDS replication alone hits $1,200/month, or $14,400/year in transfer costs that don't appear in any compute or storage line item.

Cross-region: the compliance and DR multiplier

Cross-region transfer costs apply to disaster recovery, compliance-driven replication, and serving global users from regional data stores. On AWS, cross-region is double the cross-AZ rate. On GCP and Azure, the range is wider.

Sources: 13, 14, 15

If you're replicating a 100 TB data lake from US to EU for GDPR compliance, that's a one-time $2,000 transfer cost on AWS, plus ongoing replication costs for new data. A daily sync of 500 GB of changes costs $10/day, or $3,650/year, just for the transfer.

Why this matters more than internet egress

Here's the thing most people miss: internet egress is a tax you pay occasionally, when migrating or serving external users. Cross-AZ and cross-region transfer costs are taxes you pay continuously, on every read, every replication, every pipeline run. They compound. For analytics workloads with multi-AZ HA and cross-region DR, intra-cloud transfer can easily exceed internet egress in aggregate.

Reducing the tax

Strategy 1: Minimize data movement by design

The cheapest data transfer is the one that doesn't happen. This applies to cross-AZ and cross-region transfers just as much as internet egress.

Design principles:

Process data where it lives
Aggregate before moving
Cache at edges
Batch instead of stream when possible

Example transformation:

If your analytics pipeline can work with pre-aggregated summaries, do the aggregation where the data lives.

Strategy 2: Negotiate egress into contracts

If you're spending $1M+ annually, negotiate egress relief into your enterprise agreement: committed egress credits, reduced rates tied to spend commitment, or free egress for specific use cases (backup, DR, compliance). GCP's more aggressive egress pricing is a useful leverage point against AWS and Azure.

Strategy 3: Use physical transfer for bulk migrations

For migrations over 100 TB, physical transfer devices can beat network transfer on both cost and time. AWS Snowball Edge16, GCP Transfer Appliance17, and Azure Data Box18 all ship storage devices to your datacenter, letting you move data without touching egress pricing at all. Check each provider's current device specs; capacities and product lines change frequently.

Strategy 4: Calculate true TCO including transfer costs

Most platform comparisons only account for compute and storage. Your TCO model should also include internet egress, cross-AZ transfer (how many AZs does the architecture span?), cross-region transfer (do you replicate for DR or compliance?), and one-time switching costs. These are often invisible until the first bill arrives.

The pressure on egress pricing

Regulatory pressure

EU Digital Markets Act19:

Designates major cloud providers as "gatekeepers"
Requires data portability and interoperability
Could force egress price reductions in EU

US regulatory interest:

FTC published a cloud market study in 2024 identifying egress fees and switching costs as competition concerns
No enforcement action yet, but the issue is on the regulatory radar

The "free to leave" offer that almost nobody can use

In March 2024, all three hyperscalers announced they would waive egress fees for customers migrating away. Headlines declared the end of egress lock-in.

The fine print: AWS issues retroactive credits (not a real-time waiver), requires support approval, and gives you 60 days to complete the migration. The offer targets customers who are switching away entirely. Partial repatriation, moving some workloads back on-prem while keeping your account active, doesn't qualify.

How many companies fully close their cloud accounts? Almost none. The typical enterprise has dozens of services, hundreds of integrations, and compliance dependencies that make a clean break practically impossible.

The most prominent company to actually do it is 37signals, makers of Basecamp and HEY. Their CTO, DHH, has documented the entire exit publicly: $3.2M/year in cloud spend reduced to well under $1M, with projected savings over $10M across five years. They migrated 6 petabytes of S3 data to on-prem Pure Storage, and AWS honored the waiver, crediting roughly $250,000 in egress fees. Even DHH noted that getting the credits approved "took a while."

But 37signals runs a handful of Rails applications with a simple architecture, a CTO who made cloud exit a personal crusade, and their own data center space. For a typical enterprise, full account closure isn't a realistic option, which is exactly the point. The waiver addresses the nuclear option of complete departure while leaving day-to-day egress pricing unchanged. The announcements were a response to the EU Data Act (which mandates zero switching fees by January 2027), not a genuine change in the economics of data gravity.

Competitive pressure from new entrants

Cloudflare R2:

S3-compatible storage
Zero egress fees
Clear competitive attack on data gravity

Oracle Cloud:

$0.0085/GB egress after a free 10 TB/month tier
Targeting migrations from AWS

Wasabi:

No egress fees
Hot storage priced below the hyperscalers' cold tiers

The incumbents have responded with expanded free tiers and volume discounts, but core pricing remains high. Competition is pressuring the edges, not the center.

Open formats don't solve location gravity

Open table formats (Parquet, Iceberg, Delta Lake) reduce format lock-in, but your Iceberg tables are still in S3. Moving them to GCS still triggers egress. Format portability is not location portability.

What I'd tell a data team today

I don't think cloud providers are villains for charging egress. They're rational actors optimizing for retention in a market with high customer lifetime value. Understanding the game lets you play it strategically.

The key things I'd want any data team to internalize:

Egress pricing is strategic, not cost-based. An 18-24x markup over wholesale bandwidth tells you this isn't about cost recovery.
Data gravity is engineered. Free ingress, cheap storage, expensive egress. The asymmetry is intentional.
Internet egress is only part of the transfer cost story. Cross-AZ and cross-region charges are continuous and often larger in aggregate for analytics workloads.
Multi-cloud multiplies all of these costs. If you're going multi-cloud to avoid lock-in, model the transfer costs first.
Design for locality. Process data where it lives, aggregate before moving, co-locate compute and storage in the same AZ when possible.

If I were advising a data team making platform decisions today, I'd say: build transfer costs into your TCO model from day one, not as an afterthought. Negotiate egress relief into enterprise agreements. And budget for cross-AZ costs as a line item, because they will surprise you if you don't.

The egress tax is real. The intra-cloud transfer tax is real and less visible. Plan for both.

Footnotes

This post is part of the Business of Analytics series, examining vendor incentives across the data stack to help practitioners make informed technology decisions.

Data Gravity: In the Clouds - Dave McCrory, 2010. Original concept definition.

2024 Internet Transit Pricing - DrPeering. Wholesale bandwidth cost analysis.

ClickHouse Cloud Pricing Changes - ClickHouse, 2024-2025. Evolution from zero egress to consumption-based egress.

Cloudflare R2 Pricing - Cloudflare. Zero egress object storage.

AWS Data Transfer Pricing - AWS, December 2024.

Google Cloud Network Pricing - Google Cloud, December 2024.

Azure Bandwidth Pricing - Microsoft Azure, December 2024.

List of Azure regions - Microsoft Learn, February 2026. Of ~57 public regions, roughly 38 support availability zones; the remainder, including several non-restricted GA regions, do not.

Amazon S3 Pricing - AWS. $0.023/GB/month for S3 Standard, first 50 TB, US East.

Snowflake Architecture Overview - Snowflake Documentation. Storage layer uses cloud provider object storage (S3, Azure Blob, GCS).

AWS Snowball Edge - AWS. Physical data transfer devices.

Transfer Appliance - Google Cloud. Physical data transfer devices.

Azure Data Box - Microsoft Azure. Physical data transfer devices.

Digital Markets Act - European Commission. Regulation (EU) 2022/1925, effective May 2023.

Examining the Impact of Cloud Computing on Competition - FTC, October 2024. Identifies egress fees and switching costs as barriers to competition.

Oracle Cloud Networking Pricing - Oracle. $0.0085/GB egress after 10 TB/month free.

Wasabi Pricing - Wasabi. No egress fees on hot storage.

Free Data Transfer Out to Internet When Moving Out of AWS - AWS Blog, March 2024. Google Cloud and Azure made similar announcements the same quarter.

Our Cloud-Exit Savings Will Now Top Ten Million Over Five Years - DHH, 2024. See also It's Five Grand a Day to Miss Our S3 Exit, March 2025.

Why database benchmarks are broken

Joe Harris — Mon, 16 Feb 2026 20:42:10 GMT

TL;DR: Can you trust database benchmarks? Mostly no. Vendors test their own homework, TPC audits cost $100K, and community benchmarks freeze the day they're published. If you're picking a platform this quarter, the data you need barely exists.

When the test-taker writes the test

Every data platform vendor claims to be the fastest. Browse their benchmark pages: I have yet to find one where the vendor's own product doesn't win. Vendor benchmarks aren't measurement, they're marketing with numbers.

This matters because platform decisions involve real money. In my experience, a mid-size company spends $500,000 or more annually on their data platform. Enterprises can easily reach $5 million. The cloud data warehouse market is estimated at $12 billion and growing at over 25% annually1, and finance teams are asking harder questions about every line item. Making the wrong choice based on misleading benchmarks means wasted spend, painful migrations, and missed opportunities. Yet the data available to inform these decisions is terrible.

Vendor benchmarks have an obvious problem: the company selling the product is measuring the product. The incentives are misaligned from the start.

But the problems go deeper than simple bias. A 2018 peer-reviewed study examined 16 published database benchmark papers and found that none reported the parameters necessary to interpret their results2. One configuration parameter alone changed transaction throughput by a factor of 28. Vendor benchmarks systematically distort comparisons in ways that make the numbers essentially meaningless.

Configuration optimization:
The most common distortion is asymmetric tuning: benchmarking your own product with expert configuration while running competitors on defaults. A skilled engineer can often double or triple performance on any database through careful configuration: index selection, memory allocation, parallelism settings, query hints. When a vendor shows "10x faster than the competition," the real comparison is often "our best versus their default."
These optimizations require expertise that typical users don't have. The vendor's benchmark team has deep knowledge of their own product. They know which configuration knobs matter. Competitors' products? They might spend an afternoon reading docs. The resulting comparison tells you more about configuration skill than product capability.
Workload selection:
Every database architecture has strengths and weaknesses. Columnar databases crush analytical aggregations but struggle with point lookups. Row stores handle transactions beautifully but choke on full-table scans. In-memory systems win on small datasets but hit limits at scale.
Vendors choose benchmarks that highlight their strengths. A columnar vendor runs TPC-H (analytical queries). A transactional vendor runs TPC-C (OLTP). Each shows impressive numbers on their chosen workload. Neither mentions the workloads where they lose.
The phrase "TPC-H isn't representative of real workloads" appears suspiciously often in marketing from vendors who perform poorly on TPC-H. Workload representativeness only becomes a concern when the benchmark is unflattering.
Hardware asymmetry:
Cloud benchmarks add another variable: instance selection. Vendors run their product on premium instances, the latest generation, maximum memory, fastest storage. Competitors run on whatever was convenient. The resulting performance gap reflects hardware choices as much as software efficiency.
Instance selection is particularly insidious because the differences aren't obvious. An r6i.8xlarge and r5.8xlarge sound similar. Both have 32 vCPUs and 256GB RAM. But the r6i has newer processors, faster memory bandwidth, and better storage performance. Small differences in instance generation compound across queries.
Methodology opacity:
The most damaging practice is simply hiding methodology. "Internal testing" with no details. "Optimized configuration" with no specifics. "Contact us for more information" instead of published scripts.
When a vendor can't or won't share exactly how they produced their numbers, the numbers are worthless. Without published methodology, you can't reproduce the results, validate the claims, or even understand what scenarios the benchmark represents. Independent verification becomes impossible.

These aren't theoretical concerns. In November 2021, Databricks published an official TPC-DS result at 100TB, audited by the TPC council3. Snowflake responded with its own unofficial comparison claiming similar price-performance. Databricks challenged Snowflake's methodology. The dispute played out across competing blog posts and Hacker News threads, illustrating exactly the problem: even when one vendor publishes audited results, a competitor can muddy the water with unofficial claims that can't be independently verified.

$100,000 to publish a number

TPC (Transaction Processing Performance Council) benchmarks represent the gold standard for rigor. TPC-H and TPC-DS define precise specifications, exact queries, data generation procedures, and validation requirements. Published results require independent audits. Full disclosure is mandatory. The methodology is exactly right.

But almost nobody uses it. Certification costs $100,000+4, which prices out open-source projects and smaller vendors entirely. And even vendors who can afford it face a perverse incentive: publishing only makes sense if you're confident competitors won't beat your result quickly. Spend six figures on a number that gets topped next quarter, and you've just proved you're second-fastest. Most vendors never publish, or publish once, claim the record, and never update. You can compare Oracle to Microsoft using official TPC numbers, but you can't compare either to DuckDB or ClickHouse.

The queries themselves remain genuinely useful. TPC-H's 22 queries test joins, aggregations, and sorting, the foundational operations that still dominate analytical workloads. TPC-DS adds 99 queries that stress query optimizers significantly harder, covering subqueries, window functions, and complex join patterns. These are the operations that separate fast platforms from slow ones. No single benchmark covers everything (modern workloads also include JSON processing, ML feature engineering, and time-series analysis), but TPC-H and TPC-DS cover the analytical core well.

Then there's the lag. TPC results represent a point in time. The audit process takes months. By the time results are published, the tested product version may be outdated. For cloud platforms that update continuously, published results may not reflect current performance.

The TPC council does important work. Their methodology rigor is exactly right. The problem isn't the benchmarks themselves, it's that the cost and time requirements make them impractical for most comparisons.

Enthusiasm without governance

Individual practitioners and community members run their own benchmarks to fill the gap. Mark Litwintschik's tech.marksblogg.com publishes detailed database comparisons. ClickBench, maintained by ClickHouse, provides a single-table benchmark5. H2O's db-benchmark tests dataframe operations6. Countless blog posts share individual experiences.

This work is valuable. Community benchmarks provide data points that would otherwise not exist. But they have structural problems that limit their usefulness.

Methodology varies wildly. One person runs DuckDB on a laptop. Another runs Snowflake on a Large warehouse. Someone else tests PostgreSQL with default settings while tuning ClickHouse aggressively. Two people testing the same platforms can produce opposite rankings, and neither is wrong; they just measured different things.

Results go stale immediately. Someone runs tests, publishes results, and moves on. The platforms improve. New versions ship. But five-year-old benchmark posts still appear in Google searches, citing platform versions that no longer exist. There's no systematic process for keeping results current.

Expertise is unevenly distributed. Properly benchmarking a data platform requires deep expertise: configuration options, query optimization, hardware characteristics, statistical analysis. Few individuals have this across multiple platforms. A PostgreSQL expert benchmarking ClickHouse might miss critical configuration options. The results then reflect tester expertise rather than product capability.

Nobody owns the errors. When community benchmarks contain errors, there's no correction mechanism. Vendors can point out problems, but doing so looks defensive. Bad data persists because no one is responsible for accuracy.

The people doing this work are filling a real gap with limited resources. Good intentions can't compensate for missing infrastructure: independent benchmarking without governance produces unreliable results.

Million-dollar decisions, dollar-store data

The consequences compound. Data teams facing platform selection have too much information and too little clarity. Vendor benchmarks all claim victory. Community benchmarks contradict each other. Academic benchmarks exclude half the options. The rational response is often to ignore external benchmarks entirely and run your own tests, but proper benchmarking takes weeks or months of engineering time. Most teams can't afford this, so they make decisions based on reputation, sales relationships, or which vendor bought them the nicest dinner.

When a team picks a platform based on benchmark claims that don't reflect their actual workload, they discover the gap after committing to integration, training, and dependencies. Enterprise migration projects routinely run into the millions7.

The prevalence of misleading benchmarks breeds cynicism. "All benchmarks are marketing" becomes received wisdom, and teams stop paying attention to performance data entirely.

This cynicism is understandable but harmful. Real performance differences exist between platforms. Some platforms genuinely are 10x faster for certain workloads. Dismissing all benchmarks means missing genuine optimization opportunities. New data platform vendors struggle to compete on credibility at all: established players have years of TPC results, marketing campaigns, and analyst relationships, while a startup with a genuinely better product cannot simply prove superiority because the benchmarking landscape is too broken to support credible claims. The market rewards marketing spend rather than engineering excellence.

Something better

The information available for platform decisions, one of the most consequential technical choices an organization makes, is unreliable. In a previous post, I argued that flawed benchmarks still beat no benchmarks. I stand by that. But "better than nothing" is a low bar, and practitioners deserve actual methodology, not just less-bad marketing. Fixing this requires independence from vendor funding, open methodology anyone can reproduce, and ongoing maintenance instead of one-time publication. I think that's buildable. In fact, I'm building it.

Thanks for reading Oxbow Research! This post is public so feel free to share it.

References

TPC Policies - Transaction Processing Performance Council. Full benchmark certification requires third-party auditing and TPC membership.

db-benchmark - H2O.ai database-like operations benchmark

ClickBench - ClickHouse single-table benchmark

Cloud Data Warehouse Market Size & Share Analysis - Mordor Intelligence. Estimated at $11.78B in 2025, growing at 27.64% CAGR to $39.91B by 2030.

From Teradata to the Cloud: Building a Future-Ready Data Foundation While Saving $140M - Persistent Systems. One healthcare company's Teradata-to-cloud migration involved 5,000+ users across 25 lines of business.

Mühleisen et al., Fair Benchmarking Considered Difficult: Common Pitfalls In Database Performance Testing - DBTEST 2018. Examined 16 benchmark papers; none reported sufficient parameters for reproducibility.

Databricks Sets Official Data Warehousing Performance Record - Databricks, November 2021. Official TPC-DS 100TB result: 32.9M QphDS. Full disclosure report. See also Databricks' response to Snowflake's counter-claims and Snowflake's rebuttal.

Benchmarking is good, actually

Joe Harris — Wed, 11 Feb 2026 02:21:36 GMT

TL;DR: Should you ignore flawed benchmarks? No. Teams get a week or two for platform decisions with multi-year financial consequences, and most migrations fail or exceed their budgets. Imperfect data read with skepticism beats no data at all.

The case against is already well-known

Database benchmarks have problems, and everyone in the industry knows it. Vendor benchmarks serve marketing. Configuration choices can swing results by 10x. Standard benchmarks like TPC-H test workloads that may look nothing like yours. Community benchmarks vary in methodology and rigor.

I'm not going to argue against any of that. It's all true.

But here's the part that gets left out of the critique: in my experience, most data teams don't have the luxury of ignoring benchmarks just because they're flawed. Someone has to make platform decisions, those decisions have real financial consequences, and the people making them need some kind of comparative data to work with.

I'll make the case in three steps: why teams can't benchmark everything themselves, why migration mandates make comparative data non-optional, and how to read benchmark results without getting fooled.

The two-week platform decision

The advice "just benchmark it yourself" sounds reasonable until you look at how platform decisions actually happen.

Most organizations get a week or two for platform selection before committing to implementations that span 12-36 months1. A widely cited Gartner finding from 2009 put migration failure or major-overrun rates at 83%2. It's old, but still one of the most commonly referenced baseline figures for migration risk.

The data engineer investigating Snowflake alternatives is usually the same person maintaining the current Snowflake deployment and responding to production incidents. "Run comprehensive benchmarks yourself" competes directly with "keep the business running." For most organizations I've worked with, self-benchmarking at production quality is simply not practical.

The configuration trap

Even when teams find time to evaluate platforms, they run into a subtler problem: properly configuring each platform is genuinely hard. Snowflake configuration decisions differ radically from DuckDB, which differs from BigQuery. Compute sizing, memory allocation, parallelism settings, query patterns that trigger materialization; each platform has its own arcane lore.

I've seen Postgres benchmarks that forgot to create indexes. ClickHouse tests running with outdated engine choices. Spark memory settings that force unnecessary spills. Snowflake comparisons using XS warehouses against a competitor with 10x more compute. The difference between "20% slower" and "10x faster" can come down to a single configuration choice.

This is the trap: without external benchmarks, teams unknowingly run suboptimal configurations on the platforms they're less familiar with, which is precisely the platforms they're evaluating. The resulting comparison reflects configuration skill rather than product capability.

What this looks like in practice

The scale involved is striking. One healthcare organization maintained Teradata for 20+ years, serving 5,000+ users across 25 lines of business. The cloud migration ultimately saved $140M annually, but required migrating every user and workload to validate the new platform could handle production load3. Another Teradata-to-Snowflake migration involved 700+ query translations over 8 months before achieving 60% cost reduction and 85% query efficiency improvement4.

These are partner and vendor-adjacent case studies, so I treat them as directional evidence rather than universal outcomes.

This tracks with every migration I've worked on. Parallel validation, running legacy and new platforms simultaneously, is standard practice for good reason5. These decisions evolve iteratively based on ongoing measurement, not one-time evaluation. Some data comes from internal testing, some from external sources. But "ignore all benchmarks because they're imperfect" was never a viable option.

When someone else picks your database

Sometimes the platform decision isn't yours to make. I've been on the receiving end of these mandates more than once.

"We're moving from AWS to Google Cloud." "We're consolidating to Databricks." "We're replacing Teradata with a modern cloud warehouse." These decisions get made by executives balancing factors you may not see: vendor relationships, strategic partnerships, existing contracts, real estate costs. The mandate arrives as a fait accompli. Your job isn't to choose the best platform; it's to execute the migration.

But you still need to understand the cost-performance tradeoffs of what you're moving to. That need doesn't go away just because the choice was made above your pay grade.

This happens more often than the industry admits. 63% of IT decision-makers accelerated their cloud migration efforts in 2024, up from 57% the prior year6, and the pace shows no sign of slowing. My expectation is that AI coding agents will further compress migration timelines, which increases pressure to make platform decisions faster.

The CFO question

Without benchmarks, the CFO asks "Why move to GCP?" and you answer: "Lower cost per compute." But I've learned the hard way that configuration matters enormously. BigQuery on-demand pricing differs dramatically from Flex slot pricing, and Snowflake warehouse sizing creates order-of-magnitude cost differences for identical workloads7. Without comparative data, you're guessing.

With benchmarks, the same question gets a better answer: "Platform A is 40% cheaper at equivalent performance for our average query size, based on independent testing. The lock-in cost of switching is justified by $X million savings over 3 years."

The data may be imperfect, but it's data. And when you're asked to justify a platform that costs millions annually, having external data points, even flawed ones, is the difference between a knowing nod and a very uncomfortable meeting. I know which one I prefer.

Cloud spend demands justification

Cloud spending reached $723.4 billion in 20258. At that scale, "because we've always used it" doesn't survive a finance review. Every year, finance asks the data team: "Why do we spend $X on Platform Y?" They expect performance and cost data. In many organizations, failing to provide that evidence weakens the data team's budget credibility.

The commitment decisions keep getting harder. Cloud platforms offer 25-40% savings for 1-year commitments and up to 55-72% for 3-year commitments9, but multi-year commitments lock you in. If your data volume grows 50% year-over-year and independent data suggests your workload would run 30% faster on a competitor, that three-year commitment is a fraught decision. Without external data points, you're extrapolating blindly.

Fifty databases and a deadline

The vendor landscape is chaotic. Snowflake, BigQuery, Redshift, Databricks, ClickHouse, DuckDB, Firebolt, StarRocks, SingleStore, Yellowbrick; the list grows every year. Each claims superiority. Nobody can realistically evaluate all of them.

In practice, benchmarks are the most scalable comparative filtering mechanism I've found. "This platform consistently ranks in the top tier for analytical workloads." "This platform excels at real-time ingestion but struggles with complex joins." Pattern recognition across multiple benchmarks, even imperfect ones, helps narrow 50 options to 3-5 candidates worth deeper evaluation.

You can't run comprehensive tests on 50 platforms. You can read 50 benchmark reports in an afternoon. I do this regularly, and the patterns are surprisingly consistent across independent sources.

And increasingly, the people making these decisions aren't database specialists. A Gartner projection, as quoted in a Google Cloud summary of the 2024 Magic Quadrant, says 75% of DBMS purchase decisions will be made by business domain leaders by 2027, up from 55% in 202210. For a domain leader, "Platform A is 40% faster and 20% cheaper for workloads like yours" is actionable in a way that "columnar storage with vectorized execution" simply isn't. Benchmarks, despite their flaws, make cross-platform differences visible to the people who increasingly make the calls.

How to read benchmarks without being fooled

So yes, benchmarks have real problems. Vendor benchmarks serve marketing. Configuration choices skew results. Standard workloads may not match yours. All valid.

But "benchmarks are flawed" and "benchmarks are worthless" are different claims. The first doesn't imply the second. Restaurant reviews are written by people with opinions, biases, and incomplete information. Reviews are still useful. You just read them with appropriate skepticism.

When I evaluate a benchmark report, I use a four-step filter:

Discard benchmark reports that hide basic configuration, hardware, or workload details.
Discount vendor-funded claims unless independent reports show similar ranking patterns.
Compare at least three sources and look for directional consistency, not tiny deltas.
Validate finalists on my own workload before making a commitment.

Then I ask four questions:

Who funded this? Vendor-funded means skeptical. Independent means more trust.
What configuration? Default settings favor some platforms, penalize others. Optimized settings may not reflect your team's capability.
What workload? OLAP benchmarks tell you nothing about OLTP. TPC-H tells you nothing about JSON processing.
What hardware? Cloud instance generation matters. On-prem versus cloud matters.

If three independent sources say Platform A is faster for analytical workloads, that's meaningful signal. If only the vendor says it, I discount heavily. Patterns across benchmarks are more reliable than individual results.

My advice: use benchmarks for initial filtering, reducing 50 options to 3-5 candidates. Then run your own tests on the finalists. Smaller scope, but matched to your reality. And factor in non-performance variables: cost, operational complexity, ecosystem, lock-in risk. Speed isn't everything.

Conclusions

I've spent my career watching data practitioners navigate real constraints: a week or two to make platform decisions with multi-year financial implications, accountability for cloud spend in the millions, corporate mandates that remove choice but still require cost-performance analysis, and 50+ platform options that can't all be evaluated in depth.

Benchmarks are how practitioners navigate these constraints, imperfect as they are. The solution isn't to ignore them. It's to use them with appropriate skepticism, corroborate across sources, and supplement with targeted testing on your actual workload.

The critique of benchmarks should drive demand for better benchmarks. That's what I'm building with Oxbow Research: open methodology, published data, no vendor funding.

If you're evaluating platforms this quarter, here's a practical starting point:

Pick 3-5 candidates using independent benchmark patterns.
Run a scoped in-house test on your top 2 workloads.
Present a short cost-performance tradeoff memo before committing.

Thanks for reading Oxbow Research! This post is public so feel free to share it.

Footnotes

Enterprise Data Warehouse: A Full Guide for 2025 - ScienceSoft. Decision timeline research: goals elicitation 3-20 days, tech stack selection 2-15 days, business case creation 2-15 days, implementation 6-12 months.

Gartner, "Risks and Challenges in Data Migrations and Conversions" (G00165710, 2009). This widely cited 83% migration failure/overrun figure is nearly two decades old, but remains the most commonly referenced stat in the field. See also The Research Is Clear: Too Many Migration Projects Fail - Curiosity Software, which surveys multiple migration failure studies with similar findings.

From Teradata to the Cloud: Building a Future-Ready Data Foundation While Saving $140M - Persistent Systems Client Success Story.

Teradata Migration: A Step-by-Step Guide - Hakkoda. Case study: 700+ queries migrated, 8-month engagement, 60% cost reduction, 85% query efficiency improvement, 16x faster testing.

Data Warehouse Migration Best Practices - Databricks Documentation. Parallel validation is a recommended practice across major cloud platform migration guides.

Cloud Computing Study 2024 - Foundry (formerly IDG), August 2024. Survey of 821 global IT decision-makers: 63% accelerated cloud migration in 2024, up from 57% in 2023.

BigQuery Pricing and Snowflake Pricing - Official vendor documentation. Pricing varies by compute model, commitment level, and workload patterns.

90+ Cloud Computing Statistics: A 2025 Market Snapshot - CloudZero, citing Gartner. Global cloud spending $723.4B in 2025.

See AWS Savings Plans, Google Cloud Committed Use Discounts, and Azure Reservations. Discount ranges vary by provider, commitment length, and payment structure.

2024 Gartner Magic Quadrant for Cloud Database Management Systems - Google Cloud Blog, December 2024. Gartner projection (as cited): 75% of DBMS decisions by business domain leaders by 2027.

Explore SQL vs DataFrame performance with the BenchBox MCP and Claude Code

Joe Harris — Thu, 05 Feb 2026 22:47:10 GMT

TL;DR: This post shows how the BenchBox MCP server (using Claude Code) automates benchmark discovery, execution, and investigative analysis. Using 5 simple prompts, I run and compare TPC-H at scale factor 1 with DataFusion's SQL and DataFrame support. I discover that TPC-H query 19 is 150% slower in DataFrame mode vs SQL. Root cause: the SQL optimizer extracts common join conditions from OR branches while the DataFrame API's explicit join-then-filter approach prevents the same optimization.

What I'm demonstrating

BenchBox exposes a Model Context Protocol (MCP) server that lets AI assistants interact with the benchmarking framework conversationally. You ask questions in plain language; the assistant calls BenchBox tools behind the scenes, exploring platforms, running benchmarks, comparing results, investigating anomalies. This post walks through one such session. I start with a question, explore the platform catalog, execute benchmarks, and follow an unexpected result down to its root cause.

Here are my 5 simple prompts for the agent:

"Which BenchBox platforms support both SQL and DataFrame execution and can run locally on this computer?"
"Run TPC-H sf1 Power testing with DataFusion using both SQL and DataFrame modes. Compare the runs and flag differences >10%."
"Show me the BenchBox head-to-head visualization comparing the results."
"Identify the query with largest performance regression. Explain what is different in the DataFrame and SQL executions."
"Write a technical analysis of the SQL vs DataFrame executions for this query. Use a diagram to show where they diverge"

Install BenchBox and setup MCP server

Create a virtual environment and install BenchBox with MCP support:

# Create uv virtual environment
uv venv ~/mcp_test/.venv
# Install benchbox with MCP extra
uv pip install --python ~/mcp_test/.venv/bin/python "benchbox[mcp]"
# Create project-level MCP server config
echo '{"mcpServers":{"benchbox":{"command":"~/mcp_test/.venv/bin/benchbox-mcp","args":[]}}}' > ~/mcp_test/.mcp.json

Sidebar: Claude asks for approval to use the BenchBox MCP tool, either every time (1) or the first time (2).

Tool use
benchbox - Read-only discovery tool() (MCP)
List all available database platforms.
       Returns information about each platform including:…
Do you want to proceed?
❯ 1. Yes
2. Yes, and don't ask again for benchbox - Read-only discovery tool commands in ~/mcp_test
3. No

Exploring the platform catalog

`prompt`:

"Which BenchBox platforms support both SQL and DataFrame execution and can run locally on this computer?"

`response`:

Platforms Supporting Both SQL and DataFrame
Of BenchBox's 37 platforms, 13 support both SQL and DataFrame execution. Filtering to those that can run locally (not cloud/managed services):
DataFusion is the only one ready today
It's installed and available on this system with all dependencies met (datafusion + pyarrow 23.0.0). It runs natively on ARM64 and is Arrow-native, so it's a natural fit for your Apple Silicon Mac (10-core, 16GB).
PySpark can run locally in standalone mode, but would need pyspark installed (uv pip install "benchbox[dataframe-pyspark]").
Why not the others?
The remaining 11 dual-mode platforms (Databricks, Athena Spark, Dataproc, EMR Serverless, Glue, Fabric Spark, Synapse Spark, Snowpark Connect, Quanton, etc.) are all cloud/managed services — they require remote infrastructure and credentials, not local execution.
Notable local-only platforms (single mode)
For context, you also have these local platforms ready but they only support one mode:
DuckDB (v1.4.4) — SQL only
Polars (v1.38.1) — DataFrame only
So if you want to compare SQL vs. DataFrame execution on the same engine locally, DataFusion is your only option right now.

I'll use DataFusion for this investigation, keeping the SQL-vs-DataFrame comparison on the same local engine, removing cloud infrastructure variance.

Running the benchmarks

`prompt`:

"Run TPC-H sf1 Power testing with DataFusion using both SQL and DataFrame modes. Compare the runs and flag differences >10%. These benchmarks will run locally so do not run them in parallel."

`response`:

Results: DataFusion SQL vs DataFrame — TPC-H SF1
Both runs completed all 22 queries with 3 measurement iterations on DataFusion 51.0.0.
Overall
DataFrame mode is ~16% faster overall, with a materially higher TPC-H Power metric.
Queries Exceeding 10% Difference (16 improved, 3 regressed, 3 stable)
DataFrame wins (16 queries):
SQL wins (3 queries):
Key Takeaways
DataFrame mode is broadly faster — 16 of 22 queries improved, many substantially. The DataFrame API bypasses SQL parsing and can express optimizations more directly to the execution engine.>
Three significant regressions stand out — Q19, Q12, and Q21 are all dramatically slower in DataFrame mode (2-2.5x). These likely involve query patterns where the hand-written DataFrame code generates a les>s >optimal plan than what the SQL optimizer produces. Q19
involves complex disjunctive predicates (OR conditions), Q12 has ordered aggregation, and Q21 involves multi-level correlated subqueries — all patterns where SQL optimization heuristics can outperform manual >DataFrame construction.
Q18 is the standout improvement — 332ms (SQL) vs 94ms (DF), a 3.5x improvement. This large-order query with HAVING clause and nested aggregation appears to benefit significantly from the DataFrame API's more direct expression of the computation.

The biggest percentage regression is Q19: DataFrame mode is 2.55x slower (+154.5%). I will investigate Q19.

Visual comparison of SQL vs DataFrame

`prompt`:

"Show me the BenchBox head-to-head visualization comparing the results."

`response`:

Embedded image to show colors, Substack code blocks do not support ANSI color codes

Most queries improved in DataFrame mode, but Q19, Q12, and Q21 are large regressions that dominate the risk profile.

Investigating the largest regression (Q19)

`prompt`:

"Identify the query with largest performance regression. Explain what is different in the DataFrame and SQL executions."

`response`:

Q19: Discounted Revenue — Root Cause Analysis
The Query
Q19 computes revenue for parts matching three independent sets of conditions joined by OR. Each branch requires a join between lineitem (6M rows) and part (200K rows), plus brand/container/quantity/size filters.
What the SQL Optimizer Does
The SQL version uses an implicit comma-join with the join predicate inside the OR:
FROM lineitem, part
WHERE (p_partkey = l_partkey AND p_brand = 'Brand#12' AND ...)
  OR (p_partkey = l_partkey AND p_brand = 'Brand#23' AND ...)
  OR (p_partkey = l_partkey AND p_brand = 'Brand#34' AND ...)
DataFusion's SQL optimizer can analyze this structure and:
Extract the common join predicate — p_partkey = l_partkey appears in all three OR branches, so it's factored out as the hash join condition
Extract common filter predicates — l_shipmode IN ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON' are identical across all branches and get pushed down to a lineitem scan filter
Push part-side predicates filter — p_brand IN ('Brand#12','Brand#23','Brand#34') and p_size BETWEEN 1 AND 15 (the union of all size ranges) can be used to filter the part scan
This dramatically shrinks both sides of the join before it executes.
What the DataFrame Code Does
lineitem.join(part, left_on="l_partkey", right_on="p_partkey")
   .filter(
       col("l_shipmode").is_in(ship_modes)
       & (col("l_shipinstruct") == lit("DELIVER IN PERSON"))
       & (  (branch1) | (branch2) | (branch3)  )
   )
The join is specified correctly, but the entire filter is applied post-join as a single compound expression. The optimizer sees:
HashJoin(lineitem, part, on=l_partkey=p_partkey)
 → Filter(complex_disjunction)
The problem is predicate pushdown through disjunctions is hard. While the top-level l_shipmode and l_shipinstruct predicates sit outside the OR (the DataFrame code does structure this well), the OR branches >themselves contain predicates on both sides of the join (p_brand from part, l_quantity from lineitem). The optimizer can't cleanly decompose and push these mixed-table predicates down through the OR.
The Result
The Fix
The DataFrame code could be restructured to match what the SQL optimizer does automatically — pre-filter both tables, then join, then apply the residual OR:
# Pre-filter common predicates BEFORE the join
li = lineitem.filter(
   col("l_shipmode").is_in(ship_modes)
   & (col("l_shipinstruct") == lit("DELIVER IN PERSON"))
)
pt = part.filter(
   col("p_brand").is_in([brand1, brand2, brand3])
   & (col("p_size") >= lit(1))
   & (col("p_size") <= lit(15))
)
# Join the reduced inputs, then apply branch-specific filters
result = (
   li.join(pt, left_on="l_partkey", right_on="p_partkey")
   .filter(branch1 | branch2 | branch3)
   ...
)
This is the fundamental tradeoff: SQL optimizers have decades of research on predicate extraction and pushdown through disjunctions. DataFrame APIs express computation imperatively, putting the optimization burden on the developer. Q19 is a textbook case — its deeply nested OR-of-AND structure is exactly the pattern where SQL optimizers shine and naive DataFrame construction falls short.

The SQL optimizer extracts the common join condition from OR branches and pushes filters down. In this plan shape, DataFusion's DataFrame path cannot do the same rewrite once the join is explicitly fixed before the filter.

The predicate extraction gap

`prompt`:

"Write a technical analysis of the SQL vs DataFrame executions for this query. Use a diagram to show where they diverge."

`response`:

Compilation Path Divergence
        SQL Path                        DataFrame Path
        ────────                        ──────────────
   FROM lineitem, part             lineitem.join(part, partkey)
   WHERE (key=key AND ...)         .filter(common & (b1|b2|b3))
      OR (key=key AND ...)                    │
      OR (key=key AND ...)                    │
             ▼                                ▼
┌──────────────────────────┐      ┌──────────────────────────┐
│ Parser → CrossJoin       │      │ Builder → InnerJoin      │
│ Join key INSIDE the OR   │      │ Join key ALREADY SET     │
└────────────┬─────────────┘      └────────────┬─────────────┘
            ▼                                 │
┌──────────────────────────┐        (rule does not fire:
│ ExtractEquijoinPredicate │         join already resolved)
│                          │                   │
│ Finds partkey = partkey  │                   │
│ in ALL branches → hoist  │                   │
│ as join key              │                   │
└────────────┬─────────────┘                   │
            ▼                                 ▼
┌──────────────────────────┐      ┌──────────────────────────┐
│ PushDownFilter           │      │ PushDownFilter           │
│                          │      │                          │
│ Clean single-table preds │      │ OR branches mix both     │
│ after join key removed:  │      │ tables (p_brand AND      │
│ • shipmode → lineitem    │      │ l_quantity) so cannot    │
│ • shipinstruct → lineitem│      │ push to either side      │
│ • p_size >= 1 → part     │      │                          │
│ • p_brand IN → part      │      │ Only top-level ANDs push:│
│                          │      │ • shipmode → lineitem    │
│ Both scans filtered      │      │ • shipinstruct → lineitem│
└────────────┬─────────────┘      └────────────┬─────────────┘
            ▼                                 ▼
  HashJoin(partkey)                 Filter(OR(b1,b2,b3))
  ├─ Scan lineitem [2 preds]       └─ HashJoin(partkey)
  │    ~908K rows                     ├─ Scan lineitem [2 preds]
  └─ Scan part [2 preds]              │    ~908K rows
       ~small subset                  └─ Scan part [no filter]
                                            200K rows
            ▼                                 ▼
          55 ms                            140 ms
Where the gap opens
The critical moment is ExtractEquijoinPredicate. This optimizer rule scans the WHERE clause for equijoin conditions. In SQL mode, it finds p_partkey = l_partkey buried inside all three OR branches and hoists it out as the join key. The remaining predicate, now free of the join condition, becomes a simpler OR-of-filters that PushDownFilter can decompose more aggressively.
In DataFrame mode, this rule never fires because the join is already explicit. The developer did the right thing by specifying the join key directly, but in doing so, they prevented the optimizer from ever seeing the predicate structure that SQL mode exploits.
The second gap is in PushDownFilter. DataFusion's filter pushdown can extract common factors from OR branches using the identity (A∧B) ∨ (A∧C) → A ∧ (B∨C). In SQL mode, after the join condition is extracted, the remaining branches are cleaner, containing single-table predicates that factor neatly. In DataFrame mode, each OR branch still contains predicates from both tables (p_brand from part, l_quantity from lineitem), which prevents the optimizer from attributing them to one side of the join.
The net effect: SQL mode filters the part table down to a small subset (3 brands, size 1-15) before joining. DataFrame mode joins all 200K part rows against the filtered lineitem, producing a much larger intermediate result that the OR filter must then evaluate row by row.
External confirmation
From DataFusion's push_down_filter.rs (push_down_filter.rs)
"For expressions like (A AND B) OR (C AND D), the optimizer transforms to: ((A AND B) OR (C AND D)) AND (A OR C)"
The optimizer can only extract common factors across OR branches. It cannot recognize that a join condition appears in every branch, hoist it out, and convert the pattern to JOIN ON key WHERE (filters1 OR filters2 OR filters3).
From the DataFusion blog (optimizing-sql-dataframes-part-one):
Both [SQL and DataFrame] are translated into the same LogicalPlan... the DataFrame API builds the LogicalPlan directly using LogicalPlanBuilder.
The same optimizer processes both paths, but the initial plan structure determines what optimizations are possible. SQL's implicit join gives the optimizer more room to restructure; DataFrame's explicit join is a constraint the optimizer respects.

This gap is caused by the difference in how query APIs are interpreted before execution, it is not caused by any specific DataFrame API overhead.

Learnings

The BenchBox MCP enables Claude Code to quickly compare DataFusion's SQL vs DataFrame modes, identify Q19 as a 2.55x regression (+154.5%), and trace the source to predicate pushdown during query planning. A future post will contrast this with letting Claude Code try this _without_ the MCP (spoiler: slow, confident, and wrong).

This is a DataFusion-specific result, not a blanket statement about DataFrame APIs. Polars, PySpark, etc. all have different optimization capabilities and can make different planning choices on the same logical query shape.

For Q19 on DataFusion, SQL is faster because the optimizer extracts shared predicates from OR branches and pushes filters earlier. DataFusion's DataFrame path starts from the user-expressed plan shape, and does not get the SQL-only rewrites.

However, DataFrame is actually faster for relatively straightforward queries. 16 of the 22 TPC-H queries ran more quickly, including a 3.5x improvement on Q18. The impact of the optimization difference is query-specific.

My recommendation today: on DataFusion, prefer SQL for OR-heavy multi-table predicates like Q19, Q12, and Q21. Use DataFrame mode when query construction ergonomics matter and your workload resembles the 16 queries where DataFrame won.

Note that this is not a permanent DataFusion limitation. DataFusion is moving forward quickly and the optimizer keeps evolving. This specific gap could close in future releases.

BenchBox test environment

BenchBox CLI equivalent:

$ benchbox run --platform datafusion --benchmark tpch --scale 1 --phases load,power
$ benchbox run --platform datafusion-df --benchmark tpch --scale 1 --phases load,power

BenchBox raw results (gist):

Test Limitations:

Single-node, Apple Silicon, default DataFusion configuration, TPC-H Power test only
TPC-H DataFrame queries were created for BenchBox and are not officially provided.
BenchBox's DataFrame Q19 may not represent the best possible translation of the SQL query.
BenchBox's DataFusion integration reads from parquet files and operates in-memory.

Try it yourself

The full investigation, from platform discovery to query plan analysis, took one session. Connect the BenchBox MCP server to your AI assistant and start with a question like "Which platforms support both SQL and DataFrame execution?"

Or run directly via CLI:

$ uv run benchbox run --platform datafusion --benchmark tpch --scale 1
$ uv run benchbox run --platform datafusion-df --benchmark tpch --scale 1
$ uv run benchbox compare --head-to-head --runs {run_id_sql} {run_id_df}

If you find that you cannot reproduce this, please open an issue with your run result JSON files attached. The key signal to compare is whether Q19 remains above 2x on your hardware and DataFusion version.

Resources

BenchBox GitHub Repository, Benchmarking framework used for this analysis
Apache DataFusion, extensible query engine written in Rust that uses Apache Arrow as its in-memory format
TPC-H Benchmark Specification, TPC-H is a decision support benchmark, official documentation

Does your database allow benchmarks? A 2026 DeWitt clause survey

Joe Harris — Thu, 05 Feb 2026 15:15:36 GMT

TL;DR:

For 40 years, "DeWitt clauses" let vendors legally block benchmark publication. Since 2021, that's changed: open source databases have no restrictions, major cloud vendors (AWS, Azure, Google, Databricks, Snowflake) now allow benchmarks with methodology disclosure, and only legacy holdouts (Oracle, SQL Server) still require permission. The pattern is clear: vendors confident in their performance welcome scrutiny.

What is the "DeWitt clause"?

The DeWitt Clause is part of a software license agreement that prohibits users from publishing benchmark results without the vendor's approval. From Oracle's current license:

"You may not disclose results of any program benchmark tests without our prior consent."1

Microsoft SQL Server's license contains a similar restriction:

"You must obtain Microsoft's prior written approval to disclose to a third party the results of any benchmark test of the software."2

These clauses create a chilling effect on database evaluation. Researchers can't publish comparative studies. Consultants can't share findings with the broader community. Customers have to make purchasing decisions based on vendor marketing rather than independent verification. The only databases that can be rigorously critiqued in public are open source, which puts proprietary vendors at an unfair advantage when their performance is actually competitive, and shields them when it isn't.

The clauses also have a corrosive, self-reinforcing quality. Once one vendor adopts a DeWitt clause, competitors feel disadvantaged without one. After all, if Vendor A can't be publicly critiqued but Vendor B can, Vendor B faces asymmetric scrutiny regardless of actual performance3.

The origin of the clause

The DeWitt clause story starts in 1983, when David DeWitt and his colleagues created the Wisconsin Benchmark to measure relational database performance4. When they published their findings, Oracle's performance stood out as particularly poor. According to DeWitt, Oracle CEO Larry Ellison was furious. He called the department chair and demanded: "You have to fire this guy."5

Oracle didn't succeed in getting DeWitt fired. Instead, they did something with longer-lasting consequences: they added a clause to their license agreement prohibiting customers from publishing benchmark results without Oracle's prior written consent6.

This provision became known as the "DeWitt Clause", somewhat ironic given that DeWitt championed benchmarking. The clause spread throughout the database industry like a virus, adopted by nearly every major commercial vendor. For 40 years, DeWitt's name has been synonymous with preventing the very transparency he fought for7.

But things are changing. Since 2021, there has been a shift in vendor attitudes toward benchmark publication. This post surveys the current landscape as of early 2026, documenting which vendors restrict benchmarks, which have opened up, and what it means for anyone trying to make informed database decisions.

The 2021 turning point

For nearly four decades, DeWitt clauses were just how the database industry worked. But cloud hyperscalers were the quiet exception. AWS and Microsoft Azure never adopted traditional DeWitt clauses. Instead, they used reciprocal terms from the start: you could publish benchmarks as long as you shared your methodology and granted them the same rights to benchmark your products. They didn't make a big deal about it and mostly used it as a shield when competing with each other.

Then Databricks turned benchmarking transparency into a competitive weapon. In November, 2021, Databricks announced that Databricks SQL had set a new world record on the 100TB TPC-DS benchmark, outperforming the previous record by 2.2x8. This was significant not just for the result itself, but because it was the first official TPC-audited benchmark from a cloud data warehouse vendor. The results were verified by the Transaction Processing Performance Council in a 37-page disclosure report9.

Six days later, Databricks announced they were eliminating the DeWitt Clause from their service terms entirely10. But they went further, introducing what they called a "DeWitt Embrace Clause":

"If a competitor or vendor benchmarks Databricks or instructs a third party to do so, this new provision invalidates the vendor's own DeWitt Clause to allow reciprocal benchmarking."

In other words: benchmark us, and we can benchmark you back, regardless of what your license says. The move signaled Databricks was confident in their performance and wanted to force competitors into the open. It worked. Within weeks, Snowflake responded with their own benchmarks and removed their DeWitt clause11. What followed was a brief benchmark war between the two companies, with competing claims, counter-benchmarks, and accusations of unfair methodology.

The specifics of who "won" that battle matter less than the outcome: two major cloud data warehouse vendors had permanently abandoned benchmark restrictions. Others followed. By the end of 2023, SingleStore had eliminated their clause12. AWS and Azure, which had long maintained reciprocal (rather than restrictive) benchmark terms, saw their approach validated.

The dam hasn't broken entirely. Oracle, Microsoft SQL Server, and several cloud-only services still maintain restrictions. But when I look at the industry now versus 2020, the shift is unmistakable.

Current status by vendor category

I surveyed over 25 database vendors and cloud services to document current benchmark publication policies as of January 2026. Here's what I found, and what the patterns tell us about vendor confidence.

Open source: no restrictions, of course

Open source licenses cannot contain DeWitt clauses by definition. The Apache 2.0 license grants users the right to "use, reproduce, and distribute" the software for any purpose13. The MIT license permits use "without restriction"14. The PostgreSQL license explicitly allows use "for any purpose, without fee, and without a written agreement"15.

This isn't a loophole, it's fundamental to what open source means. You can benchmark PostgreSQL, publish the results, and PostgreSQL Global Development Group has no legal recourse (nor would they want any).

DuckDB goes further, actively encouraging benchmarks. Their FAQ recommends using preview releases for fairness, references their academic paper "Fair Benchmarking Considered Difficult" on methodology pitfalls, and asks only that you report version numbers16.

If you're benchmarking open source databases, you have nothing to worry about legally. Publish away.

Cloud vendors with reciprocal rights (DeWitt embrace)

These vendors permit benchmark publication with reciprocal rights provisions, what Databricks branded the "DeWitt Embrace" approach. Notably, AWS and Azure had these terms in place before Databricks coined the phrase; the hyperscalers never adopted traditional restrictive clauses. The key elements are consistent across vendors:

You may benchmark and publish results
You must provide methodology sufficient for reproduction
By publishing, you grant the vendor reciprocal benchmarking rights

The AWS Service Terms17 state that you may benchmark and disclose results, but you must include "all information necessary to replicate such Benchmark," and AWS gains the right to benchmark your products in return. Microsoft's Online Services Terms18 use nearly identical language.

The reciprocal element is clever: if you're a competitor and you publish benchmarks of their service, you've just waived your own benchmark restrictions. This creates mutual assured transparency, at least among those who choose to engage.

You can benchmark these services freely, provided you document your methodology thoroughly. For most evaluators (who aren't competing database vendors), the reciprocal obligation is irrelevant.

Google Cloud: a recent convert

Google Cloud quietly removed their DeWitt clause. Sometime between 2022 and 2024, they updated their benchmark terms, a change that hasn't been widely commented upon.

The old Google Cloud terms (circa 2022) required customers to "obtain Google's prior written consent" before publishing any benchmark results19. This was a traditional restrictive DeWitt clause, similar to Oracle's.

The current terms use the same reciprocal approach as AWS and Azure:

"Customer may conduct benchmark tests of the Services (each a 'Test'). Customer may only publicly disclose the results of such Tests if (a) the public disclosure includes all necessary information to replicate the Tests, and (b) Customer allows Google to conduct benchmark tests of Customer's publicly available products or services and publicly disclose the results of such tests."20

No prior approval required, just reciprocity.

The hyperscaler exclusion: Google does maintain one unique restriction:

"Customer may not do either of the following on behalf of a hyperscale public cloud provider without Google's prior written consent: (i) conduct (directly or through a third party) any Test or (ii) disclose the results of any such Test."21

This means AWS, Azure, or other hyperscale competitors can't commission benchmarks of Google Cloud services without permission. Independent researchers, enterprises, and consultants are unaffected.

Service-specific restrictions: A few Google Cloud services still have full benchmark prohibitions:

Cloud NGFW Enterprise (firewall)
Cloud IDS (intrusion detection)

For these security services, customers "will not disclose, publish, or otherwise make publicly-available any benchmark, or performance or comparison tests."

Note: Some surveys (including Cube.dev's DeWitt Clause list) still categorize Google Cloud as having a restrictive DeWitt clause. This appears to be based on the older terms or the hyperscaler exclusion. For independent evaluators using current terms, Google Cloud is functionally equivalent to AWS and Azure.

You can benchmark BigQuery, Spanner, AlloyDB, and other Google Cloud database services freely, same rules as AWS and Azure. The hyperscaler exclusion only matters if you're literally acting as an agent for a competing cloud provider.

Still restricted: the holdouts

Oracle maintains the original DeWitt clause that started it all. Microsoft SQL Server (the on-premises product, distinct from Azure SQL/Synapse) still requires written approval. Several cloud-only database services restrict benchmarks, particularly for their managed offerings even when the underlying technology is open source.

Why do these vendors hold out? I see a few patterns:

Legal inertia: The clause has always been there; removing it requires someone to affirmatively decide to change it
Performance concerns: If you're not confident in your performance, transparency is risky. The vendors who embrace open benchmarking tend to be the ones winning benchmarks.
Customer lock-in: Existing customers can't easily compare alternatives when they can't see independent comparisons
Different market dynamics: Enterprise sales happen behind closed doors; public benchmarks matter less when you're selling to procurement committees

The pattern is telling: vendors confident in their performance actively encourage benchmarks. Vendors who restrict them are telling you something about their confidence level.

Practical guidance

For database evaluators

Here's what I do before publishing any benchmark, and what I recommend you do too:

Check the license first. Before running any benchmark you plan to publish, read the terms of service. A few minutes of legal review can save significant headaches. I learned this the hard way.
Open source is always safe. PostgreSQL, DuckDB, ClickHouse, Spark: benchmark freely. This is one reason I favor open source for my work.
DeWitt embrace vendors require methodology. For AWS, Azure, Databricks, Snowflake, and similar services, document everything:
- Hardware specifications (instance types, CPU, RAM, storage)
- Software versions (database version, OS, drivers)
- Configuration (all non-default settings)
- Data generation process
- Query execution methodology
- Commands to reproduce
Note: Using BenchBox makes it very simple to meet these requirements.
Restricted vendors require permission. For Oracle, SQL Server, and similar products, either get written approval, anonymize results ("Database A" vs "Database B"), or don't publish.
When in doubt, ask. Vendor legal teams can clarify what's permitted. Get it in writing.

For enterprises

Questions to ask during vendor evaluation:

"Can we publish benchmark results comparing your product to alternatives?"
"What restrictions apply to sharing performance data with our industry peers?"
"Will you provide a benchmarking waiver as part of our contract?"

A vendor's answer tells you something about their confidence in their product.

For content creators

The vector database exception

The rise of AI and vector search has created a new category of databases, and a new set of benchmark restrictions.

A 2024 survey of vector databases found that several cloud offerings restrict benchmarks even when their open source cores don't:

The pattern is notable: vendors restrict benchmarks specifically for their managed cloud offerings, even when the underlying database engine is open source. This suggests the restriction is about protecting cloud margins rather than the technology itself.

What should change

DeWitt clauses are anti-consumer and anti-competitive. Here's what I think needs to happen:

Academic exemptions should be universal. Researchers should be able to publish benchmark results without fear of legal action. Some licenses technically permit academic use; this should be explicit and standard across the industry.
The market is moving toward transparency, and that's good. The vendors with the best performance actively encourage benchmarks. The correlation isn't coincidental. Transparency favors the winners. The holdouts should take note.
TPC should continue expanding. The Transaction Processing Performance Council's acceptance of cloud-native benchmarks (starting with Databricks' 2021 TPC-DS submission) has helped legitimize comparative performance testing. More standardized, audited benchmarks across more workloads would benefit everyone.
Enterprises should push back. During your next vendor evaluation, ask: "Can we publish benchmark results comparing your product to alternatives?" A vendor's answer tells you something about their confidence. Consider adding benchmarking rights to your contract negotiations.

Conclusion

The DeWitt Clause, born from Larry Ellison's fury at unflattering benchmark results in 1983, spread throughout the database industry for four decades. But the landscape has fundamentally shifted since 2021.

Today, the majority of databases can be benchmarked and published freely:

All major open source databases have no restrictions
All major cloud warehouses (AWS, Azure, Google Cloud, Databricks, Snowflake) permit benchmarks with methodology disclosure
Traditional enterprise vendors (Oracle, SQL Server) maintain restrictions

For anyone evaluating databases in 2026, this is good news. You can conduct independent, reproducible performance testing of most modern data platforms and share your findings with the community.

Before your next database evaluation, check the vendor's benchmarking terms. If they restrict publication, ask yourself what they're hiding. And if you're negotiating an enterprise contract, push for benchmarking rights. The more customers demand transparency, the faster the holdouts will fold.

All of the benchmarking research from Oxbow Research is published with complete reproduction instructions via BenchBox.

The database industry's four-decade experiment with benchmark censorship is ending. Not with a legal ruling or regulatory mandate, but with competitive pressure from vendors confident enough in their performance to welcome scrutiny.

I hope David DeWitt approves.

Methodology

This survey was conducted in January 2026 by reviewing:

Vendor terms of service and acceptable use policies
License agreements and service-specific terms
Historical vendor announcements and blog posts
Third-party analyses from Cube.dev, benchANT, and others

Terms of service change frequently. I recommend verifying current terms before publishing any benchmark results. Links to primary sources are provided in the footnotes.

This post reflects my good-faith understanding of vendor policies and does not constitute legal advice.

Footnotes

Bitton, D., DeWitt, D. J., & Turbyfill, C. (1983). Benchmarking database systems: A systematic approach. VLDB Conference Proceedings

That time Oracle tried to have a professor fired for benchmarking their database - Dan Luu

The DeWitt Clause: Why You Rarely See Database Benchmarks - Brent Ozar, May 2018

DeWitt clause, or Can you benchmark a database and get away with it - Cube Blog

Oracle Technology Network License Agreement - Oracle

Microsoft SQL Server License Terms - Microsoft

The DeWitt clause's censorship should be illegal - David A. Wheeler

Databricks Sets Official Data Warehousing Performance Record - Databricks Blog, November 2021

TPC-DS Full Disclosure Report for Databricks - TPC, November 2021

Eliminating the DeWitt Clause for Database Benchmarking - Databricks Blog, November 2021

Snowflake vs Databricks: TPC-DS Benchmark Wars - LinkedIn, November 2021

Eliminating the DeWitt Clause for Greater Transparency in Benchmarking - SingleStore Blog, November 2023

Apache License 2.0 - Apache Software Foundation

MIT License - Open Source Initiative

PostgreSQL License - PostgreSQL Global Development Group

DuckDB FAQ - DuckDB

AWS Service Terms - Amazon Web Services. Section on Benchmarking.

Microsoft Online Services Terms - Microsoft

Google Cloud Service Specific Terms - Google Cloud. Section 7 (Benchmarking). Note: Terms were updated between 2022-2024 to remove prior written consent requirement.

To Benchmark Vector Databases or to Get Sued for breaching a DeWitt Clause? - benchANT, April 2024

DeWitt Clause discussion - Google Developer forums - August 2022. Discusses older Google Cloud terms that required prior written consent.

Introducing Oxbow Research

Joe Harris — Mon, 02 Feb 2026 21:46:27 GMT

TL;DR: What is Oxbow Research? Independent analysis of data platform performance and pricing. I also review market trends, vendor strategy, and post deep dives on historical companies and trends. Performance analysis is based on benchmarking run with BenchBox, the open-source benchmarking framework I created to provide transparency and reproducibility for this effort.

Subscribe now

The problem with benchmarks

No one trusts data platform benchmarks. Data platform vendors don’t exactly mislead with their benchmarking efforts but, understandably, they only publish benchmarks if they win. Snowflake publishes benchmarks where they look best, Databricks publishes benchmarks where they look best, and data practitioners are left comparing apples to oranges with no easy way to verify either claim. The industry calls this “benchmarketing” and it’s been the norm for decades.

The same dynamic explains why there are so few official TPC results. TPC certification costs $100k+^[1]. It only makes sense to publish a result if you are sure your competitors won’t beat it quickly. If you know a competitor could publish a better (or even close) result then publishing your TPC result is shooting yourself in the foot. That’s why most vendors never publish or publish once, claim the crown, and never update it. The incentives guarantee you won’t see an apples-to-apples comparison unless someone outside the vendor ecosystem creates one.

For data practitioners this creates a real problem: you need to justify your data platform choice and budget. You have a few outdated vendor benchmarks (apples to oranges), analyst quadrants (expensive and vague), and your own experience (limited to platforms you know). You can run your own benchmarks, but it soaks up engineering time: researching configs, debugging drivers, and fighting with cloud permissions - when you should be shipping useful data products for your business.

What about independent benchmarks?

They often suffer from a few common problems:

Conflict of Interest: Fivetran’s cloud data warehouse benchmark was very useful but Fivetran’s business requires close relationships with cloud data warehouse vendors. The conclusion “they’re all pretty good” might be accurate but it reads differently when their revenue depends on not offending anyone on the list.

Single Platform Experts: Often benchmarks are run by practitioners who are expert in a specific platform but have limited experience of the competing platforms. Doing this well requires considerable effort because it’s hard to create best case tunings for platforms you don’t know well. It’s all too easy to write of an unfamiliar platform as “slow” when it’s just misconfigured.

TPC “Inspired”: A common category of problematic benchmarks are TPC-H or TPC-DS “inspired”. They avoid the complex TPC official methodology requiring: data generation, query validation, specific query ordering, concurrent testing, refresh operations, and unique measurement logic. These “inspired” results can be directionally useful but they’re not directly comparable to other TPC-H or TPC-DS “inspired” results because they don’t adhere to the spec.

And finally there’s governance. Who decides if a benchmark was run fairly? Who handles complaints? Who updates results when new versions ship? Usually nobody. The benchmark gets published, gets shared on Hacker News, and sits there, frozen in time, increasingly outdated, with no process for correction or update.

What I built

Vendor Benchmarks vs Oxbow Research:

Funding: Vendor funded vs Subscriber funded
Methodology: Custom scripts vs Versioned toolkit
Reproducibility: Good luck vs pip install benchbox
Governance: Trust us vs documented process
Analysis: “We’re the fastest!” vs “Fastest at what, exactly?”

BenchBox is the foundation, an easy-to-use open-source benchmarking toolkit with an MIT license.

uv pip install benchbox
uv run benchbox run --platform duckdb --benchmark tpch --scale 1 --phase power

BenchBox will run a spec-compliant TPC-H Power test: generating data with dbgen, running all 22 queries (1 warmup run + 3 measurement runs), queries in the correct order for each run, using proper parameterization, and reports the geometric mean performance metric (Power@Size). Scale factor 10? Same methodology, larger dataset, correct parameterization. Scale factor 100? Depends on your hardware, but you’ll know exactly what configuration produced those numbers, because you ran it.

Oxbow Research is independent and self-funded. I have no outside investors or employer to keep happy. Every benchmark I publish uses BenchBox, the same open-source tool anyone can run. If you disagree with my results, reproduce them and show me. Methodology debates happen on GitHub - if something is wrong (or missing) we’ll fix it for everyone in public.

What I’ll write about

Benchmark results with full methodology, TPC-H power tests, TPC-DS, ClickBench. Industry economics and vendor analysis. Technical deep-dives on data platform internals. Historical perspectives on analytics technology. I have opinions. I’ll tell you what they are and why.

Why “Oxbow”?

The data industry has gone through numerous cycles where a technology or approach seems to completely dominate the market (or the mindshare) for a few years and then becomes less relevant as the market moves onto a different trend. So that’s the metaphor for Oxbow Research: understanding the speed and course of the current path for data platforms and thinking about where and why the previous path diverged.

Here are a few “oxbows” that I’ve seen in my career:

Rowstore + Indexing - Oracle
MPP Rowstores - Teradata
DW Appliances - Netezza
Early Columnstores - Vertica
Data Lakes - Hadoop, S3
Cloud Data Warehouses - Redshift, Snowflake
Lakehouse + Open Table formats - Databricks, Delta Lake, Iceberg
- The current path
Composable Data Stacks - DuckDB, DataFusion, Polars
- The next path?

What’s next

Subscribe to the Oxbow Research newsletter to stay informed on upcoming research, analysis, and deep dive posts. BenchBox is freely available today.

TPC Policies - TPC, accessed January 2026. Full benchmark certification requires third-party auditing and TPC membership. ↩