Health Samurai published an open-source FHIR server performance benchmark on June 29, 2026, with Aidbox, HAPI FHIR, Medplum, and the Microsoft FHIR Server all tested on the same hardware and the same Synthea dataset. The numbers themselves are getting most of the coverage. The repo behind them is the part worth reading for any team that wants to rerun the harness or audit how the numbers are produced. This walkthrough covers what is actually in the repository and where to look first.
The benchmark is vendor-run; Health Samurai builds Aidbox. The repo being open is the reason any of this is reproducible.
The Repo Layout in Practice
The repository is roughly 55 percent TypeScript by line count, which is the language the harness, the k6 scripts, and the orchestration logic are written in. The CI pipeline runs on Drone CI, which keeps the daily-rerun cadence consistent. Metrics are emitted in Prometheus format and rendered through Grafana, which is the same pipeline that drives the public dashboard.
The repo entry point most readers will want is the top-level runner.sh. That script wires together the three things that matter for reproducibility: the docker-compose files that pin each FHIR server to its 8 vCPU and 24 GB memory allocation, the Synthea-generated 1,000-patient dataset used as the input, and the k6 scenarios that drive load against each server in turn. Anyone wanting to look at it directly can clone the public benchmark repo and follow the README.
For broader context on how teams pick FHIR servers, the FHIR server buyer's guide sets the framing. The FHIR engineering reference holds the rest of the related material.
Three Places to Look First
For a reader trying to understand the harness rather than the numbers, three areas of the repo carry most of the signal.
- The docker-compose definitions per server. Each of Aidbox, HAPI, Medplum, and the Microsoft FHIR Server gets its own compose file with explicit CPU and memory limits. Medplum is configured as eight one-vCPU replicas because that matches its native scaling story. Aidbox, HAPI, and Medplum back onto PostgreSQL 18; Microsoft uses SQL Server 2022 Developer Edition.
- The k6 scenarios. CRUD, bundle import, and search workloads each have their own script. Concurrency is set per scenario, and the same script runs against all four servers, which is the bit that makes the comparison fair.
- The Drone CI pipeline. The pipeline is what drives the daily rerun and posts results into the Grafana dashboard. Reading it is the cleanest way to see exactly what happens between code change and published number.
What the Repo Is Useful For Beyond Reading the Numbers
The most useful property of the repo is that you can rerun it locally. The hardware will not match the 64-core bare-metal box Health Samurai uses for the canonical dashboard, but the harness, scenarios, and scoring logic do. Anyone evaluating FHIR servers for a specific workload can fork the repo, swap in their own Synthea profile or their own k6 scenario, and produce numbers that are at least directionally comparable. That is the same operating model the HAPI FHIR vs Microsoft FHIR Service comparison talks about when it discusses reproducible benchmarking.
The Medplum CTO has already publicly forked the repo, which is the use case the harness is meant to support. The repo being readable and runnable is the part that matters; the leaderboard is a snapshot.



