---
{
  "title": "FHIR TX Benchmark: Round 0",
  "description": "We're publishing the first in a series of open FHIR benchmarks: a performance benchmark for FHIR terminology servers across 20 test cases and 5 servers.",
  "date": "2026-04-13",
  "author": "Orlando Osorio",
  "tags": ["Terminology", "Performance", "Benchmark"]
}
---

![FHIR TX Benchmark — performance scores across terminology servers](hero.avif)

At Health Samurai, we care deeply about performance.

We're working on a series of open FHIR benchmarks. The first we're publishing is the FHIR TX Benchmark: a terminology server benchmark with 20 test cases, 5 servers, running on the same data and hardware, in isolation.

Terminology servers are often a critical part of the infrastructure of a system and can easily become a performance bottleneck. They are used as part of validation, can power UI elements, be the backend of search and filtering, and are increasingly being integrated into AI-powered workflows to ground agents in reality.

We want it all to be as transparent as possible: all the scripts, tests, data, results, and instructions are available on [GitHub](https://github.com/HealthSamurai/tx-benchmark). There's a live website built from the repo with the results at https://healthsamurai.github.io/tx-benchmark.

{% hint style="info" %}
**On Methodology and Bias**

Health Samurai built the initial version of this benchmark and our own server (Termbox, releasing later this month) is one of the five being tested. We know how that looks. Our answer is the methodology itself: every test, every parameter, every formula is documented and open, the data is reproducible, and we've been in conversation with the other implementers. To our knowledge, no benchmark like this exists for FHIR terminology servers. The methodology, test suite, and scoring are open to scrutiny and we welcome suggestions, corrections, and contributions from the community and from other implementers.

{% endhint %}

## Methodology

The [FHIR Terminology Server Spec](https://build.fhir.org/terminology-service.html) spans multiple operations, resource types, advanced features, and use cases. This poses a challenge when deciding what to test and how to test it. Tests were selected for their performance relevance, for example, a full text search is important for user interfaces and carries a very different implementation cost than a simple code lookup: we have a test for each. 

Conformance was a non-goal; we already have a comprehensive conformance test suite in the [Terminology Ecosystem IG](https://github.com/HL7/fhir-tx-ecosystem-ig).

Server capabilities and objectives vary: some provide partial support for certain operations, others are built for specific use cases or optimized for particular terminologies. We only test each server on the features they support. The full methodology is documented on the [methodology page](https://healthsamurai.github.io/tx-benchmark/methodology/).

### Tests

For the tests, we selected some of the main terminology operations: Lookup, Validate Code, Expand, Translate, and Subsumes. Each test is identified by a short code: the operation's initials followed by a sequence number: LK01 for the first Lookup test, EX02 for the second Expand, and so on.

We often have multiple tests for the same operation. This allows us to test different paths that are relevant for performance, different terminologies, or specific features of an operation that warrant its own test.

The full list of tests and their descriptions are documented on the [tests page](https://healthsamurai.github.io/tx-benchmark/tests/).

### Data

We load the same terminology dataset into each server (whenever possible). Instructions and licensing requirements are documented on the GitHub repository.

Terminologies were selected based on significance and variety (in structure, hierarchy, properties, size) as well as different types of valuesets. The main ones loaded for this round are:

- SNOMED International - 20260201
- SNOMED US Edition - 20260301
- SNOMED UK Edition - 20260211
- LOINC - 2.82
- RxNorm - March 2026
- A set of FHIR packages including r4 core, THO, US Core, VSAC, and others

Full list on the [data page](https://healthsamurai.github.io/tx-benchmark/data/).

### Scoring

The benchmark produces a composite score that lets you compare overall server performance at a glance. Twenty tests across six operations produce a lot of numbers and the composite distills them into a single comparable value. We were heavily inspired by the [TechEmpower Framework Benchmarks](https://github.com/TechEmpower/FrameworkBenchmarks/wiki/Composite-Scoring-for-Frameworks) and based our algorithms on theirs.

Scoring is based on throughput. Each test is run at three concurrency levels: 1, 10, and 50 virtual users. We keep the highest throughput across the three (if a server peaks at 10 VUs and degrades at higher concurrency, we use the best for scoring).

**Normalization**: Throughputs are normalized based on `LK01`. A lookup test will always be faster than a complex expansion; raw RPS would make "easy" tests dominate everything, which isn't fair or useful.

**Bias**: We multiply the test results by a _bias coefficient_. This allows us to dial down the impact of a test towards the overall score based on importance. For example, the performance of a lookup on SNOMED should matter more than a CodeSystem search for a terminology server, as the former is more often part of a critical path. The coefficients are documented and open to feedback (as everything else).

**Imputation**: Missing support isn't the same as poor performance. Scoring zero for unsupported tests is too harsh; giving a free pass on a hard test isn't fair either. Servers receive an imputed value derived from a percentile of the participating servers' results.

**Overall Score**: The best overall performer scores 100. Every other server's score is expressed as a percentage of that, both per test and in the overall composite.

### Hardware and Stack

Load testing is done with [k6](https://k6.io/), with [Prometheus](https://prometheus.io/) and [cAdvisor](https://github.com/google/cadvisor) for metrics collection. All servers run in Docker containers on the same host as the load generator, using loopback only — no network latency.

Round 0 was run on a laptop: Apple M3 8-core, 24 GB RAM, Docker Desktop configured with 8 CPUs and 20 GB. Future rounds will run on a dedicated cloud VM with significantly more resources, and the process will be fully automated.

## Round 0 Results

Round 0 is the initial round and the one currently published on the website. It's a pilot run (we've been working with implementers to ensure servers are correctly configured, and results may still be updated as we incorporate feedback). After Round 1, the process will be automated and run on a regular schedule with the latest versions of each server and an improved test suite as we gather feedback.

Five servers are included in this round: [FHIRsmith](https://github.com/HealthIntersections/FHIRsmith), [Hades](https://github.com/wardle/hades), [Ontoserver](https://www.ontoserver.csiro.au/site/), [Snowstorm](https://github.com/IHTSDO/snowstorm), and [Termbox](https://www.health-samurai.io/termbox). Not every server ran every test, this is reflected in the [capability matrix panel](https://healthsamurai.github.io/tx-benchmark/results/round-0/details/).


<div style="background:#181b2e;border-radius:10px;overflow:hidden;font-family:system-ui,-apple-system,sans-serif;margin:2rem 0;border:1px solid #252840;">
  <table style="width:100%;border-collapse:collapse;font-size:14px;white-space:nowrap;">
    <thead>
      <tr style="border-bottom:1px solid #252840;">
        <th style="text-align:left;padding:8px 16px;color:#6b7280;font-size:11px;text-transform:uppercase;letter-spacing:0.06em;width:40px;">#</th>
        <th style="text-align:left;padding:8px 16px;color:#6b7280;font-size:11px;text-transform:uppercase;letter-spacing:0.06em;width:130px;">Server</th>
        <th style="text-align:left;padding:8px 16px;color:#6b7280;font-size:11px;text-transform:uppercase;letter-spacing:0.06em;width:100%;">Score</th>
      </tr>
    </thead>
    <tbody>
      <tr style="border-bottom:1px solid #252840;">
        <td style="padding:9px 16px;color:#6b7280;font-size:12px;vertical-align:middle;">1</td>
        <td style="padding:9px 16px;font-weight:500;color:#e4e4e4;vertical-align:middle;">termbox</td>
        <td style="padding:9px 16px;min-width:220px;width:100%;vertical-align:middle;">
          <div style="display:flex;align-items:center;gap:10px;">
            <div style="flex:1;height:22px;border-radius:4px;background:#1e2236;overflow:hidden;">
              <div style="--s:100;width:calc(var(--s) * 1%);height:100%;border-radius:4px;background:linear-gradient(90deg,#f2495c,#fade2a,#73bf69);background-size:calc(100% * 100 / var(--s)) 100%;"></div>
            </div>
            <span style="font-weight:600;color:#e4e4e4;min-width:38px;text-align:right;">100%</span>
          </div>
        </td>
      </tr>
      <tr style="border-bottom:1px solid #252840;">
        <td style="padding:9px 16px;color:#6b7280;font-size:12px;vertical-align:middle;">2</td>
        <td style="padding:9px 16px;font-weight:500;color:#e4e4e4;vertical-align:middle;">fhirsmith</td>
        <td style="padding:9px 16px;min-width:220px;width:100%;vertical-align:middle;">
          <div style="display:flex;align-items:center;gap:10px;">
            <div style="flex:1;height:22px;border-radius:4px;background:#1e2236;overflow:hidden;">
              <div style="--s:38;width:calc(var(--s) * 1%);height:100%;border-radius:4px;background:linear-gradient(90deg,#f2495c,#fade2a,#73bf69);background-size:calc(100% * 100 / var(--s)) 100%;"></div>
            </div>
            <span style="font-weight:600;color:#e4e4e4;min-width:38px;text-align:right;">38%</span>
          </div>
        </td>
      </tr>
      <tr style="border-bottom:1px solid #252840;">
        <td style="padding:9px 16px;color:#6b7280;font-size:12px;vertical-align:middle;">3</td>
        <td style="padding:9px 16px;font-weight:500;color:#e4e4e4;vertical-align:middle;">ontoserver</td>
        <td style="padding:9px 16px;min-width:220px;width:100%;vertical-align:middle;">
          <div style="display:flex;align-items:center;gap:10px;">
            <div style="flex:1;height:22px;border-radius:4px;background:#1e2236;overflow:hidden;">
              <div style="--s:21;width:calc(var(--s) * 1%);height:100%;border-radius:4px;background:linear-gradient(90deg,#f2495c,#fade2a,#73bf69);background-size:calc(100% * 100 / var(--s)) 100%;"></div>
            </div>
            <span style="font-weight:600;color:#e4e4e4;min-width:38px;text-align:right;">21%</span>
          </div>
        </td>
      </tr>
      <tr style="border-bottom:1px solid #252840;">
        <td style="padding:9px 16px;color:#6b7280;font-size:12px;vertical-align:middle;">4</td>
        <td style="padding:9px 16px;font-weight:500;color:#e4e4e4;vertical-align:middle;">snowstorm</td>
        <td style="padding:9px 16px;min-width:220px;width:100%;vertical-align:middle;">
          <div style="display:flex;align-items:center;gap:10px;">
            <div style="flex:1;height:22px;border-radius:4px;background:#1e2236;overflow:hidden;">
              <div style="--s:7;width:calc(var(--s) * 1%);height:100%;border-radius:4px;background:linear-gradient(90deg,#f2495c,#fade2a,#73bf69);background-size:calc(100% * 100 / var(--s)) 100%;"></div>
            </div>
            <span style="font-weight:600;color:#e4e4e4;min-width:38px;text-align:right;">7%</span>
          </div>
        </td>
      </tr>
      <tr>
        <td style="padding:9px 16px;color:#6b7280;font-size:12px;vertical-align:middle;">5</td>
        <td style="padding:9px 16px;font-weight:500;color:#e4e4e4;vertical-align:middle;">hades</td>
        <td style="padding:9px 16px;min-width:220px;width:100%;vertical-align:middle;">
          <div style="display:flex;align-items:center;gap:10px;">
            <div style="flex:1;height:22px;border-radius:4px;background:#1e2236;overflow:hidden;">
              <div style="--s:6;width:calc(var(--s) * 1%);height:100%;border-radius:4px;background:linear-gradient(90deg,#f2495c,#fade2a,#73bf69);background-size:calc(100% * 100 / var(--s)) 100%;"></div>
            </div>
            <span style="font-weight:600;color:#e4e4e4;min-width:38px;text-align:right;">6%</span>
          </div>
        </td>
      </tr>
    </tbody>
  </table>
</div>

The full results, per-server drill-downs, raw RPS, and latency percentiles are available on the [website](https://healthsamurai.github.io/tx-benchmark/results/round-0/).

## What's next

Round 1 is coming soon — running on a dedicated cloud VM, automated, and with better configurations based on feedback from this round. After that, we plan to publish results on a quarterly schedule.

We see this as a community project. Health Samurai is the current steward, but the goal is for this to be governed and contributed to by the broader FHIR community — implementers, users, and anyone with a stake in terminology server performance.

If you want to get involved: suggest new tests, report a misconfiguration, propose a server to include, or just open an issue on GitHub. All feedback is welcome.

## Expected Questions

We expect that you might have a lot of questions. Here are some that we're anticipating. But please reach out if you have a question we're not dealing with here or just want to tell us we're doing it wrong.

<dl>
  <dt><strong>"Server X is misconfigured — that explains the numbers."</strong></dt>
  <dd>Quite possibly. We've done our best to configure each server correctly, but we're not experts in all of them. If you know how to improve a configuration, please open a pull request or file an issue and we'll update the results.</dd>

  <dt><strong>"Why isn't server X included?"</strong></dt>
  <dd>We might not be aware of it. Open an issue or a PR with setup instructions and we'll work on including it in a future round.</dd>

  <dt><strong>"Why these tests? Why not test operation X?"</strong></dt>
  <dd>Tests were selected for their performance relevance. If you think an important operation or use case is missing, the test suite is meant to grow, we already have a few in mind for Round 1. Please help us by opening an issue.</dd>

  <dt><strong>"Health Samurai made this benchmark and Termbox performs well. Isn't that a conflict of interest?"</strong></dt>
  <dd>We addressed this at the top of the post. The short version: the methodology, data, and scoring are fully open and reproducible. It's also worth noting that Termbox was designed specifically with performance in mind — we'd expect it to do well. If you find a flaw in the methodology, please open an issue.</dd>

  <dt><strong>"The bias coefficients and imputation percentile could be tuned to favor certain servers."</strong></dt>
  <dd>They're a judgment call, and we've documented them explicitly for that reason. If you disagree with the weights, open an issue — we're open to adjusting them based on community feedback.</dd>

  <dt><strong>"You ran this on a laptop?"</strong></dt>
  <dd>Yes. Round 0 is a pilot run on an Apple M3. The numbers will change on better hardware, but we expect the relative rankings to be largely consistent. Future rounds will run on a dedicated cloud VM.</dd>

  <dt><strong>"What about caching? Most production deployments sit a cache in front of the server."</strong></dt>
  <dd>True, and it changes the picture significantly. But we're benchmarking the server, not the deployment architecture. Real-world deployments are more complex, and we're aware of that — but that's a different thing to measure.</dd>

  <dt><strong>"What about clustering or horizontal scaling?"</strong></dt>
  <dd>This benchmark tests single-node performance only. Horizontal scaling is a valid dimension but depends heavily on deployment architecture, not just the server itself.</dd>
</dl>
