What if the cycle from a bug report to a complete test report with test cases took 10 minutes instead of several hours? We tried this approach on Aidbox and it worked.
In a previous post, we described agentic coding and how LLMs help build applications on top of FHIR. However, coding is only part of a product's lifecycle. We went further and started integrating LLM agents into the testing and support processes for Aidbox.
Our first step was BugBot. It reviews pull requests and detects potential issues in the code. It does a good job evaluating code, and we're happy with it. However, BugBot looks at an isolated PR or commit through the eyes of a programmer. It does not see the full picture. It may miss task context, specification compliance, and end-to-end behavior.
This raised a broader question: can LLMs help not just with writing code, but with supporting and evolving the product as a whole?
Why Testing a Complex Product is Painful
Aidbox is a large product with an extensive codebase and a complex FHIR domain that includes search APIs, validation and terminology processing, security, and performance considerations. Sometimes even reproducing a bug or a customer request takes an enormous amount of time. For example:
- Validating against a specific set of profiles and IGs
- Running FHIR search on specific data
- Testing integration with an external service
Not every bug or request requires such detailed reproduction, but far too many tasks take an unreasonable amount of time.
After implementing a feature or fixing a bug, testing requires checking not just code correctness, but also compliance with the FHIR specification, documentation accuracy, behavior in corner cases, and security (XSS, SQL injection, etc.). Unit tests help prevent regressions, but they are not a substitute for comprehensive validation. End-to-end testing remains necessary and time-consuming.
Our approach: Agentic Testing
We built a GitHub Bot that handles task evaluation and testing. We call it Agentic Testing.
The bot has two key responsibilities:
- Evaluating incoming requests: reproducing and confirming bugs or customer requests, building a test plan, analyzing code, and recommending fixes.
- Verifying tasks: analyzing PRs and commits in context, generating corner case tests, executing them and checking documentation accuracy.
If a task lands in Inbox with no linked commits, the bot tries to reproduce the problem on its own. Is it really a bug, or is it expected behavior? It cross-references the Aidbox documentation and the FHIR specification.
If the task already has a linked PR and is in QA status, the bot analyzes code changes, generates corner case tests, and runs them.
In both cases, testing is done as black-box API tests:
- The bot spins up an Aidbox instance based on the task or code changes.
- It sequentially executes requests and compares results against expected outcomes.
- If the task specifies different FHIR versions or configurable sets of configurations and profiles, the bot spins up multiple Aidbox instances and runs tests across them.
- It generates a test report.
While reproducing a task, the bot also analyzes the code and suggests where and why the problem might occur. For the developer, this means that by the time they start fixing the issue, they already have reproduction steps and a rough pointer to the root cause.
The skill is roughly 500 lines of markdown. It doesn't contain any application logic, only instructions. These are the same instructions you would write in an internal wiki for a new team member, including how to set up the environment, locate the code, run tests, and format the report. Claude Code handles the rest: reasoning about edge cases, composing requests, and interpreting responses.
How the Bot Decides What Counts as a Bug
While working with the bot, we ran into a key question: what counts as a bug?
Example: a request came in saying that FHIR search in Aidbox doesn't work correctly. The bot reproduced and confirmed the issue. However, further analysis showed that Aidbox had actually worked correctly, in full compliance with the FHIR specification.
The bot uses the Aidbox codebase, documentation, and the FHIR specification as context. When making a decision, it first checks against the documentation and the specification. We established a rule: a bug is Aidbox behavior that deviates from the FHIR specification and is not reflected in the Aidbox documentation.
We also added OWASP guidelines to the bot's context so it can identify potential vulnerabilities and security issues.
LLM Limitations and Hallucinations
LLMs shouldn't be taken at face value. Hallucinations happen. The bot can misunderstand a task and start testing something completely different from what the author intended. Especially if the task is poorly worded or contains errors.
To address this, we added the ability to re-test with clarification. You simply "call" the bot in a task comment using the \qa command and specify what exactly needs to be checked. Sometimes the incoming request lacks details: the FHIR version isn't specified, the Aidbox version is missing, or the behavior needs to be verified across multiple configurations. All of this can be clarified, and the bot will spin up the appropriate installations, create the necessary resources, load the required IG versions, run the tests, and produce a consolidated report.
Not Better, but More and Faster
The bot doesn't do anything extraordinary compared to a QA engineer. It doesn't do it better, it does it more and faster.
A recent example illustrates this. In the January release of Aidbox we found a bug with FHIR Profile minimum cardinality. A required field was determined by the condition element.min == 1, even though profiles can specify min equal to 2, 3, and so on. The fix was literally a one-character change: replacing min == 0 with min >= 1.
A developer can glance at the diff and conclude that the fix is correct. But we still want to verify thoroughly. Unit tests cover the main cases, but we want to validate it end-to-end across various combinations of minimum and maximum values.
Doing this manually isn't hard, just slow: creating multiple profiles and preparing different test data sets. It takes a lot of time and human mistakes are always possible.
We asked the bot. It understood the task, generated cases with every possible combination, and tested everything in 5 minutes.
Here's a part of the bot's report.
Test Cases
Group 1: Profile Creation and FHIRSchema Verification
| ID | Status | Description | Expected | Result |
|---|---|---|---|---|
| TC- 01 | PASS | Create StructureDefinition with min:2 for Observation.category | 200 | 200 |
| TC- 02 | SKIP | Verify FHIRSchema: category in required, min=2 | FHIRSchema contains required and min=2 | FHIRSchema not accessible via API (internal mechanism), validation verified in TC-03..TC-06 |
Group 2: Validation with min:2 Profile
| ID | Status | Description | Expected | Result |
|---|---|---|---|---|
| TC- 03 | PASS | Observation without category (0 items) | 422, required error | 422, required-key: "The property 'category' is required" |
| TC- 04 | PASS | Observation with 1 category (below min:2) | 422, cardinality error | 422, invalid-cardinality: "Current count is '1', expected between '2' and 'Infinity'" |
| TC- 05 | PASS | Observation with 2 categories (exactly min:2) | 201 | 201 |
| TC- 06 | PASS | Observation with 3 categories (above min:2) | 201 | 201 |
Group 3: Regression, min:1 Profile
| ID | Status | Description | Expected | Result |
|---|---|---|---|---|
| TC- 07 | PASS | Create StructureDefinition with min:1 for Observation.category | 200 | 200 |
| TC- 08 | PASS | Observation without category (meta.profile with min:1) | 422, required error | 422, required-key: "The property 'category' is required" |
| TC- 09 | PASS | Observation with 1 category (meta.profile with min:1) | 201 | 201 |
Group 4: Profile with min:3, max:5
| ID | Status | Description | Expected | Result |
|---|---|---|---|---|
| TC- 10 | PASS | Create StructureDefinition with min:3, max:"5" for Observation.category | 200 | 200 |
| TC- 11 | PASS | Observation with 2 categories (below min:3) | 422, cardinality error | 422, invalid-cardinality: "Current count is '2', expected between '3' and '5'" |
| TC- 12 | PASS | Observation with 3 categories (exactly min:3) | 201 | 201 |
| TC- 13 | PASS | Observation with 6 categories (above max:5) | 422, cardinality error | 422, invalid-cardinality: "Current count is '6', expected between '3' and '5'" |
PASS โ The fix works correctly.
-
element.min: 2correctly makes the field required and checks minimum cardinality -
element.min: 1continues to work as before (no regression detected) -
element.min: 3, max: 5correctly checks both minimum and maximum cardinality -
Validation error messages are informative and contain correct min/max values
Being able to test a feature before release this way dramatically simplifies the entire development process. We don't blindly trust the bot yet and still verify manually. However, we do it far more efficiently, covering additional corner cases as well as performance and security checks.
Documentation Validation
The bot constantly cross-references the Aidbox documentation. When it encounters ambiguities or obvious documentation errors, it automatically creates a task noting what needs to be fixed and where. This greatly helps maintain documentation accuracy during changes, as well as during new feature development. For example, while implementing the External Secrets feature, the bot repeatedly found discrepancies between documentation and implementation and helped us write good documentation.
How we Cleared our Backlog With the Bot
An unexpected bonus emerged: with the bot's help, we cleared our massive backlog. We sent around 30 tasks for evaluation and reproduction. Roughly half of them, tasks that had been sitting idle for a long time, turned out to have already been fixed alongside other changes. The remaining tasks confirmed their relevance. Some were prioritized for quick fixes, and we implemented them. The bot is useful not just for day-to-day testing, but also for backlog prioritization and grooming.
Over the last few weeks, the bot has tested over a hundred tasks across around 250 runs. A full testing cycle takes from 5 to 10 minutes, and up to 20 in complex tasks. All of this runs on a virtual machine that costs $8 per month. We are impressed with the results. Tasks that used to take a QA engineer one or two hours are now completed in minutes.
The bot doesn't replace a QA engineer. It sometimes misunderstands a task, and it requires human verification. But it takes on the routine: reproduction, test case generation, running combinations, documentation checks. This frees up the team's time for what truly requires human attention.
Have you already integrated LLMs into your development or maintenance workflows? We'd love to hear about your experience. Join our Zulip chat to continue the discussion.
Marat Surmashev, VP of Engineering







