What if the cycle from a bug report to a complete test report with test cases took 10 minutes instead of several hours? We tried this approach on Aidbox and it worked.
In a previous post, we described agentic coding and how LLMs help build applications on top of FHIR. However, coding is only part of a product's lifecycle. We went further and started integrating LLM agents into the testing and support processes for Aidbox.
Our first step was BugBot. It reviews pull requests and detects potential issues in the code. It does a good job evaluating code, and we're happy with it. However, BugBot looks at an isolated PR or commit through the eyes of a programmer. It does not see the full picture. It may miss task context, specification compliance, and end-to-end behavior.
This raised a broader question: can LLMs help not just with writing code, but with supporting and evolving the product as a whole?
Aidbox is a large product with an extensive codebase and a complex FHIR domain that includes search APIs, validation and terminology processing, security, and performance considerations. Sometimes even reproducing a bug or a customer request takes an enormous amount of time. For example:
Not every bug or request requires such detailed reproduction, but far too many tasks take an unreasonable amount of time.
After implementing a feature or fixing a bug, testing requires checking not just code correctness, but also compliance with the FHIR specification, documentation accuracy, behavior in corner cases, and security (XSS, SQL injection, etc.). Unit tests help prevent regressions, but they are not a substitute for comprehensive validation. End-to-end testing remains necessary and time-consuming.
We built a GitHub Bot that handles task evaluation and testing. We call it `Agentic Testing`.
The bot has two key responsibilities:
If a task lands in `Inbox` with no linked commits, the bot tries to reproduce the problem on its own. Is it really a bug, or is it expected behavior? It cross-references the Aidbox documentation and the FHIR specification.
If the task already has a linked PR and is in `QA` status, the bot analyzes code changes, generates corner case tests, and runs them.
In both cases, testing is done as black-box API tests:
While reproducing a task, the bot also analyzes the code and suggests where and why the problem might occur. For the developer, this means that by the time they start fixing the issue, they already have reproduction steps and a rough pointer to the root cause.
The skill is roughly 500 lines of markdown. It doesn't contain any application logic, only instructions. These are the same instructions you would write in an internal wiki for a new team member, including how to set up the environment, locate the code, run tests, and format the report. Claude Code handles the rest: reasoning about edge cases, composing requests, and interpreting responses.
While working with the bot, we ran into a key question: what counts as a bug?
Example: a request came in saying that FHIR search in Aidbox doesn't work correctly. The bot reproduced and confirmed the issue. However, further analysis showed that Aidbox had actually worked correctly, in full compliance with the FHIR specification.
The bot uses the Aidbox codebase, documentation, and the FHIR specification as context. When making a decision, it first checks against the documentation and the specification. We established a rule: a bug is Aidbox behavior that deviates from the FHIR specification and is not reflected in the Aidbox documentation.
We also added OWASP guidelines to the bot's context so it can identify potential vulnerabilities and security issues.
LLMs shouldn't be taken at face value. Hallucinations happen. The bot can misunderstand a task and start testing something completely different from what the author intended. Especially if the task is poorly worded or contains errors.
To address this, we added the ability to re-test with clarification. You simply "call" the bot in a task comment using the `\qa` command and specify what exactly needs to be checked. Sometimes the incoming request lacks details: the FHIR version isn't specified, the Aidbox version is missing, or the behavior needs to be verified across multiple configurations. All of this can be clarified, and the bot will spin up the appropriate installations, create the necessary resources, load the required IG versions, run the tests, and produce a consolidated report.
The bot doesn't do anything extraordinary compared to a QA engineer. It doesn't do it better, it does it more and faster.
A recent example illustrates this. In the January release of Aidbox we found a bug with FHIR Profile minimum cardinality. A required field was determined by the condition `element.min == 1`, even though profiles can specify `min` equal to 2, 3, and so on. The fix was literally a one-character change: replacing `min == 0` with `min >= 1`.
A developer can glance at the diff and conclude that the fix is correct. But we still want to verify thoroughly. Unit tests cover the main cases, but we want to validate it end-to-end across various combinations of minimum and maximum values.
Doing this manually isn't hard, just slow: creating multiple profiles and preparing different test data sets. It takes a lot of time and human mistakes are always possible.
We asked the bot. It understood the task, generated cases with every possible combination, and tested everything in 5 minutes.
Here's a part of the bot's report.
Being able to test a feature before release this way dramatically simplifies the entire development process. We don't blindly trust the bot yet and still verify manually. However, we do it far more efficiently, covering additional corner cases as well as performance and security checks.
The bot constantly cross-references the Aidbox documentation. When it encounters ambiguities or obvious documentation errors, it automatically creates a task noting what needs to be fixed and where. This greatly helps maintain documentation accuracy during changes, as well as during new feature development. For example, while implementing the External Secrets feature, the bot repeatedly found discrepancies between documentation and implementation and helped us write good documentation.
An unexpected bonus emerged: with the bot's help, we cleared our massive backlog. We sent around 30 tasks for evaluation and reproduction. Roughly half of them, tasks that had been sitting idle for a long time, turned out to have already been fixed alongside other changes. The remaining tasks confirmed their relevance. Some were prioritized for quick fixes, and we implemented them. The bot is useful not just for day-to-day testing, but also for backlog prioritization and grooming.
Over the last few weeks, the bot has tested over a hundred tasks across around 250 runs. A full testing cycle takes from 5 to 10 minutes, and up to 20 in complex tasks. All of this runs on a virtual machine that costs $8 per month. We are impressed with the results. Tasks that used to take a QA engineer one or two hours are now completed in minutes.
The bot doesn't replace a QA engineer. It sometimes misunderstands a task, and it requires human verification. But it takes on the routine: reproduction, test case generation, running combinations, documentation checks. This frees up the team's time for what truly requires human attention.
Have you already integrated LLMs into your development or maintenance workflows? We'd love to hear about your experience. Join our Zulip chat to continue the discussion.
Marat Surmashev, VP of Engineering
Get in touch with us today!
