How to load 1 billion FHIR resources into Aidbox in 5 hours

Big data usually means big problems and a lot of work. One of the problems is how to load a lot of data. There is usually some source of big data, and the problem is posed as some kind of legacy system migration or migration of historical data into a FHIR server. 

The typical solution for such problems is to build an ETL (extract, transform, load) pipeline:

Key points for this architecture:

The task of transformation and storage of the incoming data is highly dependent on the state of the system that the import is done for, thus this information will be omitted from this article. We nevertheless needed a data source for our test. Consequently the Synthea tool was used, resulting in 40 batches, each containing around 25,000 Patient resources, and their related resources. The script for generation is accessible via the following link: gen_synthea.sh.

Best practices

The task of loading a large amount of data is not exactly new and there are a number of well-established practices for this task that proved to be efficient.

From the perspective of HTTP data transmission via the net, loading the data pieces one by one is obviously not the best approach, as it leads to a significant overhead in connection initialization, receiving the response and so on. Data transmission via batches can dramatically reduce overhead costs and significantly speed up data transfer. Further improvements come with streaming, where all overhead costs wither away.

Use the PostgreSQL COPY command to load all the rows in one command, instead of using a series of INSERT commands. The COPY command is optimized for loading large numbers of rows. It is less flexible than INSERT, but incurs significantly less overhead for large data loads.

The database indexes can slow the insertion speed, since each table insert entails updating all indexes of the table to maintain transactionality. Creating indexes on existing data is much faster than incremental updates on each insert. For this reason, it is good practice to turn off (or delete) all existing table indexes before loading data, perform an insert via the COPY command, and turn on (or recreate) the table indexes.

The peculiarity of storing data in Aidbox in the form of JSONB and very large FHIR resources may cause difficulties with TOAST. In essence, the problem arises when the data does not fit into the PostgreSQL page, and the database engine starts to transfer part of the data into the TOAST-allocated table. As for the subsequent reading and writing, reading and writing internally is done for two tables: the toast-allocated and the standard.

There are a number of methods to reduce toasting:

ALTER TABLE <target_table> ALTER COLUMN resource SET STORAGE MAIN;
ALTER TABLE <target_table> SET (toast_tuple_target = 6000);

Load from bucket operation

Aidbox has a load-from-bucket operation that permits asynchronous upload of FHIR resources from an S3 bucket. This operation has been specially designed for a parallel batch load of tons of FHIR data. 

How it works:

Note: Disabling indexes is applied when the disable-idx? flag is specified. Load from bucket operation with flag disable-idx? is not recommended for use on live production, because this operation disables all indexes for target tables.

Environment

Infrastructure for this test:

Experiment

First of all let’s try to load around 10% of data and check how Aidbox will respond:

Aidbox load-from-bucket RPC operation call example:

POST /rpc
content-type: text/yaml
accept: text/yaml

method: aidbox.bulk/load-from-bucket
params:
  bucket: s3://<fhir-data-bucket>
  prefixes: ["fhir/1/", "fhir/2/", "fhir/3/", "fhir/4/"]
  thread-num: 16
  disable-idx?: true
  account:
    access-key-id: <access-key-id>
    secret-access-key: <secret-access-key>
    region: <region>

Let's run imports with different numbers of threads (1, 2, 4, ...,16) and see how execution time and average values will change.

As seen from the graph with the increase in threads from 1 to 4, the general data loading performance grows almost linearly. The further increase in threads from 4 to 12 becomes insignificant. The additional increase from 12 to 16 does not affect performance growth. The performance limitations may be caused by the net bandwidth, database throughput or similar. In our experiment there were neither limitations from the database nor from the net bandwidth and the Java performance was the main bottleneck.

Import throughput per second by resource type running in 16 threads for 10% of data

From our observations we conclude that the data loading performance is affected by the distribution of the data by volume. Thus, from the chart above, it is clear that small resources like Procedure or Observation can be loaded with a speed of up to 40,000 resources per second. On the other hand, huge resources like ExplanationOfBenefit can be loaded with only up to 5,300 resources per second, resulting in 20% of the entire dataset’s volume. Such data irregularity eventually greatly affects the throughput and the total time of the data import process, and so may vary greatly in projects.

Load 1 billion FHIR resources

Based on the experiments above we chose to load data in 12 threads for the final data import. The total time to complete the import took 4 hours and 48 minutes. About 1 billion resources were created with a total database size of 1649 GB.

Below is a summary of the download result:

Import throughput per second by resource type running in 12 threads for 100% data

This graph illustrates the general import throughput of the data upload. It can be seen that the speed of the process is not constant and can vary from 8,000 to 80,000 resources per second at its peak. The average throughput was 57,000 resources per second, 10% higher than in the previous tests.

The most significant speed drops were detected for the Provenance resource. The average record size for this resource was 37 KB, which is very big. Records of such size are processed slowly and cause complexities related to TOAST technology in Postgres. The final throughput stats are highly dependent on the structure and volume of the data: small resources are almost instantly loaded, while the big ones are loaded very slowly.

Resource count and total size distribution on top 15 biggest tables

The graph of the distribution looks particularly interesting. We can observe that there are about 300 Observations per Patient resource on average, and the resources that take up the most disk space are Claim, DiagnosticReport, DocumentReference, and ExplanationOfBenefit. The Provenance resource deserves special attention. With a total number of entries of 1 million, it takes up 40 GB of disk space with an average resource size of 37 KB.

Aidbox and Aurora CPU utilization

The Aidbox and Aurora CPU load graph shows that for almost the entire duration of the data import, the load was in full use of 12 Aidbox cores. At the same time, the average CPU consumption for the Aurora database was 54%. On the graph, you can see a sharp increase in the CPU consumption of the database and a drop for Aidbox. This is due to the transfer of large Provenance resources. At this point, the Aidbox threads began to wait for a response from the database, while the database itself was busy compressing the data in order to store it in TOAST.

Conclusion

The most obvious solution is not always the optimal one. Sometimes, in order to find an effective solution, one should pose the problem differently or make some technical compromises. In this post we offer our vision of the best practices to upload huge datasets into Aidbox.


Authors: Marat Surmashev, Georgii Ivannikov

How to load 1 billion FHIR resources into Aidbox in 5 hours

Big data usually means big problems and a lot of work. One of the problems is how to load a lot of data. There is usually some source of big data, and the problem is posed as some kind of legacy system migration or migration of historical data into a FHIR server. 

The typical solution for such problems is to build an ETL (extract, transform, load) pipeline:

Key points for this architecture:

  • Transform and upload data to storage. This allows us to split the whole process into two steps: data preparation and data upload. 
  • Compress storage data: FHIR data contains lots of similar structures, e.g. {:resourceType “Patient” … }, so zipping such data is pretty efficient, and the net load can be reduced tenfold as a result.

The task of transformation and storage of the incoming data is highly dependent on the state of the system that the import is done for, thus this information will be omitted from this article. We nevertheless needed a data source for our test. Consequently the Synthea tool was used, resulting in 40 batches, each containing around 25,000 Patient resources, and their related resources. The script for generation is accessible via the following link: gen_synthea.sh.

Best practices

The task of loading a large amount of data is not exactly new and there are a number of well-established practices for this task that proved to be efficient.

From the perspective of HTTP data transmission via the net, loading the data pieces one by one is obviously not the best approach, as it leads to a significant overhead in connection initialization, receiving the response and so on. Data transmission via batches can dramatically reduce overhead costs and significantly speed up data transfer. Further improvements come with streaming, where all overhead costs wither away.

Use the PostgreSQL COPY command to load all the rows in one command, instead of using a series of INSERT commands. The COPY command is optimized for loading large numbers of rows. It is less flexible than INSERT, but incurs significantly less overhead for large data loads.

The database indexes can slow the insertion speed, since each table insert entails updating all indexes of the table to maintain transactionality. Creating indexes on existing data is much faster than incremental updates on each insert. For this reason, it is good practice to turn off (or delete) all existing table indexes before loading data, perform an insert via the COPY command, and turn on (or recreate) the table indexes.

The peculiarity of storing data in Aidbox in the form of JSONB and very large FHIR resources may cause difficulties with TOAST. In essence, the problem arises when the data does not fit into the PostgreSQL page, and the database engine starts to transfer part of the data into the TOAST-allocated table. As for the subsequent reading and writing, reading and writing internally is done for two tables: the toast-allocated and the standard.

There are a number of methods to reduce toasting:

  • explicit storage configuration
  • compression limit increase
ALTER TABLE <target_table> ALTER COLUMN resource SET STORAGE MAIN;
ALTER TABLE <target_table> SET (toast_tuple_target = 6000);

Load from bucket operation

Aidbox has a load-from-bucket operation that permits asynchronous upload of FHIR resources from an S3 bucket. This operation has been specially designed for a parallel batch load of tons of FHIR data. 

How it works:

  • Take the list of files from the bucket
  • Disable and temporary save all indexes for target resource tables
  • Run n parallel loading threads
  • For each thread open the stream from one of the S3 bucket for loading files, and on the other side open the PostgreSQL COPY stream
  • Convert resources in parallel in stream
  • After loading all resources, turn back all indexes
Note: Disabling indexes is applied when the disable-idx? flag is specified. Load from bucket operation with flag disable-idx? is not recommended for use on live production, because this operation disables all indexes for target tables.

Environment

Infrastructure for this test:

  • Aidbox installed on AWS m5.4xlarge instance with 16 vCPU
  • Database: AWS RDS Aurora on db.r5.4xlarge instance type with 16 vCPU

Experiment

First of all let’s try to load around 10% of data and check how Aidbox will respond:

  • Total load size (zipped): 17 GB
  • Total resource count: 93 million
  • Resulting database size: 171 GB

Aidbox load-from-bucket RPC operation call example:

POST /rpc
content-type: text/yaml
accept: text/yaml

method: aidbox.bulk/load-from-bucket
params:
  bucket: s3://<fhir-data-bucket>
  prefixes: ["fhir/1/", "fhir/2/", "fhir/3/", "fhir/4/"]
  thread-num: 16
  disable-idx?: true
  account:
    access-key-id: <access-key-id>
    secret-access-key: <secret-access-key>
    region: <region>

Let's run imports with different numbers of threads (1, 2, 4, ...,16) and see how execution time and average values will change.

As seen from the graph with the increase in threads from 1 to 4, the general data loading performance grows almost linearly. The further increase in threads from 4 to 12 becomes insignificant. The additional increase from 12 to 16 does not affect performance growth. The performance limitations may be caused by the net bandwidth, database throughput or similar. In our experiment there were neither limitations from the database nor from the net bandwidth and the Java performance was the main bottleneck.

Import throughput per second by resource type running in 16 threads for 10% of data

From our observations we conclude that the data loading performance is affected by the distribution of the data by volume. Thus, from the chart above, it is clear that small resources like Procedure or Observation can be loaded with a speed of up to 40,000 resources per second. On the other hand, huge resources like ExplanationOfBenefit can be loaded with only up to 5,300 resources per second, resulting in 20% of the entire dataset’s volume. Such data irregularity eventually greatly affects the throughput and the total time of the data import process, and so may vary greatly in projects.

Load 1 billion FHIR resources

Based on the experiments above we chose to load data in 12 threads for the final data import. The total time to complete the import took 4 hours and 48 minutes. About 1 billion resources were created with a total database size of 1649 GB.

Below is a summary of the download result:

  • 4 h 48 min execution time
  • 183 GB total zipped data size
  • 1649 GB resulting database size
  • 1 billion total resource count
  • 57,000 average resource count per second
  • 64 patient data per second

Import throughput per second by resource type running in 12 threads for 100% data

This graph illustrates the general import throughput of the data upload. It can be seen that the speed of the process is not constant and can vary from 8,000 to 80,000 resources per second at its peak. The average throughput was 57,000 resources per second, 10% higher than in the previous tests.

The most significant speed drops were detected for the Provenance resource. The average record size for this resource was 37 KB, which is very big. Records of such size are processed slowly and cause complexities related to TOAST technology in Postgres. The final throughput stats are highly dependent on the structure and volume of the data: small resources are almost instantly loaded, while the big ones are loaded very slowly.

Resource count and total size distribution on top 15 biggest tables

The graph of the distribution looks particularly interesting. We can observe that there are about 300 Observations per Patient resource on average, and the resources that take up the most disk space are Claim, DiagnosticReport, DocumentReference, and ExplanationOfBenefit. The Provenance resource deserves special attention. With a total number of entries of 1 million, it takes up 40 GB of disk space with an average resource size of 37 KB.

Aidbox and Aurora CPU utilization

The Aidbox and Aurora CPU load graph shows that for almost the entire duration of the data import, the load was in full use of 12 Aidbox cores. At the same time, the average CPU consumption for the Aurora database was 54%. On the graph, you can see a sharp increase in the CPU consumption of the database and a drop for Aidbox. This is due to the transfer of large Provenance resources. At this point, the Aidbox threads began to wait for a response from the database, while the database itself was busy compressing the data in order to store it in TOAST.

Conclusion

The most obvious solution is not always the optimal one. Sometimes, in order to find an effective solution, one should pose the problem differently or make some technical compromises. In this post we offer our vision of the best practices to upload huge datasets into Aidbox.


Authors: Marat Surmashev, Georgii Ivannikov

Silver bullet DIY: Aidbox and MatrixCare EHR integration

ReAD MORE

How to load 1 billion FHIR resources into Aidbox in 5 hours

ReAD MORE

Standardized API for EHRs | Cheat Sheet - §170.315(g)(10)

ReAD MORE

Bringing data from wearables and medical IoT devices to FHIR solutions

ReAD MORE

Your path to compliance with the 21st Century Cures Act

ReAD MORE

Aidbox HIPAA book. Part 1. Technical safeguards

ReAD MORE

Why do you need to enable 2FA in your healthcare application?

ReAD MORE

The 2020 X-mas Hackathon

ReAD MORE

A quick guide to telemedicine software alternatives

ReAD MORE

About FHIR facades (part I) - two approaches

ReAD MORE

Two-phase FHIR terminology

ReAD MORE

First Fhirbase Release

ReAD MORE

FHIR Storage and Analytics in Baltimore

ReAD MORE

SQL on FHIR

ReAD MORE

Should you use FHIR resources as your data storage format?

ReAD MORE

The FHIR Guide for CTOs and technical leaders

ReAD MORE

FHIR and Machine Learning (ML)

ReAD MORE

Announcing the FHIRbase Dojo!

ReAD MORE

FHIR Starter 2018: Recap of the first FHIR conference in Eastern Europe

ReAD MORE

Using FHIR to Simplify Healthcare Application Development

ReAD MORE

FHIR: What's great, what isn't so good, and what it's not built to do

ReAD MORE

Moving Cardiovascular Disease Detection to the Cloud

ReAD MORE

FHIR Success Story: Narus Health Connects Patients, Families and Providers

ReAD MORE

Thoughts on the Duke University FHIR Applications Roundtable

ReAD MORE

Choosing Access Control Model for a Generic FHIR Server

ReAD MORE

Implementing FHIR in Dynamic Languages

ReAD MORE

Choosing FHIR for Laboratory Integration

ReAD MORE

Thoughts About Microservices

ReAD MORE

Meaningful Use Stage 3: ONC-certified API for your health care application

ReAD MORE

Transforming Healthcare IT

ReAD MORE

Start a Health IT project with the FHIR standard at your organization

ReAD MORE

How do we build the best Convenient Healthcare Clinics?

ReAD MORE

Accelerating Healthcare Innovation with HL7 FHIR

ReAD MORE

Contact us

Let us know if you want to discuss it further.

By submitting the form you agree to Privacy Policy and Cookie Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
3415 S Sepulveda Blvd Ste 1000 Los Angeles, CA 90034
+1 (818) 731-1279
hello@health-samurai.io