BigQuery AidboxTopicDestination
This functionality is available starting from version 2603.
Overview
The BigQuery Topic Destination module exports FHIR resources from Aidbox to Google BigQuery in a flattened format using ViewDefinitions and SQL-on-FHIR technology. Data is written using the Storage Write API (gRPC) for high throughput.
Delivery guarantee: The module uses an at-least-once delivery queue internally — messages are persisted in PostgreSQL and retried on failure. The BigQuery Storage Write API provides exactly-once semantics at the API level, so duplicate delivery attempts do not result in duplicate rows.
Key Features
- Real-time data export: Automatically exports FHIR resources to BigQuery as they are created, updated, or deleted
- Data flattening: Uses ViewDefinitions to transform complex FHIR resources into flat, analytical-friendly tables
- At-least-once delivery: Persistent message queue with guaranteed delivery and batch processing
- Initial export: Automatically exports existing data when setting up a new destination
- Monitoring: Built-in metrics and status reporting via
$statusendpoint
Before you begin
- Make sure your Aidbox version is 2603 or newer
- Set up a local Aidbox instance using the getting started guide
- Have a Google Cloud project with BigQuery enabled, or use the BigQuery Emulator for local testing
Installation
Docker Compose
-
Download the BigQuery module JAR file and place it next to your docker-compose.yaml:
curl -O https://storage.googleapis.com/aidbox-modules/topic-destination-bigquery/topic-destination-bigquery-2603.2.jar -
Edit your docker-compose.yaml and add these lines to the Aidbox service:
aidbox: volumes: - ./topic-destination-bigquery-2603.2.jar:/topic-destination-bigquery.jar # ... other volumes ... environment: BOX_MODULE_LOAD: io.healthsamurai.topic-destination.bigquery.core BOX_MODULE_JAR: "/topic-destination-bigquery.jar" # ... other environment variables ... -
Start Aidbox:
docker compose up -
Verify the module is loaded. In Aidbox UI, go to FHIR Packages and check that the BigQuery profile is present:
http://aidbox.app/StructureDefinition/aidboxtopicdestination-bigquery-at-least-once
The profile URL above is a FHIR canonical identifier, not an HTTP endpoint. You can find it in the Aidbox UI under FHIR Packages.
Kubernetes
For Kubernetes deployments, the module is downloaded automatically using an init container:
apiVersion: apps/v1
kind: Deployment
metadata:
name: aidbox
spec:
template:
spec:
initContainers:
- name: download-bigquery-module
image: debian:bookworm-slim
command:
- sh
- -c
- |
apt-get -y update && apt-get -y install curl
curl -L -o /modules/topic-destination-bigquery.jar \
https://storage.googleapis.com/aidbox-modules/topic-destination-bigquery/topic-destination-bigquery-2603.2.jar
chmod 644 /modules/topic-destination-bigquery.jar
volumeMounts:
- mountPath: /modules
name: modules-volume
containers:
- name: aidbox
image: healthsamurai/aidboxone:edge
env:
- name: BOX_MODULE_LOAD
value: "io.healthsamurai.topic-destination.bigquery.core"
- name: BOX_MODULE_JAR
value: "/modules/topic-destination-bigquery.jar"
- name: BOX_FHIR_SCHEMA_VALIDATION
value: "true"
# ... other environment variables ...
volumeMounts:
- name: modules-volume
mountPath: /modules
volumes:
- name: modules-volume
emptyDir: {}
This is a partial Deployment manifest showing only the module-related configuration. You still need your existing Aidbox environment variables, Service, and other Kubernetes resources. Use a pinned Aidbox version (e.g., healthsamurai/aidboxone:2603) for production instead of edge.
Updating the module
When a new version is released, update the JAR URL/filename in your deployment configuration and restart Aidbox. Available versions are listed in gs://aidbox-modules/topic-destination-bigquery/.
Configuration
Required Parameters
All requests in this tutorial use Content-Type: application/json.
| Parameter | Type | Description |
|---|---|---|
projectId | string | Google Cloud project ID |
dataset | string | BigQuery dataset name |
destinationTable | string | Target table name in BigQuery |
viewDefinition | string | The name field of the ViewDefinition resource (not id) |
batchSize | unsignedInt | Number of messages per batch |
sendIntervalMs | unsignedInt | Maximum time between sends (ms) |
Choosing batch parameters: For low-latency dashboards, use small batches (e.g., batchSize: 10, sendIntervalMs: 1000). For high-throughput bulk workloads, use larger batches (e.g., batchSize: 500, sendIntervalMs: 5000). Start with batchSize: 50 and sendIntervalMs: 5000 as a reasonable default.
Optional Parameters
| Parameter | Type | Description |
|---|---|---|
serviceAccountKey | string | Google Service Account JSON key (omit when using Workload Identity or ADC) |
skipInitialExport | boolean | Skip initial export of existing data (default: false) |
cloudSqlConnectionId | string | BigQuery Connection ID for Cloud SQL federated query (initial export optimization) |
location | string | GCP location for the BigQuery Connection (default: us) |
emulatorUrl | string | BigQuery emulator REST URL, e.g., http://bigquery:9050 (skips authentication) |
emulatorGrpcHost | string | BigQuery emulator gRPC host:port, e.g., bigquery:9060 (uses plaintext gRPC) |
Authentication
The module supports three authentication methods:
- Service Account JSON key — pass the full JSON key content as the
serviceAccountKeyparameter. Suitable for Docker Compose and environments without Workload Identity. - Application Default Credentials (ADC) — omit
serviceAccountKey. The module automatically uses the attached service account credentials. Recommended for Cloud Run and GKE with Workload Identity. - Emulator mode — set
emulatorUrlandemulatorGrpcHost. No authentication required.
Avoid hardcoding the Service Account JSON key directly in resource definitions. Use environment variables or a secrets manager to inject it at deployment time.
Required IAM Roles
The service account (whether explicit key or ADC) needs these roles:
| Role | Purpose |
|---|---|
roles/bigquery.user | Run queries, create jobs (healthcheck, federated queries) |
roles/bigquery.dataEditor | Insert data, create/update tables via Storage Write API |
Usage Example: Patient Data Export
Step 1: Create Subscription Topic
POST /fhir/AidboxSubscriptionTopic
{
"resourceType": "AidboxSubscriptionTopic",
"url": "http://example.org/subscriptions/patient-updates",
"status": "active",
"trigger": [
{
"resource": "Patient",
"supportedInteraction": ["create", "update", "delete"],
"fhirPathCriteria": "name.exists()"
}
]
}
Step 2: Create ViewDefinition
A ViewDefinition defines how to transform a complex FHIR resource into a flat table structure suitable for analytics. Each column maps a FHIRPath expression to a named column in the output table.
In this example, we flatten Patient into 5 columns: id, gender, birth_date from top-level fields, and family_name, given_name from the first official name (using forEach to navigate into the nested name array).
POST /fhir/ViewDefinition
{
"resourceType": "ViewDefinition",
"id": "patient_flat",
"name": "patient_flat",
"resource": "Patient",
"status": "active",
"select": [
{
"column": [
{"name": "id", "path": "id"},
{"name": "gender", "path": "gender"},
{"name": "birth_date", "path": "birthDate"}
]
},
{
"forEach": "name.where(use = 'official').first()",
"column": [
{"name": "family_name", "path": "family"},
{"name": "given_name", "path": "given.join(' ')"}
]
}
]
}
The column names you define here must match the columns in the BigQuery table (Step 4). See ViewDefinition documentation for the full syntax including where filters, unionAll, and type casting.
Step 3: Materialize ViewDefinition
The ViewDefinition must be materialized as a database view before the BigQuery module can use it to transform data. Materialization creates a SQL view in the sof schema that maps FHIR resources to the flat column structure you defined.
POST /fhir/ViewDefinition/patient_flat/$materialize
{
"resourceType": "Parameters",
"parameter": [
{
"name": "type",
"valueCode": "view"
}
]
}
The ViewDefinition must be materialized as a view (not a table). See the $materialize operation documentation for details.
Step 4: Create BigQuery Table
Create a table in BigQuery that matches the ViewDefinition output. You can do this via the BigQuery Console or using SQL:
CREATE TABLE your_project.your_dataset.patients (
id STRING NOT NULL,
gender STRING,
birth_date DATE,
family_name STRING,
given_name STRING,
is_deleted INT64 NOT NULL
);
The table must include an is_deleted column (INT64 NOT NULL). The module sets this to 0 for create/update operations and 1 for delete operations.
Step 5: Configure Authentication
Option A: Service Account key (Docker Compose)
- Go to Google Cloud IAM
- Create a new Service Account
- Grant it
roles/bigquery.userandroles/bigquery.dataEditoron your project - Create a JSON key and download it
Option B: Application Default Credentials (Cloud Run / GKE)
Attach a service account with the required BigQuery roles to your Cloud Run service or GKE workload. No key file needed — omit serviceAccountKey from the destination configuration.
Step 6: Configure BigQuery Destination
POST /fhir/AidboxTopicDestination
{
"resourceType": "AidboxTopicDestination",
"id": "patient-bigquery",
"topic": "http://example.org/subscriptions/patient-updates",
"kind": "bigquery-at-least-once",
"meta": {
"profile": [
"http://aidbox.app/StructureDefinition/aidboxtopicdestination-bigquery-at-least-once"
]
},
"parameter": [
{"name": "projectId", "valueString": "your-gcp-project-id"},
{"name": "dataset", "valueString": "your_dataset"},
{"name": "destinationTable", "valueString": "patients"},
{"name": "viewDefinition", "valueString": "patient_flat"},
{"name": "serviceAccountKey", "valueString": "<contents of your service account JSON key>"},
{"name": "batchSize", "valueUnsignedInt": 50},
{"name": "sendIntervalMs", "valueUnsignedInt": 5000}
]
}
For Cloud Run / GKE with Workload Identity (ADC), omit the serviceAccountKey parameter — the module will use the attached service account automatically.
Step 7: Verify
Create a test patient:
POST /fhir/Patient
{
"name": [{"use": "official", "family": "Smith", "given": ["John"]}],
"gender": "male",
"birthDate": "1990-01-15"
}
Then query your BigQuery table to confirm the data arrived:
SELECT * FROM your_project.your_dataset.patients;
Stopping the export
To stop exporting data, delete the AidboxTopicDestination resource:
DELETE /fhir/AidboxTopicDestination/patient-bigquery
This stops the export and cleans up the internal message queue. Data already written to BigQuery is not affected.
Initial Export
When a new destination is created, the module automatically exports all existing data that matches the subscription topic. This ensures your BigQuery table has complete historical data.
To skip the initial export (e.g., the table is already populated or you only need real-time data), add skipInitialExport:
{"name": "skipInitialExport", "valueString": "true"}
How it works
- Reads existing data from PostgreSQL via the materialized ViewDefinition using a streaming JDBC cursor
- Sends data to BigQuery in internal batches of 500 rows (hardcoded, separate from the
batchSizeparameter which controls real-time delivery) using the Storage Write API pending stream - After all rows are sent, finalizes and commits the stream — data becomes visible atomically
- Reports progress via the
$statusendpoint (initialExportProgress_rowsSent)
The export runs in a single thread. For small to medium datasets (up to ~100K rows) this completes in seconds to minutes. For larger datasets (1M+ rows), consider using the Cloud SQL optimization below.
The pending stream commit is atomic — if the export fails partway through (e.g., on row 999,999 of 1,000,000), no data is committed and the export is retried from scratch (up to 3 attempts). This guarantees no partial data in BigQuery, but means large exports may take multiple attempts on transient failures.
Cloud SQL Optimization
If your Aidbox PostgreSQL database runs on Google Cloud SQL, you can significantly speed up initial export using BigQuery's federated query feature. Instead of streaming data row-by-row through the module (PG → JDBC → JVM → gRPC → BigQuery), BigQuery reads directly from Cloud SQL over Google's internal network.
The module executes the following query via the BigQuery Jobs API:
INSERT INTO `{project}.{dataset}.{table}`
SELECT *, 0 as is_deleted
FROM EXTERNAL_QUERY(
'projects/{project}/locations/{location}/connections/{connectionId}',
'SELECT * FROM sof.{viewDefinitionName}'
)
BigQuery connects to Cloud SQL, runs the SELECT against the materialized ViewDefinition view in the sof schema, adds the is_deleted column, and inserts the results directly into the destination table.
Setup:
- In the BigQuery console, go to Add data > External data sources > Cloud SQL
- Create a BigQuery Connection to your Cloud SQL instance
- Note the connection ID and the region where you created it
- Add these parameters to your destination:
{"name": "cloudSqlConnectionId", "valueString": "your-connection-id"},
{"name": "location", "valueString": "us-central1"}
The location must match the region where the BigQuery Connection was created.
When to use this: The default Storage Write API path works well for datasets up to ~100K rows. For larger datasets (1M+ rows), federated query is significantly faster because it eliminates the single-threaded JVM bottleneck. The trade-off is the one-time setup of a BigQuery Connection.
Limitations: Federated queries are subject to BigQuery quotas: results are limited to approximately 20 GB of uncompressed data, and queries have a 6-hour timeout. This only applies to the initial export — real-time streaming is not affected. If your dataset exceeds these limits, omit cloudSqlConnectionId to use the default Storage Write API path instead.
Monitoring
Status Endpoint
GET /fhir/AidboxTopicDestination/patient-bigquery/$status
Returns a FHIR Parameters resource:
{
"resourceType": "Parameters",
"parameter": [
{"name": "status", "valueString": "active"},
{"name": "messagesDelivered", "valueDecimal": 100},
{"name": "messagesQueued", "valueDecimal": 0},
{"name": "messagesInProcess", "valueDecimal": 0},
{"name": "messagesDeliveryAttempts", "valueDecimal": 100},
{"name": "initialExportStatus", "valueString": "completed"},
{"name": "initialExportProgress_rowsSent", "valueDecimal": 100}
]
}
messagesDelivered— total messages sent to BigQuerymessagesQueued— messages waiting in the PG queuemessagesInProcess— messages currently being sentmessagesDeliveryAttempts— total delivery attempts (including retries)initialExportStatus—not_started,export-in-progress,completed,skipped, orfailedinitialExportProgress_rowsSent— number of rows sent during initial export
Data Transformation
The module automatically:
- Applies ViewDefinition: Transforms each FHIR resource using the specified ViewDefinition SQL
- Adds deletion flag: Sets
is_deleted = 0for create/update,is_deleted = 1for delete operations - Batches messages: Groups messages according to
batchSizeandsendIntervalMsparameters
Soft Deletes and Updates
BigQuery is an append-only store — you cannot delete or update individual rows in place. Every change to a FHIR resource (create, update, or delete) appends a new row to BigQuery:
- Create: new row with
is_deleted = 0 - Update: new row with
is_deleted = 0(old row remains unchanged) - Delete: new row with
is_deleted = 1
This means a resource that was created and then updated 3 times will have 4 rows in BigQuery, all with the same id. The is_deleted column uses INT64 with values 0 and 1.
To query only non-deleted resources (ignoring history):
SELECT * FROM your_dataset.patients WHERE is_deleted = 0;
To get the latest version of each resource (handling both updates and deletes), add a timestamp column to your table and ViewDefinition, then use a window function:
-- Requires a timestamp column (e.g., ts from meta.lastUpdated) in the table
SELECT * EXCEPT(rn) FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY ts DESC) as rn
FROM your_dataset.patients
)
WHERE rn = 1 AND is_deleted = 0;
To track versions, add meta.lastUpdated to your ViewDefinition as a ts column (type TIMESTAMP in BigQuery). Each update appends a new row with a newer ts, so you can always find the latest state.
Local Testing with BigQuery Emulator
You can test the BigQuery integration locally without a GCP account using the BigQuery Emulator.
Start the Emulator
Add the BigQuery emulator to your existing docker-compose.yaml:
services:
bigquery:
image: ghcr.io/goccy/bigquery-emulator:latest
platform: linux/amd64 # required for Apple Silicon
ports:
- "9050:9050"
- "9060:9060"
command: --project=test-project --dataset=test
docker compose up -d bigquery
Create a Test Table
curl -X POST http://localhost:9050/bigquery/v2/projects/test-project/datasets/test/tables \
-H 'Content-Type: application/json' \
-d '{
"tableReference": {"projectId": "test-project", "datasetId": "test", "tableId": "patients"},
"schema": {"fields": [
{"name": "id", "type": "STRING", "mode": "REQUIRED"},
{"name": "gender", "type": "STRING", "mode": "NULLABLE"},
{"name": "birth_date", "type": "DATE", "mode": "NULLABLE"},
{"name": "family_name", "type": "STRING", "mode": "NULLABLE"},
{"name": "given_name", "type": "STRING", "mode": "NULLABLE"},
{"name": "is_deleted", "type": "INTEGER", "mode": "REQUIRED"}
]}
}'
Configure Destination for Emulator
When using the emulator, omit serviceAccountKey and add emulator endpoints instead:
POST /fhir/AidboxTopicDestination
{
"resourceType": "AidboxTopicDestination",
"id": "patient-bigquery-local",
"topic": "http://example.org/subscriptions/patient-updates",
"kind": "bigquery-at-least-once",
"meta": {
"profile": [
"http://aidbox.app/StructureDefinition/aidboxtopicdestination-bigquery-at-least-once"
]
},
"parameter": [
{"name": "projectId", "valueString": "test-project"},
{"name": "dataset", "valueString": "test"},
{"name": "destinationTable", "valueString": "patients"},
{"name": "viewDefinition", "valueString": "patient_flat"},
{"name": "batchSize", "valueUnsignedInt": 10},
{"name": "sendIntervalMs", "valueUnsignedInt": 100},
{"name": "emulatorUrl", "valueString": "http://bigquery:9050"},
{"name": "emulatorGrpcHost", "valueString": "bigquery:9060"}
]
}
Use the Docker service name (bigquery) as the emulator host — both Aidbox and the emulator run in the same Docker network.
Query the Emulator
curl -s -X POST 'http://localhost:9050/bigquery/v2/projects/test-project/queries' \
-H 'Content-Type: application/json' \
-d '{"query": "SELECT * FROM test.patients", "useLegacySql": false}' | python3 -m json.tool
The emulator has a known limitation: DATE columns may return null when data is written via the Storage Write API (gRPC). STRING, INTEGER, and TIMESTAMP columns work correctly. This does not affect real BigQuery.
Delivery Guarantees and Retry
The module provides at-least-once delivery. Messages are persisted in a PostgreSQL queue before being sent to BigQuery. If delivery fails, the message remains in the queue and is retried on the next batch cycle (every sendIntervalMs). There is a 1-second backoff between failed delivery attempts to prevent log storms.
If the gRPC connection to BigQuery drops (network issue, server maintenance), the writer is automatically reconnected with exponential backoff. Messages are not lost during reconnection — they stay in the PG queue.
Initial export retries up to 3 times with exponential backoff (1s, 2s, 4s) on failure.
Multiple Destinations
You can create multiple destinations for the same topic, e.g., to export the same data to different BigQuery tables with different ViewDefinitions. Each destination operates independently with its own queue, writer, and status.
Schema Evolution
If you need to add a column to the BigQuery table:
- Add the column to your BigQuery table (
ALTER TABLE ... ADD COLUMN ...) - Update the ViewDefinition with the new column
- Re-materialize the ViewDefinition (
POST /fhir/ViewDefinition/{id}/$materialize) - Delete and recreate the destination to pick up the new schema
Existing rows will have NULL in the new column. New rows will include the new data.
Troubleshooting
Common Issues
- Authentication errors: Verify the Service Account JSON key is valid and has the required IAM roles, or check that ADC is properly configured
- Table not found: Ensure the BigQuery table exists and the project/dataset/table names are correct
- Schema mismatch: The BigQuery table columns must match the ViewDefinition output columns plus
is_deleted - Initial export timeout: For large datasets, the initial export may take time. Monitor progress via
$status
Debug Tips
- Check the
$statusendpoint for error details - Verify ViewDefinition works correctly:
GET /fhir/ViewDefinition/patient_flat - Test BigQuery access independently using the same Service Account
- Check Aidbox logs for detailed error messages