Aidbox Docs

Data Lakehouse AidboxTopicDestination

This functionality is available starting from Aidbox version 2605.

This page sets up an AidboxTopicDestination that streams FHIR resource changes into a Databricks Unity Catalog managed Delta table. Rows are flattened by a ViewDefinition so analytics consumers see columns, not nested FHIR JSON.

Background

"Data Lakehouse" is the generic name for the destination category: a hybrid of object-storage data lake and warehouse, built on top of the Delta Lake table format. The module writes a Delta-formatted Unity Catalog managed table.

If you're already comfortable with Databricks, Unity Catalog, and Delta Lake, skip to Overview.

Databricks

Databricks is a managed analytics platform. For this tutorial, think of it as three components:

  1. Unity Catalog is the metadata and governance layer. Unity Catalog knows about every catalog, schema, table, column, and grant in your workspace. It also issues short-lived cloud-storage credentials on demand ("vending") so external clients can write data without holding long-lived bucket keys.
  2. SQL warehouse is a compute cluster that runs SQL queries against tables in your Unity Catalog. You usually query it from the Databricks UI's SQL Editor; the module can drive it via an API.
  3. Zerobus Ingest is a push-based ingestion service that writes data into Unity Catalog Delta tables. Databricks exposes Zerobus via two transports: gRPC and REST. The Aidbox module uses the REST endpoint. The sender POSTs batches as JSON arrays and Zerobus commits them to the managed Delta table on the Databricks side.

Data lakehouse, and Delta Lake as its implementation

A data lakehouse is a hybrid of two older patterns:

  • A data lake stores raw files (Parquet, JSON, CSV) on cheap object storage (S3, GCS, ADLS). Scalable and cheap, but lacks schema enforcement, ACID transactions, and time travel.
  • A data warehouse (Snowflake, Redshift, BigQuery) gives you ACID, schema, and indexes, at the cost of a proprietary storage format you don't own.

A lakehouse takes the lake side and bolts on the warehouse's guarantees: ACID, schema, and time travel on plain Parquet files in your own bucket. An open table format does that bolting, and this module uses Delta Lake. A Delta table is a directory on object storage:

s3://bucket/prefix/my_table/
├── _delta_log/
│   ├── 00000000000000000000.json       ← transaction log: each commit is one JSON file
│   ├── 00000000000000000001.json
│   └── ...
├── part-00000-xxx.snappy.parquet       ← row data
├── part-00001-xxx.snappy.parquet
└── ...

The _delta_log/ directory is the source of truth: readers replay it; writers append a new commit. Concurrent writers race on the next log filename, and that race gives Delta its ACID guarantees.

External vs managed tables

Unity Catalog tables come in two flavours:

ManagedExternal
StatusDatabricks' default and recommended table typeUse when you need files in your own bucket
Storage locationDatabricks-managed cloud storage (path picked by Unity Catalog)Your bucket — declared with LOCATION 's3://...' / 'gs://...' / 'abfss://...' at CREATE TABLE
Who owns the filesUnity Catalog — manages read, write, storage, and optimizationYou — Unity Catalog manages metadata only
DROP TABLEDeletes the dataDrops metadata only — files stay in your bucket
Supported write paths from AidboxZerobus REST ingest (Aidbox managed-zerobus), or SQL warehouse INSERT (Aidbox managed-sql)Not supported by this module — use a managed table
External STS credential vendingNot available for managed targets (EXTERNAL USE SCHEMA is only grantable on external schemas)Allowed if the principal has EXTERNAL USE SCHEMA on the schema
Predictive OptimizationEnabled by default for accounts created on or after 2024-11-11; runs OPTIMIZE / VACUUM / ANALYZE automatically. Billed under the Jobs Serverless SKU.Not supported — Predictive Optimization runs only on managed tables
Liquid ClusteringOpt-in per table (automatic liquid clustering requires Predictive Optimization and is also opt-in)Opt-in per table

Overview

The module exports FHIR resources from Aidbox to a Delta Lake table in a flattened format using ViewDefinitions (SQL-on-FHIR).

Live (streaming) flow. Aidbox enqueues every resource change in its topics PG queue. The sender batches and pushes them to Databricks via the chosen writeMode:

Live streaming flow: FHIR changes from clients → Aidbox topics PG queue → Databricks via REST ingest or SQL INSERT.

The flow:

  1. A FHIR API client (a user, an integration, a backfill script) sends a POST / PUT / DELETE to Aidbox.
  2. Aidbox persists the resource and enqueues a topic event for the destination in PostgreSQL.
  3. The Data Lakehouse module polls the destination's batch from the same PostgreSQL queue.
  4. The module sends the batch to Databricks via one of the two paths picked by writeMode. See Write modes below.

The module may also run an initial export of pre-existing resources at first start. See Initial export for when this runs and how to skip it.

Write modes

The module supports two write modes, picked per-destination via the writeMode parameter. Both target the same Unity Catalog managed Delta table and share the initial-bulk staging + MERGE INTO flow, the auto-ALTER schema-drift handling, and the Databricks-side Predictive Optimization. They differ in how live batches reach Databricks.

Default to managed-zerobus. The sender JSON-encodes each batch and POSTs it to the Zerobus REST ingest endpoint with an OAuth M2M bearer. No SQL parsing. No warm warehouse. The warehouse is hit once at sender bootstrap for INFORMATION_SCHEMA.COLUMNS and optional ALTER TABLE; live writes don't touch it.

Use managed-sql when Zerobus isn't available on your Databricks SKU. Each batch becomes a single INSERT INTO managed (cols) VALUES (...) against a SQL warehouse. Same target, same idempotent staging-MERGE init, but a warm warehouse on the hot path. The cost difference is the warehouse uptime billing.

Authentication

Both modes authenticate to Databricks via OAuth Machine-to-Machine (M2M) with a service principal. The module exchanges client_id and client_secret at the workspace token endpoint for a ~1h bearer token, caches it, and re-issues a fresh one when fewer than 5 minutes remain.

The module sends the bearer on every Databricks call. The modes differ in which Databricks surfaces see it:

ModeUnity Catalog RESTSQL warehouseOther transportWho talks to storage
managed-zerobus (default)only during initial-export (staging vending)bootstrap + initial-export onlyZerobus REST (every batch)Zerobus ingest service, Databricks-side
managed-sqlonly during initial-export (staging vending)every batch (INSERT / ALTER / DESCRIBE)SQL warehouse compute

The service principal and the grants it needs are set up in Installation and exercised in the Usage example below.

Output semantics

How writes show up in your Delta table, and how to query the result.

Append-only

The module writes every change to a FHIR resource as a new row. There are no in-place UPDATEs or DELETEs:

  • Create → new row with is_deleted = 0
  • Update → new row with is_deleted = 0 (old row remains)
  • Delete → new row with is_deleted = 1

Example: a single patient created, updated twice, then deleted produces four rows with the same id:

idts (meta.lastUpdated)genderfamily_nameis_deleted
p-12026-04-01T10:00:00ZmaleSmith0
p-12026-04-02T08:00:00ZmaleSmith-Jones0
p-12026-04-03T14:00:00ZotherSmith-Jones0
p-12026-04-04T09:00:00ZotherSmith-Jones1

Use the read-time projection below to collapse history to "latest row per id, excluding deleted".

At-least-once delivery

Aidbox persists messages in a PostgreSQL queue and retries them on failure. Both write modes have the same crash-between-commit-and-ack semantics:

  • Initial export is idempotent for both: the module stages rows in an external Delta, then MERGE INTO target USING staging ON t.id = s.id WHEN NOT MATCHED THEN INSERT *. A replay finds the existing rows in the target and inserts zero new ones.
  • Live writes are at-least-once for both: the Zerobus REST endpoint has no offset or transaction id; the SQL INSERT path has the same constraint. Use the read-time dedup pattern below to collapse duplicates.

Querying the table

Because the module writes every change as a new row (and managed-* modes can deliver duplicates on crash-replay), querying the table directly returns full history plus possible dupes. Most analytics workloads (cohort builds, longitudinal queries, time-windowed aggregates) want this: full event history is the point.

If your query needs "latest state per resource", one common SQL pattern is window-function dedup. Add a timestamp column to your ViewDefinition (e.g. meta.lastUpdated as ts) and:

SELECT * EXCEPT(rn) FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY ts DESC) AS rn
  FROM aidbox_export.fhir.patients
)
WHERE rn = 1 AND is_deleted = 0;

This is one example, not the only approach. Wrap it in a Databricks SQL view if you reuse it, or skip it if your queries aggregate over history.

Data transformation

The module:

  1. Applies the ViewDefinition to transform each FHIR resource using the specified ViewDefinition SQL.
  2. Adds a deletion flag: is_deleted = 0 for create/update, is_deleted = 1 for delete operations.
  3. Batches messages by batchSize and sendIntervalMs parameters.
  4. Coerces types per write path:
    • managed-sql: converts Java SQL dates and timestamps to ISO-8601 strings; the SQL warehouse parses them into DATE / TIMESTAMP columns.
    • managed-zerobus: encodes dates as int32 epoch-days and timestamps as int64 epoch-microseconds, as the Zerobus REST wire format requires. The endpoint rejects ISO strings with a 400.

See Output semantics for append-only behaviour, at-least-once delivery, and the recommended read-time dedup query.

Schema evolution

The module maps FHIR / ViewDefinition types to Databricks SQL types as follows:

FHIR / ViewDefinition typeDatabricks SQL type
id, string, codeSTRING
dateDATE
dateTime, instantTIMESTAMP
integer, positiveIntINT
decimalDOUBLE
booleanBOOLEAN

Both managed-zerobus and managed-sql run ALTER TABLE ADD COLUMNS when the ViewDefinition has new columns. The module triggers this at sender start and on per-batch schema-mismatch (retried once).

To add a column:

  1. Add the column to your ViewDefinition.
  2. Re-materialize: POST /fhir/ViewDefinition/{id}/$materialize.
  3. Either delete and recreate the destination, or wait for the next write. Auto-heal catches it on the first batch.

Existing rows will have NULL in the new column.

The module only adds columns on its own. Column drops, renames, or narrowing type changes (e.g., BIGINTINT) are not auto-applied. Run the corresponding ALTER TABLE yourself.

Initial export

Cloud support: AWS S3 only (today) for the initial-export staging bucket (s3://... / s3a://...). GCS and Azure ADLS Gen2 are not supported for the staging path.

The same flow described below is also exposed standalone as a FHIR operation: $viewdefinition-export. Use that operation when you want a one-shot snapshot of a ViewDefinition without standing up a continuous AidboxTopicDestination.

When you create a new destination without skipInitialExport=true, the module exports the current state of every row in sof.<view>.

  • Updates after destination creation append a new row each (POST / PUT / DELETE), accumulating a full audit trail.
  • Pre-existing history is not exported. Initial export reads each resource's current row from sof.<view>, not from Aidbox's _history table. Run a one-off ETL from _history before you create the destination if you need older versions.

How it works

Bulk initial-export flow: Aidbox writes per-chunk Delta stagings on S3, then issues one MERGE INTO target via the Databricks SQL warehouse, then drops the stagings.

The destination's initial export and the standalone $viewdefinition-export operation use the same data-lakehouse bulk-export engine.

Timing and monitoring

The kick-off and the export are decoupled. POST /fhir/AidboxTopicDestination does not hold the HTTP connection open while billions of rows stream into Databricks.

POST /fhir/AidboxTopicDestination returns 201 Created after bootstrap (1-2 minutes worst-case).

Poll progress via the destination's $status endpoint:

curl -u <client-name>:<client-secret> "$AIDBOX_URL/fhir/AidboxTopicDestination/patient-databricks/\$status" | jq

Relevant fields during initial export:

  • initialExportStatus: not_started, export-in-progress, completed, skipped, or failed.
  • initialExportProgress_rowsSent: running row count.
  • initialExportError: error message when initialExportStatus=failed.

On failure the module retries up to 3 times with exponential backoff (1s → 2s → 4s). The MERGE INTO ... ON t.id = s.id WHEN NOT MATCHED THEN INSERT * is idempotent on id, so a retry after a lost ack inserts zero new rows.

The continuous worker starts polling the PG queue right after destination creation, in parallel with initial export. Initial-export and live writes are not serialized. The MERGE keys on id, so a continuous-stream row inserted before initial-export catches up gets skipped by the eventual MERGE pass (idempotent).

Large-scale initial export

For large tables, set initialExportChunkCount to split the initial export into parallel chunks. Initial export uses the same bulk engine as $viewdefinition-export; use its large-scale execution section for sizing, substituting initialExportChunkCount for chunkCount.

Retry behavior

  • Failed batch: the message stays in the PostgreSQL queue and retries on the next sendIntervalMs tick. 1-second backoff between failed attempts.
  • OAuth bearer token: cached; refreshed when fewer than 5 minutes remain.
  • Worker thread crash: auto-restarts with exponential backoff (1s initial, 60s max). The queue prevents message loss.
  • Initial export failure: retries up to 3 times with 1s → 2s → 4s backoff. After 3 failures, initialExportStatus = failed, the error surfaces via $status, live delivery continues unaffected, and recreating the destination kicks off a fresh attempt.

Installation

Prerequisites

  • Aidbox 2605 or newer (install guide)
  • A Databricks workspace (Free Edition works for evaluation, paid for production)
  • The Databricks CLI installed and authenticated locally. Every Databricks-side operation in the tutorial uses it.
  • An AWS account and AWS CLI authenticated locally with permission to create S3 buckets and IAM roles. Initial-export staging lives in an S3 bucket you own, and the IAM trust policy patch needs iam:UpdateAssumeRolePolicy.
  • jq
  • A SQL warehouse
  • For managed-zerobus: Zerobus enabled on your SKU (Databricks Free Edition supports it; for paid plans confirm with Databricks support)

Aidbox-side setup

Five steps to get Aidbox up with the module loaded. The Databricks-side workspace setup follows in the next sub-section.

1

Enable metastore External Data Access

Account Console → Catalog → Metastores → <your-metastore> → toggle External Data Access on (Databricks docs).

Aidbox is "non-Databricks compute" from Unity Catalog's perspective. Requires metastore admin.

Databricks Catalog Explorer → metastore details → External data access toggle.

2

Export the service-principal credentials

The module reads Databricks OAuth M2M credentials from box settings. Create the SP in the Databricks UI (Settings → Identity and access → Service principals → Add, then Secrets → Generate secret).

Databricks workspace settings → Identity and access → Service principals.

Save the SP's client ID into a box settings env var:

export BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_ID=<sp-client-id>

And the client secret:

export BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_SECRET=<sp-client-secret>
3

Download the module JAR

Download the published JAR to your project directory:

curl -O https://storage.googleapis.com/aidbox-modules/topic-destination-deltalake/databricks-module-2605.0.jar
4

Wire the module into docker-compose.yaml

Mount the JAR and point Aidbox at it via env vars:

aidbox:
  volumes:
    - ./databricks-module-2605.0.jar:/databricks-module.jar
    # ... other volumes ...
  environment:
    BOX_MODULE_LOAD: io.healthsamurai.databricks.core
    BOX_MODULE_JAR: "/databricks-module.jar"
    BOX_FHIR_SCHEMA_VALIDATION: "true"
    BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_ID: ${BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_ID}
    BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_SECRET: ${BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_SECRET}
    # ... other environment variables ...
5

Start Aidbox

Start the stack:

docker compose up

Databricks-side setup

Ten steps to prepare your Databricks workspace: the catalog, schemas, target table, S3 staging bucket, IAM trust chain, and grants. The Service Principal you created in the Aidbox-side setup is the principal we grant throughout.

1

Authenticate the Databricks CLI and export environment variables

The remaining Setup steps (and the Usage example below) run Databricks CLI commands as your own user. Authenticate once via PAT or databricks auth login.

export DATABRICKS_HOST=https://<your-workspace>.cloud.databricks.com
# Use either a PAT for your own user, or run `databricks auth login`
# for browser-based SSO.
export DATABRICKS_TOKEN=<your-pat>

The rest of the example references the names below via environment variables. Override any of them before sourcing, and the commands stay copy-pasteable:

# Aidbox endpoint that will receive the destination POST.
export AIDBOX_URL=http://localhost:8080

# Numeric workspace ID — find it in Settings → Workspace → Workspace ID,
# or in the `o=…` query parameter of the workspace URL:
# https://<dbc-id>.cloud.databricks.com/?o=<workspace-id>
export WORKSPACE_ID=<your-workspace-id>

# Identifiers the example creates — pick your own.
export CATALOG=aidbox_export
export TARGET_SCHEMA=fhir
export STAGING_SCHEMA=fhir_staging
export TARGET_TABLE=patients

# Region where the Databricks workspace lives
export DATABRICKS_REGION=us-east-1

# Region of your S3 bucket. Usually the same as DATABRICKS_REGION
# (workspace and bucket created in the same region), but they're
# separate concepts — keep them distinct if your bucket is elsewhere.
export AWS_REGION=us-east-1
export STAGING_BUCKET=<your-bucket-name>
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

# Databricks' own AWS account. Hardcoded for commercial AWS regions —
# Databricks publishes this account ID and uses it for every commercial
# workspace. **GovCloud customers use a different ID**: see Databricks'
# AWS GovCloud setup docs for the value to substitute here.
export DATABRICKS_AWS_ACCOUNT_ID=414351767826

# Unity Catalog resource names created in later steps.
export STORAGE_CRED_NAME=aidbox_staging_cred
export EXTERNAL_LOCATION_NAME=aidbox_staging_loc
export IAM_ROLE_NAME=aidbox-staging-role
2

Pick the SQL warehouse

You already created the Service Principal and exported BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_ID/_SECRET in the earlier Aidbox-side steps above.

Now in the Databricks UI under SQL Warehouses pick or create a Serverless warehouse, grab its ID:

Databricks SQL Warehouse Overview tab showing the warehouse ID.

export WAREHOUSE_ID=<warehouse-id>

Grant the SP Can use on the warehouse:

databricks warehouses update-permissions "$WAREHOUSE_ID" --json '{
  "access_control_list": [
    {"service_principal_name": "'"$BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_ID"'", "permission_level": "CAN_USE"}
  ]
}'
3

Create the S3 bucket

Use the same region as your Databricks workspace. The same bucket holds both the managed target's storage root and the initial-export staging area, under separate prefixes.

Create the bucket:

aws s3api create-bucket --bucket "$STAGING_BUCKET" --region "$AWS_REGION"
4

Create the IAM role Databricks will assume

The Storage Credential step below patches PLACEHOLDER with the real external ID.

Create the role with a temporary trust policy:

aws iam create-role --role-name "$IAM_ROLE_NAME" \
  --assume-role-policy-document "$(cat <<EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "AWS": "arn:aws:iam::${DATABRICKS_AWS_ACCOUNT_ID}:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL" },
    "Action": "sts:AssumeRole",
    "Condition": { "StringEquals": { "sts:ExternalId": "PLACEHOLDER" } }
  }]
}
EOF
)"

Attach the S3 access policy so the role can read/write the staging bucket:

aws iam put-role-policy --role-name "$IAM_ROLE_NAME" --policy-name s3-access \
  --policy-document "$(cat <<EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:GetObject","s3:PutObject","s3:DeleteObject","s3:ListBucket","s3:GetBucketLocation","s3:GetLifecycleConfiguration","s3:PutLifecycleConfiguration"],
    "Resource": ["arn:aws:s3:::${STAGING_BUCKET}","arn:aws:s3:::${STAGING_BUCKET}/*"]
  }]
}
EOF
)"

Capture the role ARN. Unity Catalog references it when we register the Storage Credential:

export STAGING_ROLE_ARN=$(aws iam get-role --role-name "$IAM_ROLE_NAME" \
  --query 'Role.Arn' --output text)
5

Register the Storage Credential in Unity Catalog

Register the Storage Credential pointing at the IAM role:

databricks storage-credentials create \
  --json '{"name":"'"$STORAGE_CRED_NAME"'","aws_iam_role":{"role_arn":"'"$STAGING_ROLE_ARN"'"}}' \
  --skip-validation

Read back the external-id Unity Catalog generated for the IAM trust relationship:

export EXTERNAL_ID=$(databricks storage-credentials get "$STORAGE_CRED_NAME" \
  | jq -r '.aws_iam_role.external_id')

Patch the IAM role's trust policy with the real external-id (replacing PLACEHOLDER):

aws iam update-assume-role-policy --role-name "$IAM_ROLE_NAME" \
  --policy-document "$(cat <<EOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "AWS": [
      "arn:aws:iam::${DATABRICKS_AWS_ACCOUNT_ID}:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL",
      "arn:aws:iam::${AWS_ACCOUNT_ID}:role/${IAM_ROLE_NAME}"
    ]},
    "Action": "sts:AssumeRole",
    "Condition": { "StringEquals": { "sts:ExternalId": "${EXTERNAL_ID}" } }
  }]
}
EOF
)"

Validate Databricks can actually assume the role end-to-end (10s wait lets the IAM update propagate):

sleep 10 \
  && databricks storage-credentials validate --storage-credential-name "$STORAGE_CRED_NAME"

Empty results means success.

6

Register the External Location

The External Location combines the Storage Credential with the bucket prefix Databricks can write into. Register the bucket root so the same External Location backs both the managed-catalog storage root and the staging-schema prefix:

databricks external-locations create "$EXTERNAL_LOCATION_NAME" \
  "s3://$STAGING_BUCKET/" "$STORAGE_CRED_NAME"
7

Create the catalog and target schema

The catalog's --storage-root must sit inside the External Location you just registered.

databricks catalogs create "$CATALOG" \
  --storage-root "s3://$STAGING_BUCKET/managed/"

Create the target schema inside the new catalog:

databricks api post /api/2.0/sql/statements --json "$(jq -n \
  --arg wh "$WAREHOUSE_ID" \
  --arg stmt "CREATE SCHEMA $CATALOG.$TARGET_SCHEMA" \
  '{warehouse_id: $wh, wait_timeout: "30s", statement: $stmt}')"
8

Create the managed Delta target table

Columns must match the ViewDefinition, plus a mandatory is_deleted INT (see Output semantics). For the FHIR → SQL type mapping, see Schema evolution.

databricks api post /api/2.0/sql/statements --json "$(jq -n \
  --arg wh "$WAREHOUSE_ID" \
  --arg stmt "CREATE TABLE $CATALOG.$TARGET_SCHEMA.$TARGET_TABLE (id STRING, ts TIMESTAMP, cts TIMESTAMP, gender STRING, birth_date DATE, family_name STRING, given_name STRING, is_deleted INT) USING DELTA" \
  '{warehouse_id: $wh, wait_timeout: "30s", statement: $stmt}')"
9

Create the sibling staging schema

The module convention places initial-export staging tables in <catalog>.<target-schema>_staging.<…>, a sibling schema next to the target. For target $CATALOG.$TARGET_SCHEMA.$TARGET_TABLE that's $CATALOG.$STAGING_SCHEMA:

databricks schemas create "$STAGING_SCHEMA" "$CATALOG" \
  --storage-root "s3://$STAGING_BUCKET/staging/"

--storage-root is not optional. Omitting it creates a managed schema that later rejects the EXTERNAL_USE_SCHEMA grant the module needs, without surfacing an error. The CLI form is the only one that works; Databricks rejects CREATE SCHEMA … LOCATION '…' via SQL.

Run this as the catalog owner. It needs CREATE_SCHEMA on the catalog and CREATE_MANAGED_STORAGE on the External Location. Don't grant either to the runtime SP.

10

Grant the service principal

Grant only the set matching your writeMode. The SP runs the module at request time; nothing in this set lets it create or destroy catalog-level resources.

PrivilegeGranted onPurpose
USE_CATALOGthe catalognavigate the catalog
USE_SCHEMAthe target schemaresolve the target table
SELECT, MODIFYthe target tableDESCRIBE + initial-bulk MERGE INTO
USE_SCHEMA, EXTERNAL_USE_SCHEMA, CREATE_TABLEthe staging schemaresolve the sibling schema, vend STS for the staging table, and let the sender register it (initial-export only)
READ_FILES, WRITE_FILES, CREATE_EXTERNAL_TABLEthe External Locationwrite bulk Parquet via vended STS (initial-export only)
EXTERNAL_USE_LOCATIONthe External Locationvend path-credentials so the module can recursive-delete orphan Parquet / _delta_log/ left from prior runs before the next CREATE TABLE — without this grant cleanup is skipped and files accumulate, but init-export still works
CAN_USEthe SQL warehousebootstrap schema-sync statements + initial-bulk MERGE (no warehouse traffic during live writes) — already granted in the SP/warehouse step

Catalog-level navigate grant:

databricks grants update catalog "$CATALOG" --json '{
  "changes":[{"principal":"'"$BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_ID"'","add":["USE_CATALOG"]}]}'

Target-schema resolve grant:

databricks grants update schema "$CATALOG.$TARGET_SCHEMA" --json '{
  "changes":[{"principal":"'"$BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_ID"'","add":["USE_SCHEMA"]}]}'

Target-table read + merge grants:

databricks grants update table "$CATALOG.$TARGET_SCHEMA.$TARGET_TABLE" --json '{
  "changes":[{"principal":"'"$BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_ID"'","add":["SELECT","MODIFY"]}]}'

Initial-export only — staging schema + external location grants:

Staging-schema grants (STS vending + table create) — initial-export only:

databricks grants update schema "$CATALOG.$STAGING_SCHEMA" --json '{
  "changes":[{"principal":"'"$BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_ID"'","add":["EXTERNAL_USE_SCHEMA","USE_SCHEMA","CREATE_TABLE"]}]}'

External-location grants (Parquet writes + path-credentials cleanup) — initial-export only:

databricks grants update external-location "$EXTERNAL_LOCATION_NAME" --json '{
  "changes":[{"principal":"'"$BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_ID"'","add":["READ_FILES","WRITE_FILES","CREATE_EXTERNAL_TABLE","EXTERNAL_USE_LOCATION"]}]}'

Identical privilege set to managed-zerobus. The SQL warehouse handles every batch instead of only the bootstrap and initial-bulk phases:

PrivilegeGranted onPurpose
USE_CATALOGthe catalognavigate the catalog
USE_SCHEMAthe target schemaresolve the target table
SELECT, MODIFYthe target tableDESCRIBE + every-batch INSERT + initial-bulk MERGE INTO
USE_SCHEMA, EXTERNAL_USE_SCHEMA, CREATE_TABLEthe staging schemaresolve the sibling schema, vend STS for the staging table, and let the sender register it (initial-export only)
READ_FILES, WRITE_FILES, CREATE_EXTERNAL_TABLEthe External Locationwrite bulk Parquet via vended STS (initial-export only)
EXTERNAL_USE_LOCATIONthe External Locationvend path-credentials so the module can recursive-delete orphan Parquet / _delta_log/ left from prior runs before the next CREATE TABLE — without this grant cleanup is skipped and files accumulate, but init-export still works
CAN_USEthe SQL warehouseevery-batch INSERT + bootstrap + initial-bulk — already granted in the SP/warehouse step

Catalog-level navigate grant:

databricks grants update catalog "$CATALOG" --json '{
  "changes":[{"principal":"'"$BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_ID"'","add":["USE_CATALOG"]}]}'

Target-schema resolve grant:

databricks grants update schema "$CATALOG.$TARGET_SCHEMA" --json '{
  "changes":[{"principal":"'"$BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_ID"'","add":["USE_SCHEMA"]}]}'

Target-table read + merge grants:

databricks grants update table "$CATALOG.$TARGET_SCHEMA.$TARGET_TABLE" --json '{
  "changes":[{"principal":"'"$BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_ID"'","add":["SELECT","MODIFY"]}]}'

Initial-export only — staging schema + external location grants:

Staging-schema grants (STS vending + table create) — initial-export only:

databricks grants update schema "$CATALOG.$STAGING_SCHEMA" --json '{
  "changes":[{"principal":"'"$BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_ID"'","add":["EXTERNAL_USE_SCHEMA","USE_SCHEMA","CREATE_TABLE"]}]}'

External-location grants (Parquet writes + path-credentials cleanup) — initial-export only:

databricks grants update external-location "$EXTERNAL_LOCATION_NAME" --json '{
  "changes":[{"principal":"'"$BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_ID"'","add":["READ_FILES","WRITE_FILES","CREATE_EXTERNAL_TABLE","EXTERNAL_USE_LOCATION"]}]}'

AidboxTopicDestination parameters

Required:

ParameterTypeDescription
viewDefinitionstringThe name field of the ViewDefinition resource (not id)
batchSizeunsignedIntRows per worker tick / batch commit
sendIntervalMsunsignedIntMax time between batched commits, in ms
databricksWorkspaceUrlstringhttps://<workspace>.cloud.databricks.com
databricksWorkspaceIdstringNumeric workspace ID (e.g. 1234567890123456). Composes the Zerobus REST endpoint host
databricksRegionstringWorkspace AWS region (e.g. us-east-1). Composes the Zerobus REST endpoint host
tableNamestringManaged table full name: catalog.schema.table
databricksWarehouseIdstringSQL warehouse ID — used at bootstrap for schema sync + (if initial-export runs) the final MERGE INTO. No warm-warehouse traffic during live writes.
awsRegionstringAWS region of the staging bucket
stagingTablePathstrings3://bucket/path/ for the staging Delta table created during initial export (S3 only today). Required when skipInitialExport is not true

Not required:

ParameterTypeDescription
writeModestringmanaged-zerobus (default) or managed-sql. Omit to get managed-zerobus
skipInitialExportbooleanSkip initial export of existing data (default: false)
targetFileSizeMbunsignedIntParquet target size during initial export (default: 128)
initialExportChunkCountunsignedIntCluster-wide number of parallel chunks for hash-partitioned initial export (default 1 — sequential). See Large-scale initial export for sizing.

Required:

ParameterTypeDescription
writeModestringMust be managed-sql (otherwise the default managed-zerobus path is used)
viewDefinitionstringThe name field of the ViewDefinition resource (not id)
batchSizeunsignedIntRows per worker tick / batch commit
sendIntervalMsunsignedIntMax time between batched commits, in ms
databricksWorkspaceUrlstringhttps://<workspace>.cloud.databricks.com
tableNamestringManaged table full name: catalog.schema.table
databricksWarehouseIdstringSQL warehouse ID
awsRegionstringAWS region of the staging bucket
stagingTablePathstrings3://bucket/path/ for the staging Delta table created during initial export (S3 only today). Required when skipInitialExport is not true

Not required:

ParameterTypeDescription
skipInitialExportbooleanSkip initial export of existing data (default: false)
targetFileSizeMbunsignedIntParquet target size during initial export (default: 128)
initialExportChunkCountunsignedIntCluster-wide number of parallel chunks for hash-partitioned initial export (default 1 — sequential). See Large-scale initial export for sizing.

Usage example: configure your first export

Installation above covers the workspace setup (SP creds, Databricks CLI auth, S3 bucket, UC catalog, schemas, target and staging tables, grants). Complete it before running these steps.

The per-export commands below need only one environment variable: the Aidbox endpoint that will receive the destination POST:

export AIDBOX_URL=http://localhost:8080

The example below uses managed-zerobus (the default).

1

Create the subscription topic

Databricks side is done. Back to Aidbox. The subscription topic declares which FHIR resource changes trigger the export; the destination resource (next step) references this topic by URL.

POST /fhir/AidboxSubscriptionTopic

{
  "resourceType": "AidboxSubscriptionTopic",
  "url": "http://example.org/subscriptions/patient-updates",
  "status": "active",
  "trigger": [
    {
      "resource": "Patient",
      "supportedInteraction": ["create", "update", "delete"],
      "fhirPathCriteria": "name.exists()"
    }
  ]
}
2

Create + materialize the ViewDefinition

A ViewDefinition flattens each FHIR resource into a row using FHIRPath expressions. The column shape here must match the Databricks target table you created above.

POST /fhir/ViewDefinition

{
  "resourceType": "ViewDefinition",
  "id": "patient_flat",
  "name": "patient_flat",
  "resource": "Patient",
  "status": "active",
  "select": [
    {
      "column": [
        {"name": "id", "path": "id"},
        {"name": "ts", "path": "getAidboxTs()"},
        {"name": "cts", "path": "getAidboxCts()"},
        {"name": "gender", "path": "gender"},
        {"name": "birth_date", "path": "birthDate"}
      ]
    },
    {
      "forEach": "name.where(use = 'official').first()",
      "column": [
        {"name": "family_name", "path": "family"},
        {"name": "given_name", "path": "given.join(' ')"}
      ]
    }
  ]
}

getAidboxTs() exposes meta.lastUpdated (the FHIR resource version timestamp); getAidboxCts() exposes the row's row-creation timestamp in Aidbox's storage. Both are Aidbox FHIRPath extensions and help with read-time dedup of the append-only history (see Querying the table).

Then materialize it as a database view in the sof schema. The module reads rows from sof.patient_flat:

POST /fhir/ViewDefinition/patient_flat/$materialize

{
  "resourceType": "Parameters",
  "parameter": [
    {"name": "type", "valueCode": "view"}
  ]
}
3

Configure the destination

The request may take one or two minutes. Aidbox runs schema sync against the warehouse (which may wake it from idle) and, if skipInitialExport is not set, kicks off the initial bulk export before returning.

Pick the writeMode tab matching your Databricks SKU. The bodies are identical except for the writeMode value:

Default. Live per-batch writes use the Zerobus REST row-insert endpoint, so no SQL warehouse stays warm for the hot path:

curl -u <client-name>:<client-secret> -X POST "$AIDBOX_URL/fhir/AidboxTopicDestination" \
  -H 'Content-Type: application/json' \
  --data-binary @- <<EOF | jq
{
  "resourceType": "AidboxTopicDestination",
  "id": "patient-databricks",
  "topic": "http://example.org/subscriptions/patient-updates",
  "kind": "data-lakehouse-at-least-once",
  "meta": {
    "profile": [
      "http://health-samurai.io/fhir/core/StructureDefinition/aidboxtopicdestination-dataLakehouseAtLeastOnceProfile"
    ]
  },
  "parameter": [
    {"name": "writeMode", "valueString": "managed-zerobus"},
    {"name": "databricksWorkspaceUrl", "valueString": "${DATABRICKS_HOST}"},
    {"name": "databricksWorkspaceId", "valueString": "${WORKSPACE_ID}"},
    {"name": "databricksRegion", "valueString": "${DATABRICKS_REGION}"},
    {"name": "tableName", "valueString": "${CATALOG}.${TARGET_SCHEMA}.${TARGET_TABLE}"},
    {"name": "databricksWarehouseId", "valueString": "${WAREHOUSE_ID}"},
    {"name": "awsRegion", "valueString": "${AWS_REGION}"},
    {"name": "stagingTablePath", "valueString": "s3://${STAGING_BUCKET}/staging/${TARGET_TABLE}/"},
    {"name": "viewDefinition", "valueString": "patient_flat"},
    {"name": "batchSize", "valueUnsignedInt": 50},
    {"name": "sendIntervalMs", "valueUnsignedInt": 5000},
    {"name": "initialExportChunkCount", "valueUnsignedInt": 1}
  ]
}
EOF

Use this when Zerobus isn't available on your Databricks SKU (older paid plans, some regions). Same managed Unity Catalog target, same staging-MERGE initial export. Live per-batch writes go through the SQL warehouse, which stays warm and accrues uptime billing:

curl -u <client-name>:<client-secret> -X POST "$AIDBOX_URL/fhir/AidboxTopicDestination" \
  -H 'Content-Type: application/json' \
  --data-binary @- <<EOF | jq
{
  "resourceType": "AidboxTopicDestination",
  "id": "patient-databricks",
  "topic": "http://example.org/subscriptions/patient-updates",
  "kind": "data-lakehouse-at-least-once",
  "meta": {
    "profile": [
      "http://health-samurai.io/fhir/core/StructureDefinition/aidboxtopicdestination-dataLakehouseAtLeastOnceProfile"
    ]
  },
  "parameter": [
    {"name": "writeMode", "valueString": "managed-sql"},
    {"name": "databricksWorkspaceUrl", "valueString": "${DATABRICKS_HOST}"},
    {"name": "databricksWorkspaceId", "valueString": "${WORKSPACE_ID}"},
    {"name": "databricksRegion", "valueString": "${DATABRICKS_REGION}"},
    {"name": "tableName", "valueString": "${CATALOG}.${TARGET_SCHEMA}.${TARGET_TABLE}"},
    {"name": "databricksWarehouseId", "valueString": "${WAREHOUSE_ID}"},
    {"name": "awsRegion", "valueString": "${AWS_REGION}"},
    {"name": "stagingTablePath", "valueString": "s3://${STAGING_BUCKET}/staging/${TARGET_TABLE}/"},
    {"name": "viewDefinition", "valueString": "patient_flat"},
    {"name": "batchSize", "valueUnsignedInt": 50},
    {"name": "sendIntervalMs", "valueUnsignedInt": 5000},
    {"name": "initialExportChunkCount", "valueUnsignedInt": 1}
  ]
}
EOF

stagingTablePath must be a sub-prefix of the External Location you registered (here s3://$STAGING_BUCKET/staging/), not the root itself. Setting it equal to the External Location root or to the staging schema's storage_root makes Databricks refuse with LOCATION_OVERLAP. Use a per-destination subdirectory like staging/patient_flat/ or staging/<destination-id>/.

4

Verify

Create a test patient:

POST /fhir/Patient

{
  "name": [{"use": "official", "family": "Smith", "given": ["John"]}],
  "gender": "male",
  "birthDate": "1990-01-15"
}

Then query your Databricks table to confirm the data arrived. In the Databricks UI open SQL Editor (left sidebar) and run:

SELECT * FROM aidbox_export.fhir.patients;

You should see one row for John Smith. If you created the destination before any patients existed in Aidbox, that one is the only row; initial export had nothing to copy. POST a few more patients and re-run the query to see live writes accumulate.

Stopping the export

To stop exporting data, delete the AidboxTopicDestination resource:

DELETE /fhir/AidboxTopicDestination/patient-databricks

This stops the export and cleans up the internal message queue. Data the module has already written to Databricks stays put.

Monitoring

Status endpoint

GET /fhir/AidboxTopicDestination/patient-databricks/$status

Returns a FHIR Parameters resource:

{
  "resourceType": "Parameters",
  "parameter": [
    { "name": "status", "valueString": "active" },
    { "name": "messagesDelivered", "valueDecimal": 100 },
    { "name": "messagesQueued", "valueDecimal": 0 },
    { "name": "messagesInProcess", "valueDecimal": 0 },
    { "name": "messagesDeliveryAttempts", "valueDecimal": 100 },
    { "name": "initialExportStatus", "valueString": "completed" },
    { "name": "initialExportProgress_rowsSent", "valueDecimal": 100 }
  ]
}
  • messagesDelivered: total messages sent to Databricks
  • messagesQueued: messages waiting in the PG queue
  • messagesInProcess: messages in flight
  • messagesDeliveryAttempts: total delivery attempts, including retries
  • initialExportStatus: not_started, export-in-progress, completed, skipped, or failed
  • initialExportProgress_rowsSent: number of rows sent during initial export

Troubleshooting

Common issues

  1. Privilege EXTERNAL USE SCHEMA is not applicable to this entity: you're trying to grant EXTERNAL USE SCHEMA on a managed schema. The staging schema (<target-schema>_staging) must be external. Recreate it with an explicit storage_root pointed at your staging External Location.
  2. Databricks denied access at POST /temporary-table-credentials [PERMISSION_DENIED]: External Data Access from non Databricks Compute environment is disabled for metastore … (EXTERNAL_ACCESS_DISABLED_ON_METASTORE): the metastore-level toggle is off. See Setup step 1.
  3. Databricks denied access at POST /temporary-table-credentials: principal lacks privileges (need EXTERNAL USE SCHEMA …) (no error_code shown in message): the grant looked successful but didn't apply because the staging schema is managed, not external. Recreate the staging schema via databricks schemas create … --storage-root s3://… and re-grant.
  4. INSUFFICIENT_PRIVILEGES on table or warehouse: verify all grants in Grant the service principal. Don't forget CAN_USE on the warehouse.
  5. DELTA_INSERT_COLUMN_ARITY_MISMATCH: the module should auto-heal this once. If it persists, check that the schema diff is column-add only (drops and renames are not auto-applied).
  6. Slow first write: Serverless warehouses cold-start in 30-90s on first use after idle. The module's HTTP timeout is 120s for SQL Statement Execution and uses wait_timeout=50s polling, so cold starts succeed without user-visible failure, but the first batch's latency is high. Keep the warehouse warm with a periodic ping if first-batch latency matters.
  7. Duplicate rows after recreating destination: deleting and recreating a destination triggers initial export again. Set skipInitialExport: true when recreating a destination that already has its data exported.
  8. LOCATION_OVERLAP during initial export: stagingTablePath either equals the staging schema's storage_root (which UC treats as the schema's own managed location) or doesn't sit under your External Location. Set it to a sub-prefix of the External Location, e.g. s3://<bucket>/staging/patient_flat/, not the External Location root itself.
  9. Unsupported table kind. Tables created in default storage are not supported (Zerobus error 4024): the catalog backing your target table was created without --storage-root, so Unity Catalog placed it in the workspace's default-storage prefix. managed-zerobus refuses to write into default storage. Recreate the catalog with databricks catalogs create <name> --storage-root s3://<bucket>/managed/ pointing inside a registered External Location (see Create the catalog and target schema in the usage example).
  10. DELTA_CREATE_TABLE_SCHEME_MISMATCH on initial export retry: your stagingTablePath contains a _delta_log/ from a previous initial-export run, and the new run has a different ViewDefinition schema (e.g. you added ts/cts columns). The module drops the UC staging table metadata on cleanup but does NOT delete S3 files, so the old _delta_log/ survives and conflicts. Fix: either point stagingTablePath at a fresh sub-prefix (e.g. append a nonce: s3://<bucket>/staging/<table>-v2/), or run aws s3 rm --recursive on the old prefix. Granting EXTERNAL_USE_LOCATION on the External Location (see item 11) enables the module's auto-cleanup so this stops biting on column-add re-runs.
  11. staging-s3-cleanup-skipped log with :reason :no-external-use-location-grant: the SP lacks EXTERNAL_USE_LOCATION on the External Location covering the staging path. Auto-cleanup of stale staging Parquet and _delta_log/ between runs is disabled and files accumulate. Init-export itself keeps working. Fix by granting the privilege:
    databricks grants update external-location "$EXTERNAL_LOCATION_NAME" --json '{
      "changes":[{"principal":"'"$BOX_DATABRICKS_DATA_LAKEHOUSE_CLIENT_ID"'","add":["EXTERNAL_USE_LOCATION"]}]}'
    
    Or via SQL:
    GRANT EXTERNAL USE LOCATION ON EXTERNAL LOCATION `<name>` TO `<sp-application-id>`;
    
    Verify with:
    SHOW GRANTS ON EXTERNAL LOCATION `<name>`;
    

Last updated: