Bulk import from an S3 bucket

If you want to import S3 objects with a presigned url, refer to aidbox.bulk data import.

`aidbox.bulk/load-from-bucket`

It allows loading data from a bunch of .ndjson.gz files on an AWS bucket directly to the Aidbox database with maximum performance.

Be careful You should run only one replica of Aidbox to use aidbox.bulk/load-from-bucket operation.

Files content and naming requirement

The file must consist of Resources of the same type.
The file name should start with a name of the Resource type, then some postfix is possible, and extension .ndjson is required. Files can be placed in subdirectories of any level. Files with the wrong path structure will be ignored.
Every resource in .ndjson files MUST contain id property.

Resource requirements for `aidbox.bulk/load-from-bucket`:

Operation	id	resourceType
`aidbox.bulk/load-from-bucket`	Required	Not required

Valid file structure example:

fhir/1/Patient.ndjson.gz
fhir/1/patient-01.ndjson.gz
Observation.ndjson.gz

Invalid file structure example:

import.ndjson
01-patient.ndjson.gz
fhir/Patient

Parameters

Object with the following structure:

bucket * defines your bucket connection string in formats3://<bucket-name>
thread-num defines how many threads will process the import. The default is 4.
account credential:
- access-key-id * AWS key ID
- secret-access-key * AWS secret key
- region * AWS Bucket region
- host use you need to override the default amazonaws.com, for example, us-gov-east-1.amazonaws.com
disable-idx? the default is false. Allows to drop all indexes for resources, which data are going to be loaded. Indexes will be restored at the end of successful import. All information about dropped indexes is stored at DisabledIndex resources.
drop-primary-key? the default is false. The same as the previous parameter, but drops primary key constraint for resources tables. This parameter disables all checks for duplicates for imported resources.
upsert? the default is false. If upsert? is false, import for files with id uniqueness constraint violation will fail with an error, if true - records in the database will be overridden with records from import. Even when upsert? is true, it's still not allowed to have more than one record with the same id in one import file. Setting this option to true will cause a decrease in performance.
scheduler possible values: optimal , by-last-modified, the default is optimal . Establishes the order in which the files are processed. The optimal value provides the best performance. by-last-modified should be used with thread-num = 1 to guarantee a stable order of file processing.
prefixes array of prefixes to specify which files should be processed. Example: with value ["fhir/1/", "fhir/2/Patient"] only files from directory "fhir/1" and Patient files from directory "fhir/2" will be processed. When provided, Aidbox scopes the S3 listing call to each prefix instead of listing the entire bucket, so import start-up time on large buckets depends on the size of the requested prefixes rather than the whole bucket.
connect-timeout the default is 0. Specifies the number of milliseconds after which the file is considered as failed if connection to the resource could not be established. (e.g. in case of network issues). Zero is interpreted as an infinite timeout.
read-timeout the default is 0. Specifies the number of milliseconds after which the file is considered as failed if there is no data available to read (e.g. in case of network issues). Zero is interpreted as an infinite timeout.

Returns the string "Upload started"

Returns error message

Example

POST /rpc
content-type: text/yaml
accept: text/yaml

method: aidbox.bulk/load-from-bucket
params:
  bucket: s3://your-bucket-id
  thread-num: 4
  account:
    access-key-id: your-key-id
    secret-access-key: your-secret-access-key
    region: us-east-1

Status: 200

result:
  message: Upload from bucket <s3://your-bucket-id> started. 6 new files added.
  progress:
    total: 6
  new-files-count: 6

Loader File

For each file being imported via load-from-bucket method, Aidbox creates LoaderFile resource. To find out how many resources were imported from a file, check the loaded field.

Loader File Example

{
  "end": "2022-04-11T14:50:27.893Z",
  "file": "/tmp/patient.ndjson.gz",
  "size": 100,
  "type": "Patient",
  "bucket": "local",
  "loaded": 20,
  "status": "done"
}

{
  "end": "2022-04-11T14:50:27.893Z",
  "file": "/tmp/patient.ndjson.gz",
  "size": 100,
  "type": "Patient",
  "bucket": "local",
  "status": "error",
  "error": {
    "code": "23505",
    "source": "postgres"
  },
  "message": "23505: ERROR: duplicate key value violates unique constraint \"patient_pkey\""
}

Sources of Error

There are the following sources of error for this request.

AWS Error
PostgreSQL Error
Aidbox Error\

AWS Error

Code	Description
InvalidAccount	The AWS access key ID or AWS secret access key that you provided is not valid.
NoSuchKey	The specified S3 bucket or S3 object key does not exist.

PostgreSQL Error

See Documentation of PostgreSQL.\

Aidbox Error

Any other errors than the above can be caught as Aidbox Error. An error message will be provided if available.\

How to reload a file one more time

On launch aidbox.bulk/load-from-bucket checks if files from the bucket were planned to import and decides what to do:

If ndjson.gz file has it's related LoaderFile resource, the loader skips this file from import
If there is no related LoaderFile resource, Aidbox puts this file to the queue creating a LoaderFile resource

In order to import a file one more time you should delete related LoaderFile resource and relaunch aidbox.bulk/load-from-bucket.

Files are processed completely. The loader doesn't support partial re-import.\

AWS User Policy: Minimal Example

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "MinimalUserPolicyForBulkImport",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::<your-bucket-name>",
        "arn:aws:s3:::<your-bucket-name>/*"
      ]
    }
  ]
}

`aidbox.bulk/load-from-bucket-status`

Returns status and progress of import for specified bucket. Possible states are: in-progress, completed, interrupted.

State interrupted means that aidbox was restarted during the loading process. If you run aidbox.bulk/load-from-bucket operation again on the same bucket, it will be continued.

Example

POST /rpc
content-type: text/yaml
accept: text/yaml

method: aidbox.bulk/load-from-bucket-status
params:
  bucket: s3://your-bucket-id

Status: 200

result:
  state: in-progress
  progress:
    total: 6
    pending: 2
    done: 4

Bulk import from an S3 bucket

aidbox.bulk/load-from-bucket

Files content and naming requirement

Resource requirements for aidbox.bulk/load-from-bucket:

Valid file structure example:

Invalid file structure example:

Parameters

Example

Loader File

Loader File Example

How to reload a file one more time

AWS User Policy: Minimal Example

aidbox.bulk/load-from-bucket-status

Example

`aidbox.bulk/load-from-bucket`

Resource requirements for `aidbox.bulk/load-from-bucket`:

`aidbox.bulk/load-from-bucket-status`