Stream: bulk data
Topic: Breakout notes
Josh Mandel (Sep 14 2019 at 19:08):
We're looking at https://github.com/smart-on-fhir/bulk-import/blob/master/import.md ; and roundtable to share experience, directions. I'll keep some live notes here...
Josh Mandel (Sep 14 2019 at 19:10):
inputSource
inputSource
currently required, but may not be relevant/known in real life (e.g., for synthetic data). Today, we use this to track light-weight provenance from inputs -> imported resources.
Josh Mandel (Sep 14 2019 at 19:11):
What's it really trying to track? Goal is to be able to recognize data from the same source, over time.
Josh Mandel (Sep 14 2019 at 19:13):
We can't assume that all resources in an import have the same real-world origin.
Josh Mandel (Sep 14 2019 at 19:15):
Proposal: keep inputSource
as required, but update docs to note that it's a URI, specified by the client, meant to track data imported through its lifecycle. Use "a FHIR server base URL" as an example but not the definition of this field.
Josh Mandel (Sep 14 2019 at 19:16):
We can also note that clients may provide more detailed Provenance along the way.
Josh Mandel (Sep 14 2019 at 19:19):
Note that Resource.id
in an import should be unique (no distinct resources that share an ID).
Josh Mandel (Sep 14 2019 at 19:19):
inputFormat
In GCP, there are use cases for importing Bundles (i.e., unwrap-bundle-and-import-the-entries) as well as piles of resources.
Josh Mandel (Sep 14 2019 at 19:20):
Can there be something like an "unwrap" mode, or even deeper Bundle semantics?
Josh Mandel (Sep 14 2019 at 19:21):
1. Resources in NDJSON
2. Resources in JSON (one per file)
3. Bundles in NDJSON
4. Bundles in JSON (one per file)
Josh Mandel (Sep 14 2019 at 19:21):
Synthea is an example of (4)
Josh Mandel (Sep 14 2019 at 19:21):
Current bulk data export is an example of (1)
Josh Mandel (Sep 14 2019 at 19:22):
No real organic use for (2); a little for (3).
Josh Mandel (Sep 14 2019 at 19:26):
Which do we want to enable in a standard way, and which do we want to require?
Josh Mandel (Sep 14 2019 at 19:29):
Proposal: require support for (1), and define support for all four. With respect to bundle types: build in a parameter that requests something like processBundleAsCollection
(so servers can introduce support for Bundles without being expected to manage transaction semantics).
Josh Mandel (Sep 14 2019 at 19:31):
input.type
Josh Mandel (Sep 14 2019 at 19:33):
DO servers need uniform types per file?
Josh Mandel (Sep 14 2019 at 19:34):
- use case for servers that are going to just provide analytics
- maintain type-based dependencies in an import (e.g., for CMS API that wants a heuristic like "load organizations; then practitions; then patients".)
Josh Mandel (Sep 14 2019 at 19:37):
Today, most heterogeneous sets of resources are in the context of bundles (GCP import experience)
Josh Mandel (Sep 14 2019 at 19:47):
Proposal: Enable support for "mixed-type" payloads, achieved by a client omitting input.type
. Some servers might refuse to process mixed-type data.
Josh Mandel (Sep 14 2019 at 19:48):
input.url
Josh Mandel (Sep 14 2019 at 19:51):
In GCP, cloud-bucket-backed imports provide the ability to list file paths with wildcards, like gcp://my-bucket/path-to/resources/Observation-*.json
. Is there something similar that could be done with https
-based storage?
Josh Mandel (Sep 14 2019 at 19:52):
How does success/error reporting work in the status response?
Josh Mandel (Sep 14 2019 at 19:53):
In GCP, logging is in stackdrive, with individual resource-level errors logged, but no file-level summaries.
Josh Mandel (Sep 14 2019 at 19:54):
Path expansion happens at the beginning of the import (probably ;-))
Josh Mandel (Sep 14 2019 at 19:54):
Also supports **
for recursing into subdirectories. (wildcard path components).
Josh Mandel (Sep 14 2019 at 19:55):
Wildcard support doesn't exist in http requests; and capabilities/semantics vary among different cloud bucket storage services.
Josh Mandel (Sep 14 2019 at 19:56):
Conformance
- How does a server advertise support for bundle processing?
- For mixed-type input files?
- For data storage modes (
storageDetail.type
)?
Josh Mandel (Sep 14 2019 at 19:58):
we can fill out placeholders for cloud bucket services and sftp
.
Josh Mandel (Sep 14 2019 at 19:58):
Proposal: define a conformance discovery process (.well-known/???
) for clients to learn this stuff.
Brian Postlethwaite (Sep 14 2019 at 19:59):
(and local folder/network folder for internal organization processing/sharing)
Josh Mandel (Sep 14 2019 at 19:59):
And we should allow a server to define in CapabilityStatement
that a server supports $import
.
Josh Mandel (Sep 14 2019 at 20:10):
Semantics of batch, transaction, etc.
Do we want to define these models more formally, so a client can request, e.g., transactional import semantics? (And any given server might support some modes and not others.)
Allow clients to request among the following three transactional modes:
- Best-effort across the import (<-- This is where we are today)
- A set of smaller files, with all-or-nothing behavior per-file (<-- This is really practical).
- All-or-nothing across the import
Josh Mandel (Sep 14 2019 at 20:10):
Do we have use cases that would help us make better decisions about requirements, for imports beyond the simple "best effort" stuff?
Josh Mandel (Sep 14 2019 at 20:25):
Proposal: processBundleAs=collection|transaction
to allow clients to request finer-grained transaction boundaries (i.e., via individual Bundle files or ndjson files full of Bundles).
Josh Mandel (Sep 14 2019 at 20:27):
For referential integrity, should servers rewrite dangling links with fullUrls back to source materials? Leave them dangling? Fail the import altogether? We'll plan to leave this set of decision until later.
Josh Mandel (Sep 14 2019 at 20:35):
Do clients need to call delete
to clean things up / indicate finished?
@Brian Postlethwaite can you submit a PR for this as a suggestion on $export?
Brian Postlethwaite (Sep 14 2019 at 20:41):
(It's actually on the core spec too in the async docco - so I'll log a gforge)
Nagesh Bashyam (Sep 17 2019 at 14:15):
@Josh Mandel @Dan Gottlieb
Has anything been finalized w.r.t the four different options and processBundleAs=collection|transaction ?
I saw the proposal, but wanted to track it for some of our work in DaVinci.
The use case that I am working on in DaVinci is as follows:
Payers create a list of patients who are attributed to different individual or organizational providers based on their coverage, claims and other information. They will be giving this list of patients to provider organizations via a Group Resource. Once the provider organization receives this Group resource, we are thinking that the provider organizations can invoke the Group/[id]/$export to get the relevant data for all the patients in the list. We are thinking it may be good to process the data associated with each patient as a transaction or a collection where a bundle may be useful.
Feel free to correct us on the thinking or the usage.
Josh Mandel (Sep 17 2019 at 14:19):
Right now, just captured in notes. But to be clear, the discussion was on $import
; is the provider org in your example invoking an $import
operation somewhere?
Nagesh Bashyam (Sep 17 2019 at 14:26):
No, the provider org or the payers are not planning on a $import. There are many other reconciliation processes they go through before finally using/persisting the received data. Since the discussion was interesting and may help other needs where processing using bundles (all related resources for a patient) instead of ndjson files (organized by resources) may be advantageous, we wanted to let the group know about the use case. Will follow the discussion as it proceeds.
Michele Mottini (Sep 17 2019 at 15:52):
I'd rather not have two different ways to represent the same data in bulk export - as individual resource and as bundles. The client importing the data had to resolve references regardless
Josh Mandel (Sep 17 2019 at 16:02):
There are pretty strong use cases for each, though -- this is from production experience with the Google team's import experience. (e.g., folks want to be able to import patient $everything bundles.)
Paul Church (Sep 17 2019 at 16:12):
But whether $export should offer bundled output is a separate question. I haven't formed a strong opinion about that yet - the use case in import is driven by data that didn't come from $export.
If the use case is to sync data between servers while protecting referential and transactional integrity, I could see some utility in having the exporting server bundle up resources according to what it knows about their dependencies.
Last updated: Apr 12 2022 at 19:14 UTC