Stream: bulk data
Topic: storage location parameter
Michael Hansen (Feb 20 2019 at 22:27):
We are working on implementing the bulk FHIR spec for the Microsoft FHIR server and one request that we have had is an ability to pass a parameter to $export, which would specify a storage location and credentials for accessing it. For Azure, this could be in the form of a SAS token. AWS has a similar concept although I forget the name. The use case is that often you would want the exported data to land somewhere where it could consumed immediately by analytics, etc. In the current spec, we would have to move the data after export as opposed to just sending it directly to a storage location where it is needed. It also makes it easier to work with the $export operation for servers that may not have large disks attached, e.g. running in a container without attached disk storage, etc.
I know we are a bit late in the game, but would there be any appetite for for a _destination
parameter, which could be a URL that includes some access token?
@Josh Mandel , @Dan Gottlieb , @Kurt Ericson I know you have expressed some interest in the topic in the past and was wondering what you think?
Josh Mandel (Feb 20 2019 at 22:49):
Thanks @Michael Hansen! One of the key issues we've discussed in the past with this kind of thing is: how to make it interoperable. If a client comes with a certain _destination
in mind (e.g. "please submit the files to my AWS S4 bucket"), then:
1. what context is required to make a destination work robustly, supporting retries, credential rotation, storage / filesystem limitations, etc.
2. is the context supplied all in-band, or is some out of band?
3. Do servers need to support all destinations compatible with the spec, or can they pick/choose?
4. If the server can pick and choose destinations, what common denominator can we rely on to ensure it's possible to write a generic works-anywhere client?
Josh Mandel (Feb 20 2019 at 22:52):
The introduction of a client-supplied destination also raises some questions about how to communicate completion status to the client; the status payloads we've defined so far are based on the idea that we list files and URLs; would we impose the expectation that any valid destination would need to support some kind of URL-based scheme, or redefine the status payload?
Paul Church (Feb 20 2019 at 23:02):
See also https://github.com/smart-on-fhir/fhir-bulk-data-docs/issues/20 for past discussion
Grahame Grieve (Feb 20 2019 at 23:03):
it's not necessary to propose that every server has to support everything - there can be a discovery process to find out what is supported so that the spec isn't limited to lowest common denominator
Paul Church (Feb 20 2019 at 23:29):
TL;DR for Google Cloud it's pretty convenient if the client tells us where to put the output - but at the end of that previous discussion there was a suggestion of having the server configuration specify a default cloud storage location on whatever cloud the server is on, so the existing pattern (without _destination) still works.
Josh Mandel (Feb 20 2019 at 23:54):
Re: who supports what, the goal is not to ensure that every server supports everything, but to ensure that it is possible to build a generic client that works across the board. In other words, to ensure that there is some properly functional subset of the API that is consistently supported.
Michael Hansen (Feb 21 2019 at 00:36):
Again, my apologies to coming in late in the discussion, there has clearly been a lot of stuff on this topic already. It seems that most of the concerns are around being able to provide a single parameter that would be able to work generally and then how to figure out what a server is supposed to support. So one possible option could a client could supply two parameters:
_storageProvider
: It would default to internal
and the _storageParameters
below would be ignored. A server could also allow others, e.g. GCP
, S3
, Azure
and there would be a property bag for each of them.
_storageParameters
: list of parameters for a given storage provider.
If a client tries to use a provider that is not implemented on a given server, then it is an error.
1. what context is required to make a destination work robustly, supporting retries, credential rotation, storage / filesystem limitations, etc.
I think those problems are there with the internal storage too. As a practical matter most production servers would have some attached storage for this sort of stuff that would have limitations. One of the problems in a cloud service scenario is exactly that of credential rotation, etc. If the user providers the storage out of band and attach it somehow to the service, then what if credentials rotate, etc. Being able to pass it in would allow the FHIR server to write somewhere else and not have to have such privileges access as to be able to retrieve keys, etc.
2. is the context supplied all in-band, or is some out of band?
So my proposal would be to supply the storage provider as one argument, have it default to internal
and other parameters in a parameter bag separately.
3. Do servers need to support all destinations compatible with the spec, or can they pick/choose?
I would say they can choose. It would be nice with a discovery mechanism. Could it be in the capability statement?
4. If the server can pick and choose destinations, what common denominator can we rely on to ensure it's possible to write a generic works-anywhere client?
I think that the client would be able to choose destinations, but the generic works-anywhere client would be the one where _storageProvider
defaults to internal
and server SHOULD implement that.
The discussion also becomes important for $import
where uploading ndjson files through the REST API seems to be not that practical.
Michele Mottini (Feb 21 2019 at 01:04):
How would that work with S3 for example ... the client creates a bucket, then pass the address of that + some kind of credentials to the server, that use them to write the files (that must be accessible to the client though...), on completion the server returns the S3 URLs of the files that it created?
Josh Mandel (Feb 21 2019 at 01:19):
This is the kind of detail we'll want to work through, yes. I'd like to see if we have some common understanding about whether all the parameters would need to be passed in band during this request, or whether there might be an assumption that certain things have been pre-configured for a specific client out of band.
Josh Mandel (Feb 21 2019 at 01:21):
Regarding retries, etc, when a server is responsible for its own temporary storage, it can take an approach that makes sense for its environment. If we just said "here's a fully authenticated URL, and you can post to it" we would be introducing a lot of issues about upload behavior, retries, etc.
Josh Mandel (Feb 21 2019 at 01:21):
it would be great to sketch out a complete end-to-end example of the kind of thing you have in mind for Azure blob storage and we can think about how the status complete payload would look @Michael Hansen .
Josh Mandel (Feb 21 2019 at 01:24):
Certainly we can (and probably should) convey support for finer grained capabilities in some structured fashion (e.g. in the capability statement or smart-configuration json doc) -- but at the end of the day we need common agreement about what will be widely supported (a discovery protocol that allows you to discover that you can't connect is not particularly better than just trying and getting an error).
Paul Church (Feb 21 2019 at 03:07):
re: token based URLs to "send my files to that other cloud over there", I am nervous about rushing ahead with this idea just because it is possible. Does this have value to clients? Do FHIR servers not owned by cloud providers want to implement it? If I want to query a diverse bunch of FHIR servers and collect the results in S3, would I be better off just making the queries and pulling the data down through a VM in AWS where I can manage all of my own credentials?
Josh Mandel (Feb 21 2019 at 03:11):
@Paul Church that has certainly been our assessment so far.
Ricky Sahu (Feb 21 2019 at 03:37):
@Paul Church I also agree that we shouldnt rush the _destination param because there are quite a few considerations to take into account if we want something that works securely cross platform. I def think there is value for clients but sending to a destination like s3 is something that you can already manage to do yourself without too much overhead at this point.
Brian Postlethwaite (Feb 21 2019 at 04:10):
With my implementation I'd certainly be streaming out of my database to somewhere local before I make it available elsewhere.
(Depending on my servers config, will either be a managed local folder, or while in azure, a blob store within that data centre (to ensure that the latency doesn't suck and slow the process of extracting the content)
So the stream to another location just sounds like another step.
This would likely then effect the status reporting. Do I report that the file is available at my local URL (existing report), and hasn't been copied yet to the _destination yet?
Michael Hansen (Feb 21 2019 at 04:48):
Few questions to unpack. Overall there seems to be a few concerns:
1. Is it possible to provide options to specify this in a way that would still allow a straight forward option for clients that want to just do it the default way? I think absolutely. If one has the concept of selecting a storage provider and then providing parameters for that. The default would be internal and no parameters would be required. We are simply talking about providing the option.
2. Since this storage would not be managed by the FHIR server, what would be expected around retries, etc.? I think this would be up to the server. It could decide to fail immediately if remote storage is unavailable or have some retry policy. As a practical matter, a FHIR server running in a cluster or other managed infrastructure would not have physical drives attached directly and would have to manage remote storage that may be unavailable in any case. Just because we assume that the server will be using internal storage does not mean that it cannot fail or that the server does not need the capability to report back to the client that it failed because storage was unavailable.
3. How does a client discover available storage options. I agree that trying a specific option and failing does not seem appealing, but one could have a section of the capability statement. There could be a list of storage providers with their expected parameters. I am sure the major cloud vendors could supply those parameters, but it could also be flexible enough that the capability statement list the required (and optional) parameters for each storage provider.
@Michele Mottini . Yes, I did imagine something like that. Upon completion of the request the URLs for the files would refer to storage that a client application would have access to.
@Josh Mandel . We can try to put something more concrete together on our team to show what it might look like in the case of Azure. If a couple of folks from GCP, AWS, etc. would chime in, many a common pattern would emerge. I think it is worth exploring.
@Paul Church I would think this sort of capability has value for both clients and servers. Specicially, for servers, the value would be in not being responsible for maintaining a potentially very large internal storage capability. Or maintaining credentials for it as they may need to be rotated, etc. It could be problematic if a lot of $export request were issues and data started piling up on the server. It could run out of space. For the client, I understand your point that one could simply spin up a VM, execute the query, and pull it down. However, that involves spinning up a VM, managing credentials, pulling down the data, etc. It can all be done but to use your own phrase, I am not sure that is appealing "just because it is possible". In any case, I can certainly see situations where either model is an advantage. I can say from my perspective as both a server implementer and frequent client, I would prefer the option of passing in a storage location of my choice so that I can link easily with my downstream tooling. That doesn't mean that the situation could not be reversed for others.
@Brian Postlethwaite Again, I can certainly understand that some intermediary storage could come into play. But I think that would be a server implementation detail. If the only model is that files are first written to the server and then retrieved by the client then it is guaranteed that the data will at least be copied one additional time. If the client can provide the storage, then the server could potentially chose to write it directly (or not) and avoid a copy.
An additional consideration for all of this is that once we start considering $import as well, it will come up again. First uploading large amounts of data and then issuing a $import request seems like an awkward workflow, but there are probably nice ways to work around that too. We actually started thinking about it in that scenario, one would add a URL (with some access token) to a storage location along with the $import operation. If that is the workflow then mirroring it for $export seemed like a logical step.
In any case, if there is interest, we can put some more details together and we can iterate. I do sense some significant push back against the idea, so we can table the discussion for now.
Brian Postlethwaite (Feb 21 2019 at 05:06):
You might also want to consider that storage and bandwidth in the cloud have a cost, and these are large things. And will add up.
Passing an authenticated link to my cloud storage for someone to use an undefined size of content... hmm.
Grahame Grieve (Feb 21 2019 at 05:07):
I think the idea is that the client passes a link to their own storage, and trust the server not to fill it up with stuff they don't want
Brian Postlethwaite (Feb 21 2019 at 05:11):
Internally, yes I planned to share storage in exactly this way... and never download it, but that's where I am both the caller and sender, and I have full control over the environment that is being used.
Brian Postlethwaite (Feb 21 2019 at 05:13):
Maybe another option here is to have alternate representations on where the content is.
That way you can see where you can download it from as a web client, or the file share where it is on SMB, or the shared storage that its located at.
(and the sender/receiver have a way of negotiating that) As I'm fairly sure that most of these files won't be just sitting on a web server folder?)
Michael Hansen (Feb 21 2019 at 05:33):
I think the idea is that the client passes a link to their own storage, and trust the server not to fill it up with stuff they don't want
I guess as a server, one would have to trust that a client doesn't repeatedly call $export and bring down the server. Again, pros and cons. These problems exist regardless of the model.
Michael Hansen (Feb 21 2019 at 05:35):
You might also want to consider that storage and bandwidth in the cloud have a cost, and these are large things. And will add up.
Passing an authenticated link to my cloud storage for someone to use an undefined size of content... hmm.
Again, as a server storage and bandwidth is also not free. And yes, they are large things so expecting that the server will hold them for clients may be unreasonable as well.
John Stairs (Feb 21 2019 at 05:47):
We're going to prototype this with Azure blob storage (and maybe one or two others) to see what these parameters could look like.
Brian Postlethwaite (Feb 21 2019 at 10:09):
I'm happy to take a peek at it too if you want some extra review.
(im inementing at the moment - directory space)
Last updated: Apr 12 2022 at 19:14 UTC