Stream: bulk data
Topic: Gzip compression on Cerner staging server
Mikhail Lapshin (Sep 11 2018 at 16:19):
@Jenni Syed , do you consider to enable gzip http compression on Cerner staging Bulk API server? It will drastically decrease amounts of data being downloaded, from hundreds of megabytes to just tens.
Dennis Patterson (Sep 11 2018 at 19:28):
The current testing implementation is returning static links to direct downloads from S3. We could put those links behind CloudFront and enable gzip compression. I think it's also possible to just upload gzipped data but that'd limit access to uncompressed data for anybody who'd want it
nicola (RIO/SS) (Sep 11 2018 at 19:30):
I think most of HTTP clients (including browsers) understand gzip :)
Jenni Syed (Sep 11 2018 at 19:34):
part of this limitation is just because this is our beta implementation - not production ready. The gzip and several other http level "large file handling" considerations would be beneficial to add to considerations in the spec
Jenni Syed (Sep 11 2018 at 19:34):
eg: streaming :)
Jenni Syed (Sep 11 2018 at 19:34):
this sounds like a good topic for the connectathon :)
Jenni Syed (Sep 11 2018 at 19:36):
I know we've talked about chunking as well in the past, which is something we support on some of our other APIs for large file transfers (each chunk would then be gzip as well)
Jenni Syed (Sep 11 2018 at 19:44):
Also, I know you mentioned megabytes because that's likely the amount of data we have here. What we've seen in other settings/more realistic production settings gets into the gig size ranges
Jenni Syed (Sep 11 2018 at 19:45):
(with other bulk APIs we have that do similar types of operations)
Mikhail Lapshin (Sep 12 2018 at 10:35):
The current testing implementation is returning static links to direct downloads from S3. We could put those links behind CloudFront and enable gzip compression. I think it's also possible to just upload gzipped data but that'd limit access to uncompressed data for anybody who'd want it
S3 can properly serve pre-gzipped files, you just need to set 'Content Type' and 'Content Encoding' properties on S3 files as described in this post: https://medium.com/@graysonhicks/how-to-serve-gzipped-js-and-css-from-aws-s3-211b1e86d1cd
It took me almost a day to download all 5GBs of this dataset, that's why I'm so concerned :)
John Moehrke (Sep 12 2018 at 14:51):
is there use of http/2 which includes multi-threading and automatic compression?
Jenni Syed (Sep 12 2018 at 15:32):
We haven't talked about needing that yet in the spec (much like GZip, streaming, etc) and good discussion to have. I will say that HTTP 2 support is still a bit spotty. I think under 30% support it as of the last stat I heard? So it may not be the 100% win here.
John Moehrke (Sep 12 2018 at 15:44):
understood, just wanting it on the stack of things to consider... especially when the group is considering gzip.
Dennis Patterson (Sep 13 2018 at 14:02):
Set up a CloudFront instance in front of our S3 bucket to auto-compress. Turns out they'll only do this for files smaller than 10MB, so not gonna help :). We'll have to look at uploading pre-compressed contents
Josh Mandel (Sep 13 2018 at 15:05):
Interesting -- and that would be an API change (i.e., it changes what's returned). I was assuming Accept-Encoding: gzip
would get us where we needed to be, but evidently not with S3 http hosting?
Dennis Patterson (Sep 13 2018 at 15:10):
Noting that this is all with pre-generated, mock data... Per @Mikhail Lapshin 's comments above, I think it'd be uploading the gzipped data, tell S3 to return Content-Encoding: gzip, and then when we return the list of files, they'd be able to be retrieved when requesting Accept-Encoding: gzip. I think AWS' approach is more elaborate if you want to return various compressions (i.e. store them all pre-compressed in S3 and use Lambda@Edge to vary what gets returned according to Accept-Encoding...blah)
Josh Mandel (Sep 13 2018 at 15:22):
In this scenario I think things would fail in the absence of Accept-Encoding: gzip
-- because a client would get gzipped content regardless of what they requested.
Dennis Patterson (Sep 13 2018 at 15:26):
Right, unless we did the work to support both, that's correct. This would be a connectathon-only limitation for our server, but from the very presence of this thread, I'm guessing that's what most clients want :)
Josh Mandel (Sep 13 2018 at 15:27):
It's definitely what most clients should want :-)
Last updated: Apr 12 2022 at 19:14 UTC