Stream: bulk data
Topic: Size limit
dsh (Jul 26 2021 at 02:24):
Hi am new to bulk data export so this may be a childish question. Our server has close 500K Condition resources and I want to fetch all of them to figure out the population distribution across Condition.code
so when I issue this bulk data request
/$export?_outputFormat=ndjson&_type=Condition&_since=2020-01-01T00:00:00.000Z
I get only 200 Condition resources with no ability to paginate to next 200, so is there a way to get all Condition resources in one bulk data export ?
dsh (Jul 26 2021 at 02:27):
I don't think it makes sense to change the max_page_size: 200
in application.yml
of the server to get bulk data export exporting everything :anguished:
dsh (Jul 26 2021 at 08:47):
any ideas?
Vassil Peytchev (Jul 26 2021 at 12:09):
Which server for implementation are you using?
dsh (Jul 26 2021 at 16:02):
Vassil Peytchev said:
Which server for implementation are you using?
JPA server 5.3.0
Vladimir Ignatov (Jul 26 2021 at 16:16):
That is probably a HAPI server then? Assuming your page size is 200, you should get multiple files with 200 conditions in each of them.
Or perhaps only 200 have been modified since 2020-01-01? Anyhow, if you want to get all the conditions, then try removing the _since
parameter.
dsh (Jul 26 2021 at 16:19):
First I tried without _since
but it didn't work and there are about 500K Conditions resources modified since 2020/01/01 ... then frustratingly I increased the max_page_size
parameter to 1 million to fetch all Conditions
dsh (Jul 26 2021 at 16:21):
I am not sure if this is a bug but if others can confirm and or tell me what I might be doing wrong that will help
Vladimir Ignatov (Jul 26 2021 at 16:22):
Were you only getting a single file link in your export manifest (before increasing the limit)?
dsh (Jul 26 2021 at 16:25):
Vladimir Ignatov said:
Were you only getting a single file link in your export manifest (before increasing the limit)?
Yes and that was the weird part
dsh (Jul 26 2021 at 16:27):
after I increased the limit I got about 418 links
Vladimir Ignatov (Jul 26 2021 at 16:34):
- With 500K records and limit of 200 you should have gotten 2.5K links (if they fit into the manifest response size limit)
- With 500K records and limit of 1M you should have gotten 1 link to file with 500K rows
That is just the simple math without the internal implementation details that could affect how pagination really works. I would suggest going to the HAPI GitHub and searching for this issue (and maybe posting new one if you don't find anything)
dsh (Jul 26 2021 at 16:36):
Vladimir Ignatov said:
- With 500K records and limit of 200 you should have gotten 2.5K links (if they fit into the manifest response size limit)
- With 500K records and limit of 1M you should have gotten 1 link to file with 500K rows
That is just the simple math without the internal implementation details that could affect how pagination really works. I would suggest going to the HAPI GitHub and searching for this issue (and maybe posting new one if you don't find anything)
your logic makes sense ... but that's not how the server behaved ... this may be a bug ... I will search in GitHub
Robert Scanlon (Jul 26 2021 at 16:56):
You could also try asking this question over in the #hapi stream (if that is what you are using), which may catch the attention of someone that knows the specifics of its bulk data implementation.
Last updated: Apr 12 2022 at 19:14 UTC