Stream: hapi
Topic: 200% CPU usage
Ye Wang (Mar 01 2018 at 18:59):
We are using HAPI 3.1.0 JPA server in docker. Without any queries running we saw more than 200% CPU usage on a 8-core server. There are two lines of message in the log file that look suspicious:
2018-03-01 18:50:53.351 [scheduledExecutorService-3] INFO ca.uhn.fhir.jpa.dao.SearchBuilder [SearchBuilder.java:2201] Query found 0 matches in 141937ms for query 29762bdf-b4d0-4e4c-a9f8-8b6ee336d2fb
2018-03-01 18:50:53.359 [scheduledExecutorService-5] INFO ca.uhn.fhir.jpa.dao.SearchBuilder [SearchBuilder.java:2196] Initial query result returned in 141944ms for query 3c80150d-1ca1-4c6f-a4f1-568b03c6c974
We emptied the data and still saw same message and similar CPU usage numbers.
I wonder if anyone experience similar problems. If it is a normal behavior for HAPI, is there a recommended configuration to improve the performance? Thanks.
Ye
James Agnew (Mar 04 2018 at 02:12):
Hi @Ye Wang that is definitely not normal behaviour.
Could you grab a thread dump while it's in that state?
James Agnew (Mar 04 2018 at 02:13):
BTW- It might be worth upgrading to 3.2.0. There are no bugs I know of in 3.1.0 that would cause this but it never hurts to be on the latest release.
Ye Wang (Mar 05 2018 at 20:13):
@James Agnew We haven't got a chance to upgrade to 3.2.0 yet. Attached is the thread dump file.hapi-fhir-500patients-threaddump.log
Ye Wang (Mar 05 2018 at 20:24):
@James Agnew What we are trying to do here is to dump 10 thousands patients data with about 26 million Observation resources in total. The loading rate is about a few resources per second. From my understanding it is probably the validations that slow the process down. So far we got 1 million Observations in. At current rate it would take weeks to get all the data in. Do you have suggestions on how to speed things up? The query on this 1 million resources is also kind of slow. It takes 200 ms to return the first 10 resources. In another similar server with the 77k sample Observation data, it only takes 50 ms. Then we discovered the usually high cup usage that lead to this post. Thanks for your help!
Joel Schneider (Mar 06 2018 at 06:44):
There has been at least one previous case where, after careful analysis, a JPA-related CPU utilization problem in HAPI was resolved by replacing <dao>.saveAndFlush() with <dao>.save().
(see also: https://github.com/jamesagnew/hapi-fhir/pull/575/files)
If the JPA cache contains a large number of objects, repeated calls to <dao>.saveAndFlush() can be very CPU intensive.
James Agnew (Mar 07 2018 at 22:24):
Hi, sorry, one thing I did not realize from your initial message is that you are using Derby. If you want to store multiple millions of records, you should definitely replace Derby with something more scalable (i.e. postgres).
Derby gets really flaky over a certain size, and that's probably what's happening here.
Ye Wang (Mar 07 2018 at 23:27):
@Joel Schneider @James Agnew Thanks! We will take a look and report back if we make any progresses.
Xiaocheng Luan (Mar 09 2018 at 18:38):
Hi @James Agnew, I'm on the same team with @Ye Wang. We switched to postgres (also switched to tomcat from jetty, just a matter of personal preference), here are some preliminary observations (and questions):
(1) The problem of 200% cpu usage (equivalence of fully using 2 cpus/cores) when idle, is gone, I'm not sure if it's due to the postgres switch or the servlet container switch, or both, or neither. The old instances using derby/jetty are still doing that, though.
(2) On the "load" (indexing) side, we are not observing significant differences in performance (yet)
(3) On the search side, when called using "curl", the postgres version does appear to be faster (question to follow below)
(4) When pointing the browser (chrome) at the server, clicking/switching between the resource types on the left nav, it's takes 2-3 seconds at least, and the postgres process consumes significant amount of cpu time during that whole period (2-3 seconds).
Questions:
(1) On the load side, we ran two tests:
Generated sample dataset, roughly 70 bundles totalling 30MB and 12,000 resources, took 10 minutes, which translates to 3MB/min, or 1,200 resources/min
Real (more or less) dataset (subset): 1.2GB or 2.5M resources, roughly 200 resources per bundle/file, took 12 hours, which translates into 1.6MB/min or 3,500 resources/min. Roughly one or two references (to another resource) per resource.
These (performance) numbers aren't bad for an experimental product, but my question is, do these numbers reasonably reflect the capacity of the product? That 1.2G data are from 455 patients, we have much more data and it would take a week or more to load, if it can get there at all.
(2) On the search side, is search fully supported by Lucene or does the database have an role there, too? Any reason why it's so slow to click on the main page navigating between resources - the interface is helpful in exploring the data but it's just painfully slow, even for <500 patients?
(3) Limited profiling shows most of the time (loading) were spent on the hibernate layer, which isn't that surprising. Do you agree with that observation, and if so, do you think it's reasonably simple (for us) to remove that layer - will lose the beauty of the JPA layer, of course.
(4) Do you see any (easy) way to improve the performance in significant ways, by "significant" I mean by a factor of X if not an order of magnitude? Along those lines, do you have any info on using elasticsearch (as the persistence layer)?
Thanks! Xiaocheng
James Agnew (Mar 10 2018 at 12:03):
Hi Xiaocheng,
A few points:
1. The default configuration of the FHIR server includes many many search parameters and therefore lots of indexes, so writing is naturally not going to be lightning fast. The JPA server isn't a "big data" solution intended to ingest massive amounts of data lightning fast, the optimization is definitely towards making that data easily usable and discoverable once it has been ingested. There are all kinds of things you can do to speed up ingestion depending on your use cases, including changing the indexes/search parameters, multithreading the ingestion, and even potentially using ETL tools.
2. Lucene is used for terminology and fulltext searching. All other searching purely uses the DB.
3. If you don't use the JPA layer, you're using HAPI FHIR as a "plain server". HAPI was originally designed with this use case actually, and lots of people use it this way. You need to supply your own storage logic of course in this case, but for many use cases this is a good option.
Xiaocheng Luan (Mar 12 2018 at 18:43):
Thanks for the response, @James Agnew . Those are in line with my thoughts, or guesses, more accurately. I just started playing around with FHIR and/or HAPI. We may have more questions for you down the road. Thanks again.
Xiaocheng Luan (Mar 12 2018 at 22:31):
Hi @James Agnew , a separate question - does the HAPI search API support _revInclude, that is, search/get "Patient" but also want to (reverse) include the observations that refer to this patient? I tried but wasn't able to get any results. Also, the "capabilities list" as returned by the service does not list any _revInclude entries. The _include seems to work fine, I was able to include Patient in the Observations results. Thanks!
James Agnew (Mar 13 2018 at 20:18):
Yup, _revinclude is fully supported. Can you post a sample query you're using against the public server that isn't working the way you expect?
Xiaocheng Luan (Mar 14 2018 at 15:25):
Thanks @James Agnew ! Just double checked my request spelling and reverse inc does work! Funny enough, there were only a couple of misspellings in the HL7 docs (revInclude vs. revinclude, that is, camel case vs. all lower) and I happened to have copy/pasted the wrong one!
Igor Sirkovich (Mar 14 2018 at 16:04):
Good catch, I've submitted a change request to fix 2 instances with a capital "I" in the Search documentation.
Xiaocheng Luan (Mar 26 2018 at 15:24):
Thanks, @Igor Sirkovich
Xiaocheng Luan (Mar 26 2018 at 15:30):
Hi @James Agnew, in the documentation for version 3.2.0, under "JPA Server", there is an "Important Note" that says: "with a single table being used to hold resource bodies (which are stored as CLOBs, optionally GZipped to save space)". While looking inside the postgres database used by HAPI server 3.2.0, I'm not seeing such a table that contains the resource bodies, could you please help clarify, thanks!
Last updated: Apr 12 2022 at 19:14 UTC