FHIR Chat · FHIR Performance: Concerns · implementers

Stream: implementers

Topic: FHIR Performance: Concerns


view this post on Zulip Karl M. Davis (Aug 31 2020 at 18:05):

I've been spending a lot of time this year thinking about FHIR performance. It's something we keep running into (painfully) at work and, from a survey of the topics here, I don't think we're the only ones.

I'd be curious to hear from anyone else out there who's running into performance problems with FHIR servers, clients, etc. If you, too, are (or have been) stressed about scale, performance, cost, etc. for your FHIR deployments, could you reply here or reach out via private message?

I imagine there are a lot of lessons-learned and strategies that can be shared to help each other and the community at large.

view this post on Zulip Karl M. Davis (Aug 31 2020 at 18:20):

I'll go first: we've seen performance-/scale-related problems with all of the following:

  • Storage size (and cost! oh my goodness, the cost...).
    • Initially mitigated this by moving away from HAPI's JPA storage engine, which was hilariously excessive for our needs.
    • More recently mitigated this by moving from AWS' RDS for PostgreSQL to AWS' RDS Aurora for PostgreSQL, which saves money if you have a lot of read replicas.
  • Very spiky demand, especially for the batch/bulk use cases we support.
    • Relatively simple to (kinda') fix: just start auto-scaling.
    • Two outstanding issues, though:
      • Scaling up is slow, particularly when accounting for JVM warm-up.
      • Latency starts to climb rapidly with CPU utilization, which leads to us wanting to leave boxes very under-utilized.
  • Serialization and deserialization of JSON FHIR bundles is expensive: really slows things down dramatically.
    • Our best solution for this by far is just to avoid it entirely: generate the JSON once such that downstream services won't have to modify it.
    • That still leaves the initial JSON serialization, though, which drives the vast majority of our services' latency.

view this post on Zulip Karl M. Davis (Aug 31 2020 at 18:20):

From the search I mentioned earlier, it sounds like a lot of folks are also running into performance problems with terminology services and validation?

view this post on Zulip Vassil Peytchev (Aug 31 2020 at 18:34):

Is it possible to categorize the performance concerns that are related to a particular implementation (e.g. hapi), a particular platform (e.g. JVM), and general?

view this post on Zulip Karl M. Davis (Aug 31 2020 at 18:38):

A lot of them are likely related to HAPI, since we're using it. But part of my reason for asking here is to figure out what performance problems other folks are seeing with what they're using. Nothing's perfect; these things are always tradeoffs. :smile:

view this post on Zulip Michele Mottini (Aug 31 2020 at 18:44):

The only really FHIR-related performance issue we had was serialization - we rewrote the serialization code of the .NET library we use and now it is OK (not super-fast for sure, but negligible compared with everything else)

view this post on Zulip Michele Mottini (Aug 31 2020 at 18:45):

We had / have plenty of other perf challenges, but are not really anything FHIR-specific

view this post on Zulip Karl M. Davis (Aug 31 2020 at 18:47):

Michele Mottini said:

We had / have plenty of other perf challenges, but are not really anything FHIR-specific

I'd still be curious to hear about some of those! Even if they're not FHIR-specific, I bet a lot of other FHIR projects are running into them.

view this post on Zulip Michele Mottini (Aug 31 2020 at 18:52):

They are always 'I do this FHIR search and it is slow' or 'I do this bulk export and it is slow' - and then the underlying cause is missing indexes, or database not using them, or our code going to the database for each single resource instead of once per search page or.... - variety of system-specific things that have nothing particular to do with FHIR

view this post on Zulip Karl M. Davis (Aug 31 2020 at 18:56):

Michele Mottini said:

They are always 'I do this FHIR search and it is slow' or 'I do this bulk export and it is slow' - and then the underlying cause is missing indexes, or database not using them, or our code going to the database for each single resource instead of once per search page or.... - variety of system-specific things that have nothing particular to do with FHIR

Interesting! Do you support a wide variety of searches/operations, then? (We don't, mostly to avoid exactly these kinds of problems.)

view this post on Zulip Michele Mottini (Aug 31 2020 at 19:05):

Yes, lots of resources with lots of search parameters: https://fhir.careevolution.com/Master.Adapter1.WebClient/fhir?prefix=fhir-r4

view this post on Zulip Michele Mottini (Aug 31 2020 at 19:05):

We have both financial and clinical data, that is probably rare

view this post on Zulip Karl M. Davis (Aug 31 2020 at 19:08):

Oh, those search parameter lists are giving me a real big case of the sads. Good on you for supporting all that! (Glad it's not me!)

view this post on Zulip Lloyd McKenzie (Aug 31 2020 at 19:16):

@Iryna Roy

view this post on Zulip Paul Church (Aug 31 2020 at 22:32):

The Google implementation has tough constraints in terms of performance. It's a fully managed service where customers have no access to the underlying storage (Spanner), so we have to anticipate everyone's use cases and load test ahead of them. At the same time we can't look at customers' data, so I never know exactly what people are doing with the service unless they tell me. It is multi-tenant and basically zero-configuration for scalability (except for quota limits) so a customer can show up and start dumping terabytes of FHIR into a store without warning. It's actually really cheap too. Billed entirely on usage.

We try to support all use cases, to the point where our conformance documentation lists the things we don't support instead of the things we do support. All DSTU2, STU3, and R4 resources. Almost all search parameters with a few exceptions that we haven't gotten to yet.

The most challenging use case is from large provider and/or payor entities that want to construct a secondary FHIR data layer that unifies numerous facilities, diverse EHRs, and many years of historical data in a single monolithic store that is updated in near-real-time and feeds an entire ecosystem of apps and analytics making thousands of queries per second. The scale is on the order of single-digit billions of resources today and double-digit billions very soon.

So we had, and continue to have, some performance challenges.

  • Indexing. The search index is a proprietary document index. We made the indexing asynchronous which is a bit unfortunate but at scale you have to hide the tail latency somewhere. Because it's a document index it doesn't really do joins, so chained search is limited.
  • Import cold-start. Importing a lot of resources into an empty store really needs some smart preallocation or it spends hours getting throttled while the storage layer incrementally adjusts without being aware of the full data size.
  • Index hotspotting. Any index on primary storage needs to be carefully looked at for what pattern it would be vulnerable to. Someone will eventually come up with a dataset that hits that vulnerability.
  • Auto-scaling of every layer of the system. All it takes to slow down everything is one layer not scaling rapidly enough.
  • Spiky demand in a multi-tenant system. Maintaining enough isolation to keep tenants from impacting each other while taking advantage of averaging out each tenant's spikes in traffic over a set of shared backends.

view this post on Zulip Iryna Roy (Aug 31 2020 at 23:25):

Hello @Karl M. Davis, yes we have completed the analysis of various FHIR implementations recently and have some lessons learned re: performance and size of HAPI FHIR and similar architecture implementations. I will try to summarize and we are also working on the white paper, related to the topic. Interesting to hear about Google experience!

view this post on Zulip Grahame Grieve (Sep 01 2020 at 00:54):

on my server, I trade storage for turn around time, and pre-store ready to go json and xml for the common views of a resource. For most searches, I just stream the raw bytes into the bundle. The other obvious issue for me is speed of processing PUT/POST - the time taken comes from validation + indexing. I can turn validation off, but indexing... it's costly. Doing it later is something I've considered but not yet done.

view this post on Zulip Karl M. Davis (Sep 01 2020 at 02:34):

Iryna Roy said:

Hello Karl M. Davis, yes we have completed the analysis of various FHIR implementations recently and have some lessons learned re: performance and size of HAPI FHIR and similar architecture implementations. I will try to summarize and we are also working on the white paper, related to the topic. Interesting to hear about Google experience!

I'd love to see the white paper when it's available! Have you publlished anything ahead of that, e.g. blog posts or podcast appearances or such?

view this post on Zulip Karl M. Davis (Sep 01 2020 at 02:39):

Grahame Grieve said:

on my server, I trade storage for turn around time, and pre-store ready to go json and xml for the common views of a resource.
For most searches, I just stream the raw bytes into the bundle.

I understand the tradeoff: that's the plan we started with for BB2, but had to abandon eventually as it became unworkable. Our experience with HAPI's JPA layer was about a... 25x storage increase over a relational representation. It got to the point where, even aside from storage costs, it was just going to take us months to write out that much data to the disk for our initial load.

The other obvious issue for me is speed of processing PUT/POST - the time taken comes from validation + indexing. I can turn validation off, but indexing... it's costly. Doing it later is something I've considered but not yet done.

Do you have any latency stats? I've found that folks have wildly divergent views on what "fast" is, so I'm always curious to anchor those opinions to real numbers a bit.

view this post on Zulip Karl M. Davis (Sep 01 2020 at 02:44):

Paul Church said:

The Google implementation has tough constraints in terms of performance. ...

That's really fascinating! Like I was mentioning with Graham, I'm super glad we were able to avoid having to deploy and support (and pay for the storage for!) a document store, so hats off to you for managing that successfully at scale! To say nothing of the nightmare that managing multi-tenant load spikes at that scale must be! (I know it's kinda' Google's whole schtick, but still.)

Do ya'all have any published benchmarks or latency/performance SLAs?

I'm also wondering if you've found folks unable to cope with the constraints imposed by the delayed indexing? Eventual consistency is... fraught, for a lot of applications.

view this post on Zulip Paul Church (Sep 01 2020 at 17:35):

@Karl M. Davis The delayed indexing is primarily a problem for conditional operations. In our beta API you can do conditional operations with any search criteria, but the search isn't inside the transaction. This isn't very useful, especially for conditional create, so it will probably go away eventually. Instead we're working on an implementation where the only condition allowed is 'identifier' but it's transactional and performant. Identifier seems to address 99% of the use cases that we've run into.

Other than that, delayed indexing hasn't been too bad of a pain point. The median latency usually sits well under 1 second, so it's not like you have to wait a long time for search consistency.

We have internal latency targets but nothing published yet. FHIR has a lot of dimensions to a request that make it tricky to publish numbers with a reproducible methodology. If we say that a search takes 100ms, that's not going to hold once you bulk up the query with a bunch of chains and includes and put the system under heavy load.

view this post on Zulip Karl M. Davis (Sep 02 2020 at 02:08):

Paul Church said:

Karl M. Davis Instead we're working on an implementation where the only condition allowed is 'identifier' but it's transactional and performant. Identifier seems to address 99% of the use cases that we've run into.

Other than that, delayed indexing hasn't been too bad of a pain point. The median latency usually sits well under 1 second, so it's not like you have to wait a long time for search consistency.

Interesting!

We have internal latency targets but nothing published yet. FHIR has a lot of dimensions to a request that make it tricky to publish numbers with a reproducible methodology. If we say that a search takes 100ms, that's not going to hold once you bulk up the query with a bunch of chains and includes and put the system under heavy load.

Do you happen to know if there's anything in the GCP ToS that prevents benchmarking? I'm wondering if anyone else has published anything about it, even if it's just anecdotal.

view this post on Zulip Paul Church (Sep 02 2020 at 16:50):

My product counsel gets a migraine any time I advise someone on the contents of the ToS. I am certainly aware of customers doing performance testing but not of anything being published. Most customers are doing load tests tailored to their anticipated production workloads so they aren't likely to disclose the details.

view this post on Zulip ℭ𝔞𝔭⥠⦿𝔟𝔦𝔩𝔩 (Oct 19 2021 at 19:18):

What are people using to estimate load of patients expected to use FHIR APIs via app after adoption in Oct 2022...

What would the expected usage and call rates?

view this post on Zulip Grahame Grieve (Oct 19 2021 at 19:31):

that depends ....


Last updated: Apr 12 2022 at 19:14 UTC