FHIR Chat · Error Status for Transient errors · bulk data

Stream: bulk data

Topic: Error Status for Transient errors


view this post on Zulip Cooper Thompson (Oct 05 2020 at 19:55):

One of the recent updates (section 5.3.3) was to let a server communicate whether an error response to a job status request indicated that the job failed, or that the error was transient and the client should check again later. As specified, if the error is transient, the server should return an OperationOutcome with a transient code. However this can be a problem if the transient error is due to network issues in front of the FHIR server. For example if there is a reverse proxy and it can't talk to the backend FHIR server. If we flip the model, and instead specify that the FHIR server should return a specific OperationOutcome if a job has failed, and any other error indicates a transient issue, then that lets the client detect cases where the transient error prevents the FHIR server from responding.

view this post on Zulip Dan Gottlieb (Oct 06 2020 at 15:16):

That makes sense to me, but since errors where a server is unable to generate an OperationOutcome aren't necessarily transient and ideally transient errors would return an OO , perhaps we should loosen the requirement in a more general way? For example, in 5.3.3 we could change "The server SHALL return a FHIR OperationOutcome resource in JSON format" to something like "When possible, the server SHOULD return a FHIR OperationOutcome resource in JSON format. If this is not possible, the server MAY return a plain text error message".

view this post on Zulip Cooper Thompson (Oct 06 2020 at 17:35):

I'm mostly coming at this from the perspective of my day job, which deals with failures in the REST API that occur between the FHIR client and FHIR server. Consider a network topology something like this (considering only HTTP-aware actors): FHIR Server <-> API gateway <-> Reverse Proxy <-> Client. My day job often involves dealing with failures in the API Gateway or Reverse proxy layers, and those are not FHIR-aware, and thus don't generate OOs. Sometimes the issues are transient, but often not. If our solution focuses on communicating job failures explicitly, where the "else" case is treated as a transient failure, then our solution is resilient to issues with non-FHIR actors. If we explicitly communicate transient errors, but the "else" case is a job failure, then the client can't tell the difference between a job failure and a transient issue with a non-FHIR intermediary.

I'm totally aware that this sort of problem is not normally addressed by the FHIR spec or IGs.

view this post on Zulip Dan Gottlieb (Oct 06 2020 at 19:07):

Yup, agree that supporting non-OO errors for non-FHIR aware layers makes sense. Since these errors may be transient or reflect job failure though, wouldn't we want to relax the current OO requirement in the IG for across all status errors?

view this post on Zulip Dan Gottlieb (Oct 06 2020 at 19:07):

I think the only distinction between what you're suggesting and the language I proposed above is whether or not a transient failure that is able to return an OO should do so, but don't see what we gain by not encouraging this?

view this post on Zulip Karl M. Davis (Dec 07 2020 at 12:57):

Maybe just be explicit? "When possible, the server and its intermediary/infrastructure SHOULD..."


Last updated: Apr 12 2022 at 19:14 UTC