Stream: analytics on FHIR
Topic: Apache Drill
Fahim Shariar (Apr 28 2017 at 22:12):
Apache Drill – Direct Clinical BI on FHIR , seems like an interesting project .
At the end of this page http:/amia.analystseim.com/
There is a query for Patient's family name, gender, birth date:
SELECT P.'name'[0].family AS family_name, P.gender, P.birthDate FROM dfs.'/synthea/json/patient' P LIMIT 10
So will it be used just to implement FHIR search parameter ? Confused about the Analytics part on that ...
natus (Apr 30 2017 at 07:08):
Use case with SQL tools such apache drill are:
- search in the dataset, say build a cohort
- BI, allow explore the data aggregate result, tables, plots
- apply any FHIR tool over the ressources, data visualization...
all would benefit from a common data model
natus (Apr 30 2017 at 07:18):
The project you pointed (http://amia.analystseim.com/guide.html) does not satisfy FHIR specification:
References and JOIN
FHIR reference are formatted as ResourceType/id, e.g. Patient/1234. This means that >to link an Encounter to the associated patient, one must join:
Encounter.patient.reference to 'Patient/' || Patient.id
Possible, but slow. To speed things, in the parquet files we've added:
Patient.patient_reference
Encounter.encounter_reference
Then, they added encounter_reference element in the Encounter resource and so on
To me, this is a problem. Any better idea "FHIR compatible" to deal with that ?
Grahame Grieve (May 01 2017 at 01:19):
@Chris Grenz is best to comment about this
Chris Grenz (May 01 2017 at 03:18):
There's no requirement to optimize the references - as I mentioned one could add the concatenation in the query's join condition directly.
Chris Grenz (May 01 2017 at 03:21):
In practice, we generally use Drill to create optimized intermediate sets from the native FHIR datasets. If that's not acceptable in your use case, additional processing power could be brought to bear.
Chris Grenz (May 01 2017 at 03:22):
@Fahim Shariar - "analytics" comes with significantly more complex queries. Baby steps!
natus (May 01 2017 at 09:21):
There's no requirement to optimize the references - as I mentioned one could add the concatenation in the query's join condition directly.
Our requirement is to help drill users to join tables. Concatenation leads to errors when the reference is subject (Patient or Group) such encounter. Moreover this leads to overhead of scripting, and hardware consumption.
The workaround of using new elements such "patient.patient_refereence" is great but sadly resources are not FHIR compatible anymore.
What about adding extension for this purpose? eg :
"extension": [
{
"url": "http://example/patient_id/",
"valueid": 16624
}
]
Chris Grenz (May 01 2017 at 12:58):
Unfortunately that just moves the complexity around. Most resources won't include that extension, so you'd have to pre-process the data to ensure consistent inclusion, a much more complex (though very possible) process on a large set.
Concatenation can work without error - the concatenation happens on the "target" side, not the "referencing" side:
X.ResourceName || '/' || X.id AS ResourceRelativeURI
If multiple resources are possible in the reference, either you must make two (or more) outer joins, or you access all resources in the same scope.
natus (May 01 2017 at 14:31):
You are right, using the concatenation just works. But it decreases DRILL performances (I am talking about FHIR datasets with multiple billion of observations ) a lot comparing to both methods:
1) adding a new element as you(?) have made and described there http://amia.analystseim.com/guide.html . But this is not FHIR compatible.
2) adding an extension as I described in this thread. This is FHIR compatible.
By FHIR compatible I mean resources can be pushed in any FHIR tool, such vizualisation, validation and more. Both methods need a pre-processing step. At least moving json files into parquet is a pre-processing step. So while transforming into parquet format one can add those extensions. And the dataset is still FHIR compatible. Does this make sense ?
Fahim Shariar (May 01 2017 at 21:03):
SELECT COUNT(*) FROM dfs.pq.'Condition' WHERE 'Condition'.'code'.coding[0].'code' = '44054006'
that just caught my attention . Are you assuming there will only be just one Code in the whole CodableConcept array . There could be multiple , right ? what about that ?
@natus does Drill SQL query supports quering JSON arrays ?
Grahame Grieve (May 01 2017 at 21:03):
in RDF, there's a special relationship for this purpose
natus (May 01 2017 at 22:05):
@Fahim Shariar I guess so : the drill function flatten does the job in case of array (https://drill.apache.org/docs/flatten/)
SELECT d.id FROM (SELECT *, flatten(C.
code
.coding
) f FROMcondition.json
C) d WHERE d.f
.code
= '44054006';
would match a condition that would have many code (tested and approved)
Chris Grenz (May 06 2017 at 08:29):
Anyone interested in playing with Drill can use the demo environment for the FHIR Connectathon for the next 48 hours. Setup instructions here: http://amia.analystseim.com/setup.html
natus (May 06 2017 at 18:47):
Hi Chris,
THis works out of the box. What is your drillbit platform configuration ? 5 nodes cluster ?
Thanks
Chris Grenz (May 07 2017 at 11:46):
6 nodes, m4.4xlarge on AWS EC2.
natus (May 07 2017 at 14:22):
I have made some tests about the references.
SELECT count(1)
FROM dfs.pq.Patient
P
JOIN dfs.pq.Encounter
E ON (P.resourceType || '/' || P.id = E.patient.reference);
First of all, there is no performance impact when using the P.patient_reference = E.patient.reference
or ON (P.resourceType || '/' || P.ide = E.patient.reference)
. Result is between 16 and 20 seconds in each way.
Then the only advantage to add the patient_reference = 'Patient/0001' element is a syntactic sugar.
BTW, sadly the id regex definition is [A-Za-z0-9\-\.]{1,64} (from https://www.hl7.org/fhir/datatypes.html#id) . It does not allows assigning Patient.id = 'Patient/001' . This would be great to extend id definition to [A-Za-z0-9\-\.\/]{1,64}
Allowing slash in id regex would allow native easy join writing : P.id = E.patient.reference
nicola (RIO/SS) (May 07 2017 at 15:06):
No performance penalty for Drill until it will get some indexes functionality :)
natus (May 07 2017 at 17:31):
This is sadly true
In the meantime, if one want to join resources in any system, integers (1999) is the most effective way. (compared to string version "Patient/1999"). So your FHIR not friendly method is better (I mean adding to Encounter a patient element such `P.id = E.patient.id ̀ is the join method).
Fahim Shariar (May 13 2017 at 12:50):
@natus Can you write in sentence what are you trying to achieve that you wrote via Query . I can guess something like , Count number of encounter that every patient has ? and you said that "Result is between 16 and 20 seconds in each way." is it the query runtime ? and what the total number of records that you are joining and searching ?
It will be helpful if you provide some information about the dataset and your progress :)
natus (May 13 2017 at 20:53):
I was testing the dataset provided by the fhir hackaton analytics. The query was a simple test of join performances. I cannot remember exactly the resulting number of rows, but more than 10 milions. Conclusion was performances are definitively good with FHIR native model.
Last updated: Apr 12 2022 at 19:14 UTC