Better Living Through Better Prediction

It’s Minority Report again – but pre-health, instead of pre-crime.

This is work from Sinai, where they are applying unsupervised machine learning techniques to generate intelligence from the electronic health record.  In less technical terms, they’re building SkyNet – for good.

These authors selected patients with at least five records in their data warehouse in the years leading up to 2013 as their modeling cohort.  Based on these ~700,000 patients, they abstracted all encounters, diagnoses, clinical notes, medications, and structured order data.  These various data types were then pared down to approximately 41,000 features that neither appeared in >80% of the records nor in fewer than five records, and then these features were normalized for analysis.  The novelty in their approach was their specific unsupervised data abstraction, reducing each patient to a dense vector of 500 features.  They then selected patients with at least one new ICD-9 diagnosis recored in their EHR in 2014, and divided them as their validation and test cohorts for disease prediction.

The results varied by diagnosis, but, most importantly, demonstrated their method appears superior to several other methods of abstraction – a non-abstracted “raw features” analysis, principal component analysis, Gaussian mixture model, k-means, and independent component analysis.  Using a random forest model for prediction, their abstraction method – “DeepPatient” – provided the best substrate for future diagnoses.  For example, their method worked best on “diabetes mellitus with complications”, providing an AUC for this diagnosis of 0.907.  Other high-scoring disease predictions including various cancers, cardiovascular disorders, and mental health issues.

Much work remains to be completed before similar technology is applicable in a practical clinical context.  This application does not even specifically account for the actual value of lab tests, only prediction of outcomes based on the co-occurence of other clinical features with a lab test result present.  Prediction strength also varied greatly by disease process; it is likely a more restricted or lightly supervised model will outperform their generic unsupervised general model with regard to specific short-term outcomes relating to emergency care.  And, of course, even when such models are being developed, they will still require testing and practice refinement regarding the traditional challenges balancing accuracy, risk tolerance, and resource utilization.

“Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records”