Oh, The Things We Can Predict!

Philip K. Dick presented us with a short story about the “precogs”, three mutants that foresaw all crime before it could occur. “The Minority Report” was written in 1956 – and, now, 60 years later we do indeed have all manner of digital tools to predict outcomes. However, I doubt Steven Spielberg will be adapting a predictive model for hospitalization for cinema.

This is a rather simple article looking at a single-center experience at using multivariate logistic regression to predict hospitalization. This differs, somewhat, from the existing art in that it uses data available at 10, 60, and 120 minutes from the arrival to the Emergency Department as the basis for its “progressive” modeling.

Based on 58,179 visits ending in discharge and 22,683 resulting in hospitalization, the specificity of their prediction method was 90% with a sensitivity or 96%,for an AUC of 0.97. Their work exceeds prior studies mostly on account of improved specificity, compared with the AUCs of a sample of other predictive models generally between 0.85 and 0.89.

Of course, their model is of zero value to other institutions as it will overfit not only on this subset of data, but also the specific practice patterns of physicians in their hospital. Their results also conceivably could be improved, as they do not actually take into account any test results – only the presence of the order for such. That said, I think it is reasonable to suggest similar performance from temporal models for predicting admission including these earliest orders and entries in the electronic health record.

For hospitals interested in improving patient flow and anticipating disposition, there may be efficiencies to be developed from this sort of informatics solution.

“Progressive prediction of hospitalisation in the emergency department: uncovering hidden patterns to improve patient flow”
http://emj.bmj.com/content/early/2017/02/10/emermed-2014-203819

Can We Trust Our Computer ECG Overlords?

If your practice is like my practice you see a lot of ECGs from triage. ECGs obtained for abdominal pain, dizziness, numbness, fatigue, rectal pain … and some, I assume, are for chest pain. Every one of these ECGs turns into an interruption for review to ensure no concerning evolving syndrome is missed.

But, a great number of these ECGs are read as “Normal” by the computer – and, anecdotally, are nearly universally correct.  This raises a very reasonable point as to question whether a human need be involved at all.

This simple study tries to examine the real-world performance of computer ECG reading, specifically, the Marquette 12SL software. Over a 16-week convenience sample period, 855 triage ECGs were performed, 222 of which were reported as “Normal” by the computer software. These 222 ECGs were all reviewed by a cardiologist, and 13 were ultimately assigned some pathology – of which all were mild, non-specific abnormalities. Two Emergency Physicians also then reviewed these 13 ECGs to determine what, if any, actions might be taken if presented to them in a real-world context. One of these ECGs was determined by one EP to be sufficient to put the patient in the next available bed from triage, while the remainder required no acute triage intervention. Retrospectively, the patient judged to have an actionable ECG was discharged from the ED and had a normal stress test the next day.

The authors conclude this negative predictive value for a “Normal” read of the ECG approaches 99%, and could potentially lead to changes in practice regarding immediate review of triage ECGs. While these findings have some limitations in generalizability regarding the specific ECG software and a relatively small sample, I think they’re on the right track. Interruptions in a multi-tasking setting lead to errors of task resumption, while the likelihood of significant time-sensitive pathology being missed is quite low. I tend to agree this could be a reasonable quality improvement intervention with prospective monitoring.

“Safety of Computer Interpretation of Normal Triage Electrocardiograms”
https://www.ncbi.nlm.nih.gov/pubmed/27519772

The Machine Can Learn

A couple weeks ago I covered computerized diagnosis via symptom checkers, noting their imperfect accuracy – and grossly underperforming crowd-sourced physician knowledge. However, one area that continues to progress is the use of machine learning for outcomes prediction.

This paper describes advances in the use of “big data” for prediction of 30-day and 180-day readmissions for heart failure. The authors used an existing data set from the Telemonitoring to Improve Heart Failure Outcomes trial as substrate, and then applied several unsupervised statistical models to the data with varying inputs.

There were 236 variables available in the data set for use in prediction, weighted and cleaned to account for missing data. Compared with the C statistic from logistic regression as their baseline comparator, the winner was pretty clearly Random Forests. With a baseline 30-day readmission rate of 17.1% and 180-day readmission of 48.9%, the C statistic for the logistic regression model predicting 30-day readmission was 0.533 – basically no predictive skill. The Random Forest model, however, achieved a C statistic of 0.628 by training on the 180-day data set.

So, it’s reasonable to suggest there are complex and heterogenous data for which machine learning methods are superior to traditional models. These are, unfortunately, pretty terrible C statistics, and almost certainly of very limited use for informing clinical care. As with most decision-support algorithms, I would be curious also to see a comparison with a hypothetical C statistic for clinician gestalt. However, for some clinical problems with a wide variety of influential factors, these sorts of models will likely become increasingly prevalent.

“Analysis of Machine Learning Techniques for Heart Failure Readmissions”
http://circoutcomes.ahajournals.org/content/early/2016/11/08/CIRCOUTCOMES.116.003039

The Mechanical Doctor Turk

Automated diagnostic machines consisting of symptom checklists have been evaluated in medicine before. The results were bleak:  symptom-checkers put the correct diagnosis first only 34% of the time, and had the correct diagnosis in the top three only 51% of the time.

However, when these authors published their prior study, they presented these findings in a vaccuum – despite their poor performance, how did this compare against human operators? In this short research letter, then, these authors, compare the symptom-checker performance against clinicians contributing to a sort of crowdsourced medical diagnosis system.

And, at least for awhile longer, the human-machine is superior than the machine-machine. Humans reading the same vignettes placed the correct diagnosis first 72.1% of the time, and in the top three 84.3% of the time.

With time and further natural language processing and machine learning methods, I expect automated diagnosis engines to catch up with humans – but we’re not there yet!

“Comparison of Physician and Computer Diagnostic Accuracy.”
https://www.ncbi.nlm.nih.gov/pubmed/27723877

Finding the Holes in CPOE

Our digital overlords are increasingly pervasive in medicine. In many respects, the advances of computerized provider order-entry are profoundly useful: some otherwise complex orders are facilitated, serious drug-interactions can be checked, along with a small cadre of other benefits. But, we’ve all encountered its limitations, as well.

This is a qualitative descriptive study of medication errors occurring despite the presence of CPOE. This prospective FDA-sponsored project identified 2,522 medication errors across six hospitals, 1,308 of which were related to CPOE. These errors fell into two main categories: CPOE failed to prevent the error (86.9%) and CPOE facilitated the error (13.1%).

CPOE-facilitated errors are most obvious. For example, these include instances in which an order set was out-of-date, and a non-formulary medication order resulted in delayed care for a patient; interface issues resulting in mis-clicks or misreads; or instances in which CPOE content was simply erroneous.

More interesting, however, are the “failed to prevent the error” issues – which are things like dose-checking and interaction-checking failures. The issue here is not specifically the CPOE, but that providers have become so dependent upon the CPOE to be a reliable safety mechanism that we’ve given up agency to the machine. We are bombarded by so many nonsensical alerts, we’ve begun to operate under an assumption that any order failing to anger our digital nannies must be accurate. These will undoubtedly prove to be the most challenging errors to stamp out, particularly as further cognitive processes are offloaded to automated systems.

“Computerized prescriber order entry– related patient safety reports: analysis of 2522 medication errors”
http://jamia.oxfordjournals.org/content/early/2016/09/27/jamia.ocw125

Stumbling Around Risks and Benefits

Practicing clinicians contain multitudes: the vastness of critical medical knowledge applicable to the nearly infinite permutaions of individual patients.  However, lost in the shuffle is apparently a grasp of the basic fundamentals necessary for shared decision-making: the risks, benefits, and harms of many common treatments.

This simple research letter describes a survey distributed to a convenience sample of residents and attending physicians at two academic medical centers. Physicians were asked to estimate the incidence of a variety of effects from common treatments, both positive and negative. A sample question and result:

treatment effect estimates
The green responses are those which fell into the correct range for the question. As you can see, in these two questions, hardly any physician surveyed guessed correctly.  This same pattern is repeated for the remaining questions – involving peptic ulcer prevention, cancer screening, and bleeding complications on aspirin and anticoagulants.

Obviously, only a quarter of participants were attending physicians – though no gross differences in performance were observed between various levels of experience. Then, some of the ranges are narrow with small magnitudes of effect between the “correct” and “incorrect” answers. Regardless, however, the general conclusion of this survey – that we’re not well-equipped to communicate many of the most common treatment effects – is probably valid.

“Physician Understanding and Ability to Communicate Harms and Benefits of Common Medical Treatments”
http://www.ncbi.nlm.nih.gov/pubmed/27571226

Better Living Through Better Prediction

It’s Minority Report again – but pre-health, instead of pre-crime.

This is work from Sinai, where they are applying unsupervised machine learning techniques to generate intelligence from the electronic health record.  In less technical terms, they’re building SkyNet – for good.

These authors selected patients with at least five records in their data warehouse in the years leading up to 2013 as their modeling cohort.  Based on these ~700,000 patients, they abstracted all encounters, diagnoses, clinical notes, medications, and structured order data.  These various data types were then pared down to approximately 41,000 features that neither appeared in >80% of the records nor in fewer than five records, and then these features were normalized for analysis.  The novelty in their approach was their specific unsupervised data abstraction, reducing each patient to a dense vector of 500 features.  They then selected patients with at least one new ICD-9 diagnosis recored in their EHR in 2014, and divided them as their validation and test cohorts for disease prediction.

The results varied by diagnosis, but, most importantly, demonstrated their method appears superior to several other methods of abstraction – a non-abstracted “raw features” analysis, principal component analysis, Gaussian mixture model, k-means, and independent component analysis.  Using a random forest model for prediction, their abstraction method – “DeepPatient” – provided the best substrate for future diagnoses.  For example, their method worked best on “diabetes mellitus with complications”, providing an AUC for this diagnosis of 0.907.  Other high-scoring disease predictions including various cancers, cardiovascular disorders, and mental health issues.

Much work remains to be completed before similar technology is applicable in a practical clinical context.  This application does not even specifically account for the actual value of lab tests, only prediction of outcomes based on the co-occurence of other clinical features with a lab test result present.  Prediction strength also varied greatly by disease process; it is likely a more restricted or lightly supervised model will outperform their generic unsupervised general model with regard to specific short-term outcomes relating to emergency care.  And, of course, even when such models are being developed, they will still require testing and practice refinement regarding the traditional challenges balancing accuracy, risk tolerance, and resource utilization.

“Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records”

Changing Clinician Behavior For Low-Value Care

I’ve reported in general terms several times regarding, essentially, the shameful rate of inappropriate antibiotic prescribing for upper respiratory infections.  Choosing Wisely says: stop!  However, aggregated data seems to indicate the effect of Choosing Wisely has been minimal.

This study, from JAMA, is a prospective, cluster-randomized trial of multiple interventions in primary care practices aimed at decreasing inappropriate antibiotic use.  All clinicians received education on inappropriate antibiotic prescribing.  Then, practices and participating clinicians were randomized either to electronic health record interventions of “alternative suggestion” or “accountable justification”, to peer comparisons, or combinations of all three.

The short answer: it all works.  The complicated answer: so did the control intervention.  The baseline rate of inappropriate antibiotic prescribing in the control practices was estimated at 37.1%.  This dropped to 24.0% in the post-intervention period, and reflected a roughly linear constant downward trend throughout the study period.  However, each different intervention, singly and in combination, resulted in a much more pronounced drop in inappropriate prescribing.  While inappropriate prescribing in the control practices had reached mid-teens by the end of the study period, each intervention group was approaching a floor-level in the single digits.  Regarding safety interventions, only one of the seven intervention practice clusters had a significantly higher 30-day revisit rate than control.

While this study describes an intervention for antibiotic prescribing, the basic principles are sound regarding all manner of change management.  Education, as a foundation, paired with decision-support and performance feedback, as shown here, is an effective strategy to influence behavioral change.  These findings are of critical importance as our new healthcare economy continues to mature from a fee-for-service free-for-all to a value-based care collaboration.

“Effect of Behavioral Interventions on Inappropriate Antibiotic Prescribing Among Primary Care Practices”
http://www.ncbi.nlm.nih.gov/pubmed/26864410

Informatics Trek III: The Search For Sepsis

Big data!  It’s all the rage with tweens these days.  Hoverboards, Yik Yak, and predictive analytics are all kids talk about now.

This “big data” application, more specifically, involves the use of an institutional database to derive predictors for mortality in sepsis.  Many decision instruments for various sepsis syndromes already exist – CART, MEDS, mREMS, CURB-65, to name a few – but all suffer from the same flaw: how reliable can a rule with just a handful of predictors be when applied to the complex heterogeneity of humanity?

Machine-learning applications of predictive analytics attempt to create, essentially, Decision Instruments 2.0.  Rather than using linear statistical methods to simply weight a small handful of different predictors, most of these applications utilize the entire data set and some form of clustering.  Most generally, these models replace typical variable weighted scoring with, essentially, a weighted neighborhood scheme, in which similarity to other points helps predict outcomes.

Long story short, this study out of Yale utilized 5,278 visits for acute sepsis and a random forest model to create a training set and a validation set.  The random forest model included all available data points from the electronic health record, while other models used up to 20 predictors based on expert input and prior literature.  For their primary outcome of predicting in-hospital death, the AUC for the random forest model was 0.86 (CI 0.82-0.90), while none of the rest of the models exceeded an AUC of 0.76.

This still simply at the technology demonstration phase, and requires further development to become actionable clinical information.  However, I believe models and techniques like this are our next best paradigm in guiding diagnostic and treatment decisions for our heterogenous patient population.  Many challenges yet remain, particularly in the realm of data quality, but I am excited to see more teams engaged in development of similar tools.

“Prediction of In-hospital Mortality in Emergency Department Patients with Sepsis: A Local Big Data Driven, Machine Learning Approach”
http://www.ncbi.nlm.nih.gov/pubmed/26679719

More Futile “Quality”, vis-à-vis, Alert Fatigue

The electronic health record can be a wonderful tool.  As a single application for orders, results review, and integrated documentation storehouse, it holds massive potential.

Unfortunately, much of the currently realized potential is that of unintended harms and inefficiencies.

Even the most seemingly innocuous of checks – those meant to ensure safe medication ordering – have gone rogue, and no one seems capable of restraining them.  These authors report on the real-world effectiveness of adverse drug alerts related to opiates.  These were not public health-related educational interventions, but, simply, duplicate therapy, drug allergy, drug interaction, and pregnancy/lactation safety alerts.  These commonly used medications frequently generate medication safety alerts, and are reasonable targets for study in the Emergency Department.

In just a 4-month study period, these authors retrospectively identified 826 patients for whom an opiate-related medication safety alert was triggered, and these 4,742 alerts constituted the cohort for analysis.  Of these insightful, timely, and important contextual interruptions, these orders were overridden 96.3% of the time.  And, if only physicians had listened, these overridden alerts would have prevented: zero adverse drug events.

In fact, all 8 opiate-related adverse drug events could not have been prevented by alerts – most of which were itching, anyway.  The authors do attribute 38 potentially prevented adverse drug events to the 3.7% of accepted alerts – although, again, these would probably mostly just have been itching.

Thousands of alerts.  A handful of serious events not preventable.  A few episodes of itching averted.  This is the “quality” universe we live in – one in which these alerts paradoxically make our patients less safe due to sheer volume and the phenomenon of “alert fatigue”.

“Clinically Inconsequential Alerts: The Characteristics of Opioid Drug Alerts and Their Utility in Preventing Adverse Drug Events in the Emergency Department”
http://www.ncbi.nlm.nih.gov/pubmed/26553282