When EHR Interventions Succeed … and Fail

This is a bit of a fascinating article with a great deal to unpack – and rightly published in a prominent journal.

The brief summary – this is a “pragmatic”, open-label, cluster-randomized trial in which a set of interventions designed to increase guideline-concordant care were rolled out via electronic health record tools. These interventions were further supported by “facilitators”, persons assigned to each practice in the intervention cohort to support uptake of the EHR tools. In this specific study, the underlying disease state was the triad of chronic kidney disease, hypertension, and type II diabetes. Each of these disease states has well-defined pathways for “optimal” therapy and escalation.

The most notable feature of this trial is the simple, negative topline result – rollout of this intervention had no reliably measurable effect on patient-oriented outcomes relating to disease progression or acute clinical deterioration. Delving below the surface provides a number of insights worthy of comment:

  • The authors could have easily made this a positive trial by having the primary outcome as change in guideline-concordant care, as many other trials have done. This is a lovely example of how surrogates for patient-oriented outcomes must always be critically appraised for the strength of their association.
  • The entire concept of this trial is likely passively traumatizing to many clinicians – being bludgeoned by electronic health record reminders and administrative nannying to increase compliance with some sort of “quality” standard. Despite all these investments, alerts, and nagging – patients did no better. As above, since many of these trials simply measure changes in behavior as their endpoints, it likely leaves many clinicians feeling sour seeing results like these where patients are no better off.
  • The care “bundle” and its lack of effect size is notable, although it ought to be noted the patient-oriented outcomes here for these chronic, life-long diseases are quite short-term. The external validity of findings demonstrated in clinical trials frequently falls short when generalized to the “real world”. The scope of the investment here and its lack of patient-oriented improvement is a reminder of the challenges in medicine regarding evidence of sufficient strength to reliably inform practice.

Not an Emergency Medicine article, per se, but certainly describes the sorts of pressures on clinical practice pervasive across specialties.

“Pragmatic Trial of Hospitalization Rate in Chronic Kidney Disease”
https://www.nejm.org/doi/full/10.1056/NEJMoa2311708

Quick Hit: Elders Risk Assessment

A few words regarding an article highlighted in one of my daily e-mails – a report regarding the Elders Risk Assessment tool (ERA) from the Mayo Clinic.

The key to the highlight is the assertion this score can be easily calculated and presented in-context to clinicians during primary care visits, allowing patients with higher scores to be easily identified for preventive interventions. With an AUC of 0.84, the authors are rather chuffed about the overall performance. In fact, they close their discussion with this rosy outlook:

The adoption of a proactive approach in primary care, along with the implementation of a predictive clinical score, could play a pivotal role in preventing critical ill- nesses, benefiting patients and optimizing healthcare resource allocation.

Completely missed by their limitations is that prognostic scores are not prescriptive. The ERA is based on age, recent hospitalizations, and chronic illness. The extent to which the management of any of these issues can be addressed “proactively” in the current primary care environment, and demonstrate a positive impact on patient-oriented outcomes, remains to be demonstrated.

To claim a scoring system is going to better the world, it is necessary to compare decisions made with formal prompting by the score to decisions made without – several steps removed from performing a retrospective evaluation to generate an AUC. It ought also be appreciated some decisions based on high ERA scores will increase resource utilization without a corresponding beneficial effect on health, while lower scores may likewise inappropriately bias clinical judgement.

This article has only passing applicability to emergency medicine, but the same issues regarding the disutility of “prognosis” apply widely.

“Individualized prediction of critical illness in older adults: Validation of an elders risk assessment model”
https://agsjournals.onlinelibrary.wiley.com/doi/abs/10.1111/jgs.18861

Everyone’s Got ChatGPT Fever!

And, most importantly, if you put the symptoms related to your fever into ChatGPT, it will generate a reasonable differential diagnosis.

“So?”

This brief report in Annals describes a retrospective experiment in which 30 written case summaries lifted from the electronic documentation system were fed to either clinician teams or ChatGPT. The clinician teams (either an internal medicine or emergency medicine resident, plus a supervising specialist) and ChatGPT were asked to generate a “top 5” of differential diagnoses, and then settle upon one “most likely” diagnosis. Each case was tested both solely on the recorded narrative, as well as with laboratory results added.

The long and short of this brief report is the lists of diagnoses generated contained the correct final diagnosis with similar frequency – about 80-90% of the time. The correct leading diagnosis was chosen from these lists about 60% of the time by each. Overlap between clinicians and ChatGPT in their lists of diagnoses was, likewise, about 50-60%.

The common reaction: wow! ChatGPT is every bit as good as a team of clinicians. We ought to use ChatGPT to fill in gaps where clinician resources are scarce, or to generally augment clinicians contemporaneously.

This may indeed be a valid reaction, and, looking at the healthcare funding environment, it is clear billions of dollars are being thrown at the optimistic interpretation of these types of studies. However, what is lacking from these studies are any sort of comparison. Prior to ChatGPT, clinicians did not operate in an information resource vacuum, as is frequently the case in these contrived situations. When faced with clinical ambiguity, clinicians (and patients) have used general search engines, in addition to medical knowledge-specific resources (e.g., UpToDate) as augments. These ChatGPT studies are generally, much like many decision-support studies, quite light on testing their clinical utility and implementation in real-world contexts.

Medical applications of large language models are certainly interesting, but it is always valuable to remember LLMs are not “intelligent” – they are simply pattern-matching and generation tools. They may, or may not, provide reliable improvement over current information search strategies available to clinicians.

ChatGPT and Generating a Differential Diagnosis Early in an Emergency Department Presentation

When ChatGPT Writes a Research Paper

It is safe to say the honeymoon phase of large language models has started to fade a bit. Yes, they can absolutely pass a medical licensing examination when given carefully constructed prompts. The focus now turns to practical applications – like, in this example, using ChatGPT to write an entire scientific paper for you!

There is no reason to go through the details of the paper, the content, the findings, or any aspect of fruit and vegetable consumption. It is linked only to prove that it exists, and was written in its entirety by an LLM. To create the article, the authors used prompts containing the actual data set, prompts for an introduction, summary tables, and a discussion – impressively, as part of an automated prompting engine written by the authors, not just a laborious manual process. The initial output was not, as you might expect, entirely appropriate, requiring substantial re-prompting and revision – but, in the end, as you may see, the output resembles a paper basically indistinguishable from an undergraduate or graduate student-level output.

There were, of course, hallucinations, banal unfounded declarations, and the expected simply fabricated references. But, considering a year or two ago, no one would have ever talked about or suggested a LLM could write any semblance of a robust research paper, this is still fairly amazing. Considering this sort of writing is close to peak intellectual accomplishment, it’s fair to say similar automated techniques may replace a great deal of lesser content generation.

“The Impact of Fruit and Vegetable Consumption and Physical Activity on Diabetes Risk among Adults”
https://www.nature.com/articles/d41586-023-02218-z

Sepsis Alerts Save Lives!

Not doctors, of course – the alerts.

This is one of those “we had to do it, so we studied it” sorts of evaluations because, as most of us have experienced, the decision to implement the sepsis alerts is not always driven by pent-up clinician demand.

The authors describe this as sort of “natural experiment”, where a phased or stepped roll-out allows for some presumption of control for unmeasured cultural and process confounders limiting pre-/post- studies. In this case, the decision was made to implement the “St John Sepsis Algorithm” developed by Cerner. This algorithm is composed of two alerts – one somewhat SIRS- or inflammation-based for “suspicion of sepsis”, and one with organ dysfunction for “suspicion of severe sepsis”. The “phased” part of the roll-out involved turning on the alerts first in the acute inpatient wards, then the Emergency Department, and then the specialty wards. Prior to being activated, however, the alert algorithm ran “silently” to create the comparison group of those for whom an alert would have been triggered.

The short summary:

  • In their inpatient wards, mortality among patients meeting alert criteria decreased from 6.4% to 5.1%.
  • In their Emergency Department, admitted patients meeting alert criteria were less likely to have a ≥7 day inpatient length-of-stay.
  • In their Emergency Departments, antibiotic administration of patients meeting alert criteria within 1 hour of the alert firing increased from 36.9% to 44.7%.

There are major problems here, of course, both intrinsic to their study design and otherwise. While it is a “multisite” study, there are only two hospitals involved. The “phased” implementation not the typical different-hospitals-at-different-times, but within each hospital. They report inpatient mortality changes without actually reporting any changes in clinician behavior between the pre- and post- phases, i.e., what did clinicians actually do in response to the alerts? Then, they look at timely antibiotic administration, but they do not look at general antibiotic volume or the various unintended consequences potentially associated with this alert. Did admission rates increase? Did percentages of discharged patients receiving intravenous antibiotics increase? Did clostridium difficle infection rates increase?

Absent the funding and infrastructure to better prospectively study these sorts of interventions, these “natural experiments” can be useful evidence. However, these authors do not seem to have taken an expansive enough view of their data with which to fully support an unquestioned conclusion of benefit to the alert intervention.

“Evaluating a digital sepsis alert in a London multisite hospital network: a natural experiment using electronic health record data”

https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocz186/5607431

Saving Lives with Lifesaving Devices

Automated electronic defibrillators are quite useful in many cases of out-of-hospital cardiac arrest – specifically, those so-called “shockable rhythms” in which defibrillation is indicated. Ventricular fibrillation, if treated with only bystander conventional cardiopulmonary resuscitation, has dismal survival. However, when no AED is available, the issue is mooted.

This is an interesting simulation exercise looking to improve access to AEDs such that availability might be improved in cases of cardiac arrest. These authors pulled every AED location in Denmark, along with the locations of all OHCAs between 2007 and 2016. Then, they used all OHCA from 1994 until 2007 as their “training set” to help derive an optimal location for AED placement with which to simulate. Optimal AED placements were dichotomized into “intervention #1” and “intervention #2” based on whether their building location provided business-hours access, or 24/7 access.

In the “real world” of 2007 to 2016, AED coverage of OHCA was 22.0%, leading to 14.6% bystander defibrillation. Based on their simulations and optimization, these authors propose potential 33.4% and 43.1% coverage, depending on business hours, leading to increases in bystander defibrillation of 22.5% and 26.9%. This improved coverage and bystander defibrillation would give an absolute increase in survival, based on the observed rate, of 3.4% and 4.1% over the study period.

This is obviously a simulation, meaning all these projected numbers are ficticious and subject to the imprecision of the inputs, along with extrapolated outcomes. However, the underlying principle of trying to intelligently match AED access to OHCA volume is certainly reasonable. It is hard to argue against distributing a limited resource in some data-driven fashion.

“In Silico Trial of Optimized Versus Actual Public Defibrillator Locations”
https://www.sciencedirect.com/science/article/pii/S0735109719361649

The United Colors of Sepsis

Here it is: sepsis writ Big Data.

And, considering it’s Big Data, it’s also a big publication: a 15 page primary publication, plus 90+ pages of online supplement – dense with figures, raw data, and methods both routine and novel for the evaluation of large data sets.

At the minimum, to put a general handle on it, this work primarily demonstrates the heterogeneity of sepsis. As any clinician knows, “sepsis” – with its ever-morphing definition – ranges widely from those generally well in the Emergency Department to those critically ill in the Intensive Care Unit. In an academic sense, this means the patients enrolled and evaluated in various trials for the treatment of sepsis may be quite different from one another, and results seen in one trial or setting may generalize poorly to another. This has obvious implications when trying to determine a general set of care guidelines from these disparate bits of data, and resulting in further issues down the road when said guidelines become enshrined in quality measures.

Overall, these authors ultimately define four phenotypes of sepsis, helpfully assigned descriptive labels using the letters of the greek alphabet. These four phenotypes of sepsis are derived from retrospective administrative data, then validated on additional retrospective administrative data, and finally the raw data from several prominent clinical trials in sepsis, including ACCESS, PROWESS, and ProCESS. The four phenotypes were derived by clustering and refinement, and are described by the authors as effectively: a mild type with low mortality; a cohort of those with chronic illness; a cohort with systemic inflammation and pulmonary disease; and a final cohort with liver dysfunction, shock, and high mortality.

We are quite far, however, from needing to apply these phenotypes in a clinical fashion. Any classification model is highly dependent upon the inputs, and in this study the inputs are the sorts of routine clinical data available from the electronic health record: vital signs, demographics, and basic labs. Missing data was common, including, for example, lactate levels, which was not obtained on 80% of patients in their model. These inputs then dictate how many different clusters you obtain, how the relative accuracy of classification diminishes with greater numbers of clusters, as well whether the model begins to overfit the derivation data set.

Then, this is a little bit of a fuzzy application in the sense these data represent as much different types of patients with sepsis, as it represents different types of sepsis. Consider the varying etiologies of sepsis, including influenza pneumonia, streptococcal toxic shock, or gram-negative bacteremia. These different etiologies would obviously result in different host responses depending on individual patient features. These phenotypes derived here effectively mash up causative agent with the underlying host, muddying clinical application.

If clinical utility is limited, then what might the best utility for this work? Well, this goes back to the idea above regarding translating work from clinical trials to different settings. A community Emergency Department might primarily see alpha-sepsis, a community ICU might see a lot of beta-sepsis, while an academic ICU might see predominantly delta-sepsis. These are important concepts to consider – and potentially subgroup-analyses to perform – when evaluating the outcomes of clinical trials. These authors do several simulations of clinical trials while varying the composition of phenotypes of sepsis, and note potentially important effects on primary outcomes. Pathways of care or resuscitation protocols could potentially be more readily compared between trial populations if these phenotypes were calculated.

This is a challenging work to process – but an important first step in better recognizing the heterogeneity in potential benefits and harms resulting from various interventions. The accompanying editorial does really a very excellent job of describing their methods, outcomes, and utility, as well.

“Derivation, Validation, and Potential Treatment Implications of Novel Clinical Phenotypes for Sepsis”
https://jamanetwork.com/journals/jama/fullarticle/2733996

“New Phenotypes for Sepsis”
https://jamanetwork.com/journals/jama/fullarticle/2733994

Please Click Here to Reset Your Password

Please enter your social security number and the last four digits of your credit card to complete the process.

Your credentials could not be confirmed. Please enter your mother’s maiden name, the first car you drove, and the number of dollars remaining on the mortgage to your house.

… when this happens to you, it’s a big problem. When it provides malicious attackers a backdoor into your healthcare delivery system, it’s a much, much bigger problem. Our institution, like many others, implemented a “phishing” training program, complete with online modules and test e-mails sent periodically to our institutional accounts.

But does it work?

Considering we’re starting from the bottom, the answer is a qualified “yes”.

In these authors’ report, they detail their experience with 5,416 unique employees at a single institution undergoing a campaign aimed at education about phishing. Their intervention and program consisted of 20 fake malicious e-mails sent periodically at 2- to 3-month intervals. Only 975 (17.9%) clicked on zero malicious links in e-mails during their educational campaign. An almost equal number, 772, clicked on five or more malicious links. Generally, over the course of the intervention, rates of click-through gradually decreased from highs in the 70% range to well below 10%.

Additionally, after 15 e-mails, those who had clicked on enough e-mails to be labelled “offenders” underwent a mandatory training program. Unfortunately, this training program had no subsequent effect on click-through rates. Those who had been offenders before, remained offenders – with click rates on malicious links of 10-25%, depending on the fake example.

Grim news for security consultants trying to prevent massive data breaches.

“Evaluation of a mandatory phishing training program for high-risk employees at a US healthcare system”
https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocz005/5376646

OK, Google: Discharge My Patient

Within my electronic health record, I have standardized discharge instructions in many languages. Many of these, I can read or edit with some fluency – such as Spanish – and those of which I have no facility whatsoever – such at Vietnamese. These function adequately for general reading material regarding any specific diagnosis made in the Emergency Department.

However, frequently, additional free text clarification is necessary regarding a treatment plan – whether it be time until suture removal, specifics about follow-up, or clarifications relevant to an individual patient. This level of language art is beyond my capacity in Spanish, let alone any sort of logographic or morphographic writing.

These authors performed a simple study in which they processed 100 free-text Emergency Department discharge instructions through the Google Translate blender to produce Spanish- and Chinese-language editions. The accuracy of the Spanish translation was 92%, as measured by the number of sentences preserving meaning and readability. Chinese fared less well, at 81%. Finally, authors assessed the errors for clinically relevant and potential harm – and found 2% of Spanish instructions and 8% of Chinese met their criteria.

Of course, there are a couple potential strategies to mitigate these potential issues – including back-translating the text from the foreign language back into English, as they did as part of these methods, or spending time verbally confirming the clarity of the written instructions with the patient. Instructions can also be improved prior to instruction by avoiding abbreviations and utilizing simple sentence structures.

Imperfect as they may be, using a translation tool is still likely better than giving no written instruction at all.

“Assessing the Use of Google Translate for Spanish and Chinese Translations of Emergency Department Discharge Instructions”
https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2725080

Precog for Medical Errors

Medical errors are grossly under-reported, with only an estimated 10% safety events ever identified via voluntary reporting systems. There’s an entire line of academic inquiry simply targeted at increasing the proportion of safety events detected, with the overall goal of informing subsequent practice change. This study – mentioned in the daily ACEP News briefing – takes it one step further, attempting to predict future safety events in real-time.

These authors used a continuous stream of data from the electronic health record to create a “patient safety active management” system. They created an initial model based on four years worth of data from 2009-13, and subsequently validated it by pilot implementation at two hospitals between 2014-17. During these pilot phases, each nursing unit was provided with a dashboard for every patient indicating whether a trigger event had occurred, along with a twice-daily updated score of their overall risk for an event. A nurse reviewer followed all the automated positive triggers and evaluated their downstream harms, as well as the harm severity.

We’re a long way from prime time. There were 775,415 trigger events in 147,503 inpatient admissions, resulting in 3,896 clinically validated safety events. The vast majority of events were “temporary harm” or “increased length-of-stay”, although there were a few serious safety events as well. Worse still, these authors don’t specifically delve into “preventable” harms, as their list of most common adverse events do not clearly offer clues as to whether the harms could specifically be mitigated or avoided. For example, many of their harms were medication-related bleeding or medication-related Clostridium difficile infection – unintended harms, to be sure, but frankly known risks of the likely medically-appropriate treatment pathways.

Every project has to start somewhere, of course, and these early steps will hopefully further inform more specific tools. Hopefully – though I’m unfortunately skeptical – I primarily expect more low-value, alert fatigue-inducing hiccups along the way.

“An Electronic Health Record–Based Real-Time Analytics Program For Patient Safety Surveillance And Improvement”
https://www.healthaffairs.org/doi/abs/10.1377/hlthaff.2018.0728