Quick Hit: Elders Risk Assessment

A few words regarding an article highlighted in one of my daily e-mails – a report regarding the Elders Risk Assessment tool (ERA) from the Mayo Clinic.

The key to the highlight is the assertion this score can be easily calculated and presented in-context to clinicians during primary care visits, allowing patients with higher scores to be easily identified for preventive interventions. With an AUC of 0.84, the authors are rather chuffed about the overall performance. In fact, they close their discussion with this rosy outlook:

The adoption of a proactive approach in primary care, along with the implementation of a predictive clinical score, could play a pivotal role in preventing critical ill- nesses, benefiting patients and optimizing healthcare resource allocation.

Completely missed by their limitations is that prognostic scores are not prescriptive. The ERA is based on age, recent hospitalizations, and chronic illness. The extent to which the management of any of these issues can be addressed “proactively” in the current primary care environment, and demonstrate a positive impact on patient-oriented outcomes, remains to be demonstrated.

To claim a scoring system is going to better the world, it is necessary to compare decisions made with formal prompting by the score to decisions made without – several steps removed from performing a retrospective evaluation to generate an AUC. It ought also be appreciated some decisions based on high ERA scores will increase resource utilization without a corresponding beneficial effect on health, while lower scores may likewise inappropriately bias clinical judgement.

This article has only passing applicability to emergency medicine, but the same issues regarding the disutility of “prognosis” apply widely.

“Individualized prediction of critical illness in older adults: Validation of an elders risk assessment model”
https://agsjournals.onlinelibrary.wiley.com/doi/abs/10.1111/jgs.18861

Update to Start 2024

A brief post collating a few bits of my various work published across the interwebs ….

The Annals of Emergency Medicine Podcast continues to summarise the meatiest articles from each month, featuring a cycle of new co-hosts, as well:

Naturally, there are continuing Journal Club features, covering the following articles:

I should also point out a couple additional new publications with two very different and amazing teams:

Lastly, in ACEPNow, we have:

Enjoy!

Everyone’s Got ChatGPT Fever!

And, most importantly, if you put the symptoms related to your fever into ChatGPT, it will generate a reasonable differential diagnosis.

“So?”

This brief report in Annals describes a retrospective experiment in which 30 written case summaries lifted from the electronic documentation system were fed to either clinician teams or ChatGPT. The clinician teams (either an internal medicine or emergency medicine resident, plus a supervising specialist) and ChatGPT were asked to generate a “top 5” of differential diagnoses, and then settle upon one “most likely” diagnosis. Each case was tested both solely on the recorded narrative, as well as with laboratory results added.

The long and short of this brief report is the lists of diagnoses generated contained the correct final diagnosis with similar frequency – about 80-90% of the time. The correct leading diagnosis was chosen from these lists about 60% of the time by each. Overlap between clinicians and ChatGPT in their lists of diagnoses was, likewise, about 50-60%.

The common reaction: wow! ChatGPT is every bit as good as a team of clinicians. We ought to use ChatGPT to fill in gaps where clinician resources are scarce, or to generally augment clinicians contemporaneously.

This may indeed be a valid reaction, and, looking at the healthcare funding environment, it is clear billions of dollars are being thrown at the optimistic interpretation of these types of studies. However, what is lacking from these studies are any sort of comparison. Prior to ChatGPT, clinicians did not operate in an information resource vacuum, as is frequently the case in these contrived situations. When faced with clinical ambiguity, clinicians (and patients) have used general search engines, in addition to medical knowledge-specific resources (e.g., UpToDate) as augments. These ChatGPT studies are generally, much like many decision-support studies, quite light on testing their clinical utility and implementation in real-world contexts.

Medical applications of large language models are certainly interesting, but it is always valuable to remember LLMs are not “intelligent” – they are simply pattern-matching and generation tools. They may, or may not, provide reliable improvement over current information search strategies available to clinicians.

ChatGPT and Generating a Differential Diagnosis Early in an Emergency Department Presentation

Don’t Use Lytics in Mild Stroke, Part 3

Well, PRISMS demonstrated unfavorable results.

MARISS tried to ascertain predictors of poor outcome in mild stroke, and intravenous thrombolysis was not associated with an effect on the primary outcome.

Now, again, we examine thrombolysis in “mild” stroke, in this case, NIHSS ≤3 – and fail.

Like MARISS, this is a retrospective dredge of patients selected by the treating clinicians to receive either intravenous thrombolysis or, in this case, dual-antiplatelet therapy with clopidogrel and aspirin. The population included for analysis is the Austrian Stroke Unit Registry from 2018 until 2019, an original cohort of 53,899 patients. Of these, 29,252 were NIHSS ≤3, but exclusions meant nearly 25,000 were left out – primarily those whose strokes were the result of atrial fibrillation, or whose treating clinicians chose platelet monotherapy instead of dual antiplatelet therapy.

The remaining ~4,000 were analyzed both in their unadjusted cohorts, as well as propensity scored cohorts comprised of roughly 20% of the original. In the unadjusted cohorts, efficacy and safety outcomes were universally worse in those selected for thrombolysis – but, of course, were generally more severe stroke syndromes. After propensity score matching, these differences generally disappeared – except a preponderance of sICH in the thrombolysis cohort.

The authors here conclude there’s no evidence of superiority for thrombolysis in mild stroke, and their results fit broadly with those from other cohorts. It’s observational and unreliable, but it ought to be a very reasonable stance to withhold thrombolysis for mild strokes pending trials conclusively demonstrating which, if any, mild strokes do improve with thrombolysis.

IV Thrombolysis vs Early Dual Antiplatelet Therapy in Patients With Mild Noncardioembolic Ischemic Stroke

Which Sepsis Alert is the Biggest Loser?

It’s a trick question – in the end, all of us have already lost.

This is a short retrospective report evaluating, primarily, the Epic Sepsis Prediction Model, and the mode in which is deployed. The Epic SPM generates a “prediction of sepsis score”, calculated at 15 minute intervals, providing a continuous risk score for the development of sepsis. Of course, in modern medicine, this is usually reduced to a trigger threshold at which point an alert is fired. Alerts, alerts, alerts – what are they good for?

In this study, the Epic SPM was evaluated at several difference SPS score thresholds ranging from ≥5 to ≥10 – and compared, as well, with SIRS, qSOFA, and SOFA. There were two goals for the evaluation: accuracy and timeliness. All prediction tools provided the same age-old tradeoff between sensitivity and specificity, with a PSS of ≥5 being 95% sensitive, but merely 53% specific. Likewise, a more specific cut-off sacrificed sensitivity. SIRS, qSOFA, and SOFA suffered from the same limitations.

The “time to detection” was a bit more interesting, but conclusions are a bit limited by the methods used to determine. The PSS is calculated at 15 minute intervals, while their calculations of SIRS, qSOFA, and SOFA all happened at hourly intervals. Then, “time zero” for their calculations was actually determined by the time of clinician action – the time at which a clinician suspected sepsis and ordered either antimicrobials or blood cultures. With respect to timeliness, only a minority of patients met threshold scores at “time zero” – except SIRS, where nearly half were at threshold.

So, it’s hard to conclude much from these data – other than, as previously alluded, we are all losers. These alerts are clearly useless, yet they, and the Surviving Sepsis bundle gestapo have trained clinicians to leap at the earliest opportunity to (over)diagnose sepsis and administer broad-spectrum antibiotics. Multiple specialty societies have asked for the SEP-1 measures to be rolled back due to these obvious harms, let alone the administrative costs, and eliminating that “quality” measure would go a long way to putting these useless alerts to bed.

Sepsis Prediction Model for “Determining Sepsis vs SIRS, qSOFA, and SOFA”

End Nail Dogma

In a world of doors, truck beds, furniture, and other finger-crushing nuisances, emergency department visits for injuries involving the distal digits are common. Injuries range from tuft fractures, to degloving injuries, to all manner of nail and nailbed derangement.

Perusing any textbook or online resource will typically advise some manner of repair, including, but not limited to, replacing an avulsed nail back into the proximal nail fold and securing it in place. If the avulsed nail is not available, recommendations include placing a bit of foil into the proximal nail fold. The general idea being that failure to do so will irretrievably scar the germinal matrix, resulting in some disfigured and mutant nail growth.

The NINJA trial tests whether this dogma is valid – and, rather unsurprisingly, finds it is not.

In this trial, children with finger nail and nailbed injuries requiring surgical repair were randomized, at the conclusion of the injury repair, either to replacement of the nail (or foil) into the nail fold, or to discard the nail and simply leave on a non-adherent dressing. The “c0-primary” outcomes were cosmetic appearance of the nail (using the Oxford Fingernail Appearance Score) and surgical-site infection at 1 week follow-up.

The majority of the 451 children involved were aged younger than 6 and most were crush injuries resulting in avulsion of the nail plate. The primary outcomes were no different between groups – 5 and 2 surgical-site infections in the “nail replacement” and “nail discarded” groups, respectively, and median OFNAS score was 5 (the highest score) in each group. Lest the trial be accused of just failing to demonstrate a difference favoring the “nail replacement” group, it was actually the “nail discarded” group having a non-significantly more favorable distribution of cosmetic scores.

When suggesting these results are unsurprising, it’s rather just a perspective many clinical encounters in the emergency department are “over-medicalized”, and receive unnecessary tests or treatment simply due to the spectrum bias associated with acute care. Most healthy human substrate is capable of healing from minor injury in a satisfactory fashion; hopefully, these results further inform the care of children with finger nail injuries, and, may be reasonably generalized to other nails and healthy adults.

Effectiveness of nail bed repair in children with or without replacing the fingernail: NINJA multicentre randomized clinical trial

The Opiates in Back Pain Conundrum

We do love to give out opiates in the emergency department. Kidney stone? Opiates. Broken arm? Opiates. Gunshot wound? Opiates. Sore throat? Dexamethasone. And opiates.

So of course we’re here with opiates for your back pain.

In this modern day, we are far, far more judicious than in times of yore, back when pharma had lobbied for pain to become the “fifth vital sign”. But, nonetheless, those patients who are struggling to manage despite non-opiate analgesia frequently end up with some sort of small supply to try and resolve an acutely painful condition.

The OPAL trial, published in The Lancet, is yet another in a series of trials decrying the disutility of virtually anything for back pain – in the context of prior work diminishing the efficacy of skeletal muscle relaxants, as well as even acetaminophen added to ibuprofen. In this trial, patients with “acute” low back pain were prescribed an oxycodone-based opiate or matching placebo, and their functional recovery was assessed in follow up. Unfortunately, no advantage was seen for patients randomized to oxycodone, while there were small, but likely real, risks for opiate misuse at later intervals.

However, does this trial apply to the emergency department?

  • Patients were eligible if they had low back pain for up to 3 months. This is not exactly “acute” – especially since early versions of the protocol excluded patients whose back pain had been ongoing for less than 2 weeks.
  • Modified-release oxycodone-naloxone was the opiate of choice in this Australian trial. The naloxone itself does not exert much influence on the analgesic effect, but the preparation itself differs from preparation used commonly in the emergency department.
  • The follow-up interval was at six weeks, a good patient-oriented timeframe for long-term clinical resolution. However, emergency department treatment tends to choose opiate analgesia with the goal of short-term mobilization and return to activity, so 48- or 72- hour relief or functioning may be more relevant.

The most notable problem with this trial is not, in fact, the trial itself. Rather, the issue remains the paucity of true short-term data regarding any added benefit for the minimally effective quantity of opiates usually dispensed from the emergency department. Spring into action, team!

“Opioid analgesia for acute low back pain and neck pain (the OPAL trial): a randomised placebo-controlled trial”

The Cost of “Quality”

In case you missed this beautiful little article, it’s worth re-highlighting regarding the paradoxical “cost” of “quality”.

In theory, high-quality care is its own reward. Timely actions and interventions, thoughtful and thorough evaluations, and appropriate guideline adherence when applicable are all goals with reasonable face validity for healthcare delivery. Competing incentives, however, coupled with time pressures, erode some of the natural inclination towards ideal care. Thus, “quality” metrics and goals, created with the best of intentions to nudge clinicians and health systems towards better care.

Unfortunately, the siren song of “quality” has begat a locust horde of metrics from all manner of organizations. Health care expenditures in the U.S. have grown from 9% of GDP to 20% GDP, and administrative costs are estimated to comprise up to 30% of total national health care spending. To add context to these larger estimates, this little article simply looks within their own institution to evaluate the potential contribution of “quality” measures to those larger sums.

The authors identified, by surveying personnel across their institution, 162 quality metrics reported to 7 measuring organizations, totalling 271 reports (as some required reporting to multiple organizations). The bulk (70%) were publicly reported “quality” measures, while another 27% were related to pay-for-performance programs.

Overall, across surveyed personnel, the authors determined approximately 108,000 person-hours were consumed annually on these reports. Based on the annual salaries of the individuals involved and their time commitment, the total annual cost to the institution was estimated at over USD$5 million. The most expensive metrics were those requiring individual chart abstraction, while those metrics requiring merely electronic data capture required a fraction of the cost.

Multiplied by the 4000+ hospitals in the U.S., suddenly we’re obviously talking about tens of billions of dollars of added administrative overhead. Interestingly enough, and relevant to emergency medicine, one of the worst offenders as far as cost is SEP-1 – the CMS sepsis core measure. Not only is this measure onerous and costly to administer on the institutional side, it results in substantial unmeasured additional work for clinical staff – and I suspect many of these “quality” measures have their cost similarly underestimated.

Administrative costs aside, it is as important to consider whether “quality” metrics actually reflect higher-quality care, or whether the changes in care driven by metrics improve value. What is certain, however, is their proliferation has been clearly nightmarish.

“The Volume and Cost of Quality Metric Reporting”
https://jamanetwork.com/journals/jama/article-abstract/2805705

Toss Up: A Little Bleeding, or A Lot of Platelets

Platelets are the good little minions of hemostasis. In their absence, invasive procedures develop additional risk, ranging from minimal to clinically important, and the mitigation strategy ranges from avoidance, the alternative procedural techniques, to prophylactic platelet transfusions. Platelets, like any blood product, are associated significant risks, not limited to acute lung injury, transfusion-related circulatory overload, allergic reactions, and more.

This prospective, randomized trial evaluated whether, in patients with thrombocytopenia, a platelet transfusion was necessary before central venous catheter placement. Enrolled patients included those undergoing in-hospital, ultrasound-guided CVC placement, primarily “regular” CVCs, placed into the internal jugular and subclavian veins. Patients randomized to transfusion received one unit of platelet concentrate roughly one hour prior to the procedure. The primary outcome was CVC-related bleeding, graded on a scale of 0 to 4, where Grade 3 and 4 bleeding was associated with significant intervention.

In those receiving a platelet transfusion prior to CVC placement, grade 3 and 4 bleeding was seen in 4 of 188 (2.1%) of patients, compared with 9 of 185 (4.9%) of those who did not receive a transfusion. There were also excesses of Grade 1 and 2 bleeding in those who did not receive a transfusion prior to the procedure. Secondary subgroup analyses were underpowered to determine if any specific subgroups were at higher risk, but it is reasonable to suggest the risk may increase as the initial platelet count decreases, while internal jugular placement was the safest site.

The cost of this initial protection was, obviously, quite a number of platelet transfusions. Owing to observed bleeding complications, the mean number of units of platelets transfused following CVC placement was much higher in the group not having received a prophylactic transfusion. However, when the initial prophylaxis is taken into account, the transfusion cohort received more than double the platelets within 24 hours of the procedure. Red blood cell transfusion was not different between groups, and other observed length-of-stay and mortality outcomes are probably not reliably different.

The trial is presented as “negative”, as the differences in serious bleeding fail to meet the pre-specified endpoint for non-inferiority. The authors, however, make an appropriately nuanced interpretation regarding the sliding scale of risk for bleeding, and suggest lower platelet counts, and those likely to require a platelet transfusion in the near future for other clinical indications, represent the most judicious population in which to consider prophylactic transfusion.

“Platelet Transfusion before CVC Placement in Patients with Thrombocytopenia”

https://www.nejm.org/doi/full/10.1056/NEJMoa2214322

When ChatGPT Writes a Research Paper

It is safe to say the honeymoon phase of large language models has started to fade a bit. Yes, they can absolutely pass a medical licensing examination when given carefully constructed prompts. The focus now turns to practical applications – like, in this example, using ChatGPT to write an entire scientific paper for you!

There is no reason to go through the details of the paper, the content, the findings, or any aspect of fruit and vegetable consumption. It is linked only to prove that it exists, and was written in its entirety by an LLM. To create the article, the authors used prompts containing the actual data set, prompts for an introduction, summary tables, and a discussion – impressively, as part of an automated prompting engine written by the authors, not just a laborious manual process. The initial output was not, as you might expect, entirely appropriate, requiring substantial re-prompting and revision – but, in the end, as you may see, the output resembles a paper basically indistinguishable from an undergraduate or graduate student-level output.

There were, of course, hallucinations, banal unfounded declarations, and the expected simply fabricated references. But, considering a year or two ago, no one would have ever talked about or suggested a LLM could write any semblance of a robust research paper, this is still fairly amazing. Considering this sort of writing is close to peak intellectual accomplishment, it’s fair to say similar automated techniques may replace a great deal of lesser content generation.

“The Impact of Fruit and Vegetable Consumption and Physical Activity on Diabetes Risk among Adults”
https://www.nature.com/articles/d41586-023-02218-z