Another Taste of the Future

Putting my Emergency Informatics hat back on for a day, I’d like to highlight another piece of work that brings us, yet again, another step closer to being replaced by computers.

Or, at the minimum, being highly augmented by computers.

There are multitudinous clinical decision instruments available to supplement physician decision-making.  However, the general unifying element of most instruments is the necessary requirement of physician input.  This interruption of clinical flow reduces acceptability of use, and impedes knowledge translation through the use of these tools.

However, since most clinicians are utilizing Electronic Health Records, we’re already entering the information required for most decision instruments into the patient record.  Usually, this is a combination of structured (click click click) and unstructured (type type type) data.  Structured data is easy for clinical calculators to work with, but has none of the richness communicated by freely typed narrative.  Therefore, clinicians much prefer to utilize typed narrative, at the expense of EHR data quality.

This small experiment out of Cincinnati implemented a natural-language processing and machine-learning automated method to collect information from the EHR.  Structured and unstructured data from 2,100 pediatric patients with abdominal pain were analyzed to extract the elements to calculate the Pediatric Appendicitis Score.  Appropriateness of the Pediatric Appendicitis Score aside, their method performed reasonably well.  It picked up about 87% of the elements of the Score from the record, and was correct when doing so about 86%, as well.  However, this was performed retrospectively – and the authors state this processing would still be substantially delayed by hours following the initial encounter.

So, we’re not quite yet at the point where a parallel process monitors system input and provides real-time diagnostic guidance – but, clearly, this is a window into the future.  The theory:  if an automated process could extract the data required to calculate the score, physicians might be more likely to integrate the score into their practice – and thusly lead to higher quality care through more accurate risk-stratification.

I, for one, welcome our new computer overlords.

“Developing and evaluating an automated appendicitis risk stratification algorithm for pediatric patients in the emergency department”

Replace Us With Computers!

In a preview to the future – who performs better at predicting outcomes, a physician, or a computer?

Unsurprisingly, it’s the computer – and the unfortunate bit is we’re not exactly going up against Watson or the hologram doctor from the U.S.S. Voyager here.

This is Jeff Kline, showing off his rather old, not terribly sophisticated “attribute matching” software.  This software, created back in 2005-ish, is based off a database he created of acute coronary syndrome and pulmonary embolism patients.  He determined a handful of most-predictive variables from this set, and then created a tool that allows physicians to input those specific variables from a newly evaluated patient.  The tool then finds the exact matches in the database and spits back a probability estimate based on the historical reference set.

He sells software based on the algorithm and probably would like to see it perform well.  Sadly, it only performs “okay”.  But, it beats physician gestalt, which is probably better ranked as “poor”.  In their prospective evaluation of 840 cases of acute dyspnea or chest pain of uncertain immediate etiology, physicians (mostly attendings, then residents and midlevels) grossly over-estimated the prevalence of ACS and PE.  Physicians had a mean and median pretest estimate for ACS of 17% and 9%, respectively, and the software guessed 4% and 2%.  Actual retail price:  2.7%.  For PE, physicians were at mean 12% and median 6%, with the software at 6% and 5%.  True prevalence: 1.8%.

I don’t choose this article to highlight Kline’s algorithm, nor the comparison between the two.  Mostly, it’s a fascinating observational study of how poor physician estimates are – far over-stating risk.  Certainly, with this foundation, it’s no wonder we’re over-testing folks in nearly every situation.  The future of medicine involves the next generation of similar decision-support instruments – and we will all benefit.

“Clinician Gestalt Estimate of Pretest Probability for Acute Coronary Syndrome and Pulmonary Embolism in Patients With Chest Pain and Dyspnea.”
http://www.ncbi.nlm.nih.gov/pubmed/24070658

The “Ottawa SAH Rule”

This is a rather dangerous article for many reasons.  Firstly, it’s published in a high-impact journal and received a fair bit of coverage in the news media.  Secondly, it concludes its discussion by suggesting this ought to be adopted as a standardized rule for the evaluation of acute headache – this isn’t just a descriptive study on features of subarachnoid hemorrhage, it’s been given an official-sounding title, the “Ottawa SAH Rule”.  Because of this, there’s significant potential for the rule described here to be adopted as widespread practice.

Therefore – it better be nearly perfect.

This is a prospective cohort from 10 university-affiliated Canadian hospitals.  They looked at non-traumatic headaches reaching maximal intensity within 1 hour not part of a recurrent headache syndrome, and found 132 patients with SAH out of 2,131 assessed.  They specifically gathered information on three previously-derived prediction rules and found none of them were 100% sensitive – so they chose the required elements from each to reach 100% (95% CI 97.2-100) sensitivity.  The cost of this 100% sensitivity?  Degeneration of specificity from the 28-35% in the three individual rules down to 15% (95% CI 13.8-16.9) in the final rule.  The authors observed application of the derived rule would have decreased investigations for SAH from 84% of the enrolled cohort down to 74% of the enrolled cohort, and thusly their rule is superior to routine clinical practice by maintaining 100% sensitivity while decreasing resource utilization.

I think their inclusion criteria are fine – a rapid-onset, severe, atraumatic headache is the classical population of interest.  Patients without this feature have such a low incidence of SAH that it’s unreasonable to evaluate for it.  Their outcome measures, unfortunately, were a little softer.  The positive diagnoses are reasonable – CT proven SAH or positive lumbar puncture with a source feature on cerebral angiography.  However, only 82% underwent CT and 39% underwent LP, with a six month telephone follow-up – and a small number were lost to follow-up.  Many would argue that CT alone is not sufficient for ruling out SAH, and 6-month survival is a limited proxy.  This weakens its claim for 100% sensitivity.

Then, of course, a 15% specificity is awful.  This isn’t necessarily a criticism of the authors, but more a recognition of the limitation of distilling diverse clinical data into concise decision instruments.  19 different patient features were significantly different between the SAH and no-SAH groups; reducing this to just 6 features discards so much information that an instrument designed for a complex clinical prediction is bound to fail.  There were 1,694 false positives by the rule compared with 132 true positives.  If this rule is applied without the strict exclusion criteria specified in the publication, there may be a huge number of inappropriate investigations.

Then, the rate of investigation comparison is also probably invalid.  These institutions underwent specific 1-hour orientation to the study being performed and were actively involved in gathering clinical data for the study.  I’m certain the 84% rate of investigation observed was conflated by the ongoing research at hand.  The previous Perry study from 2011 had an evaluation rate of 57%, so it’s hard for me to believe the statistics from the current publication.

Finally, the kappa values for inter-observer agreement were rather mixed.  Based on only sixty cases where two physicians evaluated the same patient, four of the six final rule elements had kappas between 0.44 and 0.59, representing only moderate agreement.  This is a significant threat to the internal validity of the underlying data in support of their rule.

Overall, yes – the elements they identify through their observational cohort are likely to capture most cases of SAH.  However, the limitations of this study and the poor specificity make me reluctant to buy in completely – and certainly not adopt it as a standardized “rule”.

“Clinical Decision Rules to Rule Out Subarachnoid Hemorrhage for Acute Headache”
http://www.ncbi.nlm.nih.gov/pubmed/24065011

“Distracting”, But Not Distracting

Cervical spine clearance is always a fun topic.  Once upon a time, it was plain radiography, clinical re-assessment, and functional testing with dynamic radiography.  Now, a zero miss culture has turned us mostly to CT – and, beyond that, even some advocate for MRI.

But, as far as clinical clearance of the cervical spine goes, we usually use the NEXUS criteria or the Canadian C-Spine criteria.  One of the elements of the NEXUS criteria that is, essentially, subjectively defined is the presence of “distracting injury”.  Many have questioned the inclusion of this element.

These authors looked at cervical spine clearance in the presence of “distracting injury”, which, for the purpose of research protocols, was essentially a fracture somewhere, an intracranial injury, or an intra-abdominal organ injury.  They found, when assessing a GCS 14 or 15 trauma patient, even in the presence of these other injuries, clinical examination picked up 85 of 86 cervical spine injuries.  One patient did not report midline cervical spine tenderness – with humerus and mandible fractures, as well as frontal ICH – and had a 2nd vertebrae lateral mass fracture.

So, clinical examination is mostly reliable in the presence of a “distracting injury”.  I think the best interpretation of this study is “distracting injury” has to be determined on a case-by-case basis – one patient might be a reliable reporter in the presence of long-bone fracture, while another might need such a high level of pain control for initial management they are no longer aware of their cervical spine injury.  It’s fairly clear it is reasonable to remove the cervical collar and forgo imaging for most patients who can be adequately clinically assessed.

“Clinical clearance of the cervical spine in patients with distracting injuries: It is time to dispel the myth”
http://www.ncbi.nlm.nih.gov/pubmed/23019677

The EM Lit of Note PE Decision Tree

A couple weeks back I posted regarding a study where even intermediate- and high-risk patients with suspected PE had negative CTA in the presence of low d-Dimers.  Based on that post, I’ve put together a rough decision tree encapsulating how I (currently) prefer to approach the diagnosis of pulmonary embolism:
Note that “Scan for PE” really means to be “offer patient scan for PE”, considering relevant diagnostic uncertainty and risks in a shared decision-making process.  “Other reason why d-Dimer would be elevated” takes into account clinical judgement regarding the uselessness of d-Dimer as an acute-phase reactant or inflammatory marker; many “sick” patients will have elevated d-Dimers, obviating its value as a one-way screen-out.  Also, this chart does not account for any medicolegal liability risks – a wonderful perk of practicing in Texas.
Follow-up:  Seth Trueger and John Greenwood pointed out on the original that there are some specific moderate- and high-risk cases that satisfy PERC criteria, and perhaps the risk-stratification step should occur before application of PERC, as is traditionally done.  Fair enough!  They also note the EMCrit flow-chart begins with an exhortation of “Did you really care about PE?” – which, I’d say, is approximated by my value judgement of “Bad miss?” after starting to consider PE.  Finally, Scott Weingart chimed in to suggest, for patients in whom you’re playing the minimal-harm game for unexciting pulmonary emboli, a bedside ultrasound to quickly check for an occult DVT that might cause them to come back with clinically significant pulmonary venous thrombosis.

Simple SBI Prediction – Hopeless

It remains a noble endeavour to attempt to identify the risk of serious bacteria infections in children.  That said, many have tried, and many have failed.

These authors from the Netherlands and the United Kingdom try, yet again.  They note the best performing decision instrument incorporates 26 variables – which they feel is unworkably unwieldy in a clinical setting – and attempt to derive their own, tighter instrument.  Unfortunately, the clinical variables that shake out of their prediction methodology all have odds ratios less than 6 – leading to a prediction model that can be calibrated only either for horrible sensitivity or horrible specificity.  The sensitive model will lead to over-testing of an otherwise well population, and the specific model will essentially pick up only the cases that were clinically obvious.

It’s becoming pretty clear over the years that attempting to reduce the number of discrete clinical variables in the febrile SBI decision-instrument is a dead-end strategy.  Complex clinical problems simply defy dimension reduction.  Furthermore, the true test of a decision instrument also ought not just be statistical evaluation in a vacuum, but comparison with clinical judgement.

“Clinical prediction model to aid emergency doctors managing febrile children at risk of serious bacterial infections: diagnostic study”
www.bmj.com/content/346/bmj.f1706

Falling Short on Pneumonia Prediction

These authors address a real problem: which coughing adults have pneumonia?  Unfortunately, after evaluating 2,820 of them – they still don’t really know.


This is an interesting article because it pulls together a symptom profile along with two of the other non-specific inflammatory markers being touted as important diagnostic tools: CRP and procalcitonin.  Primary care physicians enrolled adults presenting with acute cough, and used plain radiography as their gold standard for diagnosis of pneumonia.


In short:

  • “Symptoms and signs” suggestive of pneumonia (fever, tachycardia, abnormal lung exam) all had positive OR between 2.0 and 5.3, and combined offered an AUC of 0.70.
  • Adding CRP as a continuous variable to symptoms and signs gave an OR of 1.2 and increased the AUC to 0.78.
  • Adding procalcitonin as a continuous variable to symptoms and signs gave an OR of 1.1 and increased the AUC to 0.72.

Using CRP as a dichotomous cut-off at 30 mg/L, in addition to the independent symptom predictors, gave them the discriminating ability to produce a low, intermediate and high risk group: 0.7%, 3.8%, and 18.2% chance of pneumonia.  A high-risk group where fewer than one in five have the disease?  The authors recommend consideration of empiric antibiotic therapy in this group, but I prefer their other recommendation to consider radiography as confirmation in this subset.  The remainder ought to be candidates for observation, as false positives and harms from additional testing are likely to outweigh true positives.


Again, refuting the terrible JAMA distortion, procalcitonin had no useful discriminatory diagnostic value.


“Use of serum C reactive protein and procalcitonin concentrations in addition to symptoms and signs to predict pneumonia in patients presenting to primary care with acute cough: diagnostic study”

How To Evaluate Decision Instruments

This lovely editorial by Steven Green from Loma Linda succinctly summarizes the limitations of clinical decision instruments.  Decision instruments, referred to in this article as decision “rules”, are potentially valuable distillations of data from large research cohorts meant to concisely address vital clinical concerns.  These include such well-known instruments as NEXUS, PERC, Centor, Alvarado, Wells, and Geneva.

He describes a need for rigorous derivation, external validation, and ease of application as important criteria.  However, the most important topics he addresses are the related issues of “1-way” versus “2-way” application and whether the rule improves upon pre-existing clinical practice.  A “1-way” decision instrument informs clinicians only when its criteria are all met – such as the PERC rule.  A patient who fails the PERC rule does not necessarily need any additional testing due to its low specificity.  The NEXUS criteria, on the other hand, is a 2-way decision rule – where its use in appropriately selected patients typically leads to radiography if its criteria are not met.

The danger, however, is the natural propensity to using a “1-way” rule like a “2-way” rule.  His example for this error is the PECARN blunt abdominal trauma article for which I previously expressed concerns.  In the PECARN blunt trauma instrument, the specificity of the derivation was actually lower than the performance of the clinical gestalt of the physicians involved.  This means the authors recommend its use only as a “1-way” rule, based on sensitivity.  However, if the cognitive error is made to apply it as a “2-way” rule, CT scanning will increase by 13%.  Then, unfortunately, if used as a “1-way” rule, the PECARN instrument only has 97% sensitivity compared with the clinician gestalt of 99% sensitivity.  This means that, if implemented as routine practice, the PECARN instrument may have a non-trivial number of misses while potentially increasing scanning.  This illustrates his point as a “poorly-designed” decision rule, despite the statistical power of the cohort evaluated.

Overall, a lovely read regarding how to properly evaluate and apply decision instruments.

“When Do Clinical Decision Rules Improve Patient Care?”
www.ncbi.nlm.nih.gov/pubmed/23548403

The NICE Traffic Light Fails

Teasing out serious infection in children – while minimizing testing and unnecessary interventions – remains a challenge.  To this end, the National Institute for Health and Clinical Excellence in the United Kingdom created a “Traffic Light” clinical assessment tool.  This tool, which uses colour, activity, respiratory, hydration, and other features to give a low-, intermediate-, or high-risk assessment.

These authors attempted to validate the tool by retrospectively applying it to a prospective registry of over 15,000 febrile children aged less than 5 years.  The primary outcome was correctly classifying a serious bacterial infection as intermediate- or high-risk.  And the answer: 85.8% sensitivity and 28.5% specificity.  Meh.

108 of the 157 missed cases of SBI were urinary tract infections – for which the authors suggest perhaps urinalysis could be added to the NICE traffic light.  This would increase sensitivity to 92.1%, but drop specificity to 22.3% – if you agree with the blanket categorization of UTI as SBI.

Regardless, the AUC for SBI was 0.64 without the UA and 0.61 with the UA – not good at all.

“Accuracy of the “traffic light” clinical decision rule for serious bacterial infections in young children with fever: a retrospective cohort study”
www.ncbi.nlm.nih.gov/pubmed/23407730

Pediatric Blunt Trauma Remains Confounding

The latest output from the Pediatric Emergency Care Applied Research Network is a clinical decision instrument intended to assist clinicians in managing pediatric blunt abdominal trauma.

Like previous PECARN studies, this is a multi-center, prospective, observational study conducted in tertiary pediatric emergency departments.  This study enrolled 12,044 children with blunt trauma and prospectively collected structured data regarding their mechanism, external injuries, and physiologic variables.  Using the magic of statistical partitioning, the authors derived a decision instrument for use in risk-stratifying a patient as “very low risk for intra-abdominal injury requiring acute intervention.”  If the patient meets all criteria, the prediction rule is 97.0% sensitive, missing 6 out of 203 abdominal injuries.

This is critically valuable data – but, as a decision-instrument in a zero-miss environment, I’m not sure if it helps.  The authors note that use of their CT decision-instrument actually increased resource utilization if retrospectively applied to the derivation cohort, if the requirement is held that a patient be negative for every variable.  If the threshold is raised to 1 or 2 variables present, then sensitivity drops to 82% and 77%, respectively.  Only about half received a CT scan, and a small percentage were lost to follow-up – though, given the outcome of “injuries requiring intervention”, the methodology is reasonable.  However, because intervention-requiring injuries only represented 26% of all radiographically-identified intra-abdominal injuries, this study is still going to be ignored out-of-hand by folks who want to identify all injuries, not just intervention-requiring injuries.  After all, the grade 1 splenic laceration may be intervention-free, but remains important regarding activity restrictions to prevent future morbidity.

The authors also note these findings require external validation – wherever they’re going to find another pedatric emergency care network to enroll 12,000 patients!

“Identifying Children at Very Low Risk of Clinically Important Blunt Abdominal Injuries”
http://www.ncbi.nlm.nih.gov/pubmed/23375510