Nihilsm, Emergency Medicine and the art of doing nothing at emnerd.com
Category: Clinical Decision Rules
Another Taste of the Future
Putting my Emergency Informatics hat back on for a day, I’d like to highlight another piece of work that brings us, yet again, another step closer to being replaced by computers.
Or, at the minimum, being highly augmented by computers.
There are multitudinous clinical decision instruments available to supplement physician decision-making. However, the general unifying element of most instruments is the necessary requirement of physician input. This interruption of clinical flow reduces acceptability of use, and impedes knowledge translation through the use of these tools.
However, since most clinicians are utilizing Electronic Health Records, we’re already entering the information required for most decision instruments into the patient record. Usually, this is a combination of structured (click click click) and unstructured (type type type) data. Structured data is easy for clinical calculators to work with, but has none of the richness communicated by freely typed narrative. Therefore, clinicians much prefer to utilize typed narrative, at the expense of EHR data quality.
This small experiment out of Cincinnati implemented a natural-language processing and machine-learning automated method to collect information from the EHR. Structured and unstructured data from 2,100 pediatric patients with abdominal pain were analyzed to extract the elements to calculate the Pediatric Appendicitis Score. Appropriateness of the Pediatric Appendicitis Score aside, their method performed reasonably well. It picked up about 87% of the elements of the Score from the record, and was correct when doing so about 86%, as well. However, this was performed retrospectively – and the authors state this processing would still be substantially delayed by hours following the initial encounter.
So, we’re not quite yet at the point where a parallel process monitors system input and provides real-time diagnostic guidance – but, clearly, this is a window into the future. The theory: if an automated process could extract the data required to calculate the score, physicians might be more likely to integrate the score into their practice – and thusly lead to higher quality care through more accurate risk-stratification.
I, for one, welcome our new computer overlords.
“Developing and evaluating an automated appendicitis risk stratification algorithm for pediatric patients in the emergency department”
Replace Us With Computers!
In a preview to the future – who performs better at predicting outcomes, a physician, or a computer?
Unsurprisingly, it’s the computer – and the unfortunate bit is we’re not exactly going up against Watson or the hologram doctor from the U.S.S. Voyager here.
This is Jeff Kline, showing off his rather old, not terribly sophisticated “attribute matching” software. This software, created back in 2005-ish, is based off a database he created of acute coronary syndrome and pulmonary embolism patients. He determined a handful of most-predictive variables from this set, and then created a tool that allows physicians to input those specific variables from a newly evaluated patient. The tool then finds the exact matches in the database and spits back a probability estimate based on the historical reference set.
He sells software based on the algorithm and probably would like to see it perform well. Sadly, it only performs “okay”. But, it beats physician gestalt, which is probably better ranked as “poor”. In their prospective evaluation of 840 cases of acute dyspnea or chest pain of uncertain immediate etiology, physicians (mostly attendings, then residents and midlevels) grossly over-estimated the prevalence of ACS and PE. Physicians had a mean and median pretest estimate for ACS of 17% and 9%, respectively, and the software guessed 4% and 2%. Actual retail price: 2.7%. For PE, physicians were at mean 12% and median 6%, with the software at 6% and 5%. True prevalence: 1.8%.
I don’t choose this article to highlight Kline’s algorithm, nor the comparison between the two. Mostly, it’s a fascinating observational study of how poor physician estimates are – far over-stating risk. Certainly, with this foundation, it’s no wonder we’re over-testing folks in nearly every situation. The future of medicine involves the next generation of similar decision-support instruments – and we will all benefit.
“Clinician Gestalt Estimate of Pretest Probability for Acute Coronary Syndrome and Pulmonary Embolism in Patients With Chest Pain and Dyspnea.”
http://www.ncbi.nlm.nih.gov/pubmed/24070658
The “Ottawa SAH Rule”
This is a rather dangerous article for many reasons. Firstly, it’s published in a high-impact journal and received a fair bit of coverage in the news media. Secondly, it concludes its discussion by suggesting this ought to be adopted as a standardized rule for the evaluation of acute headache – this isn’t just a descriptive study on features of subarachnoid hemorrhage, it’s been given an official-sounding title, the “Ottawa SAH Rule”. Because of this, there’s significant potential for the rule described here to be adopted as widespread practice.
Therefore – it better be nearly perfect.
This is a prospective cohort from 10 university-affiliated Canadian hospitals. They looked at non-traumatic headaches reaching maximal intensity within 1 hour not part of a recurrent headache syndrome, and found 132 patients with SAH out of 2,131 assessed. They specifically gathered information on three previously-derived prediction rules and found none of them were 100% sensitive – so they chose the required elements from each to reach 100% (95% CI 97.2-100) sensitivity. The cost of this 100% sensitivity? Degeneration of specificity from the 28-35% in the three individual rules down to 15% (95% CI 13.8-16.9) in the final rule. The authors observed application of the derived rule would have decreased investigations for SAH from 84% of the enrolled cohort down to 74% of the enrolled cohort, and thusly their rule is superior to routine clinical practice by maintaining 100% sensitivity while decreasing resource utilization.
I think their inclusion criteria are fine – a rapid-onset, severe, atraumatic headache is the classical population of interest. Patients without this feature have such a low incidence of SAH that it’s unreasonable to evaluate for it. Their outcome measures, unfortunately, were a little softer. The positive diagnoses are reasonable – CT proven SAH or positive lumbar puncture with a source feature on cerebral angiography. However, only 82% underwent CT and 39% underwent LP, with a six month telephone follow-up – and a small number were lost to follow-up. Many would argue that CT alone is not sufficient for ruling out SAH, and 6-month survival is a limited proxy. This weakens its claim for 100% sensitivity.
Then, of course, a 15% specificity is awful. This isn’t necessarily a criticism of the authors, but more a recognition of the limitation of distilling diverse clinical data into concise decision instruments. 19 different patient features were significantly different between the SAH and no-SAH groups; reducing this to just 6 features discards so much information that an instrument designed for a complex clinical prediction is bound to fail. There were 1,694 false positives by the rule compared with 132 true positives. If this rule is applied without the strict exclusion criteria specified in the publication, there may be a huge number of inappropriate investigations.
Then, the rate of investigation comparison is also probably invalid. These institutions underwent specific 1-hour orientation to the study being performed and were actively involved in gathering clinical data for the study. I’m certain the 84% rate of investigation observed was conflated by the ongoing research at hand. The previous Perry study from 2011 had an evaluation rate of 57%, so it’s hard for me to believe the statistics from the current publication.
Finally, the kappa values for inter-observer agreement were rather mixed. Based on only sixty cases where two physicians evaluated the same patient, four of the six final rule elements had kappas between 0.44 and 0.59, representing only moderate agreement. This is a significant threat to the internal validity of the underlying data in support of their rule.
Overall, yes – the elements they identify through their observational cohort are likely to capture most cases of SAH. However, the limitations of this study and the poor specificity make me reluctant to buy in completely – and certainly not adopt it as a standardized “rule”.
“Clinical Decision Rules to Rule Out Subarachnoid Hemorrhage for Acute Headache”
http://www.ncbi.nlm.nih.gov/pubmed/24065011
“Distracting”, But Not Distracting
Cervical spine clearance is always a fun topic. Once upon a time, it was plain radiography, clinical re-assessment, and functional testing with dynamic radiography. Now, a zero miss culture has turned us mostly to CT – and, beyond that, even some advocate for MRI.
But, as far as clinical clearance of the cervical spine goes, we usually use the NEXUS criteria or the Canadian C-Spine criteria. One of the elements of the NEXUS criteria that is, essentially, subjectively defined is the presence of “distracting injury”. Many have questioned the inclusion of this element.
These authors looked at cervical spine clearance in the presence of “distracting injury”, which, for the purpose of research protocols, was essentially a fracture somewhere, an intracranial injury, or an intra-abdominal organ injury. They found, when assessing a GCS 14 or 15 trauma patient, even in the presence of these other injuries, clinical examination picked up 85 of 86 cervical spine injuries. One patient did not report midline cervical spine tenderness – with humerus and mandible fractures, as well as frontal ICH – and had a 2nd vertebrae lateral mass fracture.
So, clinical examination is mostly reliable in the presence of a “distracting injury”. I think the best interpretation of this study is “distracting injury” has to be determined on a case-by-case basis – one patient might be a reliable reporter in the presence of long-bone fracture, while another might need such a high level of pain control for initial management they are no longer aware of their cervical spine injury. It’s fairly clear it is reasonable to remove the cervical collar and forgo imaging for most patients who can be adequately clinically assessed.
“Clinical clearance of the cervical spine in patients with distracting injuries: It is time to dispel the myth”
http://www.ncbi.nlm.nih.gov/pubmed/23019677
The EM Lit of Note PE Decision Tree
Simple SBI Prediction – Hopeless
It remains a noble endeavour to attempt to identify the risk of serious bacteria infections in children. That said, many have tried, and many have failed.
These authors from the Netherlands and the United Kingdom try, yet again. They note the best performing decision instrument incorporates 26 variables – which they feel is unworkably unwieldy in a clinical setting – and attempt to derive their own, tighter instrument. Unfortunately, the clinical variables that shake out of their prediction methodology all have odds ratios less than 6 – leading to a prediction model that can be calibrated only either for horrible sensitivity or horrible specificity. The sensitive model will lead to over-testing of an otherwise well population, and the specific model will essentially pick up only the cases that were clinically obvious.
It’s becoming pretty clear over the years that attempting to reduce the number of discrete clinical variables in the febrile SBI decision-instrument is a dead-end strategy. Complex clinical problems simply defy dimension reduction. Furthermore, the true test of a decision instrument also ought not just be statistical evaluation in a vacuum, but comparison with clinical judgement.
“Clinical prediction model to aid emergency doctors managing febrile children at risk of serious bacterial infections: diagnostic study”
www.bmj.com/content/346/bmj.f1706
Falling Short on Pneumonia Prediction
These authors address a real problem: which coughing adults have pneumonia? Unfortunately, after evaluating 2,820 of them – they still don’t really know.
This is an interesting article because it pulls together a symptom profile along with two of the other non-specific inflammatory markers being touted as important diagnostic tools: CRP and procalcitonin. Primary care physicians enrolled adults presenting with acute cough, and used plain radiography as their gold standard for diagnosis of pneumonia.
In short:
- “Symptoms and signs” suggestive of pneumonia (fever, tachycardia, abnormal lung exam) all had positive OR between 2.0 and 5.3, and combined offered an AUC of 0.70.
- Adding CRP as a continuous variable to symptoms and signs gave an OR of 1.2 and increased the AUC to 0.78.
- Adding procalcitonin as a continuous variable to symptoms and signs gave an OR of 1.1 and increased the AUC to 0.72.
Using CRP as a dichotomous cut-off at 30 mg/L, in addition to the independent symptom predictors, gave them the discriminating ability to produce a low, intermediate and high risk group: 0.7%, 3.8%, and 18.2% chance of pneumonia. A high-risk group where fewer than one in five have the disease? The authors recommend consideration of empiric antibiotic therapy in this group, but I prefer their other recommendation to consider radiography as confirmation in this subset. The remainder ought to be candidates for observation, as false positives and harms from additional testing are likely to outweigh true positives.
Again, refuting the terrible JAMA distortion, procalcitonin had no useful discriminatory diagnostic value.
How To Evaluate Decision Instruments
This lovely editorial by Steven Green from Loma Linda succinctly summarizes the limitations of clinical decision instruments. Decision instruments, referred to in this article as decision “rules”, are potentially valuable distillations of data from large research cohorts meant to concisely address vital clinical concerns. These include such well-known instruments as NEXUS, PERC, Centor, Alvarado, Wells, and Geneva.
He describes a need for rigorous derivation, external validation, and ease of application as important criteria. However, the most important topics he addresses are the related issues of “1-way” versus “2-way” application and whether the rule improves upon pre-existing clinical practice. A “1-way” decision instrument informs clinicians only when its criteria are all met – such as the PERC rule. A patient who fails the PERC rule does not necessarily need any additional testing due to its low specificity. The NEXUS criteria, on the other hand, is a 2-way decision rule – where its use in appropriately selected patients typically leads to radiography if its criteria are not met.
The danger, however, is the natural propensity to using a “1-way” rule like a “2-way” rule. His example for this error is the PECARN blunt abdominal trauma article for which I previously expressed concerns. In the PECARN blunt trauma instrument, the specificity of the derivation was actually lower than the performance of the clinical gestalt of the physicians involved. This means the authors recommend its use only as a “1-way” rule, based on sensitivity. However, if the cognitive error is made to apply it as a “2-way” rule, CT scanning will increase by 13%. Then, unfortunately, if used as a “1-way” rule, the PECARN instrument only has 97% sensitivity compared with the clinician gestalt of 99% sensitivity. This means that, if implemented as routine practice, the PECARN instrument may have a non-trivial number of misses while potentially increasing scanning. This illustrates his point as a “poorly-designed” decision rule, despite the statistical power of the cohort evaluated.
Overall, a lovely read regarding how to properly evaluate and apply decision instruments.
“When Do Clinical Decision Rules Improve Patient Care?”
www.ncbi.nlm.nih.gov/pubmed/23548403
The NICE Traffic Light Fails
Teasing out serious infection in children – while minimizing testing and unnecessary interventions – remains a challenge. To this end, the National Institute for Health and Clinical Excellence in the United Kingdom created a “Traffic Light” clinical assessment tool. This tool, which uses colour, activity, respiratory, hydration, and other features to give a low-, intermediate-, or high-risk assessment.
These authors attempted to validate the tool by retrospectively applying it to a prospective registry of over 15,000 febrile children aged less than 5 years. The primary outcome was correctly classifying a serious bacterial infection as intermediate- or high-risk. And the answer: 85.8% sensitivity and 28.5% specificity. Meh.
108 of the 157 missed cases of SBI were urinary tract infections – for which the authors suggest perhaps urinalysis could be added to the NICE traffic light. This would increase sensitivity to 92.1%, but drop specificity to 22.3% – if you agree with the blanket categorization of UTI as SBI.
Regardless, the AUC for SBI was 0.64 without the UA and 0.61 with the UA – not good at all.
“Accuracy of the “traffic light” clinical decision rule for serious bacterial infections in young children with fever: a retrospective cohort study”
www.ncbi.nlm.nih.gov/pubmed/23407730
