Article Text

Download PDFPDF

Why science is less scientific than we think (and what to do about it): The 2022 Gaston Labat Award Lecture
  1. Brian M Ilfeld1,2
  1. 1Department of Anesthesiology, University of California San Diego, La Jolla, California, USA
  2. 2Outcomes Research, Cleveland Clinic, Cleveland, Ohio, USA
  1. Correspondence to Dr Brian M Ilfeld, Department of Anesthesiology, University of California San Diego, La Jolla, California, USA; bilfeld{at}health.ucsd.edu

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Fun (pain medicine-related) facts: the origin of the English word ‘narcotic’ is from ancient Greek: ναρκῶ narkō, ‘to make numb’, a term used by Hippocrates, Aristotle and Plato.1 Contemporaries of these medical luminaries treated the pain of operations, headaches, and childbirth with electricity from what they described as narke (figure 1), and we now call torpedo fish (‘torpedo’ is itself derived from the Latin torpere, ‘to be numb or stunned’). But I digress…

Figure 1

An example of a torpedo fish which can produce electrical current of up to 220 volts.

Background

Gout was a significant health issue in Ancient Rome lacking an effective treatment.

Methods

In the first Century, 20 Roman citizens with severe gout were randomized to touching either a living or dead torpedo fish (following written, informed consent, of course).2

Results

The day following treatment, pain scores of participants who had contact with a living fish (n=10) improved an average of 1.6 points, compared with no improvement at all for the control group on a 0–10 Numeric Rating Scale (p=0.001).

Discussion

This difference is statistically significant, but is it clinically relevant—should touching a living torpedo fish be offered to all patients with severe gout?

Individual versus group differences

There are two distinct steps in the critical process of applying study results to clinical practice: (1) determining the degree of improvement important to individuals and (2) establishing the degree of improvement between groups that is relevant. An example of the former: identifying that a majority of Romans with gout consider a decrease of 2 points on the 0–10 scale to be meaningful. In other words, if we ask any individual patient if touching the fish was worth doing, those with at least a 2-point improvement would answer in the affirmative: we have defined the minimum clinically important difference (MCID) as 2 for individuals. In our study, of the 10 patients who touched a living torpedo fish, 8 patients had an improvement of 2 points, while remaining two patients experienced no improvement, providing an average of 1.6 for the group. The mean improvement for the control group was 0 (no change), and therefore, the difference between the two treatment groups was 1.6 points.

Measuring the smallest values individuals consider important for patient-reported outcomes such as pain has been well-described and can be relatively straight-forward.3 However, as with our study of Ancient Romans, controlled clinical trials usually produce aggregated results for groups, and the smallest difference that is considered important between groups is far less well described. Unfortunately, the smallest change that is important to individual patients cannot be extrapolated to the evaluation of differences between groups.4 This is a critical distinction which is frequently unrecognized by investigators, consumers of research, as well grant and manuscript reviewers, often with significant negative consequences.5 6 Considering this issue applies to nearly every controlled study which informs patient care, it is not an obscure matter relegated to academic methodological researchers—it is of immense consequence to everyone involved in healthcare delivery.

Individual differences

Multiple methods are available to discern the smallest difference for patient-reported outcomes (eg, pain) that is important to individual patients. For example, postoperative patients can rate their pain on a numeric scale, consume an analgesic, and then rate their subsequent pain level as ‘worse’, ‘the same’, ‘slightly better’, ‘much better’ (figure 2). Patients who rated their improvement as ‘much better’ can be combined, and the difference between their preintervention and postintervention scores define a ‘benchmark’ of what improvement is deemed clinically important to each individual. The MCID can be significantly influenced by a myriad of patient characteristics such as age,7 gender,8 body mass index,7 culture,9 ethnicity,10 geographic locality,9 educational level,11 perceived general health,11 disease status,12 and socioeconomic status; intervention tolerability, adverse effects, and safety; as well as pain baseline,8 duration, intensity, frequency, location, etiology… well, you get the picture.13 As just one simplistic example, a small benefit from a treatment that is well tolerated and safe will usually be considered more acceptable than a larger benefit from a treatment that kills 50% of patients. There are a near-infinite number of combinations of characteristics and, therefore, unique groups of patients (‘populations’).

Figure 2

An example of an ‘anchor-based’ method to determine the smallest improvement in analgesia that individual patients consider relevant or important.

Crucially, the smallest important difference for one cannot be assumed valid for others. Nevertheless, it is generally accepted that while we cannot account for all variables, the MCID for specific populations are discernable and function as an estimation of the true MCID for similar individuals. For instance, in a widely respected and cited investigation, Farrar et al used an anchor-based method to determine that decreases of at least 1.7 points (or ≥28%) on an 11-point pain intensity numerical rating scale identified patients who considered themselves at least ‘much improved’ after treatment with pregabalin for chronic pain of various etiologies.8 There has been a good deal of research involving chronic pain states, resulting in a widely accepted consensus statement involving the analgesic MCID for individual patients.3 Regrettably, no similar consensus exists for acute or postoperative pain states.14 This predicament has lead researchers investigating acute pain to either simply use the MCID from chronic pain states3 8—a practice lacking validation—or rely on the individual MCID from the few studies involving acute pain.15–18

Group differences

In contrast, discerning the importance of differences between group means requires accounting for clinical and societal contexts in addition to the smallest important change among individuals. Returning to Ancient Rome, of the 10 patients who touched a living torpedo fish, 80% experienced an improvement of 2.0 points on the 0–10 scale, while the remaining two experienced no change (figure 3). Since we identified a 2-point improvement as what each individual considers important, 80% of our experimental group considers the treatment beneficial (this is simplified for demonstration purposes—reality is more complex). However, when all of the 10 patients’ pain scores are combined for analysis, the group’s average improvement is only 1.6 points (p=0.001). This is less of an improvement than the 2.0 points required by individuals to describe the change as important. When planning the study, if we had defined an ‘important difference’ between the two treatment groups as 2.0 based exclusively on what individuals consider important, then we cannot conclude that touching a living torpedo fish is clinically relevant since the difference between our experimental and control groups is only 1.6 points.

Figure 3

An example of the problem extrapolating the smallest change that is important to individual patients to the evaluation of group differences. This example is simplified to convey the general concept—reality is (unfortunately) more complex.

Should we now conclude that torpedo fish neuromodulation is an ineffective analgesic undeserving of future use? Is this a situation in which we detect a statistically significant difference without concurrent clinical significance? Put another way, did our study reveal an improvement that exists, but is so small that it is clinically irrelevant? That appears improbable since 80% of the treated patients experienced what they consider a clinically relevant effect (2-point or greater improvement), and so one could reasonably and logically conclude that the treatment is useful.

The error was to apply the MCID of individuals to group differences3 4 19 The high successful response rate information is often lost when authors perform aggregated summary statistics between groups. In addition, to determine whether the group difference is clinically meaningful, we must take into account many additional variables such as onset speed, treatment effect size magnitude, durability of benefits, convenience, patient adherence, results for secondary efficacy endpoints, safety, tolerability, and cost—to name but a few—relative to other available treatments.13 19 If torpedo fish, compared with alternative analgesics of Ancient Rome, provide a rapid analgesic onset and long duration, are plentiful and easily stored, induce no adverse side effects, and are inexpensive, then perhaps we can conclude that the average group improvement of 1.6 points is clinically meaningful and should be considered a possible analgesic option (even though individuals do not consider less than 2.0 points personally meaningful). In contrast, if torpedo fish require storage in expensive aquariums and frequently induce severe electrical burns, then perhaps the proper conclusion is that the average group improvement of 1.6 points does not suggest future use is warranted. And this is a dramatic oversimplification of the issue—completely ignored are issues of a placebo effect, baseline pain score differences, symptom spontaneous resolution, subgroup analysis, and regression to the mean, among others.19

In general, the smallest relevant improvement for groups will be smaller than for individuals.20 Consequently, if the MCID for individuals is erroneously applied to groups, there is a good chance that a potentially useful intervention will be incorrectly declared as clinically unimportant. The use of oral analgesics to treat osteoarthritis pain is a good illustration of this concept. Farrar et al previously determined that individual patients with chronic pain of multiple etiologies (including osteoarthritis) require an analgesic to improve their pain score by at least 1.7 points to be considered an important change.8 However, as noted previously by Dworkin et al, when treating osteoarthritis the improvement in average group pain scores is less than 1.1 for opioids, non-steroidal anti-inflammatory drugs, paracetamol (acetaminophen), glucosamine, and chondroitin compared with placebo (standardized to an 11-point scale).19 If the widely respected and cited smallest improvement of 1.7 points deemed important for individuals was applied to these studies of group differences, we would not have a single oral analgesic considered effective for osteoarthritis!8

Investigator conundrum

Determining the smallest improvement deemed important for groups is not easily calculated, and there is a lack of consensus regarding appropriate techniques and their application to various scenarios. Unfortunately, investigators cannot simply avoid defining the minimal difference between groups since it is usually required for hypothesis testing: in estimating an appropriate sample size, for example. To complicate matters—as if they needed further complication—much of the information required to determine the group MCID such as the incidence of adverse events is frequently unknown and requires the trial results to provide the information required to… design the trial in the first place. Which is why authors frequently use the phrase sample size ‘estimation’ or ‘justification’: the sample size must be inferred from imperfect and incomplete information. Regrettably, investigators are frequently asked to pretend they can make a definitive ‘determination’ based on published data. To illustrate the impact misunderstanding these topics can have on clinical research, I offer an example from my own experience.

My coinvestigators and I submitted a grant proposal to investigate the use of treating established post-amputation phantom limb pain with a continuous peripheral nerve block. The MCID for our specific patient population had not been discerned, so we proposed using Farrar’s finding that decreases of at least 1.7 points on an 11-point pain scale identified individuals who considered themselves at least ‘much improved’.8 Granted, this was an extrapolation of data from patients with different pain etiologies treated with a different analgesic (pregabalin); but we felt it was the best available data for our purposes. Our estimated sample size was based on an approximation of a group difference of 1.1 points, partially informed with the 1.7 MCID for individuals. The proposal was rejected, with one reviewer explaining that using 1.1 points as our MCID—when published data demonstrated 1.7 points to be the true MCID for patients with chronic pain—would leave results falling between the two values as uninterpretable.

We subsequently submitted a revised proposal using the reviewer’s preferred smallest meaningful clinical difference of 1.7 points between the treatment and placebo groups, and it was funded. On completing the study, we submitted the manuscript reporting that after 4 weeks, average phantom limb pain intensity was a mean of 3.0 in patients given local anesthetic vs 4.5 in those given placebo (p=0.003).21 The manuscript was rejected with a reviewer explaining that while our primary endpoint was statistically significant, we had prospectively defined the minimal clinically important difference between treatment groups as 1.7, and we had found a difference of only 1.5 points; demonstrating that while the intervention did, in fact, decrease phantom limb pain, it failed to do so to a clinically relevant degree.

Investigators

As previously noted, investigators have little choice but to define the smallest important difference between treatment groups to permit designing, analyzing, and reporting controlled studies. While a full discussion of the possible solutions to this challenge is outside of the scope of the current article, some relatively simple steps warrant mention. Joint hypothesis testing allows for multiple primary end points to be evaluated with a high degree of confidence, so that there are more variables—and information—on which to draw conclusions.22 This is particularly helpful when two outcomes of importance are closely related and drawing conclusions based on just one decreases confidence in the ultimate conclusions. For example, if a new medication improves pain scores yet has no effect on supplemental analgesic consumption, making either of these end points the sole primary outcome measure will result in differing conclusions. However, by prospectively defining both as primary outcomes and requiring one to be superior while the other at least non-inferior, the study results are more easily interpreted with a high degree of confidence.

Another relatively common technique is statistical ‘gatekeeping’: multiple outcomes are combined in sets and these listed in order of importance (figure 4).22 23 If at least one outcome in the first (primary) set is found to be statistically significant, the next set of end points are tested (although with a higher ‘bar’ for concluding a difference, depending on the number of variables found significant in the previous group). This continues through each of the gates until the overall type I error is exhausted. Gatekeeping permits statistical comparisons of many variables while retaining a study-wide type I error at a 0.05, increasing confidence in all comparisons and aiding interpretation and generalization of the results. In the phantom limb pain study described above, we prospectively specified a statistical gatekeeping strategy which elevated confidence in various secondary outcomes, the first of which was a scale that demonstrated pain’s interference on emotional and physical functioning was a mean of 11 for treated participants, vs 28 for the placebo group (lower scores=less interference; p=0.027).21 Given that the results of multiple published investigations suggest that a mean group difference of 2 points on this scale indicates patients who are satisfied or improved with treatment,3 our finding of a difference of 17 points suggested a clinically significant improvement in the experimental over placebo treatments; and the results were subsequently published in different journals.21 24

Figure 4

]Visual representation of joint hypothesis testing and a parallel gatekeeping procedure. Gatekeeping permits statistical comparisons of many variables while retaining a study-wide type I error at a 0.05, increasing confidence in all comparisons and aiding interpretation and generalization of the results. In this example, the study outcomes are prospectively prioritized into three ordered sets. On study conclusion, statistical testing proceeds through each ‘gate’ to the next set if and only if at least one outcome in the current set reached significance. The significance level for each set will be 0.05 times a cumulative penalty for non-significant results in previous sets (ie, a ‘rejection gain factor’ equal to the cumulative product of the proportion of significant tests across the preceding sets). Within a set, a multiple comparison procedure (Bonferroni correction) is used to control the type I error at the appropriate level, if needed. Adapted from a grant proposal coauthored with Edward Mascha, PhD (Departments of Quantitative Health Sciences and Outcomes Research, Cleveland Clinic, Cleveland, Ohio, USA).

Stakeholders

While comparing two group means facilitates hypothesis testing and thus often preferred by investigators, it also complicates interpretation of the results and extrapolation to differing clinical scenarios. For other stakeholders, including clinicians, patients, administrators, and even other researchers, a more useful measure of treatment effect can be the percentage of each study group which reached various outcomes (related to the ‘number needed to treat’ (NNT)).25 For example, it is probably difficult for individual patients with phantom limb pain to evaluate how our reported 1.5-point improvement in group means applies specifically to them. Potentially more useful information is that 56% of study participants treated with local anesthetic experienced at least a 1.7-point improvement in their pain scores at 4 weeks, compared with only 25% for those in the control group, suggesting that a 6-day continuous peripheral nerve block decreases pain a clinically relevant amount in approximately 31% of treated patients. Increasing the ways in which data is conveyed assists stakeholders in interpreting and applying the study results to their own specific situation/population (figure 5).

Figure 5

Examples of various presentation formats of the same dataset. Study participants included adult patients with postamputation phantom limb pain who all received a single-injection peripheral nerve block with ropivacaine and a perineural catheter(s). Participants were randomized to receive 6 days of either perineural ropivacaine or normal saline, and the primary outcome measure was the improvement in the average pain score queried 4 weeks following infusion initiation as measured using a 0–10 Numeric Rating Scale. Note that the combination of information improves interpretation of the results, and different formats may be valuable to differing stakeholders such as clinicians, patients, administrators, and other researchers. Adapted from Ilfeld et al.21 Some results are preliminary and should be used for demonstration purposes only (final analysis currently in preparation).

Importantly, different stakeholders often have differing priorities and therefore frequently draw dissimilar—or even opposite—conclusions from the same investigation. For instance, when presented with these results, a patient with severe phantom limb pain unchanged with any standard treatment may conclude that a 31% improvement represents an excellent alternative analgesic. Conversely, a chronic pain physician may consider these same results as unacceptable for a patient with osteoporosis given the theoretically increased risk of falling with a perineural infusion; and the available alternative of percutaneous peripheral nerve stimulation that does not induce a proprioception, sensory or motor block.26 In contrast, an insurance administrator may conclude the analgesic improvement demonstrates a favorable cost–benefit ratio relative to nerve stimulation and warrants a trial of the perineural infusion.27 Applicability of study conclusions (positive or negative) to the innumerable populations and scenarios ‘must be conducted on a case-by-case basis, and are ideally informed by patients and their significant others, clinicians, researchers, statisticians, and representatives of society at large’.19

Grant and manuscript reviewers

There will often be differences in opinion regarding the clinical significance of a group difference even for the same population due to divergent priorities among the many consumers of research results, including patients, healthcare providers, administrators, insurance carriers, medication and device manufacturers, journal editors, etc. Although thoughtful sample size estimation based on objective criteria is required for prospective clinical research, it is essential that grant and manuscript reviewers acknowledge that while investigators can provide sound scientific justification of their reasoning, it is not possible to estimate a precise sample size that includes all conceivable variables and will produce results applicable to every plausible population and scenario.

Related to this issue is an appeal to manuscript reviewers and editors to permit some degree of outcomes-based editorializing within the abstract conclusion section. Frequently reviewers require authors to simply restate the results: ‘This study found that contact with a living torpedo fish decreased gout-related pain.’ Perhaps a more useful conclusion for stakeholders places the results—which they just read in the results section—within some context: ‘While contact with a torpedo fish decreased gout-related pain, the significant number of severe electrical burns and poor survival rate when stored in a Pyxis Medstation may limit the use of torpedo fish as a first-line analgesic.’

Conclusions

It is not my intention to imply that there are no (partial) solutions in assessing the clinical relevance of group differences,28 nor to provide a review article of available statistical analysis strategies. Rather, I hope to (1) better-elucidate for readers the smallest improvement that is important to individual patients versus the smallest difference considered important between groups; (2) draw attention to the erroneous assumption that these two are interchangeable; (3) encourage investigators to use joint hypothesis testing and gatekeeping as well as provide data in multiple formats that will assist stakeholder interpretation and generalization of the results; (4) appeal to reviewers to appreciate the imprecise nature of sample size estimation; and (5) advocate that stakeholders—be they clinicians, patients, administrators, or other researchers—take an active role in applying study results to their own specific situation/population. For readers seeking additional information, I highly recommend outstanding publications by Dworkin et al3 19 29 and Muñoz-Leyva et al,14 among many others.13 28 30–35 Even with beneficial methods such as the NNT and 50% of the SD (½ SD),28 the nearly infinite combination of diverse populations and clinical scenarios essentially guarantees that there will be a good deal of art in addition to science when basing clinical care and administrative decisions on trial data. A p<0.05 is often viewed as the ‘conclusion’ of the scientific process, when—in many respects—it is only the beginning.

Ethics statements

Patient consent for publication

Ethics approval

This study does not involve human participants.

Acknowledgments

The author would like to thank all of his instructors, mentors, and colleagues, who—if listed—would require more pages than this article itself; and without whom this manuscript would never have been conceived and could not have been written.

References

Footnotes

  • Presented at American Society of Regional Anesthesia and Pain Medicine annual meeting, April 2, 2022, Las Vegas, Nevada

  • Correction notice This article has been corrected since it was first published. The open access licence has been updated to CC BY.

  • Contributors BMI conceived and wrote this manuscript. This author is responsible for the overall content as the guarantor and accepts full responsibility for the work and controlled the decision to publish.

  • Funding The author has not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests The University of California has received funding and product for Dr. Ilfeld’s research from Epimed International (Farmers Branch, TX); Avanos (Irvine, CA); Infutronics (Natick, MA); and SPR Therapeutics (Cleveland, OH).

  • Provenance and peer review Not commissioned; externally peer reviewed.