Article Text
Abstract
Background The objective of this study is to develop predictive models for persistent opioid use following lower extremity joint arthroplasty and determine if ensemble learning and an oversampling technique may improve model performance.
Methods We compared various predictive models to identify at-risk patients for persistent postoperative opioid use using various preoperative, intraoperative, and postoperative data, including surgical procedure, patient demographics/characteristics, past surgical history, opioid use history, comorbidities, lifestyle habits, anesthesia details, and postoperative hospital course. Six classification models were evaluated: logistic regression, random forest classifier, simple-feedforward neural network, balanced random forest classifier, balanced bagging classifier, and support vector classifier. Performance with Synthetic Minority Oversampling Technique (SMOTE) was also evaluated. Repeated stratified k-fold cross-validation was implemented to calculate F1-scores and area under the receiver operating characteristics curve (AUC).
Results There were 1042 patients undergoing elective knee or hip arthroplasty in which 242 (23.2%) reported persistent opioid use. Without SMOTE, the logistic regression model has an F1 score of 0.47 and an AUC of 0.79. All ensemble methods performed better, with the balanced bagging classifier having an F1 score of 0.80 and an AUC of 0.94. SMOTE improved performance of all models based on F1 score. Specifically, performance of the balanced bagging classifier improved to an F1 score of 0.84 and an AUC of 0.96. The features with the highest importance in the balanced bagging model were postoperative day 1 opioid use, body mass index, age, preoperative opioid use, prescribed opioids at discharge, and hospital length of stay.
Conclusions Ensemble learning can dramatically improve predictive models for persistent opioid use. Accurate and early identification of high-risk patients can play a role in clinical decision making and early optimization with personalized interventions.
- pain
- postoperative
- chronic pain
- pain management
Data availability statement
Data are available on reasonable request.
This is an open access article distributed in accordance with the Creative Commons Attribution 4.0 Unported (CC BY 4.0) license, which permits others to copy, redistribute, remix, transform and build upon this work for any purpose, provided the original work is properly cited, a link to the licence is given, and indication of whether changes were made. See: https://creativecommons.org/licenses/by/4.0/.
Statistics from Altmetric.com
Introduction
The USA is in the midst of an opioid epidemic, with the number of opioid-related deaths having risen sixfold since 1999.1 Surgery is a risk factor for chronic opioid use, with up to a 3% incidence of previously opioid naïve patients continuing to use opioids for more than 90 days after a major elective surgery.2 Joint arthroplasty surgery is one of the most common surgical procedures performed in the USA, with over 1 million knee and hip replacements occurring annually.3 Despite improvements in multimodal analgesia, 10%–40% of patients undergoing lower extremity joint arthroplasty still develop chronic postsurgical pain.4
Several studies have investigated risk factors for persistent opioid use following total knee and hip arthroplasty.5 6 Preoperative opioid use has consistently been found to increase the risk of postoperative chronic opioid use.7–11 Other patient characteristics associated with increased risk of chronic opioid use include history of depression, higher baseline pain scores, younger age, and female sex.7–10 Additional research is needed to develop tools to more accurately predict patients at highest risk postoperatively. Identification of at-risk patients prior to hospital discharge allows time for the formulation of a pre-emptive individualized pain management plan. Novel modalities exist that may potentially help reduce persistent opioid use, including peripheral neuromodulation, cryoneurolysis, and transitional pain clinics, however, due to limited resources, it may not be realistic to offer such modalities to every patient.12 Developing accurate predictive models will help better allocate these resources.
Limited research exists investigating the utility of machine learning in predicting persistent opioid use—defined as continued opioid use more than 3 months after surgery. The primary objective of this study is to develop machine learning-based predictive models predicting persistent opioid use. We will incorporate data from the entire acute perioperative period (preoperative, intraoperative, and postoperative variables) so that identification of high-risk patients can occur prior to hospital discharge. Furthermore, machine learning algorithms are typically evaluated by their predictive accuracy; however, when data is imbalanced (ie, large difference in rates of positive vs negative outcomes), model performance can be biased and problematic. Given that most patients undergoing joint arthroplasty do not develop persistent opioid use, we know that such datasets will be imbalanced. Therefore, to optimize our machine learning algorithms, we also applied an oversampling technique to balance the dataset using Synthetic Minority Oversampling Technique (SMOTE), which has been shown to help improve accuracy of models without biasing the study outcome.13
Methods
Study sample
The informed consent requirement was waived. Data from all patients that underwent elective hip or knee arthroplasty from 2016 to 2019 were extracted from the electronic medical record (EMR) database. Emergent cases, bilateral joint arthroplasty, hemiarthroplasty, and unicompartmental knee arthroplasty were excluded from the analysis. The manuscript adheres to the applicable EQUATOR guidelines for observational studies.
Primary objective and data collection
The primary outcome measurement was persistent opioid use, defined as patient reporting to use opioids after a 3-month postoperative cut-off, up to 6 months. The outcome data were extracted from the EMR system from: (1) orthopedic surgery postoperative follow-up note (at 3–6 months following surgery). In these notes, the surgeons routinely describe continued use of opioids or pain; (2) primary care physician or pain specialist notes (at 3–6 months postoperatively); (3) scanned documents from providers outside of our healthcare system (at 3–6 months postoperatively); and/or (4) active opioid prescription that is filled during this time period (captured in the EMR). We evaluated six different classification models: logistic regression, random forest classifier, simple-feedforward neural network, balanced random forest classifier, balanced bagging classifier, and support vector classifier. In addition, we evaluated model performance with SMOTE.
The covariates included in the models were: surgical procedure (total hip arthroplasty (posterolateral approach vs. anterior approach), total knee arthroplasty, revision total hip arthroplasty, and revision total knee arthroplasty), age, sex, body mass index, English as a primary language, preoperative opioid use, previous joint replacement surgery, osteoarthritis severity in the operative limb, hypertension, coronary artery disease, chronic obstructive pulmonary disease, asthma, obstructive sleep apnea, diabetes mellitus (non-insulin vs insulin-dependence), psychiatric history (anxiety and/or depression), active alcohol history (defined as ≥2 drinks per day), active smoking history, active marijuana use, use of perioperative regional nerve block, primary anesthesia type (neuraxial vs general anesthesia), intraoperative ketamine use (yes or no), opioid use on postoperative day 1 (measured in intravenous morphine equivalents (MEQ)), amount of prescription opioids given at discharge (MEQs), and hospital length of stay (days) (online supplemental table 1). To measure the amount of prescription opioids given at discharge, we defined this as number of pills multiplied by the opioid in MEQ. No data were missing.
Supplemental material
Statistical analysis
Python (V.3.7.5) was used for all statistical analysis. First, the cohort was divided into training and test data sets, reflecting an 80:20 split using a randomized splitter—the ‘train_test_split’ method from the sci-kit learn library12—thus, any patients present in the test set were automatically removed from the training set. We developed each machine learning model using the same training set (with or without SMOTE) and tested its performance on the same test set (measuring F1 score, accuracy, recall, precision, and the area under the curve (AUC) for the receiver operating characteristics curve (ROC)). To perform a more robust evaluation of the models, we then calculated the average F1 score, accuracy, recall, precision, and AUC using stratified K-fold cross-validation (described below) (figure 1).
Data balancing
SMOTE for Nominal and Continuous algorithm—implemented using the ‘imblearn’ library (https://imbalanced-learn.org/stable/)—was used to create a balanced class distribution and was first described by Chawla et al.13 Imbalanced data may be particularly difficult for predictive modeling due to the uneven classification of data. A balanced dataset would have minimal difference in positive and negative outcomes. However, if the difference is large, it is considered unbalanced. SMOTE is a statistical technique that increases the number of cases in a dataset to balance it—it does this by increasing new instances from minority cases, while not affecting the number of majority cases. This algorithm takes samples of the feature space for each target class and five of its nearest neighbors, and then generates new cases that combines features of the target case with features of its nearest neighbors. This method increases the percentage of the minority cases in the dataset. SMOTE was only applied to our training sets and we did not oversample the testing set, thus maintaining the natural outcome frequency.
Machine learning models
We evaluated six different classification models: logistic regression, random forest classifier, simple-feed-forward neural network, balanced random forest classifier, balanced bagging classifier, and support vector classifier. For each, we also compared the use of oversampling the training set via SMOTE vs no SMOTE. For each model, all features were included as inputs. Multivariable logistic regression—This is a statistical model that asserts a binary outcome based on the weighted combination of the underlying independent variables. We tested an L2-penalty-based regression model without specifying individual class weights. This model provides a baseline score and helps make the case for improvement over the evaluation metrics. Random forest classifier—We developed a random forest classifier with 1000 estimators, and the criterion for the split was set to the Gini impurity. The Gini impurity is calculated in which C is total number of classes and p(i) is the
C
1
Random forest is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model.14 The random forest is a robust and reliable non-parametric supervised learning algorithm that acts as a means to test the further improvement in the metrics and provide the feature importance of the dataset. Balanced random forest classifier—This is an implementation of the random forest, which randomly under-samples each bootstrap to balance it. The model was built using 1000 estimators and the default values provided in the imblearn package. The sampling strategy was set to ‘auto’, which is equivalent to ‘not minority’. Other parameters were kept the same as our random forest classifier. Balanced bagging classifier—Another way to ensemble models is bagging or bootstrap-aggregating. Bagging methods build several estimators on different randomly selected subsets of data. Unlike random forests, bagging models are not sensitive to the specific data on which they are trained. They would give the same score even when trained on a subset of the data. Bagging classifiers are also generally more immune to overfitting. We built a balanced bagging classifier using the imblearn package, where the number of estimators was set to 1000 where replacement was allowed, and the sampling strategy was set to auto, which is equivalent to ‘not minority,’ and, thus, does not resample the minority class. Multilayer perceptron neural network—Using sci-kit learn’s ‘MLPClassifier,’ we built a basic shallow feed-forward network with two hidden layers and ten neurons in each hidden layer. The activation function was set to the rectified linear unit function and the net was trained for a maximum of 700 iterations. The other parameters remained default as implemented in the sci-kit learn package. Support vector classifier—A support vector classifier tries to find a hyperplane decision boundary that best splits the data into the required number of classes. It plots each data item as a point in an n-dimensional space (n being the number of features), then finds a hyperplane that separates the classes. We developed a modification of support vector that weighs the margin proportional to the class importance, or is a cost-sensitive support vector classifier by choosing the gamma value as ‘scale,’ while defining the classifier and assigning ‘balanced’ to the class-weight parameter.
K-folds cross-validation
To perform a more robust evaluation of our models, we implemented repeated stratified-K-fold cross-validation to observe the accuracy, precision, recall, F1-score, and AUC scores, for 10 splits and 3 repeats. For each iteration, the dataset was split into 10-fold, where 1-fold served as the test set, and the remaining nine sets served as the training set. The model was built on the training set. In the case when SMOTE was used, only the training set was oversampled. This was repeated until all folds had the opportunity to serve as the test set. This was then repeated three times. For each iteration, our performance metrics were calculated on the test set. The average of each performance metric was calculated thereafter.
Performance metrics
Accuracy
Accuracy is the ratio of the correct predictions to the total number of data points present in the test set. This is an important metric used to evaluate the number of points the model predicted incorrectly, but does not tell us the full extent of the models’ performance.
Accuracy=true positive +true negative/(true positive +true negative +false positive +false negative)
Precision
Precision is defined as a metric that quantifies the number of correct predictions made by the model. In a way, precision calculates the accuracy of the minority class. Formally, it is defined as the ratio of the True Positive samples to the Sum of the True Positive and False Positive samples.
Precision=true positive/(true positive +false positive).
Recall
Recall quantifies the number of correct positive predictions made from all the positive predictions that could have been made. It serves as an indication of missed positive predictions. It is formally defined as the ratio of true positives over the sum of true positives and false negatives.
Recall=true positive/(true positive +false negative)
F1-Score
This is a version of the Fβ-metric where we provide equal weight to the Precision and Recall scores. F1 score is formally equal to the harmonic mean of Precision and Recall and this provides a way to combine both into a single metric. This is the most valuable metric to analyze a classification task and thus, is the most significant metric of our analysis.15 16
F1 Score=2 × precision ×recall/(precision +recall).
Area under the curve
We calculated the AUC for the ROC curve. The ROC curve was developed by plotting the fraction of true positive rate vs the false positive rate and represents sensitivity versus one minus the specificity in the curve. The AUC summarizes the whole curve in just a number, ranging from 0 to 1, 1 being the best possible score to achieve.
Results
Study population
Initially, there were 1094 patients, but after exclusion, there were 1042 patients in the final analysis, of which 242 (23.2%) required persistent opioid use after 3–6 months following surgery. The cohort of patients that did not have persistent opioid use had higher proportion of hip arthroplasties and severe osteoarthritis of the surgical joint. Those that did have persistent opioid use tended to be younger, received intraoperative ketamine, consumed more opioids postoperative day 1, had longer hospital length of stay, and had higher proportions of substance abuse, congestive heart failure, chronic obstructive pulmonary disease, and anxiety/depression (table 1).
We initially split the data into a training and test set and used SMOTE to oversample the training set for each model. When SMOTE was applied on the training set, there was a 1:1 ratio of positive and negative classes (table 2), in which there was a ~3.2 times increase in positive classes (persistent opioid use).
The AUC of the ROC for the logistic regression model was 0.72, while each ensemble learning approach (ie, random forest, balanced bagging, and balanced random forest) had an AUC of 0.95 (figure 2). We generated box plots (figure 3) illustrating the quantiles of the probability scores generated from each machine learning model when validated on the test set. In the test set (n=219), there were 49 (22.4%) patients who developed persistent opioid use. When modeling development of persistent opioid use, the ensemble learning approaches correctly predicted patients who had probability scores≥0.5 more often compared with the other models (figure 3A).
Based on the balanced random forest model, we reported the features contributing most to the model (figure 4). The most important features contributing to prediction with the balanced bagging approach were (in descending order) postoperative day 1 opioid consumption, body mass index, age, preoperative opioid use, opioids prescribed at discharge, hospital length of stay, and the surgical procedure.
Performance metrics calculated from K-folds cross-validation
Without SMOTE: We then calculated the average F1 score, accuracy, precision, recall, and AUC from k-folds cross-validation among all models, first without using SMOTE (table 3). When using logistic regression, the F1 score and AUC was 0.47 and 0.79, respectively. In comparison, the best performing model was balanced bagging classifier with the F1 score of 0.80 and AUC of 0.94.
With SMOTE: When SMOTE was used to oversample the test sets, the F1 score for most models improved (table 3, values in green and red font signify improvement or decreased performance, respectively, for the given metric when SMOTE was applied). For example, for the balanced bagging classifier, the F1 score improved from 0.80 (no SMOTE) to 0.84 (with SMOTE). For this model type, the AUC improved from 0.94 (no SMOTE) to 0.96. However, there were cases where accuracy, precision, recall, and AUC had decreased performance when SMOTE was applied. For example, the AUC of the multilayer perceptron went from 0.78 (no SMOTE) to 0.76 (with SMOTE). In addition, we report performance metrics of each machine learning model when different ratios of positive to negative classes were applied to SMOTE (online supplemental table 2).
Supplemental material
Discussion
We demonstrated that using an ensemble machine learning approach, combined with an oversampling technique of the training set, can improve the prediction of persistent opioid use following joint replacement. In our study population, we found the prevalence of persistent opioid use to be 23.7%. While there are several interventions that may potentially reduce this incidence—such as transitional pain clinic, peripheral neuromodulation, cryoneurolysis—these additional therapies may not realistically be applied to every patient. Thus, improving the ability to risk stratify and identify the at-risk population at time of hospital discharge is of utmost importance.
With expanding surveillance and access to electronic health data, methodologies in artificial intelligence are becoming pertinent in prediction analysis. Using the same dataset, we can improve our ability to predict outcomes by applying different types of machine learning approaches. Such practice should be applied more often in the healthcare setting, given the exponential increase in EMR data acquisition and availability. A basic approach to identifying associations between patient characteristics and outcomes includes regression techniques. In our study, we showed that when using an ensemble learning approach, the prediction of persistent opioid use is improved compared with regression, which have a few limitations: most notably, they only capture linear relationships between features and the outcomes.
Ensemble learning is beneficial because it leverages multiple learning algorithms and techniques to better predict an outcome. However, it requires diversity within the sample and between the models. To accomplish this, methods such as bagging, the use of different classifiers, and oversampling can be used to generate diversity and class balance within a given dataset. Often, data are lost due to undersampling, therefore, we can apply techniques that over sample to match the samples from the minority class to the majority class. Electronic health data are well known for class imbalance when a given outcome may not be prevalent within a population.17 Two methods to overcome this problem include random oversampling and synthetic generation of minority class data by SMOTE. Instead of replicating random data from the minority class, SMOTE uses a nearest-neighbor approach to generate synthetic data to reduce the class imbalance. We demonstrated that when we apply SMOTE to our training set, we can improve model performance with our ensemble learning techniques. This is likely because the generation of synthetic points in the training dataset may allow for better validation of the test dataset. However, there are some limitations with SMOTE; namely, it may have limitations with high-dimensional data (and thus introduce additional noise when oversampling the minority class), such as the case in study populations with increased heterogeneity and features.18 As we fine-tune our predictive ability further, we can optimize our ability to use healthcare resources efficiently to manage patient care via a personalized medicine approach.
Using the balanced random forest model’s feature importance plot, we identified six variables to be the most important predictors in our models for persistent postoperative opioid use following joint replacement. These factors include: postoperative day 1 opioid use, body mass index, age, preoperative opioid use, prescribed opioids at discharge, and hospital length of stay. The increased postoperative day 1 opioid use may be reflective of poorly controlled acute postoperative pain, a known risk factor for developing persistent pain after various surgical procedures.19–22 For this reason, literature strongly supports the use of multimodal analgesia, early counseling, peripheral nerve blocks as well as neuraxial anesthesia as strategies to minimize the transition from acute to prolonged opioid requirement after surgery.23–25
Similar to our findings, preoperative opioid use is also consistently reported to increase the risk of chronic opioid use after surgery.7–10 Goesling et al identified that taking opioids preoperatively, an average daily dose of greater than 60 mg oral MEQs was independently associated with persistent opioid use post lower extremity arthroplasty.26 Opioid prescription during or after surgery may trigger long-term use in opioid naïve patients.26–29 Patients who were prescribed greater quantities of opioids at discharge were more likely to request refills for opioids postoperatively. Hernandez et al found that the rate of refills did not vary significantly between patients with smaller versus larger opioid prescription,30 and refills were often prescribed by providers other than the surgeon postoperatively.31 Excess opioid prescription may also pose the risk of divergence and subsequent abuse.
While we did not identify sex as an important predictive risk factor, age was highly predictive of increased risk. The literature on sex and age as risk factors is variable with some studies finding younger age and female sex associated with persistent opioid need post arthroplasty32–34 and others suggesting that male sex and older age increase risk of prolonged opioid use.8–10 35 Similarly, we found that increased body mass index was an important feature predictive of persistent opioid use after lower extremity arthroplasty. This can be secondary to limitation to the progression of inpatient rehabilitation postoperatively36 or pharmacokinetics, however, outcomes and complications in obese patients are comparable to non-obese patients.37–39
By leveraging patient datasets from the EMR, machine learning may be used to offer valuable clinical insight, in this case the prediction of persistent postoperative opioid use after arthroplasty. Such models should then be integrated into a clinical decision support system in the EMR to alert healthcare providers. This predictive model can serve as a foundation for a multidisciplinary transitional pain program, which supports longitudinal care from outpatient postoperative follow-up and long-term analgesia interventions—such as cryoanalgesia or percutaneous neuromodulation—to potentially reduce the likelihood of chronic opioid use.40 However, in addition, more studies are needed to validate the efficacy of transitional pain clinics, cryoanalgesia, or peripheral neuromodulation on reducing incidence of persistent opioid use. Furthermore, these types of predictive models can also help identify potential subjects into clinical trials designed to enroll high-risk patients only. With the rising interest in early intervention of persistent postsurgical pain, and in anticipation of the emergence of multidisciplinary transitional pain clinics across the country, accurate and reliable predictive analytics technology of at-risk patients becomes cornerstone to this practice of precision medicine.
There are several limitations to our study. Importantly, this is a retrospective study and thus, collection and accuracy of the data was only as good as it was recorded. Therefore, we may have missed some patients that did indeed require persistent opioids due to lack of information in the charts on our review. Future studies would need to develop models from prospectively collected data to ensure accuracy of the features and outcomes. Furthermore, while SMOTE is effective at generating synthetic data to reduce class imbalance, SMOTE does not take into consideration that neighbors may be from other classes, which can increase noise at the boundary of classes. SMOTE can also be problematic for high dimensional data; however, if variables can be reduced, the bias introduced by the k-NN process will be eliminated.41 Finally, our predictive models would need to be externally validated in separate datasets to assess its generalizability in this population.
Accurate predictive modeling can provide perioperative physicians with clinical insight to the most vulnerable patient population. Integration of risk factors into an evidence-based perioperative screening tool may allow for early identification of at-risk patients, thus allowing for early intervention by targeted patient-centered systematic approach via a transitional pain program. This approach will achieve the fine balance of addressing acute postoperative pain management, while minimizing the risk of persistent postoperative opioid need.
Data availability statement
Data are available on reasonable request.
Ethics statements
Patient consent for publication
Ethics approval
This retrospective study was approved by the University of California San Diego’s Human Research Protections Program for the collection of data from our electronic medical record system (EMR).
References
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Footnotes
Correction notice This article has been corrected since it was first published. The open access licence has been updated to CC BY.
Contributors RG, BH, SS and ETS were involved in study design. RG, RSP and ETS collected the data. RG, BH and SS were involved in the statistical analysis. RG, BH, RSP, SS, IC and KF were involved in the interpretation of results. RG, BH, RSP, SS, IC, KF and ETS were involved in the preparation and finalization of the manuscript. RG serves as the guarantor.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.