Delta radiomic features improve prediction for lung cancer incidence: A nested case–control analysis of the National Lung Screening Trial

Abstract Background Current guidelines for lung cancer screening increased a positive scan threshold to a 6 mm longest diameter. We extracted radiomic features from baseline and follow‐up screens and performed size‐specific analyses to predict lung cancer incidence using three nodule size classes (<6 mm [small], 6‐16 mm [intermediate], and ≥16 mm [large]). Methods We extracted 219 features from baseline (T0) nodules and 219 delta features which are the change from T0 to first follow‐up (T1). Nodules were identified for 160 incidence cases diagnosed with lung cancer at T1 or second follow‐up screen (T2) and for 307 nodule‐positive controls that had three consecutive positive screens not diagnosed as lung cancer. The cases and controls were split into training and test cohorts; classifier models were used to identify the most predictive features. Results The final models revealed modest improvements for baseline and delta features when compared to only baseline features. The AUROCs for small‐ and intermediate‐sized nodules were 0.83 (95% CI 0.76‐0.90) and 0.76 (95% CI 0.71‐0.81) for baseline‐only radiomic features, respectively, and 0.84 (95% CI 0.77‐0.90) and 0.84 (95% CI 0.80‐0.88) for baseline and delta features, respectively. When intermediate and large nodules were combined, the AUROC for baseline‐only features was 0.80 (95% CI 0.76‐0.84) compared with 0.86 (95% CI 0.83‐0.89) for baseline and delta features. Conclusions We found modest improvements in predicting lung cancer incidence by combining baseline and delta radiomics. Radiomics could be used to improve current size‐based screening guidelines.


| INTRODUCTION
The National Lung Screening Trial (NLST) compared lowdose helical computed tomography (LDCT) vs standard chest radiography for three annual screens and revealed a 20% relative reduction in lung cancer mortality among participants screened with LDCT. [1][2][3] In the LDCT arm, screen-detected incident lung cancers were found 2.7-fold higher associated with a stage shift from late stage to more early-stage lung cancers and exhibited improved 5-year survival compared with prevalence cancers diagnosed at baseline. 3,4 Despite the benefits associated with lung cancer screening, LDCT imaging is associated with a high rate of detection of indeterminate pulmonary nodules (IPNs) of which only a fraction are diagnosed as lung cancer. In the NLST, 96.4% of the positive LDCT screens were false positives/IPNs. Though clinical guidelines [5][6][7] provide for the evaluation and follow-up of nodules, there are no validated clinical decision tools to predict lung cancer risk and probability of cancer development. Ideally, an efficient and accurate noninvasive approach should be developed as a clinical decision tool for radiologists and pulmonologists to better manage nodules, especially IPNs, in the lung cancer screening setting.
Radiomics is the process of converting standard-of-care digital medical images into quantitative image-based feature data that can be subsequently analyzed using conventional biostatistics and machine learning methods. 6 With highthroughput computing, it is now possible to rapidly extract radiomic features from a region of interest that quantify size, shape, intensity, and texture of the region of interest. As radiomic features are likely capturing biological and pathophysiology information of the region of interest, 6 radiomics have the potential to provide a rapid and accurate noninvasive approach to better manage pulmonary nodules detected by LDCT in the lung cancer screening setting.
In this study we conducted a nested case-control analysis of the NLST, using training and test sets, to identify radiomic features that are predictive of lung cancer incidence. We analyzed robust and reproducible radiomic features 8 from baseline (T0)-positive screens in the LDCT arm of the NLST to identify radiomic models that predict lung cancer incidence in the first (T1) and second (T2) follow-up screening intervals. Moreover, we also included delta radiomic features to determine whether changes in the nodules over time from T0 to T1 improve predicting lung cancer incidence. Current guideline algorithms for managing LDCT-detected solid and subsolid nodules are largely based on size, specifically longest diameter. As recommended by the National Comprehensive Cancer Network (NCCN) 5 and the American College of Radiology (ACR), 6,7 the current cutoff size for assessing lung nodules increased to 6 mm rather than the 4 mm originally used in the NLST. 2,3 Although this increase in threshold positivity has been reported to decrease false-positive results, 7,9,10 decision support tools and lung cancer risk prediction are still lacking for IPNs ≥6 mm. As such, we also performed size-specific analysis based on three size classes of the nodules: <6 mm [small nodules], 6-16 mm [intermediate-sized nodules], and ≥16 mm [large nodules]. To our knowledge, this is one of the first radiomic analyses in lung cancer screening to utilize delta radiomic features (changes in radiomics over time) by nodule size class to predict lung cancer incidence.

| NLT study population
This research was approved by the Institutional Review Board (Advarra, Inc, Columbia, MD, USA). Deidentified data and LDCT images were obtained through the National Cancer Institute (NCI) Cancer Data Access System (CDAS). 9 The NLST study design and main findings have been described previously. 2,3 Briefly, the NLST was a randomized multicenter trial comparing screening with LDCT to CXR in high-risk individuals. Eligibility criteria included current or former smokers aged 55-74 years with a minimum 30 pack-years smoking history; former smokers had to have quit within the past 15 years.

| NLST CT screening results
The NLST protocol defined a positive screening result as one or more noncalcified nodules or masses measuring ≥4 mm in axial diameter or, less commonly, other abnormalities such as adenopathy or pleural effusion. 2,3 Positive screens were defined in the setting of abnormalities on baseline screens or abnormalities on follow-up screens that were new, stable, or that evolved with the latter demonstrating an increase in nodule size, consistency, or other characteristic potentially related to lung cancer. Participants with positive screening results received follow-up recommendations; trial-wide guidelines for the management of positive screens were developed, but were not mandated by protocol.
Negative screens were defined as CT scans with no abnormalities, minor abnormalities not suspicious for lung cancer, or significant abnormalities not suspicious for lung cancer. In this analysis, we did not include any participants who had a negative screening result.

| Nested case-control study design
We performed a nested case-control study comprised of screen-detected incident lung cancers and matched nodulepositive controls from the LDCT arm of the NLST. Based on the schema originally described in Schabath et al, 4 the screen-detected incident lung cancers and nodule-positive controls are depicted in Figure 1A.

| Lung cancer cases
We identified 196 screen-detected incident lung cancers who had a baseline-positive screen (T0) that was not diagnosed as lung cancer and then were diagnosed at either the first (T1, N = 104) or second follow-up (T2, N = 92).

| Nodule-positive controls
Using a 2:1 to nested case-control study design, we identified 392 LDCT screening participants who had three consecutive positive screens (T0 to T2) that were not diagnosed as lung cancer. These NLST participants were designated as nodule-positive controls in the current analysis. The nodulepositive controls were frequency matched to the lung cancer cases' age at enrollment (±5 years), sex, race/ethnicity, and smoking status. This study design minimizes the influence of confounders between the cases and the controls. As such, radiomic image features that differentiate cases and nodulepositive controls are not likely be attributed to external risk factors.

| Training and test sets
Based on the availability of complete LDCTs and inability to verify the nodule/abnormality, the 192 lung cancer cases were reduced to 160. Likewise, the original set of 392 nodule-positive controls was reduced to 307. The lung cases in cohort 1 were diagnosed at T1 and the lung cancer cases in cohort 2 were diagnosed at T2. All of the nodule-positive controls had a positive scan from T0 to T2 and never developed lung cancer through T7 based on the available NLST data. Cohort 1 was used as a training set and Cohort 2 as a test set.

| Target lung nodule identification
The identification of target lung nodules has been previously described. 11 Briefly, two radiologists (YL and QL) reviewed all LDCT images at both the lung window setting (width, 1500 HU; level, −600 HU) and the mediastinal window setting (width, 350 HU; level, 40 HU). The identification of cancerous nodules among the screen-detected incident lung cancers was based on data provided by the NLST (ie, location and size). As nodule location was not always available, the senior radiologist (YL) 11 identified the nodules and manually mapped each nodule from T0 to T1. The locations of all nodules in this analysis are publically available in the TCIA database (www.cancerimagingarchive.net). For NLST participants with multiple lung nodules, the largest nodule at baseline (T0) and subsequent follow-up nodule was used for radiomic feature extraction.

| CT segmentation, feature extraction, and feature selection
The workflow of our radiomic pipeline 12 and analyses is depicted in Figure 1B. As previously described, 11 a singleslick segmentation ensemble and subsequent feature extraction were performed using Definiens software (Definiens, Inc, AG Cambridge, MA, USA). There were 219 features extracted to quantify size, shape, location, and texture information of the pulmonary nodules. 6 The complete list of features used in our analyses has been previously described 8 and was reduced to the most consistent features based on our previous test/retest analyses. Additionally, we used features from the same filter that based on Cohort 1 were found to be "stable" over time (denoted as C1 stable). C1 stable features were filtered using an analogous approach to that for identifying RIDER stable features. For RIDER stable features, two LDCT screenings were performed in a 15-minute interval. For the C1 stable features using the NLST subjects, we utilized T0 and T1 features as the test/retest set. For each feature, we computed the concordance correlation coefficient 13 and dynamic range and we selected as C1 stable features those which had values for both parameters greater than 0.95. Even though we used a test/retest filter for initial feature selection, we built models which were able to classify data with the most predictive number of features. For that purpose, we used feature selectors ReliefF (RfF) and Correlation-based Feature Selector (CFS). In each analysis, we selected the top 5 and top 10 ranked features. Tables 2 and 4 present the performance statistics based on the models with the best AUROC.

| Baseline and delta features
For all available cases and controls, we extracted radiomic features from the T0 baseline screen and the T1 follow-up screen. To assess changes in nodules after an approximately one-year interval, we subtracted the T0 and T1 features to generate delta features. For all patients in our analysis, the median time from randomization to the T1 screen was 375 days (interquartile range = 360-400 days). As such, the time interval to the T1 screen is relatively consistent for all subjects and eliminates the need to normalize the delta features with respect to time. In Tables 2 and 4, delta features are denoted with a "∆" and baseline features are denoted with "T0".

| Size-specific analyses: Splitting the training and test sets on nodule size
Size-specific analyses were performed based on the longest diameter (LD) of the T0 nodules. Current recommendations by the NCCN and the American College of Radiology (ACR) have been increased for a positive scan to have a 6 mm longest diameter nodule 5 rather than the 4 mm originally used in the NLST. 3 As such, we performed size-specific analyses using three nod- For computing overall accuracy, sensitivity, and specificity, we summarized confusion matrices of each size group and based on the result produce statistical parameters for the model. Computation of the area under the receiver operating characteristic (AUROC) uses a list of probabilities indicating an instance belongs to a class. For computation of the "overall" AUROC, we merged probability lists for each size group and produced the result on the final list.

| Classifiers
Of the 219 features, there were 23 RIDER stable features and 37 C1 stable features. The C1 stable features are provided in Table S1. Features marked with asterisk symbol in Table  S1 are used in RIDER stable feature set. Although we used a test/retest filter initial selection, our goal was to identify a model that is able to classify data with a small number of features. Size-specific nodules from Cohort 1 were utilized to create the training dataset. For each training dataset, we applied a feature selector in order to simplify resulting model and remove noisy features. Selected features were used to train a classifier and after training on a corresponding subset of Cohort 2 used for testing. From multiple possible models, we selected the one which produces the highest AUROC. For the feature selectors, we used ReliefF (RfF) [14][15][16] and Correlation-based Feature Selector (CFS). For each feature selector, we selected the top 5 and 10 ranked features to identify highly predictive parsimonious models. One of the benefits we gained from splitting datasets is the independent usage of classifiers. For each subset, we applied the following classifiers: • Decision tree-J48 17 ; • Rule-based Classifier-JRIP 18 ; • Naive Bayes 19 ; • Support Vector Machine (SVM) 19 ; • Random Forests. 20 For the SVM classifier, we utilized a radial basis function as a kernel and also a linear kernel. C and Gamma were found on the training set using Grid Search. Performance statistics and 95% confidence intervals (CIs) were calculated for each model including AUROC, accuracy, sensitivity, and specificity. All the experiments were performed in Weka version 3.6.13. 21 T A B L E 1 Study population characteristics of incident lung cancer cases and nodule-positive controls by three nodule size classes

Oversampling Technique
Because of the imbalance of case and controls across the various size classes, we also applied Synthetic Minority Oversampling Technique (SMOTE) 22 in the analyses. SMOTE is an oversampling approach in which the minority class is over-sampled by creating "synthetic" examples rather than by oversampling with replacement. To create a synthetic instance, one example (nodule feature vector) is randomly picked from minority class. For that example, five nearest neighbors in the same class are chosen. Then, one of these neighbors is randomly chosen. For each numeric feature, the example and its chosen neighbor produce a line segment between the two features. A new synthetic instance represents a randomly chosen point on the line segment for each feature. The process repeats with a new example randomly chosen until the desired number of instances is produced.

| RESULTS
The study population characteristics for the three size classes by the training and test sets of the lung cancer cases and nodule-positive controls are presented in Table 1. None of the study population characteristics were significantly different between the training cohort and test cohort (Table S2) and, as previously reported (Table 1 in  We also computed the overall AUROC (Table 2), which included all nodule sizes, for baseline-only features (AUROC = 0.83; 95% CI 0.82-0.86) and baseline and delta features (AUROC = 0.86; 95% CI 0.83-0.89). As such, we had a higher AUROC and accuracy for the large-sized nodule model (0.86) compared with the overall model (0.83). When comparing the overall model to the intermediatesized nodule model, the overall model had higher AUROC, but the intermediate-sized model had higher accuracy (0.76 vs 0.74) and specificity (0.92 vs 0.90). When comparing the overall model to the small-sized nodule model, the AUROCs and specificities were identical for small-sized nodules. The overall AUROC for three size classes for baseline and delta features was 0.86 (0.83-0.89), which was higher than the AUROCs for the three size-specific models.  However, the large-sized nodule model had a higher accuracy than the overall model (0.88 vs 0.78). Likewise, the intermediate-sized nodule model had a higher accuracy than the overall model (0.80 vs. 0.78).
We also found when we applied the SMOTE method, which over-samples the minority class creating synthetic minority class examples, some of the performance statistics improved ( Table 2).
Because there were only 16 lung cancer cases and 7 nodule-positive controls with large nodules (≥16 mm), we combined the intermediate-and large groups and repeated the analyses (Tables 3 and 4 and Figure 2A-D). As such, when the intermediate-sized nodules and large nodules were combined into a single group (≥6 mm), the AUROC for baselineonly features was 0.80 (95% CI 0.76-0.84) compared with an AUROC of 0.86 (95% CI 0.83-0.89) for baseline and delta features. The AUROC for the overall model was identical for the large-sized nodule model; however, the large-sized nodule model has higher accuracy and specificity. Figure 2Aa-C presents the AUROC plots for the final models for the small nodules and large nodules with and without SMOTE.

| DISCUSSION
While lung cancer screening with LDCT for high-risk individuals has unequivocally demonstrated that early detection saves lives, the current screening strategy comes at the identification of large numbers of indeterminate nodules and limited clinical decision tools to manage nodules. 23 As such, we conducted a nested case-control analysis of the NLST to identify radiomic-based models that predict lung cancer incidence. We utilized training and test sets of incident lung cancer cases and nodule-positive controls to generate performance statistics of baseline-only radiomic features vs. the combination of time-varying delta radiomic features and baseline features. Additionally, analyses were conducted across three nodule size classes. Overall, we found that combining delta radiomics with baseline radiomics generally improved the performance statistics to predict lung cancer incidence when compared to using only baseline radiomic features. However, we note inconsistent results in some of the performance statistics when comparing the overall models, which were not size-specific, to the size-specific models. As such, our findings suggest there is a trade-off in terms of performance using nodule size-specific models vs. an overall model. Previous studies have shown the utility of delta radiomic in lung cancer prognostication and therapy response, 24,25 and to the best of our knowledge, this is the first analysis to consider delta radiomics in the lung cancer screening setting. The modest improvements by including delta features with the baseline features suggest there were not substantial time-varying differences from the baseline screen (T0) to the first follow-up screen (T1) which occurred 12 months later. In our previous work 4 that evaluated the screening histories and outcomes from T0 to T2 of the entire CT-arm of the NLST, there were 6921 nodule-positive controls at T0, then 4951 positive screens at T1 of which only 104 were diagnosed as lung cancer. As such, the majority of the nodules were either stable at T1 (N = 4951 nodule-positive controls) or they resolved and were scored as a negative screen T1 (N = 1488 negative screens). So, the observed modest improvements in performance statistics of delta radiomics in the NLST warrant their further evaluation in other screening settings.
In our previous work using baseline-only features in the NLST, 11 a random forest classifier identified a model of 23 features that could predict nodules that would be diagnosed as lung cancer 1 year after baseline with an AUROC of 0.83 and 2 years after baseline with an AUROC of 0.75. Our current analysis differed from the previous work 11 in many ways. First, the prior work identified a single model based on the best accuracy using only baseline features. In the current analysis, we included delta radiomics, generated radiomics models by nodule class size, trained our models to identify the features that achieved the best AUROCs, and we applied a SMOTE approach since there was an imbalance of case and controls across the various size classes. Additionally, to identify highly predictive parsimonious models with fewer features that were previously identified (23 features), we choose to identify models containing the top 5 and 10 features. We focused on AUROC because prior work demonstrated 26 that AUROC is a better measure than accuracy in the evaluation of learning algorithms by demonstrating that AUROC is statistically consistent and more discriminating than accuracy.
A novel and important aspect on our analyses was the radiomic models by nodule size class. Nodule size is a key characteristic of malignancy whereby larger nodules have a higher probability of being diagnosed as lung cancer. 27  screening guidelines is largely based on size and shape of the nodule. [5][6][7] Certainly, reductions in false-positive rates have been reported 7,9,10  Because of the distribution of nodule sizes among the cases and controls (Figure 3), we selected different nodule size cut-points. Importantly, we note that each size class yielded different final models of radiomic features suggesting the potential importance of size-specific biomarkers to improve nodule management. Another novel approach and subsequent finding in our analysis were the improvements of sensitivity and specificity when we applied SMOTE. 29 Classification analyses using class-imbalanced data are biased in favor of the majority class, and the bias is even larger for high-dimensional data where the number of variables greatly exceeds the number of samples. 29 To address potential bias and imbalance, we applied SMOTE as this is a popular oversampling method that was originally proposed to improve random oversampling. In our analyses, we found that SMOTE tended to have marginal influence on the AUROCs; however, we observed consistent modest improvements in sensitivity and specificity when SMOTE was utilized when compared to the same size class where SMOTE was not utilized. This suggests SMOTE is not beneficial in improving discrimination classifiers, which has been previously reported by Blagus and Lusa, 29 but improves the performance of the classifier in terms of sensitivity and specificity.
There are some limitations and some strengths of this analysis. Although Lung-RADS TM categories 10 are commonly used in lung cancer screening, we opted to utilize categories based on longest diameter size. However, using this nested case-control approach, we did not have adequate representation across Lung-RADS TM categories 10 since the majority of the nodules were between 6 and 16 mm. Nonetheless, our analyses did demonstrate that nodule size-specific models may have utility in improving some performance statistics compared with an overall model. Another potential limitation is the nested case-control design resulting in the modest sample size. The nested design was utilized because it is not feasible to segment and extract radiomic features on >4,000 T0-and T1-positive scans. Although our radiomic pipeline is well-established 12 and is efficient for studies on lung cancer screening, lung cancer outcomes, and radiogenomics, 11,[30][31][32][33][34][35][36] nodule identification and segmentation is still a time-consuming bottleneck. However, we are actively pursuing approaches for automated segmentation which will allow us to segment and extract radiomic features on large numbers of LDCT scans. We acknowledge there were fewer lung cancer cases in the training set and there was an imbalance across size classes; however, training on a subset improved accuracy and area under the AUROC to predict lung cancer incidence. Another possible limitation is that unmeasured/unknown cofounders may exist between the lung cancer cases and nodule-positive controls. However, we attempted to reduce confounding between the lung cancer cases and nodule-positive controls by matching on key demographic features. Despite the modest aforementioned limitations, we applied a rigorous training and testing analyses to identify informative, parsimonious models that predict lung cancer incidence in the lung cancer screening setting.
In conclusion, we demonstrated that the inclusion of delta radiomic features improves the ability to classify which lung nodules will be diagnosed as an incident lung cancer more accurately than previous reports. [37][38][39][40][41] At present, adjunct biomarkers are not currently used for lung cancer screening, largely attributed to their early stage in development. 42 Published reports have found that bloodbased and circulating biomarkers exhibited sensitivity values ranging from 40% to 91% and specificity values from 75% to 84%, [43][44][45] with possible cancer detection capability as early as 12-29 months prior to a lung cancer diagnosis. 46 But, a critical goal of biomarker research is to add value to existing risk assessment standards, and the biomarker should be designed to supplement the current diagnostic/ management tools. 47 As such, radiomic-based biomarkers are attractive because they can be incorporated into the current radiology workflow, are noninvasive, and can be generated from standard-of-care images negating the requirement of additional laboratory-based biomarkers.