A 5‐gene prognostic nomogram predicting survival probability of glioblastoma patients

Abstract Background Glioblastoma (GBM) remains the most biologically aggressive subtype of gliomas with an average survival of 10 to 12 months. Considering that the overall survival (OS) of each GBM patient is a key factor in the treatment of individuals, it is meaningful to predict the survival probability for GBM patients newly diagnosed in clinical practice. Material and Methods Using the TCGA dataset and two independent GEO datasets, we identified genes that are associated with the OS and differentially expressed between GBM tissues and the adjacent normal tissues. A robust likelihood‐based survival modeling approach was applied to select the best genes for modeling. After the prognostic nomogram was generated, an independent dataset on different platform was used to evaluate its effectiveness. Results We identified 168 differentially expressed genes associated with the OS. Five of these genes were selected to generate a gene prognostic nomogram. The external validation demonstrated that 5‐gene prognostic nomogram has the capability of predicting the OS of GBM patients. Conclusion We developed a novel and convenient prognostic tool based on five genes that exhibited clinical value in predicting the survival probability for newly diagnosed GBM patients, and all of these five genes could represent potential target genes for the treatment of GBM. The development of this model will provide a good reference for cancer researchers.


| INTRODUC TI ON
Malignant gliomas are the most frequent and lethal brain tumors worldwide (Louis et al., 2007), and they account for approximately 80% of primary malignant brain tumors (Omuro & Deangelis, 2013).
These tumors grow rapidly, recur easily (Meir et al., 2010) and represent a leading cause of cancer-related deaths in adults and children (Natesh et al., 2015). Gliomas are categorized as low grade glioma (LGG) and high grade glioma (HGG) (Wang et al., 2009;Wang & Jiang, 2013). HGG includes grade IV glioblastoma (GBM) (Brennan et al., 2013), the most biologically aggressive subtype of glioma with an average survival of 10 to 12 months (Ning, Hao, Feng, & Zou, 2017). Invasion and neo-angiogenesis are the hallmarks of GBM (Cooper et al., 2012). The current standard of care for GBM patients is surgical resection followed by adjuvant radiation therapy and chemotherapy with the oral alkylating agent temozolomide (Parsons et al., 2008), which minimally contributes to the prognosis of GBM by only prolonging the median survival to 15 months (Stupp et al., 2005). Unfortunately, life-threatening tumor recurrence is inevitable in the vast majority of patients given the best available treatments (Wang et al., 2017). Thus, it is necessary to develop novel treatments to improve the prognosis of GBM. Gene targeting provides new hope for GBM patients. And in recent decades, considerable effort has been placed on the identification of genetic alterations in GBMs that might help to define GBM patients with varied prognoses or responses to specific therapies (Mellinghoff et al., 2005). However, very few tumor-specific targets have been identified, tested, and validated for clinical development . Hence, it is urgent to develop methods to identify reliable therapeutic gene targets that could enable earlier prognostic evaluation and better therapeutic strategies.
The overall survival (OS) of each GBM patient is a critical factor to devise a personal treatment plan. Therefore, it is important to develop a reliable tool to predict the survival probability for newly diagnosed GBM patients in clinical practice. Given the remarkable development of high-throughput technologies for the profiling of genome-wide methylation and expression, such as methylation microarray and MeDip-seq, and RNA-seq, and the publicly available datasets around the world (He, Zhang, Shi, & Lu, 2018;Ning et al., 2017), we were able to analyze world-wide data. In previous cancer studies, a risk score system based on genes was applied to identify cancer patients with a high risk of mortality (Chen et al., 2017). Here, we developed a new prognostic tool to make the prediction method more convenient and intuitive.
We aimed to generate an easy and effective prognostic tool based on several genes and other factors that may affect OS. Using the TCGA dataset and two independent GEO datasets, we identified 168 genes that were associated with OS and differentially expressed between GBM tissues and adjacent normal tissues. Furthermore, five of these genes were selected by a robust likelihood-based survival modeling approach to generate a gene prognostic nomogram.
We used an independent dataset to validate the effectiveness of the nomogram, demonstrating its clinical value for predicting the survival probability for newly diagnosed GBM patients.

| GBM dataset from TCGA and survival analyses
We first downloaded the gene expression dataset and the clinical information of GBM patients including 168 samples from TCGA (The Cancer Genome Atlas, RRID:SCR_003193) cohort using the Illumina RNA Sequencing method. The normalization was performed using the "calcNormFactors" function from the R package "edgeR" (edgeR, RRID:SCR_012802). This function normalizes for RNA composition by finding a set of scaling factors for the library sizes that minimize the log-fold changes between the samples for most genes using a trimmed mean of Mvalues (TMM) (Robinson & Oshlack, 2010). After normalization, the gene expression data of each sample were matched with the OS information. Statistical analyses were performed using R (R Project for Statistical Computing, RRID:SCR_001905). Kaplan-Meier curves for high and low expression groups of each gene were plotted using the "survfit" function from the R package "survival". Calculated using the "survdiff" function, a p-value of log rank test less than 0.05 was considered statistically significant.

| Analysis of differentially expressed genes
We identified the differentially expressed genes (DEGs) within the TCGA cohort, using the "exactTest" function from the R package "edgeR" which is based on the quantile-adjusted conditional maximum likelihood (qCML) method (Robinson & Smyth, 2008), via the thresholds of fold change greater than 1 or less than −1 and adjusted p < 0.05. This dataset included 168 samples from GBM tissues and five samples from adjacent normal tissues. Two independent microarray datasets from the GEO (Gene Expression Omnibus, RRID:SCR_005012) of the National Center for Biotechnology Information (NCBI, RRID:SCR_006472) were used to further narrow down these DEGs. GSE68848 (Madhavan et al., 2009) included the gene expression data of 228 GBM tissue samples and 28 samples from adjacent normal tissues, whereas GSE4290 (Sun et al., 2006) included 77 GBM tissue samples and 23 adjacent normal tissue samples (Table 1). All these samples from GEO were included on the Affymetrix Human Genome U133 Plus 2.0 Array platform. Since nontransformed gene expression values usually are substantially skewed in linear scale, we performed data normalization of the two datasets to obtain equal distributions of the probe signal intensities suitable for the analysis using the R package "limma" (LIMMA, RRID:SCR_010943). Expression values for all genes were transformed to the log base 2. DEGs were selected via the same standard as the TCGA dataset. The final set of DEGs was obtained via overlapping the three datasets above.

| Selection of best genes for modeling
After obtaining the DEGs associated with OS, we used a robust likelihood-based survival approach to select the best genes for modeling.
The analysis was performed in R environment using the R package "rbsurv" (rbsurv, RRID:SCR_001175). Our detailed algorithm is summarized as follows: 1. All the TCGA GBM samples were randomly divided into the training set with N*(1 − p) samples and the validation set with N*p samples (p = 1/3). Next, a gene was fitted to the training set of samples using the Cox proportional hazards model and the parameter estimate for this gene was obtained. Log likelihood was evaluated with the parameter estimate and the validation set of samples. This evaluation was performed for each gene.

2.
The above procedure was repeated 10 times, thus 10 log likelihoods were obtained for each gene. Next, the best gene g (1) with the largest mean log likelihood was selected. All the best survivalassociated genes were selected by the robust likelihood-based approach.
3. Let g (1) be the selected best gene in the previous step. Adjusting for g (1) , the next best gene was found by repeating the above two steps. In other words, g (1) + g (j) was evaluated for every j and an optimal two-gene model, g (1) + g (2) , was selected. This forward gene selection procedure was continued until fitting was impossible because of the lack of samples. Thus, a series of K models were generated:

4.
To avoid over-fitting, Akaike information criterion (AICs) for all the candidate models were computed, and an optimal model with the smallest AIC was selected. The model that is best according to AIC is the one that minimizes prediction error (Akaike, 1981;Cho, Yu, Kim, & Kang, 2008).

| Development of the gene prognostic nomogram
We used the R package "rms" to generate the prognostic nomogram based on the expression level of the genes selected by the previous step, using the 168 training samples from TCGA. In the package, the "cph" function was used to build the COX model. Based on the model, the "nomogram" function was used to generate the prognostic nomogram. The length of line segments corresponding to each variable in the prognostic nomogram reflects the contribution of predictors to the patient outcome.

| External validation of the gene prognostic nomogram
After the nomogram was generated, an independent dataset GSE43378 (Kawaguchi et al., 2013) (Affymetrix Human Genome U133 Plus 2.0 Array; N = 32) with complete OS information was used as the external validation dataset. For this cohort, we calculated the C-index to test the ability of the gene prognostic nomogram to discriminate the outcome of patients. To evaluate the calibration of the gene prognostic nomogram, we also generated calibration curves for the 12-, 15-and 18-month survival predictions, as well as survival curves for the high-risk group and low-risk group. To discover whether our gene nomogram could be deemed as an independent prognostic factor, multivariate Cox regression was performed with the nomogram, patient age, and gender.

| Genes associated with OS
The TCGA dataset included 168 effective samples with expression values for 29,846 genes. Additionally, all datasets contain information on sample observation time and censoring status. We preliminarily identified 2,154 survival associated genes with a p < 0.05, using log-rank test.

| DEGs in GBM and adjacent normal tissues
We compared the expression values of these 2,154 genes among 168 samples from GBM tissues and five samples from adjacent normal tissues in TCGA datasets. DEGs(11,972) were initially identified with the thresholds of fold change greater than 1 or less than −1 and adjusted p < 0.05. To further narrow down to a list of more effective DEGs, we screened DEGs in two independent datasets (GSE68848 included 228 samples from GBM tissues and 28 from adjacent normal tissues; GSE4290 included 77 samples from GBM tissues and 23 from adjacent normal tissues) using the same criteria. In the GSE68848 cohort, 2,753 DEGs were identified. In the

| Best genes for modeling
Of the overlapping 2,154 genes associated with OS and 2,005 DEGs, 168 genes were identified for the final analysis. Using a partial likelihood of the Cox proportional hazard regression model, we next selected the best survival-associated genes. We implemented a cross-validation technique considering the large data variability. A forward selection was employed to generate a series of gene models, and the optimal model was then selected using the minimal AIC. Finally, five genes (OSMR, BICDL1, SH3BP2, MSTN, and RGS14) were selected that can optimally predict the OS of GBM patients (Table 4).

| The development of a prognostic nomogram
We used the R package "rms" to generate the prognostic nomogram based on the expression level of the five genes (OSMR, BICDL1, SH3BP2, MSTN, and RGS14). As shown in Figure 3, "1" represents the high expression level of each gene, whereas "0" represents the low expression level of each gene. "Points" is the score corresponding to the expression level of a single gene. "Total points" is the sum of the "Points" that five genes get, which corresponds to the accurate survival probability of each sample. An increased "Total points" indicates greater mortality risk for GBM patients.

| External validation of the prognostic nomogram
An independent cohort of GBM patients with OS and gene expres- that it could work as a prognostic factor independent of age or gender. Therefore, the prognostic nomogram based on five genes could effectively predict the survival probability of patients with GBM.
The process of this study is represented in Figure 6.

| D ISCUSS I ON
In this study, we developed a 5-gene prognostic nomogram that exhibits the ability of predicting the survival probability for patients within GBM. Using this tool, we could predict patients with higher risk of mortality, and thus a need for them to get more immediate attention and treatment. Within the survival analysis of the GBM samples from TCGA, a total of 2,154 genes were identified to be associated with the OS of GBM patients. Meanwhile, 2,005 genes were found differentially expressed between GBM tissues and adjacent normal tissues by analyzing the TCGA dataset and two other independent datasets. By KEGG analysis, we found that these DEGs were enriched in the signaling pathways such as cell cycle, p53 signaling pathway, retrograde endocannabinoid signaling and dopaminergic synapse. A previous study showed that the cell cycle and p53 signaling pathways co-mutated in GBM (Wei, Wang, & Zhao, 2014). Notably, the p53 signaling pathway exerts an important role in glioma pathogenesis (Stegh & DePinho, 2011). After identifying F I G U R E 1 The 2,005 differentially expressed genes (DEGs) were identified by overlapping the TCGA cohort, GSE68848 cohort and GSE4290 cohort using Venny 2.1.0. The criteria of DEGs was set as the thresholds of fold change greater than 1 or less than −1 and adjusted p < 0.05 F I G U R E 2 GOSlim summary for the differentially expressed genes (DEGs). Each Biological Process, Cellular Component and Molecular Function category is represented by a red, blue and green bar, respectively. The height of the bar represents the number of DEGs observed in the category. Figure 2a shows the GO terms of upregulated genes. Figure 2b shows the GO terms of downregulated genes  Recently, a study demonstrated that OSMR is required for GBM tumor growth. The study discovered that the cytokine receptor OSMR is an essential co-receptor of EGFRvIII that plays a prominent role in GBM tumorigenesis, and it is also a highly upregulated direct transcriptional target gene of STAT3 in GBM (Jahaniasl, Yin, & Soleimani, 2016). Similarly, another study identified OSMR as a novel key regulator of brain tumor stem cell proliferation and GBM tumorigenesis (Mohan, Bonni, & Jahani-Asl, 2017). Their findings highlight the significance of OSMR as a potential druggable target in GBM therapy. Furthermore, our study not only validated the relationship between OSMR and GBM but also demonstrated the effectiveness of using OSMR on the prognosis of GBM.
SH3BP2, a c-Abl binding protein in mice and humans (Bell, Shaw, Jou, Myers, & Knowles, 1997;Ren, Mayer, Cicchetti, & Baltimore, 1993), is expressed in most cell types (Reichenberger, Levine, Olsen, Papadaki, & Lietman, 2012). SH3BP2 acts as an adapter protein to control intracellular signaling by interacting and forming complexes with binding proteins (Deckert & Rottapel, 2006;Le et al., 2004) and scaffolding proteins (Le, Moon, Foucault, Breittmayer, & Deckert, 2007). MSTN, a potent negative regulator of muscle growth, is indispensable in regulating neuronal and muscle function (Augustin et al., 2017). Inhibition of MSTN is a promising method to alleviate muscle wasting, especially considering that loss of functional myostatin in humans is not related with apparent deleterious effects other than muscle hypertrophy (Schuelke et al., 2004). Currently in clinical trials, applying monoclonal antibodies that bind to myostatin to inhibit its function is a potential therapy to treat cachexia syndrome in cancer patients (Nicole, Lisa, Wolfram, Anker, & Stephan, 2014). RGS14, a multifunctional scaffolding protein, integrates heterotrimeric G protein and H-Ras signaling pathways (Vellano, Brown, Blumer, & Hepler, 2012), and plays a key role in regulating synaptic plasticity (Branch & Hepler, 2017). However, the function of BICDL1 remains unknown.
Except OSMR, the other four genes have not been reported in previous GBM studies, indicating that these genes could represent potential target genes for GBM treatments, and their biological roles in the development of GBM would be of great interest in further studies.
The 5-gene prognostic nomogram was validated using an independent dataset on different platform. The C-index revealed the

F I G U R E 3
The 5-gene prognostic nomogram based on the expression level of OSMR, BICDL1, SH3BP2, MSTN, and RGS14. The high and low expression levels of each gene were represented by "1" and "0" respectively. "Points" is the score corresponding to the expression level of a single gene. "Total points" is the sum of the "Points" which five genes get. The higher "Total points" value means the lower survival probability of GBM patients nomogram's ability to discriminate the outcome of patients; The calibration curve presented the calibration between the probability of the actual outcome and the probability of prediction; The result of multivariate Cox regression showed that the nomogram could work as an independent prognostic factor. All these above findings demonstrated that this prognostic nomogram based on five genes has the capability of predicting the survival probability of GBM patients. On the other hand, we have considered trying to build a new model including patients' age. However, after adjustment for age, we obtained another 9-gene model whose prediction power is not as good as the 5-gene model. Finally, the 5-gene model was chosen.
F I G U R E 5 The survival curves of the high risk group and low risk group of GSE43378 cohort divided by 5-gene prognostic nomogram. The high risk group exhibited a poorer prognosis compared with the low risk group (p = 0.018) The process of developing the 5-gene prognostic nomogram. First, 168 differentially expressed genes associated with the OS in GBM patients were identified by univariate survival analysis and differential expression analysis. Next, a robust likelihood-based survival modeling approach was applied to identify the best genes for prognosis prediction. Then, the gene prognostic nomogram was generated based on five genes (OSMR, BICDL1, SH3BP2, MSTN, and RGS14). Finally, the 5-gene prognostic nomogram was validated in an independent cohort on different platforms F I G U R E 4 Performance of 5-gene prognostic nomogram in predicting survival probability of GBM patients from GSE43378 cohort. (a) The calibration curve was generated for 12-month survival predictions of the 5-gene prognostic nomogram. (b) The calibration curve was generated for 15-month survival predictions of the 5-gene prognostic nomogram. (c) The calibration curve was generated for 18-month survival predictions of the 5-gene prognostic nomogram In prior cancer studies, the risk score system was often constructed to predict the mortality risk for patients. However, in clinical practice, risk scores occasionally do not accurately reflect the survival probability. It is important to build a convenient predicting tool for clinical practice. Our study developed a prognostic nomogram with five genes that could effectively predict the survival probability for GBM patients. Moreover, the method is flexible and convenient.
The result is presented by relative risk combined with absolute survival probability, which is more visual and intuitive. Based on the relative risk ratio, the specific survival probabilities of the individuals can be queried according to the level of the five genes, and patients predicted to be in severe condition will very likely need more attention and care. The prognostic nomogram can be easily generated using R software and can serve as a powerful tool in model prediction. It is important to make Real-time Quantitative PCR (QPCR) assay more popular in clinical practice. The expression level of genes could be obtained with QPCR, which could make our gene nomogram implemented into routine clinical setting conveniently. This model will provide a good reference for cancer researchers. In our future studies, we will collect more clinical tumor tissues which are used to determine the cut-off point of risk. Meanwhile, other wellknown clinical prognostic factors that could not be obtained from the database, like IDH mutation and pre-operative KPS, should be the key point in our further study. With the patient's clinical information collected more comprehensively, we will try to build a model with adjustment for these well-known risk factors or find a way to combine our nomogram with these clinical characteristics.

CO N FLI C T O F I NTE R E S T
The authors declare no competing interests.