Prediction of neurologic deterioration based on support vector machine algorithms and serum osmolarity equations

Abstract Objective Dehydration on admission is correlated with neurological deterioration (ND). The primary objective of our study was to use support vector machine (SVM) algorithms to identify an ND prognostic model, based on dehydration equations. Methods This study included a total of 382 patients hospitalized with acute ischemic stroke. The following parameters were recorded: age, sex, laboratory values (serum sodium, potassium, chlorinum, glucose, and urea), and vascular risk factor data. Receiver operating characteristic (ROC) curve analysis was used to evaluate the discriminative performance of the BUN/Cr ratio as well as each of 38 equations for predicting ND. We used the Boruta algorithm for feature selection. After optimizing the SVM kernel parameters, we built an SVM model to predict ND and used the test set to obtain predictive values for assessing model accuracy. Results In total, 102 of 382 patients (26.7%) with acute ischemic stroke developed ND. In all patients, the BUN/Cr ratio and each of 38 equations were significant predictors of ND. Equation 20 [1.86 × Na+ + glucose + urea + 9] yielded the maximum area under the ROC curve, and faired best in terms of prognostic performance (a cutoff value of 284.49 mM yielded a sensitivity of 94.12% and specificity of 61.43%). Equation 32 predicted ND poststroke across population groups, and worked well in older as well as young adults; (a cutoff value of 297.08 mM yielded a sensitivity of 93.14% and specificity of 60.00%). Feature selection by the Boruta algorithm was used to decrease the number of variables from 18 to 5 in the condition. The specificity of test samples for the SVM prediction model increased from 44.1% to 89.4%, and the AUC increased from 0.700 to 0.927. Conclusions SVM algorithms can be used to establish a prediction model for dehydration‐associated ND, with good classification results.


| INTRODUC TI ON
Dehydration in patients with acute ischemic stroke has been observed in many clinical experimental studies (Crary et al., 2013).
Furthermore, dehydration on admission is a strong predictor of clinical outcome and is correlated with the volume of the ischemic lesion and with an increased risk of mortality (Rowat, Graham, & Dennis, 2012;Schrock, Glasenapp, & Drogell, 2012). However, in the early assessment and treatment of cerebral infarction, serum osmolality is not regularly recommended as a diagnostic test (Adams et al., 2007). Therefore, the blood urea nitrogen (BUN)/ creatinine (Cr) ratio, as a routinely available indicator of hydration, has been used in numerous studies (C. J. Lin et al., 2016; L. C. Lin, Lee, Hung, Chang, & Yang, 2014;Rowat et al., 2012;Schrock et al., 2012). Although use of the BUN/Cr ratio for evaluating dehydration has some value, it still has certain drawbacks. Defining the BUN/Cr ratio >15/1 as indicating dehydration is based on a priori knowledge. This prior knowledge is not entirely reliable and could affect results. Hence, it is necessary to determine a cutoff value of the BUN/Cr ratio that would yield the best diagnostic accuracy.
Serum osmolarity is associated with serum electrolyte and glucose levels (Siervo, Bunn, Prado, & Hooper, 2014). Consequently, electrolyte disturbances or diabetes status are likely to influence the accuracy of diagnosing dehydration (Stookey, Pieper, & Cohen, 2005). Therefore, a more precise marker for dehydration, based on the BUN/Cr ratio while also accounting for other factors, including electrolyte disturbances, and glucose metabolism disorders, is necessary. Furthermore, osmotically active determinants (serum sodium, potassium, urea, and glucose) should be used to derive a valid equation for the calculation of serum osmolarity.
Many equations have been used to calculate osmolarity (Fazekas et al., 2013;Siervo et al., 2014), but it remains unclear which equation performs best. It is also possible that a new formula could be devised that would better predict the functional outcome after an ischemic stroke.
Previously, logistic regression models have typically been used to analyze stroke outcome data. However, machine learning algorithms, which potentially have more powerful high-level prediction performance, have been proposed as an alternative for analyzing large-scale multivariate data (Bastanlar & Ozuysal, 2014). One of the most popular machine learning methods used for recognition or classification is the support vector machine (SVM) (Noble, 2006).
With the SVM, data are divided into a training set and a test set, the training dataset is used to build a classification algorithm model, which is then used to assign test set data to one or another category. The SVM algorithm has been widely applied in the biological sciences.
We aimed to design a dehydration equation that is not prone to the differential bias associated with the above-mentioned factors, to improve diagnostic accuracy. The primary objective of our study was to use SVM algorithms to identify an ND prognostic model based on dehydration equations.

| Patients
This study included acute ischemic stroke patients who were ad-  (Weimar et al., 2005). ND was determined and validated by participating neurologists.

| Data collection
We used a standardized data form to collect the participants' data from a database. The following data were collected: age, sex, arterial blood pressure, and laboratory studies (serum sodium, potassium, chlorinum, glucose, and urea) on admission within the first 7 days.
Information about vascular risk factors, such as hypertension, diabetes, hypercholesterolemia, ischemic heart disease, and smoking, was included in the collected database. We used the BUN/Cr ratio and 38 different equations to calculate serum osmolarity (Fazekas et al., 2013). The equations mostly involved summing multiples of serum sodium, potassium, glucose, and urea. The BUN/Cr ratio and 38 equations analyzed in this study are shown in Table 1.

| Statistical analysis
Continuous data were summarized as the mean (standard deviation).
Normally distributed data and non-normally distributed data were compared by the two-sample t test, and Mann-Whitney U test, respectively. Categorical data were presented as number (

| Feature selection
Feature selection is a crucial step in predictive modeling (Weimar et al., 2005). Some features available in the study were relevant to ND, but may hinder the predictive model from achieving higher accuracy. Therefore, removing these variables from the model would increase prognostic accuracy. The Boruta algorithm is available in the R package and is useful for feature selection (https://CRAN.Rproject.org/package=Boruta). We used the Boruta algorithm to identify the most sensitive features and eliminate redundant features, biases, and unwanted noise.
TA B L E 3 AUC, cutoff value, specificity, and sensitivity of prognostic accuracy of assessment of ND

| SVM model
R language (https://www.r-project.org/) is a processing environment for statistical computing and graphics. In this study, SVM classifiers were implemented using the e1071 package (https://CRAN.Rproject.org/package=e1071) in R. We applied SVM to the dataset to determine ND prognosis. The SVM model should be able to discriminate patients with ND and without ND.
The two datasets included 382 instances with 18 attributes for each instance. Of these, 56 attributes or 18 attributes were realvalue input features. There were 102 patients with ND, and 280 patients without ND in the dataset.
First, we randomly divided the dataset into two subsets, one with about 80% of the instances, for training, and another with around the remaining 20% of instances, for testing. Second, the two parameters (C-cost, r-gamma) could not be intuitively defined. However, the selection of these two parameters would affect the accuracy of an SVM model. Fortunately, the two parameters could be optimized via cross-validation of the training data using the tune () function. Then, we implemented SVM classifiers with a radial basis function (RBF) kernel (exp (-gamma*|u-v|^2)).
The convergence epsilon, which is an optimizer parameter used to specify the stop point for iteration, was set to 0.1. Finally, we built an SVM model and used the test set to obtain predictive values to assess the model accuracy. We also constructed an ROC curve and determined sensitivities and specificities for particular cutoff values.

| ROC curve of prediction of ND with various indicators of serum osmolarity
The prognostic accuracy for ND based on the BUN/Cr ratio and 38 equations in all patients are presented in Table 3. In all patients, the BUN/Cr ratio and each of the 38 equations were significant predictors of ND. An ROC curve can be used to illustrate and evaluate the accuracy of the classifier system. Sensitivity represents the probability that a true judgment is made for patients with ND, while specificity represents the probability that a true judgment is made for patients without ND. The AUC is the most commonly used   (Hooper et al., 2015;Siervo et al., 2014) and young adults (Heavens, Kenefick, Caruso, Spitz, & Cheuvront, 2014) and predicted poststroke ND across population groups, with a cutoff value of 297.08 mM (sensitivity = 93.14%, specificity = 60.00%).

| Feature selection
There were 18 input variables, which included personal information (gender, age, and vascular risk factors), laboratory studies (serum The ROC curve (raw data) sodium, potassium, chlorinum, glucose, and urea). The Boruta algorithm was applied to select the most characteristic structural patterns. Five variables were confirmed to be important: creatinine (Cr) + serum urea (BUN) + chlorinum (Cl) + glucose (Glu) + sodium (Na). Superfluous parameters were removed; these 13 variables that were confirmed to be unimportant were atrial fibrillation (Af), age, coronary heart disease (CHD), diabetes (DM), hypertension (HPT), and 8 others. These results are shown in Figure 1.

| Optimized SVM parameters
Before processing the SVM model, the penalty parameter "cost" and kernel function parameter "gamma" were optimized using the tune () function. The results listed the sampling method (10-fold crossvalidation), the best parameters (gamma = 0.001, cost = 64), the best performance (0.166), and the details of the tested parameter values.

| ROC curve for ND prediction with an SVM model
We used two SVM classifiers, using the raw data and featureselected data after applying the Boruta algorithm, respectively.
The results are shown in Figure 1. Sensitivity, specificity, and AUC based on ROC analysis were used to assess the performance of classifiers. Figures 2 and 3 show the ROC curves for the classifiers applied to the different datasets. The AUC values for the different classifiers were 0.700, and 0.927 for the raw data and feature-selected data, respectively. The classifier applied to feature-selected data was effective in distinguishing the two categories of patients. The performance of the classifier on feature-selected data after applying the Boruta algorithm had higher sensitivity and specificity than that applied to the raw data. The cutoff values with the highest sensitivity and specificity were 0.272 for feature-selected data (sensitivity = 92.0%, specificity = 89.4%) and 0.155 for raw data (sensitivity = 93.8%, specificity = 44.1%).

| D ISCUSS I ON
In this study, we ranked 38 equations with indicators of dehydration according to the AUC values for predicting ND in poststroke patients; Equation 20 (1.86 × Na+ + glucose + urea + 9, with all components measured in mmol/L) had the highest performance for prognosis of ND among all indicators of dehydration. There are similarities across these three studies (Heavens et al., 2014;Hooper et al., 2015;Siervo et al., 2014), with some equations appearing in more than one study. Equation 32 appeared to work well for predicting serum osmolarity in older as well as younger adults and for predicting poststroke ND. Given that this single equation works well across all situations and population groups, this may be the appropriate equation to use for serum osmolarity. However, the SVM model based on feature-selected data yielded even better predictive accuracy than Equations 20 and 32.
Determining the pathophysiological mechanism of ND occurrence was beyond the scope of this study. However, there are some possible explanations. Increased serum osmolarity can be attributed to increased serum glucose levels (hyperglycemia) in patients with acute ischemic stroke (Bhalla et al., 2000). The evolution of hyperosmolality and hypernatremia has been observed in ischemic stroke patients, but the level of potassium appears to remain stable (Natochin Iu et al., 1996). Those findings were in keeping with our results.
In our study, hypertension, diabetes, hypercholesterolemia, ischemic heart disease, and smoking were considered as vascular risk factors in the analysis. The baseline characteristics between the patients with ND and without ND were not significantly different.
In laboratory tests, participants with ND had higher BUN, creatinine, and fasting plasma glucose concentrations. However, these indicators were not abnormal. This means that these indicators could potentially play a role in dehydration. These findings were similar to those of previously published studies (Bhatia, Mohanty, Tripathi, Gupta, & Mittal, 2015). L. C. Lin et al., 2014;Rowat et al., 2012;Schrock et al., 2012). The mean BUN/Cr ratios in patients with ND and without ND in our study were both higher than 15. These differences in cutoff values between our study and previous studies may be due to differences in reagent and measurement methods. Nevertheless, it is clear that an increase in the BUN/Cr ratio is associated with ND.
In addition, before training the SVM model, we used the tune () function to select appropriate penalty parameter (cost) and kernel function parameter (gamma) values. If these values were inappropriate, overfitting or underfitting may have occurred (Reid et al., 2010). Parameter optimization is important to improve the precision of classification. Because the RBF has been proven to make the kernel function more compatible, we used RBF as a kernel function for training (Cawley & Talbot, 2004). RBF also reduced the computational complexity and improved the generalization performance. The Boruta algorithm could capture all features that were either strongly or weakly relevant to the outcome variable.
As compared to the traditional feature selection algorithm, the Boruta algorithm more appropriately identifies variables of importance (Blog, 2016). Consequently, it is well suited to biomedical applications. We adopted the Boruta algorithm to reduce the number of variables from 18 to 5. The specificity of allocating test samples after this process was increased from 44.1% to 89.4%, and the AUC was increased from 0.700 to 0.927. Thus, the accuracy of classification was increased by implementing the Boruta algorithm.
Although the sensitivity of Equation 20 was higher than that of the SVM model, it still had the limitation of a low specificity. Low specificity could lead to many false-positive and false-negative results and make clinical prediction uncertain. Nevertheless, the AUC of the SVM model was significantly higher than that of Equation 20.
As the AUC should be maximized to obtain the best possible prediction method, the SVM model represented the best predictive accuracy.
Several limitations of this study need to be considered. First, the generalizability of the results may be limited, as it was a single-center study. Second, this study lacked a serum osmolarity measure of dehydration to serve as a gold standard for comparison. Third, our data suggested that an SVM model incorporating dehydration measures would be able to predict ND, but it is necessary to assess its effect on different stroke subtypes. Fourth, patients with reperfusion therapy were not included in this study. We also did not assess outcomes beyond 3 months and did not consider mortality. A follow-up of longer than 3 months would allow evaluation of the long-term predictive accuracy of this approach.
In conclusion, dehydration, which is common in hospitalized stroke patients, was associated with ND at the hospital. Highlevel machine learning techniques (SVM) were used to establish a prediction model for ND associated with dehydration, and achieved good classification results. Feature selection by the Boruta algorithm can eliminate redundant features, biases, and unwanted noise and can increase prognostic accuracy. The tune () function can be used for SVM model parameter optimization to improve the results. It could be helpful for detecting patients with dehydration in order to predict and prevent ND. The application of SVM in this study provides a basis for designing and more efficient data analysis. In future studies, other dehydration information can be integrated into this process to improve the precision of classification.

CO N FLI C T O F I NTE R E S T
None declared.