Model selection for partial least squares calibration and implications for analysis of atmospheric organic aerosol samples with mid‐infrared spectroscopy

In developing partial least squares calibration models, selecting the number of latent variables used for their construction to minimize both model bias and model variance remains a challenge. Several metrics exist for incorporating these trade‐offs, but the cost of model parsimony and the potential for underfitting on achievable prediction errors are difficult to anticipate. We propose a metric that penalizes growing model variance against decreasing bias as additional latent variables are added. The magnitude of the penalty is scaled by a user‐defined parameter that is formulated to provide a constraint on the fractional increase in root mean square error of cross‐validation (RMSECV) when selecting a parsimonious model over the conventional minimum RMSECV solution. We evaluate this approach for quantification of four organic functional groups using 238 laboratory standards and 750 complex atmospheric organic aerosol mixtures with mid‐infrared spectroscopy. Parametric variation of this penalty demonstrates that increase in prediction errors due to underfitting is bounded by the magnitude of the penalty for samples similar to laboratory standards used for model training and validation. Imposing an ensemble of penalties corresponding to a 0–30% allowable increase in RMSECV through sum of ranking differences leads to the selection of a model that increases the actual RMSECV up to 20% for laboratory standards but achieves an 85% reduction in the mean error in predicted concentrations for environmental mixtures. Partial least squares models developed with laboratory mixtures can provide useful predictions in complex environmental samples, but may benefit from protection against overfitting. © 2015 The Authors. Journal of Chemometrics published by John Wiley & Sons Ltd.


INTRODUCTION
When the relationship between measured instrumental signals and response variables becomes difficult to decode, statistical methods of multivariate calibration can be used to develop models for quantitative prediction [1]. This challenge is particularly salient for characterization of organic functional group abundances in atmospheric aerosol samples (containing an ensemble of particles within a designated size range) by mid-infrared (MIR) spectroscopy [2]. These complex mixtures consisting of tens to hundreds of thousands of different types of organic molecules originate from the combination of directly emitted compounds and products of atmospheric photooxidation in the gas and condensed phases and pose challenges for characterization by any measurement technique [e.g., Hallquist et al. 3]. For Fourier transform infrared spectroscopy (FTIR) analysis, the challenge is manifested in broad absorption profiles originating from strong overlap of contributions from various functional groups present in the sample.
Despite the complex structure of the spectroscopic signal, MIR spectra of such samples contain a wealth of information and have been used to infer contributions from various source classes and extent of atmospheric processing of organic aerosol composition [4][5][6]. Estimation of functional group abundances in these samples is the first step in reconstructing estimates of the total organic aerosol burden and is relevant for interpreting the contribution from emission sources and atmospheric chemistry to a geographical location or region. In such applications, partial least squares (PLS) regression [1,7] has been used to develop models for quantitative calibration [8][9][10][11]. A necessary assumption is that such models developed from simpler standards prepared in the laboratory, where reference values are known and can be extrapolated to these complex samples that differ significantly in composition (i.e., number and types of molecules) and spectral structure.
In cases where laboratory standards used for calibration and environmental samples differ significantly (illustrated in Figure 1), it is conceivable that the effects of model misspecification can be amplified. Minimizing the prediction error against reference samples in the form of root mean square error (RMSE) is commonly used as an objective for model selection [12]. However, the true predictive performance of a model for new samples can be overestimated in that the RMSE metrics capture variations in bias but do not adequately account for the growth of variance as a function of model complexity, leading to selection of overly fitted PLS models containing more than the necessary number of latent variables (LVs) [13,14].
Researchers have previously proposed attainment of parsimonious models through consideration of the bias-variance trade-off in various forms. Alongside bias measures (e.g., RMSE calculated against various validation samples), model complexity or variance has been characterized in the form of effective rank [15,16], pseudo degrees of freedom [17,18], or some property of the regression vector (coefficients), for example, its two-norm magnitude [14,19] or 'jaggedness' introduced by oscillations [20,21]. These opposing measures have been evaluated along a Pareto curve [22,23] or combined together in a single metric [e.g., 14,18,21,24].
In this manuscript, we address the implications of model misspecification on prediction errors of laboratory standards and complex environmental mixtures, and methods for its prevention. We revisit the development of PLS calibration models for the quantification of functional groups in ambient aerosol samples previously described by [11] and explore the sensitivity of model selection on the formulation of metrics targeting parsimony. In this objective, we propose a modification of a metric described by [21] and [18]. We introduce a parameter to scale an arbitrary indicator of variance and penalize its growth against a measure of decreasing bias. The magnitude of this penalty is defined relative to the minimum RMSE of cross-validation (CV), such that the cost of parsimony can be anticipated against the available estimate of achievable prediction error. We vary the penalty on the variance measure (magnitude computed from vector of regression coefficients is used in this work) to generate a large set of model performance curves and report on the resulting complexity and prediction errors for the selected models. Prediction errors and sensitivity of predicted concentrations to penalization are evaluated for functional groups in laboratory standards and in ambient samples with compositions lying outside of the mixture space of the calibration model. For ambient samples in which we lack true reference values, we additionally compare an aggregate estimate of organic carbon (OC) estimated by the sum of FTIR functional groups with the measurements of OC obtained by a different but widely used analytical technique. A set of models selected from an ensemble of model performance curves generated by our proposed metric and combined by sum of ranking differences (SRD) [25,26] are validated against an independent randomization test [27], and we further evaluate their suitability for application to laboratory and ambient samples.

Multivariate calibration and model selection
2.1.0.1. Partial least squares. We use PLS implemented by the pls library [28] in the R statistical package [29] to estimate regression vectors O b for predicting univariate responses of functional group concentrations O y from a set of spectra arranged in a row-wise matrix X: We use the NIPALS algorithm with 10-fold venetian blinds CV on calibration samples (Section 2.2) sorted according to the known concentration of the response variable. In this way, the distribution of concentrations in validation and test sets are arranged to be similar to each other in each permutation during CV.
Model selection for PLS 2.1.0.2. Prediction error. Root mean square error is a common metric used to evaluate the difference between estimated (O y) and observed values (y) in the response variable: k k is used to denote the two-norm of a vector (also written as k k 2 , but we omit the subscript by convention), and the subscript k indicates the value computed using k LVs. RMSE is often defined more specifically in terms of RMSEC, the RMSE of calibration defined by the overall fit of models to samples in the training set; RMSEV, the RMSE with respect to samples in a validation set; RMSECV, the RMSE of CV evaluated within the calibration set (comprising the training and validation sets); and RMSEP, the RMSE of prediction reserved for new samples used for evaluation (the 'test set'). In the conventional method of model selection, RMSECV is computed for models generated with 1, 2, : : : , Ä LVs, and the optimal number of LVs (k * ) is selected by a criterion of minimum RMSECV to establish the base case. Alternatively, RMSEV can be used in place of RMSECV to determine k * ; we use RMSECV and a value of Ä = 120 in this work. We denote the solution corresponding to the minimum RMSECV as RMSECV k * = min k {RMSECV k }. The mean error is another metric used to assess the prediction error and is defined as k k 1 is used to denote the one-norm of a vector. While we do not use the mean error for model selection, we report its value alongside the RMSEP for model evaluation as it is a common metric used in the atmospheric modeling community [30].
2.1.0.3. Metrics. The metric outlined by Gowen et al. [21] and Kalivas and Palmer [18] weighs a scaled measure of bias (i.e., RMSE) against a scaled 'regression vector measure' (RVM) that characterizes the magnitude of variance according to some property of the regression vector (as we refer to the vector of regression coefficients in this work). This metric is denoted M1 in this manuscript and computed over k = 1, 2, : : : , Ä models: where {x k } denotes the set of values for all models {x k : k = 1, 2, : : : , Ä}. M1 is dependent on max k {RVM k }, which effectively determines the largest number of LVs considered in the set of solutions because RVM generally increases with the number of LVs. Therefore, Ä is considered to be a free parameter on which the metric is defined. Model selection according to M1 is more sensitive to Ä than for the minimum RMSECV criterion (where the determination of k * is insensitive to Ä given that it is sufficiently high) , and yet what value of Ä to be used is not clear a priori. Furthermore, the effect on prediction error due to underfitting by the newly selected model is uncertain.
We therefore propose a reformulation of M1 that parameterizes the RVM penalty relative to the minimum RMSE value: For this metric, the minima and maxima are defined for the set {x k : 1, 2, : : : , Ä = k * }, where k * is determined a priori by RMSECV or RMSEV. Using this metric, we define the number of LVs selected by this metric as k = arg min k {M2 k }. This formulation shares some similarities with the objective function to be minimized by ridge regression or canonical Tikhonov regularization [31,32], which can be written with regularization parameter Á as The essential commonality between M2 and L is the penalization of a fidelity term by an RVM scaled with a weighting coefficient. While b is found directly through min b L in ridge regression (without projection onto LVs) according to the magnitude of Á, we propose for PLS that b should be selected from a set of candidate solutions constructed from the k LVs determined by the magnitude of prescribed in M2.
The appealing property of the scaled formulation of M2 is that fixes the allowed RVM penalty as a fraction or factor of min k {RMSE k } and bound the increase on prediction error due to underfitting when a more parsimonious model is chosen. Furthermore, by defining * = max k {RMSE k }/min k {RMSE k } -1, we can bound the anticipated increase in RMSE k with respect to the estimated magnitude of reduction in prediction errors achievable through incorporation of additional LVs (Section S1). Written in the notation defined earlier, Also, by specification of = * , we obtain the same solution k that would be selected by using M1 when Ä = k * (Section S1), while we obtain the conventional solution k = k * when = 0. The primary difference between M1 and M2 is that the latter defines the RVM penalty by scaling the structure of growth in RVM from 1 to k * , rather than 1 to Ä k * . It is unclear whether additional information contained in the RVM from k * to Ä is relevant for selecting a more parsimonious model containing fewer than k * LVs; it is hoped that the benefits of defining a new metric that can bound the increase in RMSE due to underfitting will outweigh the cost of this omission. Model selection by M2 in this way closely follows another heuristic of accepting an alternate solution corresponding to a fixed increase in RMSE (e.g., 10%, [33]) but additionally considers the dependence of model complexity and increased prediction variance on the number of LVs. Physically plausible values of may possibly be reasoned out based on an estimated uncertainty in RMSECV or RMSEV. In this work, we vary parametrically from 0 to * and examine predictions from the models selected.

Specification of root mean square error and regression vector measure.
For both M1 and M2, any of RMSECV, RMSEV, or RMSEP can be used. Using M1, Gowen et al. [21] found that using RMSECV for a fixed value of Ä = 20 indicated instances of underfitting for their data set; [18] suggests that similar results are obtained with either metric. Using RMSECV can result in a value of k that is systematically less than or equal to the value selected in the scenario where RMSEC is used, but can be compensated by the selection of a larger Ä (for M1) or smaller (for M2). To facilitate comparisons across all metrics, in this work we specify RMSECV as the bias measures of M1 and M2 to be consistent with the determination of k * = arg min k {RMSECV k } as stated earlier.
As introduced in Section 1, the properties of a vector as embodied by RVM can be defined by magnitude (k O bk), jaggedness (k O bk 1 ), or Durbin-Watson statistic [18,20,21], among other characteristics. We choose to use k O bk as this magnitude has been shown to be proportional to the prediction variance [19] and related to the sensitivity and detection limit of the calibration model [14]. In our models, k bk 1 increases monotonically with k O bk ( Figure S4), so the main conclusions for this work are expected to be insensitive to this choice.

Ensemble and randomization approach to model selection.
We further employ two techniques to assess the number of LVs for each of our calibration models. Given the uncertainty in selection of for M2, we adapt an ensemble scoring approach to consider the most suitable model according to a range of values together. SRD described by Héberger and co-workers [25,34,35] is a method recently applied in the context of model or parameter selection for PLS and Tikhonov regularization [26]. In this work, we use the modern MATLAB implementation of SRD provided by Kalivas et al. [26]. In SRD, each metric or 'merit' is rescored according to its respective ranking against the target solution, which we designate as the minimum value of M2 [26]. M2 is calculated for each fold of PLS CV (number of folds is V = 10 in this work), for a common number of LVs determined as the maximum value of k * computed across all folds. This leads to a block or matrix with dimensions of V k * for a single value of , and is varied to generate multiple merit blocks. For validating SRD ranking results, we use V-fold CV with V = 10. We select as our solution the minimum value of the mean normalized SRD and designate the number of LVs as k to differentiate from k used to denote the solution obtained with a single realization of used with M2. A paired Wilcoxon signed rank test between the k = k solution and all others can be performed to find an alternate number of LVs for which the normalized SRD scores are not statistically significantly different [26]. However, we find that this approach can undermine the constraint on the growth of RMSECV imposed by , so it is not used in this work. A randomization test for PLS has been described by Wiklund et al. [27] and used in similar contexts of model selection [e.g., Gowen et al. 21]. We adapt the 'randtest' function provided by the mdatools package in R [36] for use with the centered NIPALS algorithm provided by the pls library [28] and calculate our test statistic from P = 1000 permutations for each component. In this test, the order of samples in the response or residual vector y is permuted P number of times, and the conditional test statistic (covariance between PLS scores and y vector obtained after extraction fitted components) is compared with its corresponding value obtained for the original response or residual vector without permutations [27]. The exceedances of the reference value over the P permutations are referred to as the 'overfitting risk' in this work and are expressed as a percentage. The value of this percentage is compared with a significance level by close analogy to p-values used in standard statistical tests [36]. The overfitting risk is estimated by the empirical cumulative probability distribution in mdatools, and the cumulative probability distribution is additionally calculated using the inverse Gaussian function fitted to the test statistic using the statmod [37] and fitdistrplus [38] libraries in R. As the overfitting risk estimated by the two methods were practically identical for our interpretation, we only present one value of the risk that corresponds to the empirical cumulative probability distribution estimate. Wiklund et al. [27] report that randomization tests on spectra without pre-treatment can result in extraction of model components that are not in order of decreasing relevance, effectively leading to observations of erratic estimates of the overfitting risk. As discussed in Section 2.2, the signal contribution from the substrate in comparison with analyte is substantial in our samples and is not removed with pre-treatment for this analysis. Therefore, we interpret the results from this randomization test only with a qualitative appreciation; to guard against erratic significance values, we determine the maximum number of LVs by finding a consecutive sequence of length two or greater for which the significance of the test statistic is less than or equal to 5% and 10% significance levels, and take the largest number of components from this sequence.

Data set
2.2.0.6. Mid-infrared spectra. The set of infrared spectra used in this work (examples shown in Figure 1) is particulate matter samples collected and analyzed on polytetrafluoroethylene filter substrates as previously described by Ruthenburg et al. [11] and Dillner and Takahama [39]. The set consists of 238 laboratory standards (single component to ternary mixture samples of sugars, dicarboxylic acids, an ester, and ketone compounds) and 744 ambient samples (collected from seven US Interagency Monitoring of PROtected Visual Environments (IMPROVE) monitoring network sites in 2011). Out of the 794 ambient samples originally available, 50 are excluded from this evaluation as Ruthenburg et al. [11] identified them as spectrally anomalous. MIR spectra consisting of 2784 wavenumbers spanning the range between 4000 and 420 cm -1 without background correction is used for this work [39]. Four functional groups are considered for quantification: alcohol COH (aCOH), carboxylic COH (cCOH), alkane CH (aCH), and carbonyl C=O (CO).
Ruthenburg et al. [11] used 2/3 of the laboratory samples (n = 158) for model development and the remaining 1/3 (n = 80) for model selection according to the minimum RMSEV criterion. In this work, we use the same 2/3 of laboratory standards for calibration with CV and leave the remaining 1/3 for evaluation of our capability to predict concentrations in similar laboratory samples (to contrast with predictions for ambient samples). The similarity in relative functional group abundances and the concentrations between calibration and test set samples are shown in Figure  S5. The maximum number of LVs considered by Ruthenburg et al. [11] was less than that used for this work (Ä = 30 instead of Ä = 120) and resulted in the selection of different models (Table S1).

Thermal optical reflectance organic carbon.
Organic carbon concentrations measured by thermal optical reflectance (TOR) analysis [40] are taken from the IMPROVE network database (http://views.cira.colostate.edu/fed/). These samples are collected on quartz fiber filters collocated with the polytetrafluoroethylene filter samples used for FTIR analysis. This technique analyzes the total carbon vaporized when filters are subjected to a temperature ramp under an inert and then oxidizing environment. The OC fraction of the total carbon (the balance being elemental carbon) is operationally defined according to monitored optical properties of the filter during the vaporization process [40].

Calculation of organic carbon from functional groups.
We can estimate the OC content from infrared analysis by taking the inner product of the molar functional group concentrations n = [n aCOH , n cCOH , n aCH , n CO ] with the average number of moles of carbon associated with each bond c = [n C /n aCOH , n C /n cCOH , n C /n aCH , n C /n CO ], such that the mass of carbon is estimated as c T n 12.01 g/mol. The values of c are not known precisely for ambient samples but are estimated based on typical molecular structures assumed to be present in the mixture and can be fractional quantities to prevent potential double counting of atoms [2,9,11,41,42]. For this work, we use values of c = [0, 0, .5, 1] as assumed by Ruthenburg et al. [11].

RESULTS AND DISCUSSION
First, we describe the similarity of U-shaped curves generated by M1 and M2 by varying their respective tuning parameters ( Figure 2). Curves of RMSECV and k O bk used for calculation of M1 and M2 are also shown. k O bk is observed to increase monotonically for our models, while strict monotonicity is not observed for RMSECV. U-shaped curves are visible for M1 over the domain of k = {1, 2, : : : , Ä = 2k * }, which we have specified for illustration. Unambiguous U-shaped curves within the domain of k = {1, 2, : : : , k * } are generally observed to emerge for M2 when the value of is approximately greater than one ( = 0.2 and = 5.0 are illustrated). One consequence of this pattern is that for low values of , 1+ is a close approximation of RMSECV k /RMSECV k * . As is increased, the actual increase in RMSECV k /RMSECV k * will become increasingly small compared with that of 1 + . For this data set, the solution obtained with M1 for Ä Ä 2k * corresponds to a penalty for M2 of > 0.2. Further comparisons of the two metrics are shown in Section S2, in which we also summarize that the selected model is sensitive to the specification of Ä in M1, but how the variation in Ä influences the increase in RMSECV of the selected solution is difficult to anticipate. Only solutions obtained using M2 and will be discussed in the following sections.

Evaluation of latent variable reduction on laboratory standards
The variation in k and the corresponding fit metrics as a function of as formulated by M2 is shown in Figure 3. We can see the clear trend of monotonic decrease in the number of LVs selected with increasing . In all cases, RMSECV k Ä (1 + )RMSECV k * , which is a property of the metric (Section 2.1). We do note that 1 + becomes an increasingly conservative upper bound of the actual increase in RMSECV over RMSECV k * when is large.
We can see that a structure and magnitude similar to the variation in RMSECV with respect to are preserved in the   corresponding RMSEP and mean error estimated for test set laboratory standards for most functional groups (Figure 3). The RMSEP of the test set is approximately an order of magnitude higher than RMSECV (a factor of 9 on average), and the RMSEP ranges between 0% and 305%. The difference in the mean error, however, remains small (28-34%) between the two sets of laboratory standards (Figures 3 and S2).
Alkane CH does not follow the pattern of variation in fitting statistics observed for the rest of the functional groups: The change in RMSEP and mean error is not anticipated by the variation in RMSECV. An RVM penalty of = 0.1 results in a reduction in the selected number of factors from 54 to 18, an increase in RMSECV of 8%, and a decrease in RMSEP of 40%. Over the range of s explored, RMSECV changed the least for aCH compared with all other variables. While RMSECV increases by 20% for = 23, RMSEP does not increase above the value of RMSEP k * . The same conclusion is reached when random 10-fold CV is used (not shown), which possibly implicates the influence of structural differences between calibration and evaluation samples with respect to the absorption bands of aCH. Structural differences in absorption bands may be further reduced by rigorous statistical design of the calibration set when mixture composition of new samples can be anticipated, but model sensitivity to such differences can be problematic when predicting concentrations in more complex environmental samples (Section 3.3). The absorption bands of aCH (approximately 3000-2800 cm -1 [43]; Figure 1) in atmospheric aerosols are particularly feature rich, with absorbance patterns differing by hydrocarbon source type, for example, vegetative detritus [44] or various forms of fossil fuel combustion [6].

Model selection
The appropriate value of for this real data set is unknown, so we apply an ensemble modeling approach implemented with SRD. Normalized  LVs, latent variables; aCOH, alcohol COH; cCOH, carboxylic COH; aCH, alkane CH; CO, carbonyl C=O; RMSECV, root mean square error of cross-validation; SRD, sum of ranking differences.
CV, the relative magnitudes are rescored as rankings with respect to the RMSECV prior to aggregation. As a result, we note that the number of LVs selected does not monotonically decrease with the maximum of each ensemble and the bound of 1 + in the growth of RMSECV is not strictly obeyed when ensemble scoring is used (illustrated by the fact that the increase in RMSECV is 1.23 for aCOH when the maximum 1 + is 1.2). However, the ensemble scoring approach indicates that the k s selected are same or fewer than k * s across functional groups (especially for aCH), and reasonable agreement can be found among the k s for different ensembles of .
We independently confirm by a randomization test at the 5% significance level that the number of relevant components is also less than or equal to the those calculated by the minimum RMSE criterion and is also similar to values calculated by SRD-especially for aCH (Table I). Figure 4 illustrates how the overfitting risk changes according to the number of LVs, and solutions selected at the 5% significance levels are also indicated. As shown in the same figure, the first component for aCOH exceeds 5%, but we attribute this anomaly to the presence of the substrate interference in the infrared spectra, and it is not considered for estimation of LVs. This is anticipated from the fact that pretreatment was not applied to remove the substrate interferences to the signal as described by Wiklund et al. [27] (Section 2.1). The solution for the 10% significance level does show a noticeable departure in the selected number of LVs for aCH compared with the other estimates, however indicating sensitivity to the choice of significance level combined with erratic variations in the overfitting risk.
For further analysis, we choose as our reference solution the number of LVs (k = k ) determined by M2 with SRD for = {0.0, 0.1, 0.2, 0.3}. Figure 5 and increases in RMSECVs found in Table I show that the predicted concentrations in laboratory standards are insensitive to the selection of these solutions, but we discuss their appropriateness for application to samples outside of the domain of the calibration set in Section 3.3.

Implications for extrapolation
To evaluate implications for overfitting in environmental samples in which reference functional group concentrations are not available, we examine the changes in predicted concentrations of each variable for various values of and compare the aggregated estimates of FTIR OC with the measured TOR OC ( Figure 6). We first note the similarity (r > 0.8) in predicted concentrations for each functional group between the k * and k solutions for modest values of < 0.2, except for aCH. At = 0.1, the number of LVs   selected for aCH is reduced from 54 to 18, and this coincides with large differences in predicted concentrations ( Figure 6, top two panels). For the selected solutions corresponding to k = k , we also observe major differences (r = -0.62) with respect to the k = k * solution for aCH (Figure 7) when the number of LVs is reduced from 54 to 10. This change is highlighted in contrast to the insensitivity of predicted abundances for laboratory standards to the selected number of LVs (Section 3.2 and Figure 5).
Aggregating the OC content from a combination of the estimated functional groups and comparing with TOR OC, we find that a very large increase in the Pearson's correlation coefficient (from r = -0.42 to 0.86) and a reduction in the mean error (from 70 to 12 g of OC mass on the filter) are observed for = 0.1, with continued agreement between the two estimates of OC with increasing ( Figure 6, bottom two panels). For our selected value of k s (Section 3.2), the error is further reduced to 10 g (Figure 8). The agreement of FTIR OC with TOR OC is largely dictated by the variation in aCH, as aCH is estimated to compose 60-78% of OC mass in these samples. While separate sample collection and analytical artifacts exist for OC quantification by FTIR and TOR [45], we expect a general agreement between the two measurements and conclude that the k solution is more appropriate than the k * solution.
A small difference in the correlation (from r = 0.86 to 0.93) between the two estimates of OC is observed between = 0.1 and greater values of (Figures 6) primarily because of the change in the number of factors selected for aCH, particularly for a certain class of spectra. While beyond the scope of this manuscript, focused study of such differences in sensitivity across ambient samples may further provide characterization of mixture composition for specific samples.

CONCLUSIONS
We propose a reformulation of a metric weighing bias and variance measures for model selection. The defining parameter, which frames the trade-off between model parsimony and the lowest prediction error, is changed from the total number of LVs considered to a penalization parameter interpreted as the permissible increase relative to the minimum RMSECV, for which we have better intuitive sense. We explore the impact of the parameter on model selection and estimation of organic functional group concentrations from infrared spectra. We build a calibration model from 158 laboratory samples and evaluate predictions for 80 laboratory samples similar to those in the calibration set and for 750 complex environmental mixtures in which true references are not available for calibration.
We find, expectedly, that the number of LVs selected for PLS generally decreases according to increasing penalization for larger RVM (used as an indication of model variance). For a number of models, we can predict concentrations in laboratory standards with modest variations in RMSECV (Ä20%), but extension of these models to more complex mixtures leads to larger differences (greater than 100% in predicted concentrations). In comparison with an independent estimate of OC, we find that the model with a higher number of LVs as selected by the minimum RMSECV criterion is unable to estimate the mass of organic material in the samples.
As the appropriate choice of penalization parameter in our metric is not known, we use an ensemble scoring approach (SRD) to aggregate solutions for various penalty values. When using SRD, the actual increase in RMSECV for the selected solution can increase modestly above the maximum penalty specified but provides a consistent selection of LVs that is robust with respect to a range of penalty values considered and consistent with an independent randomization test. In our work, we demonstrate that the model selected by an ensemble of penalties corresponding to maximum allowable increase in RMSECV of 0%, 10%, 20%, and 30% yields an actual increase in RMSECV of 0-20% across functional groups. The same model drastically improves predictions in environmental samples, however reducing the mean error with respect to an independent metric of OC mass from 70 to 10 g (an 85% reduction) and increasing the correlation between predictions and observations from r = -0.42 to r = 0.93.
As previously reported, PLS models selected on the basis of bias measure may be susceptible to overfitting, which becomes apparent when applying calibration models to more complex mixtures. In such an application, the conventional minimum RMSE metric may yield models that lead to gross errors. A reformulated metric combined with an ensemble scoring approach can provide some additional guidance for selecting a model that considers the cost of parsimony on increased prediction error, while guarding against larger errors incurred by overfitting.