Modelling the distribution of rare and invasive species often occurs in situations where reliable absences for evaluating model performance are unavailable. However, predictions at randomly located sites, or ‘background’ sites, can stand in for true absences. The maximum value of the area under the receiver operator characteristic curve, AUC, calculated with background sites is believed to be 1 − a/2, where a is the typically unknown prevalence of the species on the landscape. Using a simple example of a species' range, I show how AUC can achieve values > 1 − a/2 when test presences do not represent each inhabited region of a species__ range in proportion to its area. Values of AUC that surpass 1 − a/2 are associated with higher model predictions in areas overrepresented in the test data set, even if they are less environmentally suitable than other regions the species occupies. Pursuit of high AUC values can encourage inclusion of spurious predictors in the final model if they help to differentiate areas with disproportionate representation in the test data. Choices made during modelling to increase AUC calculated with background sites on the assumption that higher scores connote more accurate models can decrease actual accuracy when test presences disproportionately represent inhabited areas.
Species distribution models (SDMs) are commonly used in conservation to identify core areas of species' ranges for protection (Carroll et al., 2010), predict species' responses to global change (Matthews et al., 2011), forecast the spread of non-native species (Beaumont et al., 2009), locate prospective sites for reintroduction (Bartel & Sexton, 2009) and find previously unknown populations of rare species (Le Lay et al., 2010). In each of these cases, variation in model predictions across sites determines the allocation of effort devoted to different areas. Thus, incorrect ranking of sites by their suitability to a species can lead to misappropriation of limited conservation resources.
High-confidence absence data are often unavailable for assessing model performance. Although not as informative as true absences, randomly located sites (or ‘background’ sites) can be used in their stead, with the caveat that the performance metrics will then reflect the ability of a model to differentiate between presences and average conditions across the study region, not between presence and absence (Phillips et al., 2006). An increasing number of studies use background sites to assess the performance of SDMs (Beaumont et al., 2009; Marini et al., 2010; Richmond et al., 2010; Gogol-Prokurat, 2011; Tuanmu et al., 2011), and this is the default option in the popular Maxent software package (Phillips et al., 2006). Background sites are inappropriate for evaluating models of species' potential distributions because projections into suitable but unoccupied areas are penalized (Jiménez-Valverde, 2012). This study extends this observation by showing that background sites are also not suitable for evaluating models of realized distributions when test presences are biased, meaning they misrepresent the distribution of the species.
As test sites are commonly a subset of all possible sites, measures of model performance calculated using test sites only indicate apparent model accuracy. In contrast, actual accuracy is a model's performance against the true distribution of a species. As this is typically unknown, a modeller can only assume apparent accuracy correlates positively with actual accuracy. Modelling generally requires numerous choices regarding inclusion or exclusion of predictors, study region extent, model parameterization and so on. Decisions that increase apparent accuracy are typically preferred as these are assumed to increase actual accuracy. However, if test sites misrepresent habitat occupied by a species, then a modeller may retain decisions that result in a high test metric but have low performance in reality, leading to a trade-off between apparent and actual model accuracy.
Here, I show that a particular type of bias in test presences, the disproportionate representation of suitable areas, combined with the use of background sites in place of absences, can lead to incorrect ranking of site suitability by a model. This tends to occur when AUC is relatively high, causing a trade-off between apparent and actual model accuracy. Disproportionate representation of suitable sites is likely one of the most prevalent forms of bias in test data as only well-designed surveys can obviate such bias.
Throughout, I mostly ignore mention of training presences, which can be identical to test presences, drawn from the same region and time but mutually exclusive, or from a different region or era. The conclusions of this study are independent of their relationship because AUC, by definition, is only calculated on test presences.
Limits of AUC
AUC measures a models' discriminatory ability, or propensity to assign higher model predictions to presence sites over absence (or background) sites regardless of the absolute difference between them. Provided a set of np test presences with values indexed by i and na test absences with values indexed by j, AUC is given by
(Mason & Graham, 2002).
AUC calculated with true presences, AUCpa, ranges from 0 to 1, with values above, equal to and below 0.5 indicating that a model discriminates between presences and absences better than a random guess, no better than random and worse than random, respectively. AUCpa also equals the probability that a randomly chosen presence will have a higher model prediction value than a randomly chosen absence (Mason & Graham, 2002).
AUC calculated with background sites, AUCbg, is bounded at values < 1. Phillips et al. (2006) state that this upper limit is 1 − a/2, where a is the proportion of the study region occupied by the species, a fraction typically unknown. The unspecified limits placed on AUCbg disallow comparison across species with different prevalence (Jiménez-Valverde, 2012). Nevertheless, AUCbg is assumed to correlate positively with AUCpa (Phillips et al., 2006), and among a set of models for the same species, the ones with higher AUCbg would seem to indicate greater actual accuracy.
AUCpa achieves its maximum value when the model has maximum discriminatory ability, meaning all test presences have higher model predictions than all test absences. Consider the output from a model with maximum discriminatory power shown in Fig. 1(a), where the model predictions for each site across a landscape are plotted by their rank. As all presence sites are ranked higher than all absence sites, AUCpa equals 1.
In contrast, the maximum value of AUCbg depends on how representative the test presence sites are of all presence sites. Consider again the case where the model has maximum true discriminatory ability (AUCpa = 1) but true absence data are unavailable so the modeller uses background sites to calculate AUCbg (Fig. 1a). Also assume that test presence sites are representatively (randomly) sampled from among all presence sites. A proportion equal to 1 − a of background sites will have model predictions less than any test presence. These cases fall under the first line of equation (1b)b, so receive a weight of 1. Some of the background sites will fall on presence sites, however. Of these, on average, a proportion equal to a/2 will fall below test presence sites (but on a presence), and a/2 will fall above test presence sites so receive weights of 1 and 0, respectively. As a result,
which is the expression given by Phillips et al. (2006) for maximum AUCbg.
It is possible for AUCbg to achieve values > 1 − a/2, however. Consider a second case where test presence sites are biased so that they have maximum model predictions (Fig. 1b). AUCbg achieves its actual maximum value in this extreme case. Note that presences not located at test sites may not necessarily have greater model predictions than absence sites, meaning that apparent model accuracy could be greater than actual model accuracy. This is a risk of tuning models to achieve high performance scores calculated for a particular set of test sites.
The upper limits of AUCbg are calculable from the number of background sites expected to fall on and off a test presence site (Wiley et al., 2003). Consider again the situation in Fig. 1(b) for a landscape with n0 total sites. When test presences have maximum predicted values, a proportion of background sites equal to 1 − np/n0 will fall in areas with model predictions lower than the test sites, so will receive a weight of 1 in equation (1b)b. Likewise, np/n0 will fall on a test presence, half of which will fall above and half of which will fall below any given test presence, so receive weights of 0 and 1, respectively. Hence, maximum AUCbg is
This value will always be > 1 − a/2 except when the species only occupies the test presence sites (a = np/n0), a very unlikely situation. AUCbg reaches its minimum achievable value when all test presence sites receive the lowest possible model predictions. For reasons analogous to the rationale for the upper limit of AUCbg, the lower limit is
Bias in Model Predictions
Having established the upper and lower limits on the range of AUCbg, I now use a simple example to show that disproportionate representation of the area of occupied habitat by test presence sites can produce values of AUCbg > 1 − a/2 and in the process encourage biased predictions. As a is typically unknown, it is not possible to ascertain when a model surpasses this critical value.
Consider a simple case in which the range of a species can be divided into two regions, a1 and a2 which sum to a, the proportion of the landscape occupied by the species (Fig. 2). Assume that a SDM correctly assigns absences (the areas outside a) lower model predictions than areas inside the species' range, meaning that AUCpa = 1. Assume also that an environmental factor (e.g. summer rainfall) differentiates area a1 from a2 but is unimportant to the actual suitability of the habitat, meaning that a model including this factor could assign different model predictions to a1 and a2. Although this factor could be used as a predictor, using AUCpa as the test metric provides no incentive to do so as increased differences in model predictions across a1 and a2 will not increase AUCpa any further.
In contrast, if AUCbg is used as the metric of model performance, then there is an incentive to include the spurious factor and predict higher suitability in areas overrepresented in the test presence set even if these areas are of lower environmental quality. Consider the case in which the spurious factor varying across a1 and a2 is used by the model to assign region a1 lower model predictions than a2. Regions a1 and a2 contain n1 and n2 test presences, respectively, which together sum to n. Equation 1 for this situation can be written by replacing the summand operator with the proportions of test presences in a1 and a2 times the proportion of absences that fall outside the species' range (weight = 1 from equation (1b)b), inside a1 (weight = ½ or 1, depending on whether test presences are in a1 or a2, respectively) or inside a2 (weight = 0 or ½, depending on whether test presences are in a1 or a2):
assuming predictions in a1 are less than in a2. In other words, if a1 is underrepresented in the test presence set relative to its area in comparison with a2, AUCbg will be greater when a2 receives higher model predictions regardless of its environmental suitability (compare Fig. 2a vs. b). This occurs because comparisons between test sites in an overrepresented area (a2 in this case) with background sites in the opposing area (a1) count more towards AUCbg than comparisons between test sites in an underrepresented area and background sites in the overrepresented area.
Writing the equivalent of equation (5) for the case where predictions in a1 are > a2, setting this equal to equation (5) and simplifying yields the condition in which a modeller will favour higher predictions for a2 than a1:
and higher predictions for a1 than a2:
Note that these inequalities have the same form as equation (6) (although the inequality sign in equation (6) should be reversed when comparing to equation (7b)b). This means that when the overrepresented portions of a species' range are assigned greater model scores, AUCbg increases above the maximum value in the unbiased case, 1 − a/2. Equal predictions in a1 and a2 will be only be favoured when
In other words, when using AUCbg as a measure of model performance, there is an incentive to reduce predictions of habitat suitability in undersampled areas regardless of their actual suitability, and this occurs at AUCbg values > 1 − a/2. This leads to a perilous trade-off between apparent accuracy (AUCbg) and accuracy of predictions of relative suitability. As a is typically unknown, it is generally not possible to determine whether a modelling decision increases AUCbg above 1 − a/2 and imparts a biased model or whether it reflects an actual increase in model performance.
This scenario generalizes to any number of divisions of the range (a1, a2, a3, …, aN), which can be differentiated by a model on the basis of environmental differences among them. If each subdivision of the range is represented in the test presence set proportionate to its area, then equal predictions across sites will most increase AUCbg (equation (7c)c), regardless of actual differences in suitability between them. However, if the areas represented by test presences are unequal for at least some sites (ai ≠ aj for some i and j), then pursuing high AUCbg will lead to an ordering of sites with overrepresented areas receiving higher rankings irrespective of their actual suitability.
For simplicity, the examples discussed above have assumed that the model in question has maximum true discriminatory ability (predictions in a1 and a2 are always > predictions outside a). However, these results are not strictly dependent on predictions within the range being truly separable from predictions outside of it (AUCpa must not always = 1). The effect of increasing predictions in a1 over a2 or vice versa on AUCbg will depend on the particular distribution of predictions in these areas and outside the range, but incorrect rankings of sites can still occur even if AUCbg is < 1 − a/2.
These observations compound the general issues with AUC, regardless of whether absences or background sites are used in the calculations (Lobo et al., 2007; Jiménez-Valverde, 2012). AUCbg penalizes predictions of high favourability in unoccupied areas, thereby favouring models that predict the actual, vs. potential, distribution of a species (Jiménez-Valverde, 2012). Here, I have shown that higher values of AUCbg are associated with biased predictions within the realized range of the species, not just outside it.
Unless care is taken in sampling, the density of test presences is unlikely to be truly proportionate to the representation of suitable habitat across the landscape. In many cases, species' records represent sample effort targeting sites where a collector believes a species has a high probability of being present (Schulman et al., 2007; Loiselle et al., 2008; Sheth et al., 2012). Likewise, if detectability varies across the species' range, then areas with lower probabilities of detection will likely be underrepresented even if the species occurs there (MacKenzie et al., 2006). The problems with AUCbg discussed here are compounded if training and test presences are drawn from the same data set. When this occurs, an overly represented area in the test set is likely to be overly represented in the training set, meaning the model could interpret overrepresentation in the training set as indicative of greater suitability. However, bias in training data can be corrected using ‘targeted’ background sites (Phillips et al., 2009) and weighting of training presences or absences (Elith et al., 2010). Unfortunately, no techniques exist to correct for bias in test sites. Presumably, test presences could be weighted in proportion to their representation of suitable habitat, but this would require a priori knowledge about the habitat's suitability, obviating the need for a model. Careful design of sampling effort, at least for test sites, should be practised following stratified designs (see discussion in Guisan & Zimmermann, 2000).
Modellers should be aware that ‘pursuing’ increasing values of AUCbg past the typically unknown threshold set by 1 − a/2 can lead to a trade-off between correct ranking of habitable sites and AUCbg. Biased predictions have serious consequences for allocation of conservation effort. For example, if SDMs are used to identify core areas of a species' range for targeted management or acquisition, sites that are otherwise highly favourable may receive lower rankings than sites that are poorer but overly represented in the test set.
These results also indicate that pursuing higher AUCbg scores may encourage inclusion of predictors that have no real relationship with species' distributions but nonetheless differentiate areas with unequal representation in the test set. This can have serious consequences for models projected to different regions or time periods if the spurious factor changes with time, thereby incorrectly influencing predictions.
I have directed these observations specifically towards AUC as it remains the most widely used measure of model performance. However, other metrics such as the true skill statistic (TSS; Allouche et al., 2006) or Cohen's Kappa (Fielding & Bell, 1997) calculated with background sites can also have undesirable relationships between actual and apparent model performance. For example, TSS is calculated as the sum of the proportion of test presences and test absences (or background sites) correctly predicted minus 1. Assuming the same situation used in Fig. 2, if a1 is undersampled relative to its area, higher TSS can be achieved by reducing predictions in a1 equal to or below true absence sites because the number of ‘absences’ in a1 counts more towards TSS than the number of test presences. Likewise, if test absences are available but biased, there will likely be discrepancies between apparent and actual accuracy.
Modellers are advised to search for alternative measures of performance when true absence data are unavailable. One option is to test AUCbg against a null distribution of values generated from models trained on random sampling of sites across a region of interest (Raes & ter Steege, 2007). However, this framework assumes higher values of AUCbg are better, whereas I have shown that there can be a positive relationship between AUCbg and model bias. Alternatives include use of presence-only measures of performance (Algar et al., 2009) or model selection based on some measure of parsimony (e.g. number of predictors selected by the model or smallest area predicted occupied given a set proportion of test presences are correctly predicted). Unfortunately, a model's ability to differentiate presence from absence can only be evaluated with representative presence and absence data. Many species of conservation concern are so poorly characterized that absence data are unavailable, but the exigency of their situations demands action backed by model predictions, even if they are imperfect (Wiens et al., 2009). Hence, evaluation of SDMs when absences are unavailable remains an open and needed avenue of research.
I wish to thank the two anonymous reviewers of the manuscript whose comments were instrumental to the direction and tone of this article. This project is made possible by a grant from the US Institute of Museum and Library Services.
Adam B. Smith is a Postdoctoral Researcher at the Missouri Botanical Garden who desires to make species distribution models more useful to the conservation community. His interests include biogeography, macroecology and international conservation policy. Currently, he is evaluating the vulnerability of threatened plants in the North American Central Highlands to climate change, a project made all the more challenging by the lack of true absences.