C. F. Dormann (email@example.com), Dept of Computational Landscape Ecology, UFZ Helmholtz Centre for Environmental Research, Permoserstr. 15, DE-04318 Leipzig, Germany. – J. M. McPherson, Dept of Biology, Dalhousie Univ., 1355 Oxford Street HAlifax NS, B3H 4J1 Canada. – M. B. Araújo, Dept de Biodiversidad y Biología Evolutiva, Museo Nacional de Ciencias Naturales, CSIC, C/ Gutiérrez Abascal, 2, ES-28006 Madrid, Spain, and Centre for Macroecology, Inst. of Biology, Universitetsparken 15, DK-2100 Copenhagen Ø, Denmark. – R. Bivand, Economic Geography Section, Dept of Economics, Norwegian School of Economics and Business Administration, Helleveien 30, NO-5045 Bergen, Norway. – J. Bolliger, Swiss Federal Research Inst. WSL, Zürcherstrasse 111, CH-8903 Birmensdorf, Switzerland. – G. Carl and I. Kühn, Dept of Community Ecology (BZF), UFZ Helmholtz Centre for Environmental Research, Theodor-Lieser-Strasse 4, DE-06120 Halle, Germany, and Virtual Inst. Macroecology, Theodor-Lieser-Strasse 4, DE-06120 Halle, Germany. – R. G. Davies, Biodiversity and Macroecology Group, Dept of Animal and Plant Sciences, Univ. of Sheffield, Sheffield S10 2TN, U.K. – A. Hirzel, Ecology and Evolution Dept, Univ. de Lausanne, Biophore Building, CH-1015 Lausanne, Switzerland. – W. Jetz, Ecology Behavior and Evolution Section, Div. of Biological Sciences, Univ. of California, San Diego, 9500 Gilman Drive, MC 0116, La Jolla, CA 92093-0116, USA. – W. D. Kissling, Community and Macroecology Group, Inst. of Zoology, Dept of Ecology, Johannes Gutenberg Univ. of Mainz, DE-55099 Mainz, Germany, and Virtual Inst. Macroecology, Theodor-Lieser-Strasse 4, DE-06120 Halle, Germany. – R. Ohlemüller, Dept of Biology, Univ. of York, PO Box 373, York YO10 5YW, U.K. – P. R. Peres-Neto, Dept of Biology, Univ. of Regina, SK, S4S 0A2 Canada, present address: Dept of Biological Sciences, Univ. of Quebec at Montreal, CP 8888, Succ. Centre Ville, Montreal, QC, H3C 3P8, Canada. – B. Reineking, Forest Ecology, ETH Zurich CHN G 75.3, Universitätstr. 16, CH-8092 Zürich, Switzerland. – B. Schröder, Inst. for Geoecology, Univ. of Potsdam, Karl-Liebknecht-Strasse 24-25, DE-14476 Potsdam, Germany. – F. M. Schurr, Plant Ecology and Nature Conservation, Inst. of Biochemistry and Biology, Univ. of Potsdam, Maulbeerallee 2, DE-14469 Potsdam, Germany. – R. Wilson, Área de Biodiversidad y Conservación, Escuela Superior de Ciencias Experimentales y Tecnología, Univ. Rey Juan Carlos, Tulipán s/n, Móstoles, ES-28933 Madrid, Spain.
Species distributional or trait data based on range map (extent-of-occurrence) or atlas survey data often display spatial autocorrelation, i.e. locations close to each other exhibit more similar values than those further apart. If this pattern remains present in the residuals of a statistical model based on such data, one of the key assumptions of standard statistical analyses, that residuals are independent and identically distributed (i.i.d), is violated. The violation of the assumption of i.i.d. residuals may bias parameter estimates and can increase type I error rates (falsely rejecting the null hypothesis of no effect). While this is increasingly recognised by researchers analysing species distribution data, there is, to our knowledge, no comprehensive overview of the many available spatial statistical methods to take spatial autocorrelation into account in tests of statistical significance. Here, we describe six different statistical approaches to infer correlates of species’ distributions, for both presence/absence (binary response) and species abundance data (poisson or normally distributed response), while accounting for spatial autocorrelation in model residuals: autocovariate regression; spatial eigenvector mapping; generalised least squares; (conditional and simultaneous) autoregressive models and generalised estimating equations. A comprehensive comparison of the relative merits of these methods is beyond the scope of this paper. To demonstrate each method's implementation, however, we undertook preliminary tests based on simulated data. These preliminary tests verified that most of the spatial modeling techniques we examined showed good type I error control and precise parameter estimates, at least when confronted with simplistic simulated data containing spatial autocorrelation in the errors. However, we found that for presence/absence data the results and conclusions were very variable between the different methods. This is likely due to the low information content of binary maps. Also, in contrast with previous studies, we found that autocovariate methods consistently underestimated the effects of environmental controls of species distributions. Given their widespread use, in particular for the modelling of species presence/absence data (e.g. climate envelope models), we argue that this warrants further study and caution in their use. To aid other ecologists in making use of the methods described, code to implement them in freely available software is provided in an electronic appendix.
Species distributional data such as species range maps (extent-of-occurrence), breeding bird surveys and biodiversity atlases are a common source for analyses of species-environment relationships. These, in turn, form the basis for conservation and management plans for endangered species, for calculating distributions under future climate and land-use scenarios and other forms of environmental risk assessment.
The analysis of spatial data is complicated by a phenomenon known as spatial autocorrelation. Spatial autocorrelation (SAC) occurs when the values of variables sampled at nearby locations are not independent from each other (Tobler 1970). The causes of spatial autocorrelation are manifold, but three factors are particularly common (Legendre and Fortin 1989, Legendre 1993, Legendre and Legendre 1998): 1) biological processes such as speciation, extinction, dispersal or species interactions are distance-related; 2) non-linear relationships between environment and species are modelled erroneously as linear; 3) the statistical model fails to account for an important environmental determinant that in itself is spatially structured and thus causes spatial structuring in the response (Besag 1974). The second and third points are not always referred to as spatial autocorrelation, but rather spatial dependency (Legendre et al. 2002). Since they also lead to autocorrelated residuals, these are equally problematic. A fourth source of spatial autocorrelation relates to spatial resolution, because coarser grains lead to a spatial smoothing of data. In all of these cases, SAC may confound the analysis of species distribution data.
Spatial autocorrelation may be seen as both an opportunity and a challenge for spatial analysis. It is an opportunity when it provides useful information for inference of process from pattern (Palma et al. 1999) by, for example, increasing our understanding of contagious biotic processes such as population growth, geographic dispersal, differential mortality, social organization or competition dynamics (Griffith and Peres-Neto 2006). In most cases, however, the presence of spatial autocorrelation is seen as posing a serious shortcoming for hypothesis testing and prediction ( Lennon 2000, Dormann 2007b), because it violates the assumption of independently and identically distributed (i.i.d.) errors of most standard statistical procedures (Anselin 2002) and hence inflates type I errors, occasionally even inverting the slope of relationships from non-spatial analysis (Kühn 2007).
A variety of methods have consequently been developed to correct for the effects of spatial autocorrelation (partially reviewed in Keitt et al. 2002, Miller et al. 2007, see below), but only a few have made it into the ecological literature. The aims of this paper are to 1) present and explain methods that account for spatial autocorrelation in analyses of spatial data; the approaches considered are: autocovariate regression, spatial eigenvector mapping (SEVM), generalised least squares (GLS), conditional autoregressive models (CAR), simultaneous autoregressive models (SAR), generalised linear mixed models (GLMM) and generalised estimation equations (GEE); 2) describe which of these methods can be used for which error distribution, and discuss potential problems with implementation; 3) illustrate how to implement these methods using simulated data sets and by providing computing code (Anon. 2005).
Methods for dealing with spatial autocorrelation
Detecting and quantifying spatial autocorrelation
Before considering the use of modelling methods that account for spatial autocorrelation, it is a sensible first step to check whether spatial autocorrelation is in fact likely to impact the planned analyses, i.e. if model residuals indeed display spatial autocorrelation. Checking for spatial autocorrelation (SAC) has become a commonplace exercise in geography and ecology (Sokal and Oden 1978aSokal and Oden 1978b, Fortin and Dale 2005). Established procedures include (Isaaks and Shrivastava 1989, Perry et al. 2002): Moran's I plots (also termed Moran's I correlogram by Legendre and Legendre 1998), Geary's c correlograms and semi-variograms. In all three cases a measure of similarity (Moran's I, Geary's c) or variance (variogram) of data points (i and j) is plotted as a function of the distance between them (dij). Distances are usually grouped into bins. Moran's I-based correlograms typically show a decrease from some level of SAC to a value of 0 (or below; expected value in the absence of SAC: E(I)=−1/(n–1), where n=sample size), indicating no SAC at some distance between locations. Variograms depict the opposite, with the variance between pairs of points increasing up to a certain distance, where variance levels off. Variograms are more commonly employed in descriptive geostatistics, while correlograms are the prevalent graphical presentation in ecology (Fortin and Dale 2005).
Values of Moran's I are assessed by a test statistic (the Moran's I standard deviate) which indicates the statistical significance of SAC in e.g. model residuals. Additionally, model residuals may be plotted as a map that more explicitly reveals particular patterns of spatial autocorrelation (e.g. anisotropy or non-stationarity of spatial autocorrelation). For further details and formulae see e.g. Isaaks and Shrivastava (1989) or Fortin and Dale (2005).
Assumptions common to all modelling approaches considered
All methods assume spatial stationarity, i.e. spatial autocorrelation and effects of environmental correlates to be constant across the region, and there are very few methods to deal with non-stationarity (Osborne et al. 2007). Stationarity may or may not be a reasonable assumption, depending, among other things, on the spatial extent of the study. If the main cause of spatial autocorrelation is dispersal (for example in research on animal distributions), stationarity is likely to be violated, for example when moving from a floodplain to the mountains, where movement may be more restricted. One method able to accommodate spatial variation in autocorrelation is geographically weighted regression (Fotheringham et al. 2002), a method not considered here because of its limited use for hypothesis testing (coefficient estimates depend on spatial position) and because it was not designed to remove spatial autocorrelation (see e.g. Kupfer and Farris 2007, for a GWR correlogram).
Another assumption is that of isotropic spatial autocorrelation. This means that the process causing the spatial autocorrelation acts in the same way in all directions. Environmental factors that may cause anisotropic spatial autocorrelation are wind (giving a wind-dispersed organism a preferential direction), water currents (e.g. carrying plankton), or directionality in soil transport (carrying seeds) from mountains to plains. He et al. (2003) as well as Worm et al. (2005) provide examples of analyses accounting for anisotropy in ecological data, and several of the methods described below can be adapted for such circumstances.
Description of spatial statistical modelling methods
The methods we describe in the following fall broadly into three groups. 1) Autocovariate regression and spatial eigenvector mapping seek to capture the spatial configuration in additional covariates, which are then added into a generalised linear model (GLM). 2) Generalised least squares (GLS) methods fit a variance-covariance matrix based on the non-independence of spatial observations. Simultaneous autoregressive models (SAR) and conditional autoregressive models (CAR) do the same but in different ways to GLS, and the generalised linear mixed models (GLMM) we employ for non-normal data are a generalisation of GLS. 3) Generalised estimating equations (GEE) split the data into smaller clusters before also modelling the variance-covariance relationship. For comparison, the following non-spatial models were also employed: simple GLM and trend-surface generalised additive models (GAM: Hastie and Tibshirani 1990, Wood 2006), in which geographical location was fitted using splines as a trend-surface (as a two-dimensional spline on geographical coordinates). Trend surface GAM does not address the problem of spatial autocorrelation, but merely accounts for trends in the data across larger geographical distances (Cressie 1993). A promising tool which became available only recently is the use of wavelets to remove spatial autocorrelation (Carl and Kühn 2007b). However, the method was published too recently to be included here and hence awaits further testing.
We also did not include Bayesian spatial models in this review. Several recent publications have employed this method and provide a good coverage of its implementation (Osborne et al. 2001, Hooten et al. 2003, Thogmartin et al. 2004, Gelfand et al. 2005, Kühn et al. 2006, Latimer et al. 2006). The Bayesian approach to spatial models used in these studies is based either on a CAR or an autologistic implementation similar to the one we used as a frequentist method. The Bayesian framework allows for a more flexible incorporation of other complications (observer bias, missing data, different error distributions) but is much more computer-intensive then any of the methods presented here.
Beyond the methods mentioned above, there are also those which correct test statistics for spatial autocorrelation. These include Dutilleul's modified t-test (Dutilleul 1993) or the CRH-correction for correlations (Clifford et al. 1989), randomisation tests such as partial Mantel tests (Legendre and Legendre 1998), or strategies employed by Lennon (2000), Liebhold and Gurevitch (2002) and Segurado et al. (2006) which are all useful as a robust assessment of correlation between environmental and response variables. As these methods do not allow a correction of the parameter estimates, however, they are not considered further in this study. In the following sections we present a detailed description of all methods employed here.
1. Autocovariate models
Autocovariate models address spatial autocorrelation by estimating how much the response variable at any one site reflects response values at surrounding sites. This is achieved through a simple extension of generalised linear models by adding a distance-weighted function of neighbouring response values to the model's explanatory variables. This extra parameter is known as the autocovariate. The autocovariate is intended to capture spatial autocorrelation originating from endogenous processes such as conspecific attraction, limited dispersal, contagious population growth, and movement of censused individuals between sampling sites (Smith 1994, Keitt et al. 2002, Yamaguchi et al. 2003).
Adding the autocovariate transforms the linear predictor of a generalised linear model from its usual form, y=Xβ+ɛ, to y=Xβ+ρA+ɛ, where β is a vector of coefficients for intercept and explanatory variables X; and ρ is the coefficient of the autocovariate A.
Where spatial autocorrelation is thought to be anisotropic (e.g. because seed dispersal follows prevailing winds or downstream run-off), multiple autocovariates can be used to capture spatial autocorrelation in different geographic directions (He et al. 2003).
2. Spatial eigenvector mapping (SEVM)
Spatial eigenvector mapping is based on the idea that the spatial arrangement of data points can be translated into explanatory variables, which capture spatial effects at different spatial resolutions. During the analysis, those eigenvectors that reduce spatial autocorrelation in the residuals best are chosen explicitly as spatial predictors. Since each eigenvector represents a particular spatial patterning, SAC is effectively allowed to vary in space, relaxing the assumption of both spatial isotropy and stationarity. Plotting these eigenvectors reveals the spatial patterning of the spatial autocorrelation (see Diniz-Filho and Bini 2005, for an example). This method could thus be very useful for data with SAC stemming from larger scale observation bias.
The method is based on the eigenfunction decomposition of spatial connectivity matrices, a relatively new and still unfamiliar method for describing spatial patterns in complex data (Griffith 2000b, Borcard and Legendre 2002, Griffith and Peres-Neto 2006, Dray et al. 2006). A very similar approach, called eigenvector filtering, was presented by Diniz-Filho and Bini (2005) based on their method to account for phylogenetic non-independence in biological data (Diniz-Filho et al. 1998). Eigenvectors from these matrices represent the decompositions of Moran's I statistic into all mutually orthogonal maps that can be generated from a given connectivity matrix (Griffith and Peres-Neto 2006). Either binary or distance-based connectivity matrices can be decomposed, offering a great deal of flexibility regarding topology and transformations. Given the non-Euclidean nature of the spatial connectivity matrices (i.e. not all sampling units are connected), both positive and negative eigenvalues are produced. The non-Euclidean part is introduced by the fact that only certain connections among sampling units, and not all, are considered. Eigenvectors with positive eigenvalues represent positive autocorrelation, whereas eigenvectors with negative eigenvalues represent negative autocorrelation. For the sake of presenting a general method that will work for either binary or distance matrices, we used a distance-based eigenvector procedure (after Dray et al. 2006) which can be summarized as follows: 1) compute a pairwise Euclidean (geographic) distance matrix among sampling units: ; 2) choose a threshold value t and construct a connectivity matrix using the following rule:
where t is chosen as the maximum distance that maintains connections among all sampling units being connected using a minimum spanning tree algorithm (Legendre and Legendre 1998). Because the example data we use represent a regular grid (see below), t=1 and thus wij is either 0 or 1–1/42=0.9375 in our analysis. Note that we can change 0.9375 to 1 without affecting eigenvector extraction. This would make the matrix fully compatible with a binary matrix which is the case for a regular grid. 3) Compute the eigenvectors of the centred similarity matrix: (I–11T/n)W(I–1 1T/n), where I is the identity matrix. Due to numerical precision regarding the eigenvector extraction of large matrices (Bai et al. 1996) the method is limited to ca7000 observations depending on platform and software (but see Griffith 2000a, for solutions based on large binary connectivity matrices). 4) Select eigenvectors to be included as spatial predictors in a linear or generalised linear model. Here, a model selection procedure that minimizes the amount of spatial autocorrelation in residuals was used (see Griffith and Peres-Neto 2006 and Appendix for computational details). In this approach, eigenvectors are added to a model until the spatial autocorrelation in the residuals, measured by Moran's I, is non-significant. Our selection algorithm considered global Moran's I (i.e. autocorrelation across all residuals), but could be easily amended to target spatial autocorrelation within certain distance classes. The significance of Moran's I was tested using a permutation test as implemented in Lichstein et al. (2002). This potentially renders the selection procedure computationally intensive for large data sets (200 or more observations), because a permutation test must be performed for each new eigenvector entered into the model.
Once the location-dependent, but data-independent eigenvectors are selected, they are incorporated into the ordinary regression model (i.e. linear or generalized linear model) as covariates. Since their relevance has been assessed during the filtering process model simplification is not indicated (although some eigenvectors will not be significant).
3. Spatial models based on generalised least squares regression
In linear models of normally distributed data, spatial autocorrelation can be addressed by the related approaches of generalised least squares (GLS) and autoregressive models (conditional autoregressive models (CAR) and simultaneous autoregressive models (SAR)). GLS directly models the spatial covariance structure in the variance-covariance matrix ∑, using parametric functions. CAR and SAR, on the other hand, model the error generating process and operate with weight matrices that specify the strength of interaction between neighbouring sites.
As before, the underlying model is Y=Xβ+ɛ, with the error vector ɛ=N(0,Σ). ∑ is called the variance-covariance matrix. Instead of fitting individual values for the variance-covariance matrix ∑, a parametric correlation function is assumed. Correlation functions are isotropic, i.e. they depend only on the distance sij between locations i and j, but not on the direction. Three frequently used examples of correlation functions C(s) also used in this study are exponential (C(s)=σ2 exp(−r/s)), Gaussian (C(s)=σ2 exp(−r/s))2) and spherical , where r is a scaling factor that is estimated from the data).
Some restrictions are placed upon the resulting variance-covariance matrix ∑: a) it must be symmetric, and b) it must be positive definite. This guarantees that the matrix is invertible, which is necessary for the fitting process (see below). The choice of correlation function is commonly based on a visual investigation of the semi-variogram or correlogram of the residuals.
Parameter estimation is a two-step process. First, the parameters of the correlation function (i.e. scaling factor r in the examples used here) are found by optimizing the so called profiled log-likelihood, which is the log-likelihood where the unknown values for β and σ2 are replaced by their algebraic maximum likelihood estimators. Secondly, given the parameterization of the variance-covariance matrix, the values for β and σ2 are found by solving a weighted ordinary least square problem:
where the error term is now normally distributed with mean 0 and variance σ2I.
Both CAR and SAR incorporate spatial autocorrelation using neighbourhood matrices which specify the relationship between the response values (in the case of CAR) or residuals (in the case of SAR) at each location (i) and those at neighbouring locations (j) (Cressie 1993, Lichstein et al. 2002, Haining 2003). The neighbourhood relationship is formally expressed in a n×n matrix of spatial weights (W) with elements (wij) representing a measure of the connection between locations i and j. The specification of the spatial weights matrix starts by identifying the neighbourhood structure of each cell. Usually, a binary neighbourhood matrix N is formed where nij=1 when observation j is a neighbour to observation i. This neighbourhood can be identified by the adjacency of cells on a grid map, or by Euclidean or great circle distance (e.g. the distance along earth's surface), or predefined according to a specific number of neighbours (e.g. a neighbourhood distance of 1.5 in our case includes the 8 adjacent neighbours). The elements of N can further be weighted to give closer neighbours higher weights and more distant neighbours lower weights. The matrix of spatial weights W consists of zeros on the diagonal, and weights for the neighbouring locations (wij) in the off-diagonal positions. A good introduction to the CAR and SAR methodology is given by Wall (2004).
with ɛ=N(0, Vc). If =σ2 for all locations i, the covariance matrix is VC=σ2 (I−ρW)−1, where W has to be symmetric. Consequently, CAR is unsuitable when directional processes such as stream flow effects or prevalent wind directions are coded as non-Euclidean distances, resulting in an asymmetric covariance matrix. In such situations, the closely related simultaneous autoregressive models (SAR) are a better option, as their W need not be symmetric (see below). For our analysis, we used a row-standardised binary weights matrix for a neighbour-distance of 2 (Appendix).
Simultaneous autogressive models (SAR)
SAR models can take three different forms (we use the notation presented in Anselin 1988), depending on where the spatial autoregressive process is believed to occur (see Cliff and Ord 1981, Anselin 1988, Haining 2003, for details). The first SAR model assumes that the autoregressive process occurs only in the response variable (“lagged-response model”), and thus includes a term (ρW) for the spatial autocorrelation in the response variable Y, but also the standard term for the predictors and errors (Xβ+ɛ) as used in an ordinary least squares (OLS) regression. Spatial autocorrelation in the response may occur, for example, where propagules disperse passively with river flow, leading to a directional spatial effect. The SAR lagged-response model (SAR lag) takes the form
(which is equivalent to Y=(I−ρW)−1Xβ+ (I−ρW)−1ɛ), where ρ is the autoregression parameter, W the spatial weights matrix, and β a vector representing the slopes associated with the predictors in the original predictor matrix X.
Second, spatial autocorrelation can affect both response and predictor variables (“lagged-mixed model”, SAR mix). Ecologically, this adds a local aggregation component to the spatial effect in the lag-model above. In this case, another term (WXγ) must also appear in the model, which describes the regression coefficients (γ) of the spatially lagged predictors (WX). The SAR lagged-mixed model takes the form
Finally, the “spatial error model” (SAR err) assumes that the autoregressive process occurs only in the error term and neither in response nor in predictor variables. The model is most similar to the CAR, with no directionality in the error. In this case, the usual OLS regression model (Y=Xβ+ɛ) is complemented by a term (λWμ) which represents the spatial structure (λW) in the spatially dependent error term (μ). The SAR spatial error model thus takes the form
where λ is the spatial autoregression coefficient, and the rest as above. SAR and CAR are related to each other, but the terms ρW used in both CAR and SAR are not identical. As noted above, in CAR, W must be symmetrical, whereas in SAR it need not be. Let ρW of the CAR be called K and ρW of the SAR be called S. Then any SAR is a CAR with K=S+ST−STS (Haining 2003). Assuming constant variance σ2, the formal relationship between the error variance-covariance matrices in GLS, SAR, and CAR is as follows: VGLS=σ2C(s); VCAR=σ2 (I–K)−1and VSAR=σ2 (I–S)−1(I–ST)−1, with K and S as defined above. Thus CAR and SAR models are equivalent if VCAR=VSAR. The relationship between specific values in correlation matrix C and weight matrix W is not straightforward, however. In particular, spatial dependence parameters that decrease monotonically with distance do not necessarily correspond to spatial covariances that decrease monotonically with distance (Waller and Gotway 2004). An extensive comparison of the impact of different model formulations on parameter estimation and type I error control is given by Kissling and Carl (2007) using simulated datasets with different spatial autocorrelation structures.
Spatial generalised linear mixed models (GLMM)
Spatial generalised linear mixed models are generalised linear models (GLMs) in which the linear predictor may contain random effects and within-group errors may be spatially autocorrelated (Breslow and Clayton 1993, Venables and Ripley 2002). Formally, if Yij is the j-th observation of the response variable in group i,
where g is the link function, η is the linear predictor, β and ζ are coefficients for fixed and random effects, respectively, and x and z are the explanatory variables associated with these effects. Conditional on the random effects ζ, the standard GLM applies and the within-group distribution of Y can be described using the same error distributions as in GLM.
Since the GLMM is often implemented based on so-called penalized quasi-likelihood (PQL) methods (Breslow and Clayton 1993, Venables and Ripley 2002) around the GLS-algorithm (McCullough and Nelder 1989), we can use it in a similar way, i.e. fitting the structure of the variance-covariance-matrix to the data (see GLS above), albeit with a different error distribution. In cases where spatial data are available from several disjunct regions, GLMMs can thus be used to fit overall fixed effects while spatial correlation structures are nested within regions, allowing the accommodation of regional differences in e.g. autocorrelation distances, and assuming autocorrelation only between observations within the same region (Orme et al. 2005, Davies et al. 2006, Stephenson et al. 2006).
4. Spatial generalised estimating equations (GEE)
Liang and Zeger (1986) developed the generalised estimating equation (GEE) approach which is an extension of generalised linear models (GLMs). When responses are measured repeatedly through time or space, the GEE method takes correlations within clusters of sampling units into account by means of a parameterised correlation matrix, while correlations between clusters are assumed to be zero. In a spatial context such clusters can be interpreted as geographical regions, if distances between different regions are large enough (Albert and McShane 1995). We modified the approach of Liang and Zeger to use these GEE models for spatial, two-dimensional datasets sampled in rectangular grids (see Carl and Kühn 2007a, for more details). Fortunately, estimates of regression parameters are fairly robust against misspecification of the correlation matrix (Dobson 2002). The GEE approach is especially suited for parameter estimation rather than prediction (Augustin et al. 2005).
Firstly, consider the generalised linear model E(y)=ì, ì=g−1 (Xâ) where y is a vector of response variables, μ the expected value, g−1 the inverse of the link function, X the matrix of predictors, and β the vector of regression parameters. Minimization of a quadratic form leads to the GLM score equation (Diggle et al. 1995, Dobson 2002, Myers et al. 2002)
where DT is the transposed matrix of D of partial derivatives D=∂μ/∂β. Secondly, note that the variance of the response can be replaced by a variance-covariance matrix V which takes into account that observations are not independent. In GEEs, the sample is split up into m clusters and the complete dataset is ordered in a way that in all clusters data are arranged in the same sequence: Then the variance-covariance matrix has block diagonal form, since responses of different clusters are assumed to be uncorrelated. One can consequently transform the score equation into the following form
which sums over all clusters j. This equation is called the generalised estimating equation or the quasi-score equation.
For spatial dependence the following correlation structures for V are important: 1) Fixed. The correlation structure is completely specified by the user and will not change during an iterative procedure. Referred to here as GEE. 2) User defined. Correlation parameters are to be estimated, but one can specify that certain parameters must be equal, e.g. that the strength of correlation is always the same at a certain distance. Referred to here as geese.
First, we consider the GEE model with fixed correlation structure. In order to predetermine the correlation structure we have good reasons to assume that the correlation decreases exponentially with increasing spatial distance in ecological applications. Therefore, we use the function
for computation of correlation parameters α. Here dij is the distance between centre points of grid cells i and j and α1 is the correlation parameter for nearest neighbours. The parameter α1 is estimated by Moran's I of GLM residuals. In this way we obtain a full n×n correlation matrix with known parameters. Thus clustering is not necessary.
In the user defined case we build a specific variance-covariance matrix in block diagonal form with 5 unknown correlation parameters (corresponding to the five different distance classes in a 3×3 grid) which have to be calculated iteratively. The dispersion parameter as a correction of overdispersion can be calculated as well.
Example analysis using simulated data
To illustrate and compare the various approaches that are available to incorporate SAC into the analysis of species distribution data, we constructed artificial datasets with known properties. The datasets represent virtual species distribution data (for example species atlases) and environmental (such as climatic) covariates, available on a lattice of 1108 square cells imposed on the surface of a virtual island (Fig. 3).
Generation of artificial distribution data
The basis for the virtual island is a subset of the volcano data set in R, which consists of a digital elevation model for Auckland's Maunga Whau Volcano in New Zealand (Anon. 2005). Two uncorrelated (Pearson's r=0.013, p=0.668) environmental variables were created based on the altitude-component of this data set: “rain” and “djungle”. These data are available as electronic appendix and are depicted in Fig. 3. While “rain” is a rather deterministic function of altitude (including a rain-shadow in the east), “djungle” is dominated by a high noise component. Data are given in the Appendix.
On this lattice the species distribution data, yi (with i an indicator for cell (i=1, 2 …, 1108)), were simulated as a function of one of the two artificial environmental predictors, raini. Onto this functional relationship, we added a spatially correlated noise component we refer to as error ɛi. The covariate raini can for example be thought of as estimates of the total annual amount of rainfall in cell i. We simulated the three most commonly available types of species distribution data; continuous, binary and count data, using the normal distribution and approximations of the Poisson and binomial distributions respectively. The following models were used to simulate the artificial data: 1) normally distributed data: yI=80−0.015×raini+10×ɛi. 2) Binary data: yi=0 if pi<0.5, and yi =1 if, pi≥0.5, where , and 3) Poisson data:, where, and round is an operator used to round values to the nearest integer. This led to simulated data with no over- or underdispersion.
A weight matrix W was used to simulate the spatially correlated errors ɛi using weights according to the distance between data points. Let D=(dij) be the (Euclidean) distance matrix for the distances between cells i and j (dij=0 if i=j). On our lattice, the distance between the mid-points of neighbouring cells is dij=1. Then, Ω=(ωij) is a matrix defined as is a parameter that determines the decline of inter-cell correlation in errors with inter-cell distance. The strength of spatial autocorrelation increases with increasing values of ρ (there is no spatial autocorrelation if ρ=0). Here, we used a value of ρ=0.3, which resulted in strongly correlated errors in neighbouring cells (ωij=0.74, if dij=1), but a steep decline of autocorrelation with increasing distance. A weights matrix W was calculated (by Choleski decomposition) using Ω=WTW. Finally, the spatially correlated errors are given by ɛ=WTξ, with ξ drawn from the standard normal distribution.
Analysis of simulated data
For each error distribution, ten data sets were created, each using a random realisation of the spatially autocorrelated errors, using random draws of ξi. These data sets were then submitted to statistical analyses in which the response variables were modelled using a number of different linear models for the normally distributed data, and generalized linear models with the binomial distribution and logit-link for the binary data, and Poisson distribution and log-link for the count data: E(yi)=g−1(α+β×raini+γ×djunglei), where g are the corresponding link functions (identity for the normal distribution). The variable “djungle” was entered into all of the statistical models as an additional predictor of the response. This was done to be able to assess the models’ ability to distinguish random noise from meaningful variables.
Simulations and analyses were primarily carried out (see Appendix for implementation details and R-code) using the statistical programming software R (Anon. 2005), with packages gee (Carey 2002), geepack (Yan 2002, 2004), spdep (Bivand 2005), ncf (Bjørnstad and Falck 2000) and MASS (Venables and Ripley 2002). Calculations for the spatial eigenvector mapping were originally performed in Matlab using routines later ported to R (spdep) by Roger Bivand and Pedro Peres-Neto. Additional functions (Appendix) to work generalised estimating equations on a 2-D lattice were written by Gudrun Carl (Carl and Kühn 2007a). See also Table 1 for alternative software.
Table 1. Methods correcting for spatial autocorrelation and their software implementations. This list is not exhaustive but represents the major software developments in use.
1 for most R-packages (<http://www.r-project.org>) an equivalent for S-plus is available.
2 low, medium, high and very high refer roughly to a few seconds, several minutes, a few hours and several hours of CPU-time per model (1108 data points on a Pentium 4 dual core, 3.8GHz, 2GB RAM).
3 GeoDa: freeware: <http://www.geoda.uiuc.edu>.
4 for normally distributed error only.
Matlab: <http://www.mathworks.com>, with EigMapSel – a matlab compiled software to perform the eigenvector selection procedure for generalised linear models (normal, logistic and poisson) – available in ESA's Electronic Data Archive (Griffith and Peres-Neto 2006).
SAM: spatial analysis for macroecology; freeware under: <http://www.ecoevol.ufg.br/sam/>.
*requires the free “Spatial Econometric toolbox”: <http://www.spatial-econometrics.com>.
†requires additional module “spatial”.
Autoregressive models4 (CAR, SAR)
GeoDa, Matlab*, SAM, SpaceStat, S-plus†
Generalised linear mixed model
Generalised estimating equations
Generalised least squares4
Spatial eigenvector mapping
As most of the statistical methods tested allow for some flexibility in the precise structure of their spatial component, several models per method were calculated for each simulated dataset. This allowed us to identify the model configuration that most successfully accounted for spatial autocorrelation in the data at hand, by, for example, varying the distance over which spatial autocorrelation was assumed to occur, or its functional form. Inferior models were discarded, so that the results section below reports only on the best configuration for each approach. We used residuals based on fitted values and which were as such calculated from both the spatial and the non-spatial model components. For each, we report the following details: 1) model coefficients (and their standard errors); since the true parameters are known, we can directly judge the quality of coefficient estimation; 2) removal of SAC (global Moran's I, i.e. Moran's I computed across neighbourhood up to a distance of 20, and correlograms, which plot Moran's I for different distance classes); 3) spatial distribution of residuals (map).
Results of simulations
It is worth pointing out that the main aim of this study is to illustrate the different methods by applying them to the same data sets. The ten realisations of one type of spatial autocorrelation do not allow us to provide a comprehensive evaluation of the relative merits of each of the methods considered. Such evaluation is beyond the scope of this review paper, and will depend on the data set and question under study. Nonetheless, some interesting results emerged from our simulations.
Spatial and non-spatial models differed considerably in terms of the spatial signature in their residuals (Table 2; Figs. 2 and 3). Residual maps for OLS/GLM and GAM exhibit clusters of large residuals of the same sign (Fig. 3), indicating that these models were not able to remove all spatial autocorrelation from the data. In our case we know that this is due neither to the omission of an important variable nor an incorrect functional relationship, but a simulated aggregation mechanism in the errors. In comparison, all spatial models managed to decrease spatial autocorrelation in the residuals (Fig. 2), although not all were able to completely eliminate it. Geese performed worst in this regard. Our simulations are not comprehensive enough, however, to allow us to deduce what the influence of this incomplete removal of SAC might be on parameter estimation or hypothesis testing.
Table 2. Model quality: spatial autocorrelation in the model residuals (given as global Moran's I) and mean estimates for the coefficients “rain” and “djungle” (±1 SE across the 10 simulations). True coefficient values are given in the first row for each distribution in italics. ***, ** and ns refer to median significance levels of p<0.001, <0.01 and >0.1, respectively, across the 10 realisations. See Fig. 1 for abbreviations.
Another – though inconsistent – difference between the spatial and non-spatial models especially with binary data was that standard errors of the coefficient estimates for “rain” and “djungle” were often larger for the spatial models (Fig. 1). For normal and Poisson data, differences in coefficient estimates between spatial and non-spatial models were relatively small, and statistical inference was not affected. Only autocovariate model and SAR lag provided consistently incorrect estimates of the spatially autocorrelated parameter “rain”.
Most model approaches performed well with respect to type I and II error rates for the normal and Poisson data, correctly identifying “rain” as a significant effect (Table 2). An exception was autocovariate regression, which severely and consistently underestimated the effects of rain (Table 2, Fig. 1). Model performance was worse for data with a binomial error structure than for models with normal or Poisson error structure. When applied to such data, autocovariate regression (9 false negatives) and GAM (3 false negatives) were rather prone to type II errors (results not shown). Moreover, the spurious effect of djungle would have been retained in the model in several cases (based on a significance level of α=0.05: 6 normal, 2 binomial and 1 Poisson model of those presented in Table 2), resulting in type I errors (rejecting a null hypothesis although it was true).
The ability of simultaneous autoregressive models (SAR) to correctly estimate parameters depended heavily on SAR model structure. For instance, using a lagged response model in our artificial dataset yielded much poorer coefficient estimates for “rain” than using an error model (Fig. 1). This was to be expected, since our artificial distribution data was created such that its spatial structure most closely resembled that of the SAR error model.
We used an exponential distance decay function to generate the spatial error (see above). Hence, we would also expect those methods to perform best in which a correlation function can be defined accordingly (i.e. GLS, GLMM and GEE). While indeed the exponential GLS yielded better coefficient estimates than the spherical model, the Gaussian model and the GEE using a different exponential function were equivalent, as were methods that did not specify the correlation structure in such a way (e.g. SAR, Fig. 2). However, parameterisation for GEE resulted from the Moran's I correlogram, mimicking the distance decay function, though not using the original correlation function.
Limitations of our simulations
Our example analysis above was meant to illustrate the application of the presented methods to species distribution data. As such, it remained a cartoon of the complexity and difficulties posed by real data. Among the potential factors that may influence the analysis of species distribution data with respect to spatial autocorrelation, we like to particularly mention the following.
Missing environmental variables. As mentioned in the introduction, SAC can be caused by omitting an important variable from the model or misspecifying its functional relationship with the response (Legendre 1993). This is certainly often a problem in real data, where the ecological determinants of a species’ niche are not necessarily known and good spatial coverage may not be available for all the important factors. Also, moderate collinearity among environmental variables may lead models to exclude one or more variables which would be important in explaining the species’ spatial patterning.
Biased spatial error. The autocorrelated error we added in our simulated data had no bias in geographical (stationarity of the error) or parameter space. Hence our non-spatial models performed similar to the spatial ones with regards to parameter estimation, as opposed to removal of SAC. This may or may not be very different in real data, where both non-stationarity (Ver Hoef et al. 1993, Brunsdon et al. 1996, Foody 2004, Osborne et al. 2007) and bias in parameter space (e.g. less complete data coverage in warmer regions) can be found (Lennon 2000, but see Hawkins et al. 2007 for an opposite view).
Mapping bias or mapping heterogeneity can cause spatial autocorrelation in real data. If real data resulted from several different regional mapping schemes with different protocols or from different people performing the mapping, data can differ systematically across a grid with being more similar within a region and more different across.
Spatial autocorrelation at different spatial scales. Several of the methods presented build a “correction structure” across all spatial scales (i.e. the variance-covariance matrices in GLS-based models as well as the spatial eigenvectors), but others do not (the autocovariate and the cluster in geese have one specific spatial scale). Even the former may be dominated by patterns at one spatial scale, underestimating effects of another.
Finally, small sample sizes make the estimation of model parameters unstable. Adding the additional parameters for spatial models will further destabilise model parameterisation. Also, patterns of SAC in small data sets will hinge on very few data points which may distort the spatial correction.
The analysis of species distribution data has reached high statistical sophistication in recent years (Elith et al. 2006). However, even the most advanced and computer-intensive statistical procedures are no guarantee for improving our understanding of the determinants of species distributions, nor of our ability to predict species distributions under altered environmental conditions (Araújo and Rahbek 2006, Dormann 2007c). One critical step in statistical modelling is the identification of the correct model structure. As pointed out for experimental ecology in 1984 by Hurlbert, designs analysed without consideration of the nested nature of subsampling are fundamentally flawed. Spatial autocorrelation is a subtle, less obvious form of subsampling (Fortin and Dale 2005): samples from within the range of spatial autocorrelation around a data point will add little independent information (depending on the strength of autocorrelation), but unduly inflate sample size, and thus degrees of freedom of model residuals, thereby influencing statistical inference.
We have presented an overview of different modelling approaches for the analysis of species distribution data in which environmental correlates of the distribution are inferred. All these methods can be implemented in freely available software packages (Table 1). In choosing between the methods, the type of error distribution in the response variable will be an important criterion. For normal data, GLS-based methods (GLS, SAR, CAR) can be used efficiently. The most flexible methods, addressing SAC for different error distributions, are spatial GLMMs, GEEs and SEVM. The autocovariate method, too, is flexible, but performed very poorly with regards to coefficient estimation in our analyses. We encourage users to try a number of methods, since there is often not enough mechanistic information to choose one specific method a priori. One can use AIC or alike to compare models (Link and Barker 2006). Note that a “proper” (perfectly correctly specified) model would not require the kind of correction the above methods undertake (Ripley in comments to Besag 1974). In the absence of a perfect model, however, doing something is better than doing nothing (Keitt et al. 2002).
With the exception of autocovariate regression, differences in parameter estimates and inference between spatial and non-spatial models were small for our simulated data. This was possibly a result of the type of spatial autocorrelation in, and the simplistic nature of, these data (see section “Limitations of our simulations”). However, spatial autocorrelation can also reflect failure to include an important environmental driver in the analysis or inadequate capture of its non-linear effect, so that its spatial autocorrelation cannot be accounted for by non-spatial models (Besag et al. 1991, Legendre et al. 2002). In either case, spatial autocorrelation can make a large difference for statistical inference based on spatial data (for review see Dormann, 2007aDormann, 2007bDormann, 2007c; for drastic cases of this effect see Tognelli and Kelt 2004 and Kühn 2007). How to interpret these differences, especially the shifts in parameter estimates between spatial and non-spatial models commonly observed in real data, remains controversial. While Lennon (2000) and others (Tognelli and Kelt 2004, Jetz et al. 2005, Dormann 2007b, Kühn 2007) argue that spatial autocorrelation in species distribution models may well bias coefficient estimation, Diniz-Filho et al. (2003) and Hawkins et al. (2007) found non-spatial model to be robust and unbiased for several data sets. So far, no extensive simulation study has been carried out to investigate how spatial versus non-spatial methods perform under different forms and causes of SAC. Implementing a lagged autocorrelation structure to simplistic data did not reveal a bias in parameter estimation in OLS (Kissling and Carl 2007), consistent with the results of Hawkins et al. (2007).
One of the two most striking findings of our analyses is the high error rate of the autocovariate method. Most methods for normally distributed data yielded coefficient estimates for “rain” that were acceptable, including the non-spatial ordinary least square regression (Fig. 1). However, two models performed poorly: both the autocovariate regression and the lag version of the simultaneous autoregressive model showed a very consistent and strong bias, leading to severe underestimation (in absolute terms) of model coefficients. A similar pattern was also found for the non-normally distributed errors, identifying autocovariate regression as a consistently worse performer than the other approaches. The poor performance of the autocovariate regression approach in our study with regards to parameter estimation contrasts with earlier evaluations of this method (Augustin et al. 1996, Huffer and Wu 1998, Hoeting et al. 2000, He et al. 2003), but is in line with more recent ones (Dormann 2007a, Carl and Kühn 2007a). These earlier studies used more sophisticated parameter estimation techniques, suggesting that the inferiority of autocovariate models in our simulation may partly result from our simplistic (but not unusual) implementation of the method. Moreover, two of the earlier studies were undertaken in the context of many missing values: Augustin et al. (1996) used only 20% of sites in their study area for model training; Hoeting et al. (2000) used between 3.8 and 5.8%. This may have diminished the influence of any autocovariate and perhaps explains why in these studies the autocovariate did not overwhelm other model coefficients (as it did in ours). A final reason for the discrepancy in findings may be that our artificial data simulated spatial autocorrelation in the error structure, whereas other simulations created spatial structure directly in the response values, which more closely reflects the assumptions underlying autocovariate models.
The second interesting finding is the overall higher variability of results for binary data. While for normal- and Poisson-distributed residuals all model approaches (apart from autocovariate regression) yielded similar results and little variance across the ten realisations (Fig. 1), a different pattern emerged for binary (binomial) data. We attribute this to the relatively low information content of binary data (Breslow and Clayton 1993, Venables and Ripley 2002), making parameterisation of the model very dependent on those data points that determine the point of inflexion of the logistic curve (McCullough and Nelder 1989). This phenomenon has been noted before (McCullough and Nelder 1989), and remains relevant for species distribution models, where the majority of studies are based on the analysis of presence-absence data (Guisan and Zimmermann 2000, Guisan and Thuiller 2005).
Tricks and tips
Each of the above methods has its quirks and some require fine-tuning by the analyst. Without attempting to cover these comprehensively, we here hint at some areas for each method type which require attention.
In autocovariate regression, neighbourhood size and type of weighting function are potentially sensitive parameters, which can be optimised through trial and error. It seems, however, that small neighbourhood sizes (such as the next one to two cells) often turn out best, and that the type of weighting function has relatively little effect. This was the case in our analysis as well as in published studies investigating different neighbourhood sizes (for review see Dormann 2007b). Another important aspect of autocovariate models is the approach chosen to dealing with missing data, which may lead to cells without neighbours (“islands”). Since the issue arises for all modelling methods, we shall briefly discuss it here. Missing data can be overcome by a) omission (Klute et al. 2002, Moore and Swihart 2005); b) strategic choice of neighbourhood structure (Smith 1994); c) estimating missing response values by initially ignoring spatial autocorrelation and regressing known response values against explanatory variables other than the autocovariate (Augustin et al. 1996, Teterukovskiy and Edenius 2003, Segurado and Araújo 2004); and d) as in c), but then refining it through an iterative procedure known as the Gibbs sampler (Casella and George 1992). This procedure is computationally intensive, but has been found to yield the best results (Augustin et al. 1996, Wu and Huffer 1997, Osborne et al. 2001, Teterukovskiy and Edenius 2003, Brownstein et al. 2003, He et al. 2003). Simulation studies further suggest that a) parameter estimation is poor when the autocovariate effect is strong relative to the effect of other explanatory variables (Wu and Huffer 1997, Huffer and Wu 1998); b) the precision of parameter estimates varies with species prevalence, i.e. the number of presence records relative to the total sample size (Hoeting et al. 2000); and c) autocovariate models adequately distinguish between meaningful explanatory variables and random covariates (Hoeting et al. 2000) (but not in our study). Both simulation and empirical studies also indicate that autocovariate models achieve better fit than equivalent models lacking the autocovariate term (Augustin et al. 1996, Hoeting et al. 2000, Osborne et al. 2001, He et al. 2003, McPherson and Jetz 2007).
For spatial eigenvector mapping, computational speed becomes an issue for large datasets. Although the calculation of eigenvectors itself is rapid, optimising the model by permutation-based testing combinations of spatial eigenvectors is computer-intensive. Diniz-Filho and Bini (2005) argue that the identity of the selected eigenvectors is indicative of the spatial scales at which spatial autocorrelation takes effect, making this method potentially very interesting for ecologists. The implementation used in our analysis requires little arbitration and hence should be explored more widely. Note that SEVM, in the way that was applied here, is based on a different modelling philosophy. Its declared aim is to remove residual spatial autocorrelation, unlike all other methods described above, which simply provide a mathematical way to incorporate SAC into the analysis.
For the GLS-based methods (GLS and the spatial GLMM), estimation of the correlation structure functions (i.e. the parameter r) can be rather unstable. As a consequence some models yield r=0 (i.e. no spatial autocorrelation incorporated) or r≈∞, with the GLS model returning what is in fact a non-spatial GLM or nonsensical results, respectively. This problem can be overcome by inclusion of a “nugget” term that reduces the correlation at infinitesimally small distances to a value below 1, or, even better, a specification of r based on a semi-variogram of the residuals (Littell et al. 1996, Kaluzny et al. 1998). The common justification for a nugget term are measurement errors (on top of the spatially correlated error); including a nugget effect can stabilize the estimation of the correlation function (Venables and Ripley 2002).
Autoregressive models (SAR and CAR) require a decision on the weighting scheme for the weights matrix, for which there is not always an a priori reason. The main options are row standardised coding (sums over all rows add up to N), globally standardised coding (sums over all links add up to N), dividing globally standardised neighbours by their number (sums over all links add up to unity), or the variance-stabilising coding scheme proposed by Tiefelsdorf et al. (1999, pp. 167–168), i.e. sums over all links to N. In our analysis, the row standardised coding was most often the superior choice, which is in line with other studies (Kissling and Carl 2007), but the binary and the variance-stabilising coding scheme also resulted in good models. SAR and CAR models did not differ much in our analysis. According to Cressie (1993), CAR models should be preferred in terms of estimation and interpretation, although SAR models are preferred in the econometric context (Anselin 1988). Either approach can be relatively slow for large data sets (sample size>10 000) due to the estimation of the determinant of (I–ρW) for each step of the iteration. Note that Bayesian CAR models do not require the computation of such a determinant and can therefore be particularly suitable for data on large lattices (Gelfand and Vounatsou 2003). For SAR models, identification of the correct model structure is recommended and model selection procedures can help to reduce bias (Kissling and Carl 2007). The Lagrange-test (see supplementary material) can also help here. However, SAR error models generally perform better than SAR lag or even SAR mix models when tackling simulated data containing autocorrelation in lagged predictors (or response and predictors), as recently demonstrated in a more comprehensive assessment of SAR models using different spatially autocorrelated datasets (Kissling and Carl 2007).
Generalised estimating equations require high storage capacity for solving the GEE score equation without clustering as we used it in our fixed model. Application of the fixed model will therefore be limited for models on data with larger sample size, but the method is very suitable for missing data and non-lattice data. The need in storage capacity is considerably reduced by cluster models, such as our user-defined model. But clustering requires attention to three steps in the analysis: cluster size, within-cluster correlation structure and allocation of cells to clusters. To find the best cluster size for the analysis, we recommend investigating clusters of 2×2, 3×3 and 4×4. In real data, these cluster sizes have been sufficient to remove spatial autocorrelation (Carl and Kühn 2007a). Several different correlation structures should be computed initially, e.g. to allow for anisotropy. Finally, allocation of cells to clusters can start in different places. Depending on the starting point (e.g. top right or north west), cells will be placed in different clusters. Choosing different starting points will give the analyst an idea of the (in our experience limited) importance of this issue. Computing time is short.
Autocorrelation in a predictive setting
Spatial autocorrelation may arise for a number of ecological reasons, including external environmental and historical factors limiting the mobility of organisms, intrinsic organism-specific dispersal mechanisms and other behavioural factors causing the spatial aggregation of populations and species in the landscapes. In addition to these factors, spatial autocorrelation can also be caused by observer bias and differences in sampling schemes and sampling effort. Overall, spatial autocorrelation occurs at all spatial scales from the micrometre to hundreds of kilometres (Dormann 2007b), possibly for a whole suite of reasons. Since these reasons are mostly unknown, one cannot readily derive a spatial correlation structure for an entirely new, unobserved region. Augustin et al. (1996) and others (Hoeting et al. 2000, Teterukovskiy and Edenius 2003, Reich et al. 2004) have, however, successfully used the Gibbs sampler (Casella and George 1992) to derive predictions for unobserved areas within the study region (interpolation), and He et al. (2003) extrapolated autologistic predictions through time to examine possible effects of climate change.
Interpolation, i.e. the prediction of values within the parameter and spatial range, can be achieved by several of the presented methods. An advantage of GLS is that the spatially correlated error can be predicted for sites where no observations are available, based on the values of observed sites (e.g. kriging). The same holds true for the spatial GLMM. For autocovariate regression and spatial eigenvector mapping, in contrast, interpolation is more complicated, requiring use of the aforementioned Gibbs-sampler.
When models are projected into new geographic areas or time periods the handling of spatial autocorrelation becomes more problematic (if not impossible). Extrapolation in time, for example, is necessarily uncertain, particularly if biotic interactions – and with them spatial autocorrelation patterns – could change as each species responds differentially to climate change. However; most of the statistical methods used for prediction in time neglect important processes such as migration, dispersal, competition, predation (Pearson and Dawson 2003, Dormann 2007c), or at least assume many of them to remain constant. One might therefore argue that, while taking the autocorrelation structure as constant adds one more assumption, the use of spatial parameters at least helps to derive better models. Extrapolation in space, in contrast, is not recommended: the variance-covariance matrix parameterised in GLS approaches, for example, may look very different in other regions, even for the same organism. Hence, extrapolation can only be based on the coefficient estimates, not on the spatial component of the model. Extrapolation is further complicated by model complexity. The use of non-linear predictors and interactions between environmental variables will increase model fit, but compromises transferability of models in time and space (Beerling et al. 1995, Sykes 2001, Gavin and Hu 2006). Our study therefore did not compare methods’ abilities to either make predictions to new geographic areas or extrapolate beyond the range of environmental parameters.
Our review focused on frequentist methods. Bayesian methods, which allow prior beliefs about data to be incorporated in the calculation of expected values, offer an alternative. Experience and a good understanding of the influence of prior distributions and convergence assessment of Markov chains are crucial in Bayesian analyses. Thus, if therefore the question of interest can be addressed using more robust, less computationally intensive methods, there is no real need to apply the “Bayesian machinery” (Brooks 2003). The spatial analyses as presented in this paper can be done straightforwardly using non-Bayesian methods. However, Bayesian methods for the analyses of species distribution data are more flexible; they can be more easily extended to include more complex structures (Latimer et al. 2006). Models can for example be extended to a multivariate setting when several (correlated) counts of different species in each grid cell are to be modelled, or when both count and normally distributed data are to be modelled within the same framework (Thogmartin et al. 2004, Kühn et al. 2006). Bayesian methods are also a generally more suitable tool for inference in data sets with many missing values, or when accounting for detection probabilities (Gelfand et al. 2005, Kühn et al. 2006).
In this study, we introduced a wide range of statistical tools to deal with spatial autocorrelation in species distribution data. Unfortunately, none of these tools directly represents dynamic aspects of ecological reality (e.g. dispersal, species interaction): all the methods examined remain phenomenological rather than mechanistic. Therefore they are unable to disentangle stochastic and process-introduced spatial autocorrelation. Disentangling these sources of spatial autocorrelation in the data would be particularly important for the analysis of species that are not at equilibrium with their environmental drivers (e.g. newly introduced species expanding in range or species that have undergone population declines due to overexploitation). Moreover, it would be desirable to extend the statistical approaches used here to model multivariate response variables, such as species composition (see Kühn et al. 2006, for an example). Similarly, presence-only data, as commonly found for museum specimens, cannot be analysed with the above methods, nor are we aware of any method suitable for such data. While in principle it is possible to incorporate temporal and/or phylogenetic components into species distribution models (e.g. into GEEs, GLMMs and Bayes), this has not yet been attempted. It also would be desirable to have methods available that allow for the strengths of spatial autocorrelation to vary in space (non-stationarity), since stationarity is a basic and strong assumption of all the methods used here (except perhaps SEVM). Finally, the issue of variable selection under spatial autocorrelation has received virtually no coverage in the statistical literature, and hence the effect of spatial autocorrelation on the identification of the best-fitting model, or candidate set of most likely models, still remains unclear.
Data were created by CFD, GC and FS. Analyses and manuscript sections describing each method were carried out as follows: autocovariate regression: JMM; SEVM: PRPN and RB; GAM: JB, RO and CFD; GLS: BR and WJ; CAR: BS; SAR: WDK; GLMM: FMS and RGD; GEE: GC and IK. Further analyses, figure and table preparation and initial drafting were carried out by CFD. All authors contributed to writing the final manuscript.
We also would like to thank Pierre Legendre, Carsten Rahbek, Alexandre Diniz-Filho, Jack Lennon and Thiago Rangel for comments on an earlier version. This contribution is based on the international workshop “Analysing Spatial Distribution Data: Principles, Applications and Software” (GZ 4850/191/05) funded by the German Science Foundation (DFG), awarded to CFD. CFD acknowledges funding by the Helmholtz Association (VH-NG-247). JMM's work is supported by the Lenfest Ocean Program. MBA, GC, IK, RO & BR acknowledge funding by the European Union within the FP 6 Integrated Project “ALARM” (GOCE-CT-2003-506675). GC acknowledges a stipend from the federal state “Sachsen-Anhalt”, Ministry of Education and Cultural Affairs. WDK & IK acknowledge support from the “Virtual Inst. for Macroecology”, funded by the Helmholtz Association (VH-VI-153 Macroecology). RGD was supported by NERC (grant no. NER/O/S/2001/01257). PEPN research wsa supported by NSERC.