1 / 46

Spatial statistics in practice

Lecture #2: quantitative regionalization and cluster detection, with special reference to local statistics. Spatial statistics in practice Center for Tropical Ecology and Biodiversity, Tunghai University & Fushan Botanical Garden. Topics for today’s lecture.

amadis
Download Presentation

Spatial statistics in practice

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture #2:quantitative regionalization and cluster detection, with special reference to local statistics Spatial statistics in practice Center for Tropical Ecology and Biodiversity, Tunghai University & Fushan Botanical Garden

  2. Topics for today’s lecture • Multivariate grouping, and location-allocation modeling. • Going from the global to the local: variability and heterogeneity. • Impacts of spatial autocorrelation on histograms. • The LISA and Getis-Ord statistics. • Cluster analysis: multivariate analysis, cluster detection, and spider diagrams. • An overview of geographic and space-time clusters. • Regression diagnostics and geographic clusters

  3. Multivariate grouping goals • If groups are unknown, to identify the latent natural groups of areal units • If groups are known, to assess similarities and differences among the groups • To determine the group centroids and groups of geographical points that result from minimizing some function of standard distance

  4. Conventional cluster analysis distances to minimize • Single linkage – distances are measured between pairs of closest (nearest neighbor) areal units, one from each of two clusters, in attribute space • Complete linkage – distances are measured between pairs of most distant (furthest neighbor) areal units, one from each of two clusters, in attribute space • This criterion often gives the best grouping results

  5. Average linkage – distances are measured between all possible pairs of areal units, one from each of two clusters, in attribute space, and then averaged • Centroid method – squared distances are measured between each areal unit and all cluster means, in attribute space • Ward’s algorithm – based upon ANOVA, areal units are allocated to clusters in order to minimize within cluster variances, and maximize between cluster variance • This criterion relates to location-allocation

  6. Contemporary cluster analysis criteria • One- or two-stage density– areal unit groupings are based upon nonparametric probability density estimation (kth nearest neighbor, uniform kernel, Wong’s hybrid); utilizes single linkage • EML (equal variance maximum likelihood) – areal unit groupings are based upon maximizing the likelihood of mixtures of identical spherical multivariate normal distributions, possibly with unequal mixing proportions (i.e., sampling probabilities) • Flexible-beta– areal unit groupings are based upon a weighting involving scalar beta, which usually falls between 0 and -1 (a common default value is -0.25, with -0.5 appearing to be more suitable for data with many outliers) • McQuitty’s method– areal unit groupings are based upon weighted average linkage, the weighted pair-group arithmetic averages • Gower's median method– areal unit groupings are based upon weighted pair-group centroids, where distance may or may not be squared

  7. Ward’s algorithm and location-allocation

  8. Clustering with PCA/FA • Although PCA & FA are used most frequently to deal with multicollinearity across attribute variables (R-mode), these techniques also can be used to handle redundant information across areal units (Q-mode; e.g., the eigenfunctions of geographic weight matrix C) • Linear combinations extracted from matrix (I-11T/n)C(I-11T/n) or (I-11T/n)D*(I-11T/n) identify the range of possible distinct map patterns (i.e., uncorrelated and orthogonal)

  9. Legendre et al. method • A comparison of the two procedures is in • Links directly to the semivariogram plot • D* is a truncated distance-based matrix, where the truncation is determined by the length of a minimum spanning tree articulating the set of locations

  10. Ej is the map pattern with spatial autocorrelation level MCj Properties • The extreme eigenvalues define MCmax and MCmin (not necessarily 1, -1) • As eigenvalues go from the largest positive to the largest negative value, map patterns become more fragmented • Positive eigenvalues denote: • Global trends with relatively large values • Regional trends with intermediate values • Local trends with relatively small values

  11. Selected ideal map patterns global regional MC ~ 1 MC = 0.9 MC = 0.7 regional local MC = -0.6 MC = 0.5 MC = 0.25

  12. SA impacts on Gaussian RVs standard normal curves Principal impact: variance inflation SA map pattern: MC = 1.12, GR = 0.08 MC = 1.00 GR = 0.18 MC = 0.00 GR = 1.00 MCmax = 1.18 MC = 0.28 GR = 0.77 heavier tails increased kurtosis

  13. Unstandardized normal curve map pattern generated autoregressive generated Kurtosis increases from 0.01 (roughly 0) to 0.73. The variance of kurtosis is 24/n. Therefore, here spatial autocorrelation has induced increased relative peakedness (from the sign of the kurtosis statistic) whose z = 7.3. Kurtosis increases from 0.04 (roughly 0) to 2.79. The variance of kurtosis is 24/n. Therefore, here spatial autocorrelation has induced increased relative peakedness (from the sign of the kurtosis statistic) whose z = 27.8.

  14. Typical case: MC/MCmax = 0.6 map pattern MC = 0.61 GR = 0.50 map pattern MC = 0.80 GR = 0.34 E(MC) = -0.00042 E(GR) = 1

  15. Transformations to normal approximations Torturing the data – conforming to a bell-shaped curve • Box-Cox power transformations • Manly’s exponential transformation • Percentage adjustments (also arcsine)

  16. China data example: births/females

  17. China data example:pop/area

  18. China data example: births/deaths

  19. A China example: % F15-44

  20. Constant variance • Attribute: variable transformations often stabilize the variance of a variable across its measurement range • Mean/median split gives a heuristic assessment of constant variance (equal variability of high and low values)

  21. Constant variance • Geographic: variable transformations often stabilize the variance of a variable across the geographic landscape over which it is distributed • Quadrants of the plane/established areal unit groupings give a heuristic assessment of constant variance across a geographic landscape Plane quadrants provinces

  22. Non-normal random variables (RVs) • Poisson: the mean equals the variance (built-in heterogeneity) • overdispersion: the variance is greater than the mean • assuming a gamma-distributed mean results in a negative binomial random variable • binomial: variance equals (1-p) times the mean [i.e., Np(1-p)] • overdispersion: the variance is greater than Np(1-p) • employ a quasi-likelihood estimation

  23. Spatial autocorrelation impacts on Poisson RVs weak positive spatial autocorrelation iid overdispersion occurs when: var(Y) > strong positive spatial autocorrelation

  24. Impacts of typical spatial autocorrelation levels hexagonal tessellation Poissonness plots irregular tessellation

  25. Spatial auto-correlation impacts on binomial RVs global • variance increases • shape goes to uniform, then to sinusoidal autoregressive global & regional global & regional & local

  26. Going from the global to the local Paralleling statistics concerning data outliers, and leverage and influential points, spatial heterogeneity in georeferenced data is addressed by focusing on individual areal units. The emphasis shift is from global trends to local exceptions, to better understand local deviations from global model descriptions by exploiting tensions between global trends and informative local details latent in empirical data: • adaptation of conventional diagnostic statistics (e.g., Unwin and Wrigley, 1987) • spatializing existing statistical techniques (e.g., Fotheringham et al., 2002) • Anselin’s (1995) seminal paper about indices of spatial association (i.e., LISA statistics) • Getis and Ord’s (1992, 1995) Gi and Gi* statistics

  27. Goals of global versus local analysis • Identify clustering • Identify particular clusters (significant local clusters in the absence of global autocorrelation) • distinguish between homogeneity and heterogeneity (e.g., spatial outliers - highs surrounded by lows, and vice versa) • identify hot/cold spots • analyze local instability (local deviations from global pattern of spatial autocorrelation)

  28. LISA: local indicators of spatial autocorrelation selevation area competition

  29. Goal: to assess spatial correlation heterogeneity ANOVA F = 2845 Pr(>X2) = 0.4 LISA z-score

  30. The randomization perspective

  31. Conditional randomization

  32. LISA for PR LN(elevation + 17.5) slope is unstandardized MC Pr(LISA) Bonferroni Sidak MC = 0.51; GR = 0.49

  33. LISA maps • Cannot distinguish between H-H and L-L clusters • Conventional clustering fails to preserve contiguity significant LISA

  34. … with contiguity proclivity Clustering geocoding coordinate pair coupled with zLISA values Clustering geocoding coordinate pair with frequencies proportional to zLISA values

  35. Getis-Ord Gi [ Gi* includes i (i.e., j = i)] • contiguity based upon distance band defined by dr • dr may be obtained from a semivariogram plot • one statistic for each areal unit • Gi(dr) > 0 signifies clustering of high values • Gi(dr) < 0 signifies clustering of lows values • LISA fails to make this particular distinction

  36. A Gi-based analysis: complete linkage Gi Gi clusters

  37. A relationship between LISA and Gi for the same geographic connectivity matrix C The quadratic trend is why LISA cannot distinguish between HH and LL clusters, while Gi can.

  38. geographic & space-time clusters: an overview • Global cluster tests search for spatial clusters anywhere in a study area but do not necessarily identify where the clusters occur, and are used to identify departures from spatial randomness when overall spatial pattern is considered. • Local cluster tests identify locations at which there is some excess/deficit—a hot/cold spot—anywhere within a study area. • Focused cluster tests determine whether there is an excess near a pre-specified location, called a focus, and are used to detect clustering near, say, putative hazards (e.g., a toxic waste dump).

  39. Cluster detection techniques

  40. Spider diagrams • allocation to AAR centroids • allocation to cluster (U, V, z-LISA) centroids

  41. Regression diagnostics: each observation’s influence on parameter estimates and predicted values • PRESS – global measure that should roughly equal the mean squared error (MSE) for a trend line (equivalent to cross-validation) • Leverage – measures degree of influence of areal unit CzY,i value on an MC trend line (marked: > 2/n) • Studentized residual – measures whether ith areal unit causes a significant shift in its corresponding regression intercept (i.e., is an outlier; marked: > 2) • Cook’s D – measures influence of ith areal unit on an MC estimate (analogous to DFFITS; marked: > 2 )

  42. Moran scatterplot for LN(elevation + 17.5) marked values Barranquitas is a spatial outlier, again! mean of C1

  43. Spatial autocorrelation in diagnostic statistics: eigenvector covariates MC(E2) = 1.04926 Dark red: very high Light red: high Gray: medium Light green: low Dark green: very low MC = 1.04926 R2

  44. What have we learned today ?

More Related