1 / 15

A statistical perspective on quality in GEOSS: challenges and opportunities

A statistical perspective on quality in GEOSS: challenges and opportunities. Dan Cornford d.cornford@aston.ac.uk Aston University Birmingham, UK. What is quality?. Quality has several facets, but the key one in my view is a quantitative statement of the relation of a value to reality

tekli
Download Presentation

A statistical perspective on quality in GEOSS: challenges and opportunities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A statistical perspective on quality in GEOSS: challenges and opportunities Dan Cornford d.cornford@aston.ac.uk Aston University Birmingham, UK

  2. What is quality? • Quality has several facets, but the key one in my view is a quantitative statement of the relation of a value to reality • Quality is challenging to define • ISO 9000: • “Degree to which a set of inherent characteristics fulfils requirements” • International Association for Information and Data Quality: • “Information Quality is not just “fitness for purpose;” it must be fit for all purposes” • Oxford English dictionary: • “the standard of something as measured against other things of a similar kind; the degree of excellence of something”

  3. The key aspects of data quality An exhaustive list might include: Accuracy, Integrity, Precision, Objectivity, Completeness, Conciseness, Redundancy, Validity, Consistency, Timeliness, Accessibility, Utility, Usability , Flexibility, Traceability There is not universal agreement but we propose: accuracy: value correctly represents the real world completeness: degree of data coverage for a given region and time consistency: are rules to which the data should conform met usability: how easy is it to access and use the data traceability: can one see how the results have arisen utility: what is the user view of the data value to their use-case

  4. Accuracy and what is reality for GEOSS • Accuracy is the most important quality aspect • This is not a talk about philosophy ... • However, we must define objects in the real world using mental concepts • I view reality as a set of continuous space-time fields of discrete or continuous valued variables • The variables represent different properties of the system, e.g. temperature, land cover • A big challenge is that reality varies over almost all space and time scales, so we need to be precise about these when defining reality

  5. Relating observations to reality - accuracy • Assume we can define reality precisely • I argue the most useful information I can have about an observation is the relation of this observation to reality • Express this relation mathematically as y = h(x) where • y is my observation • x is reality (not known) • h() is my sensor / forward /observation model that maps reality to what I can observe • I can almost never write this due to various sources of uncertainty in my observation and h, and maybe variations in x. So I must write y = h(x) + ε(x) • ε(x) is the (irreducible) observation uncertainty

  6. Dealing with the unknown • Uncertainty is a fundamental part of science • Managing uncertainty is at the heart of data accuracy • There are several frameworks for handling uncertainty: • frequentist probability (requires repeatability) • subjective Bayesian probability (personal, belief) • fuzzy methods (more relevant to semantics) • imprecise probabilities (you don’t have full distributions) • belief theory and other multi-valued representations • Choosing one is a challenge • I believe that subjective Bayesian approaches are a good starting point

  7. What I would really, really want • You supply an observation y • I would want to know • h(x), the observation function • ε(x), the uncertainty about y, defining p(y|x) because then I can work out: p(x|y) α p(y|x)p(x) • This is the essence of Bayesian (probabilistic) logic – updating my beliefs • I need to know about how you define x too – what spatial and temporal scales

  8. Why does p(y|x) matter? • Imagine I have other observations of x (reality) • How do I combine these observations rationally and optimally? • Solution is to use the Bayesian updates – need to know the joint structure of all the errors! • This is optimistic (unknowable?) however, given reality, it is likely that for many observations the errors will be uncorrelated

  9. What about a practical example? • Consider weather forecasting: • essentially an initial value problem – we need to know about reality x at an initial time • thus if we can define p(x|y) at the start of our forecast we are good to go • this is what is called data assimilation; combining different observations requires good uncertainty characterisation for each observation, p(y|x) • In data assimilation we also need to know about the observation model y = h(x) + ε(x) and its uncertainty

  10. How do we get p(y|x)? • This is where QA4EO comes in – has a strong metrology emphasis where all sources of uncertainty are identified • this is very challenging – to define a complete probability distribution requires many assumptions • So we had better try and check our model, p(y|x) using reference validation data, and assess the reliability of the density • reliability is used in a technical sense – it is a measure of whether the probabilities estimated are really observed

  11. Practicalities of obtaining p(y|x) • This is not easy – I imagine a two pronged approach • Lab based “forward” assessment of instrument characteristics to build an initial uncertainty model • Field based validation campaigns to update our beliefs about the uncertainty, using a data assimilation like approach • Continual refinement based on ongoing validation • This requires new statistical methods, new systems and new software and is a challenge! • ideally should integrate into data assimilation ...

  12. How does this relate to current approaches? • Current approaches in e.g. ISO19115 / -2 (metadata) recognise quality as something important but: • they do not give strong enough guidance on using useful quality indicators (QA4EO addresses this to a greater degree) • many of the quality indicators are very esoteric and not statistically well motivated (still true in ISO19157) • We have built UncertML specifically to describe flexibly but precisely uncertainty information in a useable form to allow probabilistic solutions

  13. Why probabilistic quality modelling? • We need a precise, context independent definition of the observation accuracy as a key part of quality • Probabilistic approaches • work for all uses of the observation – not context specific • provide a coherent, principled framework for using observations • allow integration of data (information interoperability) and data reuse • can extract information from even noisy data – assuming reliable probabilities any ‘quality’ of data can be used

  14. Quality, metadata and the GEO label • Quantitative probabilistic accuracy information is the single most useful aspect • Other aspects remain relevant: • traceability: provenance / lineage • usability: ease / cost of access • completeness: coverage • validity: conformance to internal and external rules • utility: user rating • I think the GEO label concept must put a well defined probabilistic notion of accuracy at its heart, but also consider these other quality aspects

  15. Summary • Quality has many facets – accuracy is key • Accuracy should be well defined requiring a rigorous statistical framework, and a definition of reality • Quality should be at the heart of a GEO label • QA4EO is starting to show the way • We need also to show how to implement this • GeoViQua will develop some of the necessary tools, but this is a long road ...

More Related