1 / 75

Bayesian Methods and Subjectiv Probability

Bayesian Methods and Subjectiv Probability. Daniel Thorburn Stockholm University 2011-01-10. Outline. Background to Bayesian statistics Two simple rules Why not design-based? Bayes, Public statististics and sampling De Finetti theorem, Bayesian Bootstrap Comparisons between paradigms

anais
Download Presentation

Bayesian Methods and Subjectiv Probability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian Methods and Subjectiv Probability Daniel Thorburn Stockholm University 2011-01-10

  2. Outline • Background to Bayesian statistics • Two simple rules • Why not design-based? • Bayes, Public statististics and sampling • De Finetti theorem, Bayesian Bootstrap • Comparisons between paradigms • Preposterior analysis • Statistics in science • Complementary Bayesian meethods

  3. 1. Background • Mathematically: • Probability is a positive, finite, normed, (s)-additive measure defined on a (s)-algebra • But what does that correspond to in real life?

  4. What is the probability of heads in the following sequence? Does it change? And when? • This is a fair coin • I am now going to toss it in the corner • I have tossed it but noone has seen the result • I have got a glimpse of it but you have not • I know the result but you don´t • I tell you the result

  5. Laplace definition. ”All outcomes are equally probable if there is no information to the contrary”. (number of positive elementary events/number of possible elementary events) • Choose heads and bet on it, with your neighbour. You get one krona if you are right and lose one if you are wrong. When should you change from indifference? • Frequency interpretation. (LLN). If there is an infinite sequence of independent experiments then the relative frequency converges a.s. towards the true value. Cannot be used as a definition for two reasons • It is a vicious circle. Independence is defined in terms of probability • It is logically impossible to define over-countably many different quantities by a countable procedure.

  6. Probabilities do not exist (de Finetti) • They only describe your lack of knowledge • If there is a God almighty, he knows everything now, in the past and in the future. (God does not play dice, (Einstein)) • But lack of knowledge is personal, thus probability is subjective • Kolmogorovs axioms only does not say anything about the relation to reality

  7. Probability is the language which describes uncertainty • If you do not know a quantity you should describe your opinion in terms of probability • Probability is subjective and varies between persons and over time, depending on the background information.

  8. Rational behaviour – one person • Axiomatic foundation of probability. Type: • For any two events A and B exactly one of the following must hold A < B, A > B or A v B (pronounce A as more likely than B, B more likely than A, equally likely) • If A1, A2, B1 and B2 are four events such that A1A2 = B1B2 is empty and A1> B1 and A2> B2 then A1 U A2> B1 U B2. If further either A1 > B1 or A2 > B2 then A1 U A2 > B1 U B2 • … • If these axioms hold all events can be assigned probabilities, which obey Kolmogorovs axioms (Villegas, Annals Math Stat, 1964), • Axioms for behaviour. Type … • If you prefer A to B, and B to C then you must also prefer A to C • … • If you want to behave rationally, then you must behave as if all events were assigned probabilities (Anscombe and Aumann, Annals Math Stat, 1963)

  9. Axioms for probability (these six are enough to prove that a probability following Kolmogorovs axioms can be defined plus the definition of conditional probability) • For any two events A and B exactly one of the following must hold A < B, A > B or A v B (pronounce A as more likely than B, B more likely than A, equally likely) • If A1, A2, B1 and B2 are four events such that A1A2 = B1B2 is empty and A1> B1 and A2> B2 then A1 U A2> B1 U B2. If further either A1 > B1 or A2 > B2 then A1 U A2 > B1 U B2 • If A is any event then A > (the impossible (empty) event) • If Ai is an strictly decreasing sequence of events and B a fixed event such that Ai > B for all i then (the intersection of all A ii) > B • There exists one random variable which has a uniform distribution • For any events A, B and D, (A|D) < (B|D) if and only if AD < BD • Then one needs some axioms about comparing outcomes, (utilities) in order to be able to prove rationality…

  10. Further one needs some axioms about comparing outcomes, (utilities) in order to be able to prove rationality • For any two outcomes, A and B, one either prefers A to B or B to A or is indifferent • If you prefer A to B, and B to C then you must also prefer A to C • If P1 and P2 are two distributions over outcomes they may be compared and you are indifferent between A and the distribution with P(A)=1 • Two measurability axioms like • If A is any outcome and P a distribution then the event that P gives an outcome preferred two A can be compared to other events (more likely …) • If P1 is preferred to P2 and A is an event, A > 0, then the game giving P1 if A occurs is preferred to the game giving P2 under A if the results under the not-A are the same. • If you prefer P1 to P and P to P2, then there exists numbers a>0 and b>0 such that P1 with probability 1-a and P2 with probability a is preferred to P, which is preferred to P1 with probability b and P2 with probability 1-b.

  11. There is only one type of numbers, which may be known or unknown. • Classical inference has a mess of different types of numbers e.g. • Parameters • Latent variables like in factor analysis • Random variables • Observations • Independent (explaining) variables • Dependent variables • Constants • a.s.o. • Superstition!

  12. 2. Two simple requirements for rational inference

  13. Rule 1 • What you know/believe in advance + The information in the data = What you know/believe afterwards

  14. Rule 1 • What you know/believe in advance + The information in the data = What you know/believe afterwards • This is described by Bayes’ Formula: • P(q|K) * P(X|q,K) a P(q|X,K)

  15. Rule 1 • What you know/believe in advance + The information in the data = What you know/believe afterwards • This is described by Bayes’ Formula: • P(q|K) * P(X|q,K) a P(q|X,K) • or in terms of the likelihood • P(q|K) * L(q|X) a P(q|X,K)

  16. Rule 1 corrolarium • What you believe afterwards + the information in a new study = What you believe after both studies

  17. Rule 1 corrolarium • What you believe afterwards + the information in a new study = What you believe after both studies • The result of the inference should be possible to use as an input to the next study • It should thus be of the same form! • Note that hypothesis testing and confidence intervals can never appear on the left hand side so they do not follow rule 1

  18. Rule 2 • Your knowledge must be given in a form that can be used for deciding actions. (At least in a well-formulated problem with well-defined losses/utility).

  19. Rule 2 • Your knowledge must be given in a form that can be used for deciding actions. (At least in a well-formulated problem with well-defined losses/utility). • If you are rational, you must use the rule which minimizes expected ”losses” (maximizes utility) • Dopt = argmin E(Loss(D, q )|X,K) = argmin X Loss(D,q) P(q |X,K) dq

  20. Rule 2 • Your knowledge must be given in a form that can be used. (At least in a well-formulated problem with well-defined losses/utility. • If you are rational, you must use the rule which minimizes expected ”losses” (maximizes utility) • Dopt = argmin E(Loss(D, q )|X,K) = argmin X Loss(D,q) P(q |X,K) dq • Note that classical design-based inference has no interface with decisions.

  21. Statistical tests are useless • They cannot be used to combine with new data. • They cannot be used even in simple decision problems. • They can be compared to the blunt plastic knife given to a three year old child • He cannot do much sensible with it • But he cannot harm himself either

  22. N=4, n=2, SRS. Dichotomous data, black or white. The variable is known to come in pairs, i.e. the total is T=0, 2 or 4. Probabilities: 3. An example of the the stupidity of frequency-based (design-based) methods If you observe 1 white you know for sure that the population contains 2 white. If you observe 0 or 2 you white the only unbiased estimate is T*= 0 resp. 4 The variance of this estimate is 4/3 if T=2 (=1/6*4+4/6*0+1/6*4) and 0 if T=0 or 4 So if you know the true value the design-based variance is 4/3 and if you are uncertain the design-based variance is 0. (Standard unbiased variance estimates are 2 resp. 0)

  23. Bayesian analysis works OK • We saw the Bayesian analysis when t=1, (T*=2). • If all possibilities are equally likely à priori, the posterior estimates of T when t = 0 (2) is T* = 2/7 (26/7) and the posterior variance is 24/49.

  24. Always stupid? • It is always stupid to believe that the variance of an estimator is a measure of precision in one particular case. (It is defined as a long run property for many repetitions) • But it is not always so obvious and so stupid as in this example. • Is this a consequence of the unusual prior with T must be even?

  25. Example without the prior infoStill stupid but not quite as much

  26. Example without the prior infoStill stupid but not quite as much If you observe 1, the true error is never larger than 1, but the standard deviation is always larger than 1 for all possible parameter values.

  27. Always stupid? • It is always stupid to assume that the variance of an estimator is a measure of precision in one particular case. (It is defined as a long run property for many repetitions) • But it is not always so obvious and stupid as in these examples. • Under suitable regularity conditions designbased methods are asymptotically as efficient as Bayesian methods

  28. Many people say that one should choose the approach that is best for the problem at hand. Classical or Bayesian.

  29. Many people say that one should choose the approach that is best for the problem at hand. Classical or Bayesian. • So do Bayesians. • But they also draw the conclusion:

  30. Many people say that one should choose the approach that is best for the problem at hand. Classical or Bayesian. • So do Bayesians. • But they also draw the conclusion: • Always use Bayesian methods!

  31. Many people say that one should choose the approach that is best for the problem at hand. Classical or Bayesian. • So do Bayesians. • But they also draw the conclusion: • Always use Bayesian methods! • Classical methods can sometimes be seen as quick and dirty approximations to Bayesian methods. • Then you may use them.

  32. 4.What is special for many statistical surveys, e.g. public statistics?

  33. 3. What is special for many statistical surveys, e.g. public statistics? • Answer 1 The producer of the survey is not the user. • Often many readers and many users. • The producer has no interest in the figures par se • P( q |Kuser) is not known to the producer and not unique P(q| Kuser) * L(q|X) a P(q|X, Kuser) • Solution?

  34. 3. What is special for many statistical surveys, e.g. public statistics? • Answer 1 The producer of the survey is not the user. • Often many readers and many users. • The producer has no interest in the figures par se • P( q |Kuser) not known to the producer and not unique P(q| Kuser) * L(q|X) a P(q|X, Kuser) • Publish L( q |X) so that any reader can plug in his prior • Usually given in the form of the posterior with a vague, uninformative (often = constant) prior L(q|X) tP(q|K0) * L(q|X) a P(q|X,K0)

  35. Describing the likelihood • Estimates are often asymptotically normal. Then it is enough to give the posterior mean and variance or a (symmetric) 95% prediction interval (for large samples) • When the maximum likelihood estimator is approximately efficient and normal the ML-estimate and inverse Fisher information are enough. (t standard confidence interval) • Asymptotically efficient t for large samples almost as good as Bayesian estimates, which are known to be admissible also for finite samples

  36. What is special for many statistical surveys, e.g. public statistics? • Answer 2 There is no parameter or more exactly: The parameter consists of all the N values of all the units in the population.

  37. What is special for many statistical surveys, e.g. public statistics? • Answer 2 There is no parameter or more exactly The parameter consists of all the N values of all the units in the population. • Use this vector as the parameter q in Bayes’ formula. • If you are interested in a certain function, e.g. the total T, integrate out all nuisance parameters in the posterior, to get the marginal of interest P(YT|X,K) = X…XS qi = YT P(q1, … , qN |X,K) P1N-1dqi

  38. 5. De Finetti’s theorem • Random variables are said to be exchangeable if there is no information in the ordering. This is for instance the case with SRS • If a sequence of random variables is infinitely exchangeable than they can be described as independent variables given q, where q is a latent random variable. (The proof is simple but needs some knowledge of random processes. Formally q is defined on the tail s-algebra) • Latent means in this case that it does not exist but can be useful when desscribing the distribution.

  39. This imaginary random variable can take the place of a parameter • But note that it does not exist (is not defined) until the full infinite sequence has been defined and the full sequence will never be observed. • Note also that most sequences in the real world are not independent but only exchangeable. If you toss a coin 1000 times and get 800 heads it is more likely that the next toss will be heads (compared to the case with 200 heads). • So obviously there is a dependence between the first 1000 tosses and the 1001st

  40. Dichotomous variables or The Polya Urn scheme • In an urn there is one white and one black ball. • Draw one ball at random. Note its colour. • Put it back together with one more ball of the same colour • Draw one at random … • This sequence can be shown to be exchangeable and it can by de Finetti’s theorem be described as • Take q € U(0,1) = Beta(1,1) • Draw balls independently with this probability of being white • There is no way to see the difference between a Bernoulli sequence (binomial distribution) with an unknown p and a Polya urn scheme. Since the outcomes follow the same distribution there cannot exist any test to differentiate between them.

  41. Dichotomous variables or The Polya Urn scheme • We could have started with another number of balls. This had given other parameters in the prior Beta-distribution • Beta(a,b)  a white balls and b black balls • E(Beta) = a /(a + b) • Var(Beta) = ab /((a + b)2(a + b+1))

  42. Dichotomous variables or The Polya Urn scheme • This can be used to derive the posterior distribution of the number (yT) of balls/persons with a certain property (white) in a population, given an observed SRS-sample of size n with yS white balls/persons. • Use a prior with parameters so that the expected value is your best guess of the unknown proportion and the standard deviation describes your uncertainty about it.

  43. Properties • The posterior distribution can be shown to be • With both parameters set to 0, the expected value is Np* and the variance p*(1-p*)N(N-n)/(n+1). • The designbased estimate and variance estimator are good approximations to this (equal apart from n in place of n+1)

  44. Simulation • It is often easier to simulate from the posterior than to give its exact form. • In this case the urn scheme gives a simple way to simulate the full population. Just continue with the Polya sampling scheme starting from the sample • If you repeat this 1000 times, say, and plot the 1000 simulated population totals in a histogram, you will get a good description of the distribution of the unknown quantity

  45. Dirichlet-multinomial • If the distribution is discrete with a finite number of categories, a similar procedure is possible • Just draw from the set of all observations and put it back together with a similar observation. Continue until N • Repeat and you get a number of populations which are drawn from the posterior distribution. • For each population compute the parameter of interest, e.g. the mean or median, and plot the values in a histogram • If this is described as in de Finetti’s Theorem, the parameter comes from a Dirichlet distribution and the observations are conditionally independent multinomial.

  46. The Bayesian Bootstrap • This procedure is called the Bayesian bootstrap (if an uninformative prior i.e. all parameters = 0 is used) • This can be generalised to variables measured on a continuous scale • The design-based estimate gives the same mean estimate as this (for polytomous populations). • The design-based variance estimator is also close to the true variance apart from a factor n/(n+1) • Note, that if the distribution is skew, this method does not work well, since it does not use the prior information of skewness (nor does the designbased methods) • Note also that with many categories it may be better to use even smaller parameters e.g. – 0.9.

  47. Other Bayesian models • There are many other models/methods within Bayesian survey sampling, than the Bayesian Bootstrap • Another approach starts with a normal-gamma model • Given m,s2, data come from an iid normal(m,s2) model • The variance s2 follows a priori an inverse gamma • The mean m follows a priori a normal model with mean m and variance ks2 • and later relaxes the normality assumption • but I have not enough time here.

  48. 6. Properties of some different paradigms within survey sampling

  49. 7. Preposterior analysisStudy/experimental design • In the design of a survey one must take into account the posterior distribution. • You may e.g. want to • Get a small posterior variance • Get a short 95 % prediction interval • Make a good decision • This analysis of the possible posterior consequences before the experiment is carried out, is called preposterior analysis

  50. Preposterior analys with public statistics • Usually when you make a survey for your own benefit you should use your own prior both in the preposterior and the posterior analysis • With public statistics you should have a flat prior in the posterior analysis, • e.g. the posterior variance is Var(q |X, K0). • But the design decision is yours and you should use all your information for that decision • e.g. find the design, which minimizes E(Var(q |X, K0)| KYou )

More Related