380 likes | 391 Views
Statistical Data Analysis and Simulation. Jo ã o R. T. de Mello Neto. Jorge Andre Swieca School Campos do Jord ão, January,2003. Questions. What is probability? How to quantify it? What is the probability of something happens? What is the value of a given parameter?
E N D
Statistical Data Analysis and Simulation João R. T. de Mello Neto Jorge Andre Swieca School Campos do Jordão, January,2003
Questions • What is probability? How to quantify it? • What is the probability of something happens? • What is the value of a given parameter? • What is the uncertainty in a given parameter? • Is this fit acceptable? • What is the likelihood of a given signal be physics and not background? • How one separates signal from background?
Chance The conception of chance enters into the very first steps of scientific activity, in virtue of the fact that no observation is absolutely correct. Max Born Natural Philosophy of Cause and Chance, p. 47 O acaso é um diabo e um deus ao mesmo tempo. Machado de Assis
Lectures • Basics: random variables, probability, distributions • Random numbers, minimization techniques • Maximum likelihood and chi-square methods • Goodness of fit, limits • Applications: pattern recognition in the LHCb muon system, sigma particle fitting in E791, bayesian coin,…
First lecture Basics: random variables, probabilities and distributions Jorge Andre Swieca School Campos do Jordão, January,2003
References • Statistical Data Analysis, G. Cowan, Oxford, 1998; • Statistics, A guide to the Use of Statistical Methods in the Physical Sciences, R. Barlow, J. Wiley & Sons, 1989; • Computational Statistics Handbook with MATLAB, W. L. Martinez, A. R. Martinez, Chapman&Hall, 2002
Random Variables • Random experiment: the outcome cannot be predicted with certainty • Statistics: model and analyze the outcomes • Sample space S = set of all possible outcomes • Die X = { 1, 2, 3, 4, 5, 6} • Period of a pendulum Errors in the measuring process Fundamental unpredictability Discrete random variable Continous random variable
Probability • Quantify the degree of randomness; • Definition in terms of set theory: S composed of elements A (subsets of S) • P(A) real number that satisfy three axioms: • for every A, P(A) ≥ 0 • if A∩B = Ø (disjoints) P(AUB) = P(A) + P(B) • P(S) = 1 P(Ā) = 1 – P(A) P(Ø) = 0 P(AUĀ) = 1 If A C B, P(A) ≤ P(B) 0 ≤ P(A) ≤ 1 P(AUB) = P(A) + P(B) – P(A∩B)
S A B ∩ ∩ ∩ ∩ ∩ ∩ ∩ ∩ ∩ ∩ events in A and B total events in B total Intuitiveapproach Conditional probability P(A|B) : prob. of event A given B P(A∩B) P(B) events in A and B Events in B P(A|B) = = = 2 3 P(B∩A) P(A) P(B|A) = = P(A∩B) = P(B|A)P(A) = 2/3 x 3/10 = 2/10 = P(A|B)P(B)
S A B ∩ ∩ ∩ ∩ ∩ ∩ ∩ ∩ ∩ ∩ Intuitive approach P(A|B) = P(A) Independent probabilities not independent S A B ∩ ∩ ∩ independent ∩ ∩ ∩ ∩ ∩ ∩ ∩
BayesTheorem disjoints
90% signal π signal K 10% Cherenkov counter 95% efficiency 6% false signals = 99.3% = 0.7% = 67.6% = 32.4%
AIDS positive “About 0.01 percent of men with no known risk behaviour are infected with HIV (base rate). If such a man has the virus, there is a 99.9 percent chance that the test result will be positive (sensitivity). If a man is not infected, there is a 99.99 percent chance that the test result will be negative (specificity)” What is the chance that a man who tests positive actually has the virus? = 0.5 Reckoning with Risk, G. Gigerenger, 2002
AIDS positive natural frequencies (no known risk behaviour) 10000 9999 no HIV 1 HIV 1 positive 9998 negative 1 positive 0 negative Many examples: mamography screening 1 out of 10 positives! Gigerenger, 2002
Probability What is the meaning of P(A)? Frequentist: limit of relative frequencies S: possible outcomes of an experiment (repeatable) A: occurrence of a given outcome (event) P(A) = lim number of occurrences of A in n measurements n→∞ n • consistent with the probability axioms • usual interpretation in standard textbooks • appropriate to particle physics (many repeatable events) • more problematic for unique phenomena • big-bang • rain tomorrow
Probability Bayesian (subjective) Element of S: hypotheses or propositions (true or false) P(A) = degree of belief that hypothesis A is true Hypothesys: a measurement will yield a given outcome a certain fraction of the time subjective probabilities include the frequentist interpretation m1≤ me ≤ m2 Bayesian interpretation! P=95% Bayesian statistics: interpretation of Bayes theorem
Probability A: a given theory is correct; likelihood B: data will yield a particular result; P(theory|data) = P(data|theory) P(theory) P(data) apriori posteriori
Distributions f(x) prob. density function x: random continuos variable probability to observe x in the interval [x, x+ dx] = f(x)dx cumulative distribution function
P(A∩B) = prob. of x in [x, x + dx] and y in [y, y + dy] = Distributions joint p.d.f f(x,y)
Distributions expectation value population variance covariance correlation coeficient
binomial • process with a given number of identical trials (N) with two possible outcomes : success (p), failure (1-p) • what is the probability of n success? ( N-n failures) probability for a particular sequence: order does not matter: number of sequences probability, not prob. density
binomial C1 C2 C3 C4 C5 Individual efficiency: 0.95 track: at least 3 points 3 chambers: f(3;3,0.95) = 0.953 = 0.857 f(3;4,0.95) + f(4;4,0.95) = 0.171 + 0.815 = 0.986 4 chambers: f(3;5,0.95) + f(4;5,0.95) + f(5;5,0.95) = 5 chambers: 0.021 + 0.204 + 0.774 = 0.999
Poisson binomial: N large, p very small, Np→ν particular events, but no idea of number of trials sharp events occurring in a continuum Geiger counter near a radioactive source; Number of flashes of lightning in a storm;
Poisson Proof: ν events in some interval split interval in N sections prob. that a given section contains an event prob. of n events in N sections N→∞ with n finite
number of deaths one corps X year Poisson Fatal horse kicks: number of Prussian soldiers kicked to death by horses. In ten different army corps, over 20 years, there were 122 deaths: = = 0.610 no deaths: P(0, 0.61) = 0.5434 number of (corpsXyears) with no deaths: 200X0.5434 = 108.7 one death: P(1, 0.61) = 0.3315 number of (corpsXyears) with one death: 200X0.3515 = 66.3 deaths actual number Poisson corpsXyear 0 109 108.7 1 65 66.3 2 22 20.2 3 3 4.1 4 1 0.6
Gaussian standard gaussian: evaluated numerically cumulative
Gaussian in N dimensions: column vectors V: symmetric NXN matrix in 2 dimensions:
Central limit theorem the sum of N independent continous random variables xiwith means µiand variances σi (N→∞) becomes a Gaussian random variable with regardless of the form of the individual p.d.f. of the xi formal justification for treating measurement errors as Gaussian random variables: total error: sum of a large number of small contributions
Central limit theorem Actually used: algorithm R632 Cern library