1 / 33

Primer on Probability

Primer on Probability. Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy sroy@biostat.wisc.edu Sep 25 th , 2012. BMI/CS 576. Definition of probability.

olisa
Download Presentation

Primer on Probability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Primer on Probability Sushmita Roy BMI/CS 576 www.biostat.wisc.edu/bmi576/ Sushmita Roy sroy@biostat.wisc.edu Sep 25th, 2012 BMI/CS 576

  2. Definition of probability • frequentist interpretation: the probability of an event from a random experiment is the proportion of the time events of same kind will occur in the long run, when the experiment is repeated • examples • the probability my flight to Chicago will be on time • the probability this ticket will win the lottery • the probability it will rain tomorrow • always a number in the interval [0,1] 0 means “never occurs” 1 means “always occurs”

  3. Sample spaces • sample space: a set of possible outcomes for some event • event: a subset of sample space • examples • flight to Chicago: {on time, late} • lottery: {ticket 1 wins, ticket 2 wins,…,ticket n wins} • weather tomorrow: {rain, not rain} or {sun, rain, snow} or {sun, clouds, rain, snow, sleet} or…

  4. Random variables • random variable: a function associating a value with an attribute of the outcome of an experiment • example • X represents the outcome of my flight to Chicago • we write the probability of my flight being on time as P(X = on-time) • or when it’s clear which variable we’re referring to, we may use the shorthand P(on-time)

  5. Notation • uppercase letters and capitalized words denote random variables • lowercase letters and uncapitalized words denote values • we’ll denote a particular value for a variable as follows • we’ll also use the shorthand form • for Boolean random variables, we’ll use the shorthand

  6. 0.3 0.2 0.1 sun rain sleet snow clouds Probability distributions • if X is a random variable, the function given by P(X = x)for each x is the probability distribution of X • requirements:

  7. Joint distributions • joint probability distribution: the function given by P(X = x, Y = y) • read “X equals xandY equals y” •  example probability that it’s sunny and my flight is on time

  8. Marginal distributions • the marginal distribution of X is defined by “the distribution of X ignoring other variables” • this definition generalizes to more than two variables, e.g.

  9. Marginal distribution example joint distribution marginal distribution for X

  10. Conditional distributions • the conditional distribution of Xgiven Y is defined as: “the distribution of X given that we know the value of Y”

  11. Conditional distribution example conditional distribution for X givenY=on-time joint distribution

  12. Independence • two random variables, X and Y, are independent if 

  13. Independence example #1 joint distribution marginal distributions Are X and Y independent here? NO.

  14. Independence example #2 joint distribution marginal distributions Are X and Y independent here? YES.

  15. Conditional independence • two random variables X and Y are conditionally independent given Z if  • “once you know the value of Z, knowing Y doesn’t tell you anything about X” • alternatively

  16. Conditional independence example Are Fever andHeadache independent? NO.

  17. Conditional independence example Are Fever and Vomitconditionally independent given Flu: YES.

  18. Chain rule of probability • for two variables • for three variables • etc. • to see that this is true, note that

  19. Bayes theorem • this theorem is extremely useful • there are many cases when it is hard to estimate P(x| y) directly, but it’s not too hard to estimate P(y| x) andP(x)

  20. Bayes theorem example • MDs usually aren’t good at estimating P(Disorder| Symptom) • they’re usually better at estimating P(Symptom| Disorder) • if we can estimate P(Fever| Flu) and P(Flu) we can use Bayes’ Theorem to do diagnosis

  21. Expected values • the expected value of a random variable that takes on numerical values is defined as: this is the same thing as the mean • we can also talk about the expected value of a function of a random variable

  22. Expected value examples • Suppose each lottery ticket costs $1 and the winning ticket pays out $100. The probability that a particular ticket is the winning ticket is 0.001.

  23. The binomial distribution • distribution over the number of successes in a fixed number n of independent trials (with same probability of success p in each) • e.g. the probability of x heads in ncoin flips p=0.5 p=0.1 P(X=x) x x

  24. The multinomial distribution • k possible outcomes on each trial • probability pifor outcome xi in each trial • distribution over the number of occurrences xifor each outcome in a fixed number n of independent trials • e.g. with k=6 (a six-sided die) and n=30 vector of outcome occurrences

  25. Statistics of alignment scores Q: How do we assess whether an alignment provides good evidence for homology? A: determine how likely it is that such an alignment score would result from chance. What is “chance”? • real but non-homologous sequences • real sequences shuffled to preserve compositional properties • sequences generated randomly based upon a DNA/protein sequence model

  26. Model forunrelatedsequences • we’ll assume that each position in the alignment is sampled randomly from some distribution of amino acids • let be the probability of amino acid a • the probability of an n-character alignment of x and y is given by

  27. Model forrelatedsequences • we’ll assume that each pair of aligned amino acids evolved from a common ancestor • let be the probability that evolution gave rise to amino acid a in one sequence and b in another sequence • the probability of an alignment of x and y is given by

  28. taking the log, we get Probabilistic model of alignments • How can we decide which possibility (U or R) is more likely? • one principled way is to consider the relative likelihood of the two possibilities

  29. Probabilistic model of alignments • the score for an alignment is thus given by: • the substitution matrix score for the pair a, b should thus be given by:

  30. Scores from random alignments • suppose we assume • sequence lengths m and n • a particular substitution matrix and amino-acid frequencies • and we consider generating random sequences of lengths m and n and finding the best alignment of these sequences • this will give us a distribution over alignment scores for random pairs of sequences

  31. The extreme value distribution • but we’re picking thebest alignments, so we want to know what the distribution of max scores for alignments against a random set of sequences looks like • this is given by an extreme value distribution

  32. Distribution of scores • the expected number of alignments, E, with score at least S is given by: • S is a given score threshold • m and n are the lengths of the sequences under consideration • K and are constants that can be calculated from • the substitution matrix • the frequencies of the individual amino acids

  33. Statistics of alignment scores • to generalize this to searching a database, have n represent the summed length of the sequences in the DB (adjusting for edge effects) • the NCBI BLAST server does just this • theory for gapped alignments not as well developed • computational experiments suggest this analysis holds for gapped alignments (but K and must be estimated from data)

More Related