970 likes | 1.29k Views
Lecture 3 Bayesian Reasoning 第 3 讲 贝叶斯推理. 3.1 Bayesian 规则 3.2 Naïve Bayes Model 3.3 Bayesian Network. Sources of Uncertainty. Information is partial Information is not fully reliable. Representation language is inherently imprecise.
E N D
Lecture 3 Bayesian Reasoning 第3讲 贝叶斯推理 3.1 Bayesian 规则 3.2 Naïve Bayes Model 3.3 Bayesian Network
Sources of Uncertainty • Information is partial • Information is not fully reliable. • Representation language is inherently imprecise. • Information comes from multiple sources and it is conflicting. • Information is approximate • Non-absolute cause-effect relationships exist
Source of Uncertainty • Uncertain data (noise) • Uncertain knowledge (e.g, causal relations) • A disorder may cause any and all POSSIBLE manifestations in a specific case • A manifestation can be caused by more than one POSSIBLE disorders • Uncertain reasoning results • Abduction(溯因) and induction(归纳) are inherently uncertain • Default reasoning, even in deductive fashion, is uncertain • Incomplete deductive inference may be uncertain • Incomplete knowledge and data
Probabilistic Reasoning • Evidence • What we know about a situation. • Hypothesis • What we want to conclude. • Compute • P( Hypothesis | Evidence )
Bayes Theorem • P( H | E )= P(H). P(E|H)/ P(E) This can be derived from the definition of conditional probability. 后验概率=先验概率*似然概率/证据因子 • Posterior = (Prior. Likelihood) / Evidence
P(H|E) = P(H, E) P(E) Baye’s Formula P(H|E) = P(E|H) ● P(H) P(E)
Bayes Theorem 的经典应用:相继律 一个人打靶,打了n次,命中T次,问 此人打靶命中的概率θ如何估计? T/n ?
先验分布: 似然度: 已知打靶命中率为θ,则打靶n次命中恰为T的 的概率为:
利用后验分布的期望值估计 相继律
Outline • Independence and Conditional Independence (条件独立) • Naïve Bayes Model • Application: Spam(垃圾邮件)Detection
Probability of Events • Sample space and events • Sample space S: (e.g., all people in an area) • Events E1 S: (e.g., all people having cough) E2 S: (e.g., all people having cold) • Prior (marginal) probabilities of events • P(E) = |E| / |S| (frequency interpretation) • P(E) = 0.1 (subjective probability) • 0 <= P(E) <= 1 for all events • Two special events: and S: P() = 0 and P(S) = 1.0 • Boolean operators between events (to form compound events) • Conjunctive (intersection): E1 ^ E2 ( E1 E2) • Disjunctive (union): E1 v E2 ( E1 E2) • Negation (complement): ~E (E = S – E)
~E E1 E E2 E1 ^ E2 • Probabilities of compound events • P(~E) = 1 – P(E) because P(~E) + P(E) =1 • P(E1 v E2) = P(E1) + P(E2) – P(E1 ^ E2) • But how to compute the joint probability P(E1 ^ E2)? • Conditional probability (of E1, given E2) • How likely E1 occurs in the subspace of E2
Independence assumption • Two events E1 and E2 are said to be independent of each other if (given E2 does not change the likelihood of E1) • It can simplify the computation • Mutually exclusive (ME) and exhaustive (EXH) set of events • ME: • EXH:
Independence: Intuition • Events are independent if one has nothing whatever to do with others. Therefore, for two independent events, knowing one happening does change the probability of the other event happening. • one toss of coin is independent of another coin (assuming it is a regular coin). • price of tea in England is independent of the result of general election in Canada.
Independence: Definition • Events A and B are independent iff: P(A, B) = P(A) . P(B) which is equivalent to P(A|B) = P(A) and P(B|A) = P(B) when P(A, B) >0. T1: the first toss is a head. T2: the second toss is a tail. P(T2|T1) = P(T2)
Conditional Independence • Dependent events can become independent given certain other events. • Example, • Size of shoe • Age • Size of vocabulary • Two events A, B are conditionally independent given a third event C iff P(A|B, C) = P(A|C)
Conditional Independence:Definition • Let E1 and E2 be two events, they are conditionally independent given E iff P(E1|E, E2)=P(E1|E), that is the probability of E1 is not changed after knowing E2, given E is true. • Equivalent formulations: P(E1, E2|E)=P(E1|E) P(E2|E) P(E2|E, E1)=P(E2|E)
Naïve Bayes Method • Knowledge Base contains • A set of hypotheses • A set of evidences • Probability of an evidence given a hypothesis • Given • A sub set of the evidences known to be present in a situation • Find • the hypothesis with the highest posterior probability: P(H|E1, E2, …, Ek).
Naïve Bayes Method • Assumptions • Hypotheses are exhaustive and mutually exclusive • H1 v H2 v … v Ht • ¬ (Hi ^ Hj) for any i≠j • Evidences are conditionally independent given a hypothesis • P(E1, E2,…, Ek|H) = P(E1|H)…P(Ek|H) • P(H | E1, E2,…, Ek) = P(E1, E2,…, Ek, H)/P(E1, E2,…, Ek) = P(E1, E2,…, Ek|H)P(H)/P(E1, E2,…, Ek)
Naïve Bayes Method • The goal is to find H that maximize P(H | E1, E2,…, Ek)(最大后验概率 假设 ) • Since P(H | E1, E2,…, Ek) = P(E1, E2,…, Ek|H)P(H)/P(E1, E2,…, Ek) and P(E1, E2,…, Ek) is the same for different hypotheses, • Maximizing P(H | E1, E2,…, Ek) is equivalent to maximizing P(E1, E2,…, Ek|H)P(H)= P(E1|H)…P(Ek|H)P(H) • Naïve Bayes Method • Find a hypothesis that maximizes P(E1|H)…P(Ek|H)P(H)
Example: Play Tennis? Predict playing tennis when <sunny, cool, high, true> What probability should be used to make the prediction? How to compute the probability? H={+, -}
Probabilities of Individual Attributes • Given the training set, we can compute the probabilities − − + + P(sunny | +)=2/9
P(+ |sunny, cool, high, true) =P(sunny, cool, high, true | +)* P(+)/ P(sunny, cool, high, true) = P(sunny| +)* P(cool| +)* P(high| +)* P(true| +)* P(+)/ P(sunny, cool, high, true) =(2/9) * (3/9) * (3/9) * (3/9) * (9/14) / P(sunny, cool, high, true) P(sunny, cool, high, true) = P(sunny, cool, high, true | +)* P(+) + P(sunny, cool, high, true | − )* P(−) = (2/9) * (3/9) * (3/9) * (3/9) * (9/14) / [(2/9) * (3/9) * (3/9) * (3/9) * (9/14) + (3/5) * (1/5) * (4/5)* (3/5) * (5/14) ] =0.435
思考题: 计算 P(− |sunny, cool, high, true)
Example 2 • Suppose we have a set of data on credit authorization with 10 training instances, each classified into one of 4 classes: • C1 =authorize • C2 =authorize after identification • C3 =do not authorize • C4 =do not authorize; call the police
Example 2 • Training data:
Example 2 • P(C1) = 6 / 10 = 0.6 • P(C2) = 2 / 10 = 0.2 • P(C3) = 1 / 10 = 0.1 • P(C4) = 1 / 10 = 0.1
Example 2 • P(x1 = 4 | C1) = P(x1 = 4 and C1) / P(C1) = 2/6 = 0.33 • P(x1 = 3 | C1) = 2/6 = 0.33 • P(x1 = 2 | C1) = 2/6 = 0.33 • P(x1 = 1 | C1) = 0/6 = 0 • Similarly, we have • P(x1 = 2 | C2) = 1/2 = 0.5 • P(x1 = 1 | C2) = 1/2 = 0.5 • P(x1 = 3 | C3) = 1/1 = 1 • P(x1 = 1 | C4) = 1/1 = 1 • All other probabilities P(x1 | Cj) = 0
Example 2 • P(x2 = “Excellent” | C1) = 3/6 = 0.5 • P(x2 = “Good” | C1) = 3/6 = 0.5 • P(x2 = “Bad” | C1) = 0/6 = 0 • Similarly, we have • P(x2 = “Good” | C2) = 1/2 = 0.5 • P(x2 = “Bad” | C2) = 1/2 = 0.5 • P(x2 = “Bad” | C3) = 1/1 = 1 • P(x2 = “Bad” | C4) = 1/1 = 1 • All other probabilities P(x2 | Cj) = 0
Example 2 • Suppose now we want to classify a tuple t = {3, “Excellent”}. We have: P(t | C1) = P(xik |C1) = P(x1 = 3 | C1) P(x2 = “Excellent” | C1) = 0.33 * 0.5 = 0.17 P(t | C2) = 0 * 0 = 0 P(t | C3) = 1 * 0 = 0 P(t | C4) = 0 * 0 = 0 • P(t) = P(t |Cj) P(Cj) = 0.17 * 0.6 + 0 + 0 + 0 = 0.1
Example 2 • Then we can calculate P(Cj | t) for each class. • P(C1 | t) = P(t |C1) P(C1) / P(t) = 0.17 * 0.6 / 0.1 = 1 • P(C2 | t) =0 • P(C3 | t) =0 • P(C4 | t) =0 • Therefore, tuple t is classified as class C1 because it has the highest probability.
Example 2 • Suppose now we want to classify another tuple t = {2, “Good”}. We have: P(t | C1) = P(xik |C1) = P(x1 = 2 | C1) P(x2 = “Good” | C1) = 0.33 * 0.5 = 0.17 P(t | C2) = 0.5 * 0.5 = 0.25 P(t | C3) = 0 * 0 = 0 P(t | C4) = 0 * 0 = 0 • P(t) = P(t |Cj) P(Cj) = 0.17 * 0.6 + 0.25 * 0.2 + 0 + 0 = 0.15
Example 2 • Then we can calculate P(Cj | t) for each class. • P(C1 | t) = P(t |C1) P(C1) / P(t) = 0.17 * 0.6 / 0.15 = 0.67 • P(C2 | t) = P(t |C2) P(C2) / P(t) = 0.25 * 0.2 / 0.15 = 0.33 • P(C3 | t) =0 • P(C4 | t) =0 • Therefore, tuple t is classified as class C1 because it has the highest probability.
Application: Spam Detection(垃圾邮件检测) • Spam • Dear sir, We want to transfer to overseas ($ 126,000.000.00 USD) One hundred and Twenty six million United States Dollars) from a Bank in Africa, I want to ask you to quietly look for a reliable and honest person who will be capable and fit to provide either an existing …… • Legitimate email • Ham: for lack of better name.
Example 3 • Hypotheses: {Spam, Ham} • Evidence: a document • The document is treated as a set (or bag) of words • Knowledge • P(Spam) • The prior probability of an e-mail message being a spam. • How to estimate this probability? • P(w|Spam) • the probability that a word is w if we know w is chosen from a spam. • How to estimate this probability?
Let V be the vocabulary of all words in the documents in D For each category ci C Let Dibe the subset of documents in D in category ci P(ci) = |Di| / |D| Let Ti be the concatenation of all the documents in Di Let ni be the total number of word occurrences in Ti For each word wj V Let nij be the number of occurrences of wj in Ti Let P(wi| ci) = (nij + 1) / (ni + |V|)
Given a test document X Let n be the number of word occurrences in X Return the category: where ai is the word occurring the ith position in X
Minimum Description Length (MDL)(最小描述长度) • Occam’s razor(奥坎坶剃刀): “prefer the shortest hypothesis”(选择最短(最简单)的假设) • MDL: prefer hypothesis that minimizes
Minimum Description Length C2和给定h时 D的最优编码 C1为h的最优编码
为解释这一点,先引入信息论中的一个基本结论:为解释这一点,先引入信息论中的一个基本结论: • 设想要为随机传送的消息设计一个编码,其中遇到消息i的概率是pi • 设计最简短的编码,即为了传输随机信息的编码所能得到的最小期望传送位数 • 为使期望的编码长度最小,必须为可能性较大的消息赋予较短的编码 • Shannon & Weaver(1949)证明最优编码(使得期望消息长度最短的编码)对消息i的编码长度为-log2pi位 • 例: “cbabcccd”的最优编码:
Summary 条件独立 贝叶斯规则 相继率 朴素贝叶斯方法 最大后验假设 最小描述长度 Next lecture: 贝叶斯网
Bayesian networks (贝叶斯网) • Probabilistic networks(概率网) • Causal networks(因果网) • Belief networks(信度网) (不同的名称)
Probabilistic Belief • There are several possible worlds that areindistinguishable to an agent given some priorevidence. • The agent believes that a logic sentence B is True with probability p and False with probability 1-p. B is called abelief • In the frequency interpretation of probabilities, this means that the agent believes that the fraction of possible worlds that satisfy B is p • The distribution (p,1-p) is the strength of B
Bayesian Networks: Definition • Bayesian networks are directed acyclic graphs (DAGs). • Nodes in Bayesian networks represent random variables, which is normally assumed to take on discrete values. • The links of the network represent direct probabilistic influence(直接概率影响). • The structure of the network represents the probabilistic dependence/independence relationships between the random variables represented by the nodes.
Bayesian Network: Probabilities • The nodes and links are quantified with probability distributions. • The root nodes (those with no ancestors) are assignedprior probability distributions. • The other nodes are assigned with the conditional probability distribution of the node given its parents.