720 likes | 986 Views
Beyond MARLAP: New Statistical Tests For Method Validation. NAREL – ORIA – US EPA Laboratory Incident Response Workshop At the 53 rd Annual RRMC. Outline. The method validation problem MARLAP’s test And its peculiar features New approach – testing mean squared error (MSE)
E N D
Beyond MARLAP:New Statistical TestsFor Method Validation NAREL – ORIA – US EPA Laboratory Incident Response Workshop At the 53rd Annual RRMC
Outline • The method validation problem • MARLAP’s test • And its peculiar features • New approach – testing mean squared error (MSE) • Two possible tests of MSE • Chi-squared test • Likelihood ratio test • Power comparisons • Recommendations and implications for MARLAP
The Problem • We’ve prepared spiked samples at one or more activity levels • A lab has performed one or more analyses of the samples at each level • Our task: Evaluate the results to see whether the lab and method can achieve the required uncertainty (uReq) at each level
MARLAP’s Test • In 2003 the MARLAP work group developed a simple test for MARLAP Chapter 6 • Chose a very simple criterion • Original criterion was whether every result was within ±3uReq of the target • Modified slightly to keep false rejection rate ≤ 5 % in all cases
Equations • Acceptance range is TV ± kuReq where • TV = target value (true value) • uReq = required uncertainty at TV, and • E.g., for n = 21 measurements (7 reps at each of 3 levels), with α= 0.05, we get k = z0.99878 = 3.03 • For smaller n we get slightly smaller k
Required Uncertainty • The required uncertainty, uReq, is a function of the target value • Where uMR is the required method uncertainty at the upper bound of the gray region (UBGR) • φMR is the corresponding relative method uncertainty
Alternatives • We considered a chi-squared (χ2) test as an alternative in 2003 • Accounted for uncertainty of target values using “effective degrees of freedom” • Rejected at the time because of complexity and lack of evidence for performance • Kept the simple test that now appears in MARLAP Chapter 6 But we didn’t forget about the χ2 test
Peculiarity of MARLAP’s Test • Power to reject a biased but precise method decreases with number of analyses performed (n) • Because we adjusted the acceptance limits to keep false rejection rates low • Acceptance range gets wider as n gets larger
Biased but Precise This graphic image was borrowed and edited for the RRMC workshop presentation. Please view the original now at despair.com. http://www.despair.com/consistency.html
Best Use of Data? • It isn’t just about bias • MARLAP’s test uses data inefficiently – even to evaluate precision alone (its original purpose) • The statistic – in effect – is just the worst normalized deviation from the target value • Wastes a lot of useful information
Example: The MARLAP Test • Suppose we perform a level D method validation experiment • UBGR = AL = 100 pCi/L • uMR = 10 pCi/L • φMR= 10/100 = 0.10, or 10 % • Three activity levels (L = 3) • 50 pCi/L, 100 pCi/L, and 300 pCi/L • Seven replicates per level (N = 7) • Allow 5 % false rejections (α = 0.05)
Example (continued) • For 21 measurements, calculate • When evaluating measurement results for target value TV, require for each result Xj: • Equivalently, require
Example (continued) • We’ll work through calculations at just one target value • Say TV = 300 pCi/L • This value is greater than UBGR (100 pCi/L) • So, the required uncertainty is 10 % of 300 pCi/L • uReq = 30 pCi/L
Example (continued) • Suppose the lab produces 7 results Xj shown at the right • For each result, calculate the “Z score” • We require |Zj| ≤ 3.0 for each j
Example (continued) • Every Zj is smaller than ±3.0 • The method is obviously biased (~15 % low) • But it passes the MARLAP test
2007 • In early 2007 we were developing the new method validation guide • Applying MARLAP guidance, including the simple test of Chapter 6 • Someone suggested presenting power curves in the context of bias • Time had come to reconsider MARLAP’s simple test
Bias and Imprecision • Which is worse: bias or imprecision? • Either leads to inaccuracy • Both are tolerable if not too large • When we talk about uncertainty (à la GUM), we don’t distinguish between the two
Mean Squared Error • When characterizing a method, we often consider bias and imprecision separately • Uncertainty estimates combine them • There is a concept in statistics that also combines them: mean squared error
Definition of MSE • If X is an estimator for a parameter θ, the mean squared error of X is • MSE(X) = E((X − θ)2) by definition • It also equals • MSE(X) = V(X) + Bias(X)2= σ2 + δ2 • If X is unbiased, MSE(X) = V(X)= σ2 • We tend to think in terms of the root MSE, which is the square root of MSE
New Approach • For the method validation guide we chose a new conceptual approach A method is adequate if its root MSE at each activity level does not exceed the required uncertainty at that level • We don’t care whether the MSE is dominated by bias or imprecision
Root MSE v. Standard Uncertainty • Are root MSE and standard uncertainty really the same thing? • Not exactly, but one can interpret the GUM’s treatment of uncertainty in such a way that the two are closely related • We think our approach – testing uncertainty by testing MSE – is reasonable
Chi-squared Test Revisited • For the new method validation document we simplified the χ2 test proposed (and rejected) in 2003 • Ignore uncertainties of target values, which should be small • Just use a straightforward χ2 test • Presented as an alternative in App. E • But the document still uses MARLAP’s simple test
The Two Hypotheses • We’re now explicitly testing the MSE • Null hypothesis (H0): • Alternative hypothesis (H1): • In MARLAP the 2 hypotheses were not clearly stated • Assumed any bias (δ) would be small • We were mainly testing variance (σ2)
A χ2 Test for Variance • Imagine we really tested variance only • H0: • H1: • We could calculate a χ2 statistic • Chi-squared with N − 1 degrees of freedom • Presumes there may be bias but doesn’t test for it
MLE for Variance • The maximum-likelihood estimator (MLE) for σ2 when the mean is unknown is: • Notice similarity to χ2 from preceding slide
Another χ2 Test for Variance • We could calculate a different χ2 statistic • N degrees of freedom • Can be used to test variance if there is no bias • Any bias increases the rejection rate
MLE for MSE • The MLE for the MSE is: • Notice similarity to χ2 from preceding slide • In the context of biased measurements, χ2 seems to assess MSE rather than variance
Our Proposed χ2 Test for MSE • For a given activity level (TV), calculate a χ2 statistic W: • Calculate the critical value of W as follows: • N = number of replicate measurements • α = max false rejection rate at this level
Multiple Activity Levels • When testing at more than one activity level, calculate the critical value as follows: • Where L is the number of levels and N is the number of measurements at each level • Now α is the maximum overall false rejection rate
Evaluation Criteria • To perform the test, calculate Wi at each activity level TVi • Compare each Wi to wC • If Wi > wC for any i, reject the method • The method must pass the test at each spike activity level • Don’t allow bad performance at one level just because of good performance at another
Lesson Learned • Don’t test at too many levels • Otherwise you must choose: • High false acceptance rate at each level, • High overall false rejection rate, or • Complicated evaluation criteria • Prefer to keep error rates low • Need a low level and a high level • But probably not more than three levels (L=3)
Better Use of Same Data • The χ2 test makes better use of the measurement data than the MARLAP test • The statistic W is calculated from all the data at a given level – not just the most extreme value
Caveat • The distribution of W is not completely determined by the MSE • Depends on how MSE is partitioned into variance and bias components • Our test looks like a test of variance • As if we know δ = 0 and we’re testing σ2 only • But we’re actually using it to test MSE
False Rejections • If wC<N, the maximum false rejection rate (100 %) occurs when δ= ±uReq and σ=0 • But you’ll never have this situation in practice • If wC≥N+2, the maximum false rejection rate occurs when σ=uReq and δ=0 • This is the usual situation • Why we can assume the null distribution is χ2 • Otherwise the maximum false rejection rate occurs when both δand σ are nonzero • This situation is unlikely in practice
To Avoid High Rejection Rates • We must have wC≥N+2 • This will always be true if α<0.08, even if L=N=1 • Ensures the maximum false rejection rate occurs when δ = 0 and the MSE is just σ2 • Not stated explicitly in App. E, because: • We didn’t have a proof at the time • Not an issue if you follow the procedure • Now we have a proof
Example: Critical Value • Suppose L = 3 and N = 7 • Let α = 0.05 • Then the critical value for W is • Since wC ≥ N + 2 = 9, we won’t have unexpectedly high false rejection rates Since α < 0.08, we didn’t really have to check
Some Facts about the Power • The power always increases with |δ| • The power increases with σ if or if • For a given bias δ with , there is a positive value of σ that minimizes the power • If , even this minimum power exceeds 50 % • Power increases with N
Power Comparisons • We compared the tests for power • Power to reject a biased method • Power to reject an imprecise method • The χ2 test outperforms the simple MARLAP test on both counts • Results of comparisons at end of this presentation
False Rejection Rates H1 Rejection rate = α Rejection rate < α H0 Rejection rate = 0
Region of Low Power H1 Rejection rate = α H0
Region of Low Power (MARLAP) H1 Rejection rate = α H0
Example: Applying the χ2 Test • Return to the scenario used earlier for the MARLAP example • Three levels (L = 3) • Seven measurements per level (N = 7) • 5 % overall false rejection rate (α = 0.05) • Consider results at just one level, TV = 300 pCi/L, where uReq = 30 pCi/L
Example (continued) • Reuse the data from our earlier example • Calculate the χ2 statistic • Since W > wC (17.4 > 17.1), the method is rejected • We’re using all the data now – not just the worst result
Likelihood Ratio Test for MSE • We also discovered a statistical test published in 1999, which directly addressed MSE for analytical methods • By Danish authors Erik Holst and Poul Thyregod • It’s a “likelihood ratio” test, which is a common, well accepted approach to hypothesis testing
Likelihood Ratio Tests • To test a hypothesis about a parameter θ, such as the MSE • First find a likelihood functionL(θ), which tells how “likely” a value of θ is, given the observed experimental data • Based on the probability mass function or probability density function for the data
Test Statistic • Maximize L(θ) on all possible values of θ and again on all values of θ that satisfy the null hypothesis H0 • Can use the ratio of these two maxima as a test statistic • The authors actually use λ = −2 ln(Λ) as the statistic for testing MSE
Critical Values • It isn’t simple to derive equations for λ, or to calculate percentiles of its distribution, but Holst and Thyregod did both • They used numerical integration to approximate percentiles of λ, which serve as critical values
Equations • For the two-sided test statistic, λ: • Where is the unique real root of the cubic polynomial • See Holst & Thyregod for details
One-Sided Test • We actually need the one-sided test statistic: • This is equivalent to:
Issues • The distribution of either λ or λ* is not completely determined by the MSE • Under H0 with , the percentiles λ1−α and λ*1−α are maximized when σ0 and |δ|uReq • To ensure the false rejection rate never exceeds α, use the maximum value of the percentile as the critical value • Apparently we improved on the authors’ method of calculating this maximum