In the first part of the series, I have provided three illustrative examples on choosing the level of significance. In this second part, I will demonstrate how the problem can be approached analytically by minimizing the expected loss of hypothesis testing. The method follows the proposal of Edward Leamer, as detailed in his book written in 1978. I also provide an example using the R package OptSig in this article.
e
Expected Loss of Hypothesis Testing
The problem is based on the following factors of hypothesis testing for the null hypothesis H0:
α: the probability of Type I error (the level of significance);
β: the probability of Type II error;
L1: loss from Type I error;
L2: loss from Type II error;
P: the probability that H0 is true (prior).
For the precise definitions of Type I and II errors, please refer to a basic statistics book or the first part of the series.
The prior P is determined by prior information (evidence documented in previous studies and/or relevant theories) or by subjective assessment (including expert opinion or consensus in the profession). It should not be influenced by sample information, as it must be determined before the researcher observes the data. If the researcher a priori believes that the null and alternative hypotheses are equally likely, then the value of P is set at 0.5. The losses L1 and L2 are determined based on the context of hypothesis testing, free from sample information.
The expected loss (EL) of hypothesis testing is defined as
EL(α; P, L1, L2) = PαL1 + (1-P)β(α)L2.
It is a function of α, given the values of P, L1, and L2. Here, the probability of Type II error β(α) is expressed as a function of α, because it is determined when the researcher chooses the level of significance α.
To further simplify, we can express the expected loss in terms of the relative loss k ≡ L2/L1. Without loss of generality, we can set L1 = 1 so that k = L2. And the expected loss is expressed as
EL(α; P, k) = PαL1 + (1-P)β(α)k.
The value of k greater (less) than 1 indicates that the loss from Type II (I) error is larger than that of Type I (II) error, while k = 1 means that the two losses are identical or the researcher is indifferent between the two.
Line of Enlightened Judgement
The two error probabilities α and β are closely related with a trade-off. Given the sample size and the value of the parameter under H1, a higher (lower) value of α leads to a lower (higher) value of β. Consider, as a simple example, the test for the population mean μ, under the assumption of random sampling from a population which follows a normal distribution with the known variance 1:
H0: μ = 0; H1: μ > 0.
The black curve is the sampling distribution under H0: μ = 0 and the red curve is the sampling distribution under H1 where μ = 0.5. The grey-shaded area under the black curve represents α, while the red-shaded area under the red curve represents β. A lower value of α (a higher critical value) obviously leads to a higher value of β, and vice versa. This trade-off can be represented as the following plot:
Figure 1: The Line of Enlightened Judgement
The black line is called the line of enlightened judgement, and it represents all possible combinations of (α,β), given the sample size and the value of μ under H1. If the sample size is larger, the value of β gets smaller (higher power), for all values of α; and the black line shifts towards the origin. In the next subsection, the points corresponding to the four red dots on the line of enlightened judgement in Figure 1 will be used as examples.
Optimal Level of Significance
The key point is that there is no reason to believe that setting α = 0.05 (or 0.01 or 0.10) is optimal by any criterion, among all possible combinations of (α,β). An optimal solution can be achieved by finding the value of α, which minimizes the expected loss of hypothesis testing. That is, find the value of α such that the function
EL(α; P, k) = PαL1 + (1-P)β(α)k
is minimized. The first-order condition of minimization yields
which indicates the slope of the line of enlightened judgement at the point of minimization. That is, the optimal values (α*, β*) are determined where the slope of the line is equal to the above.
Suppose (P = 0.5, k = 1), which represents the case where the researcher believes H0 and H1 are equally likely, and the losses from Type I and II errors are identical. The optimal point corresponds to the slope of -1, where (α*, β*) = (0.20, 0.23) in Figure 1.
Suppose (P = 0.5, k = 10), which represents the case where the researcher believes H0 and H1 are equally likely, but the loss from Type II error is ten times higher than that of Type I error. The optimal point corresponds to the slope of -10, where (α*, β*) = (0.60, 0.01) in Figure 1. Since Type II error is much more costly than Type I error, the researcher should set the level of significance so that the probability of Type II error is low.
Suppose (P = 0.1, k = 1), which represents the case where the researcher strongly believes that H1 is true, while the losses from Type I and II errors are equal. The optimal point corresponds to the slope of -0.10, where (α*, β*) = (0.05, 0.58) in Figure 1. Since the researcher believes that H0 is highly unlikely, she sets to a lower value on the probability of Type I error. That is, a conventional level such as 0.05 is only optimal when the researcher strongly favours H1 in her prior belief.
Suppose (P = 0.5, k = 0.01), which yields the slope of the line of -0.01. This may be the situation of the criminal court of law where the loss from Type I error is substantially larger than that of Type II error, as discussed in first part of the series. The optimal level of significance (α*) is 0.001 as in Figure 1, which corresponds to “beyond reasonable doubt” as the burden of proof.
An Example
A production manager wishes to examine the relationship between aptitude test scores given prior to hiring and performance ratings of employees, three months after commencing work. The aptitude test results range from 0 to 100 and the performance ratings from 1 to 5 (1: well below average; 2: somewhat below average: 3: average; 4: somewhat above average; 5: well above average). The data from 20 workers are plotted as below:
Figure 2: Scatter plots between ratings and test score
The manager calculates the sample correlation coefficient r = 0.379; and conducts the following test: H0: ρ = 0; H1: ρ ≠ 0, where ρ denotes the population correlation coefficient. The test statistic is
which follow the t-distribution with n-2 degree of freedom where n is the sample size. The t-statistic is 1.738 with two-tailed p-value of 0.099, which means that H0 cannot be rejected at the 5% level of significance. Based on this, the manager may conclude that the aptitude test is not effective.
However, the question is whether the 5% level of significance is appropriate. The manager can choose the 10% level, and may wonder if she rejects or accepts H0 as the p-value is virtually identical to 0.10. Or, should the manager choose the 1% level of significance and accept H0? Given the clear linear relationship observed in Figure 2, is the decision to accept H0 reasonable? The answer is not clear and the decision can be ambiguous under the conventional level of significance.
Let us see how the decision can be made under the optimal significance level. Suppose that the manager believes that the value of ρ should be at least 0.50 for the correlation to be practically important. Using this as the value of practical importance under H1, the manager tests H0: ρ = 0; H1: ρ = 0.5. Then, the optimal level of significance can be obtained by minimizing the expected loss as above (assuming P = 0.5, k = 1). Using the R package OptSig, the optimal level of significance is calculated as
OptSig.r(r=0.5,n=20,p=0.5,k=1,alternative="two.sided")
with (α*, β*) = (0.145, 0.192). At the optimal level of significance, H0 is clearly rejected in favour of H1: ρ ≠ 0, with the p-value = 0.099 less than the value of α*. This example illustrates that an unambiguous and possibly more sensible decision can be made by adopting the optimal level of significance.
Conclusion
In conclusion, as emphasized in the first part, a conventional level of significance is an arbitrary benchmark and has no scientific justification. The researchers should choose the level of significance in careful consideration of the key factors of hypothesis testing. And this article provides a method of finding the optimal choice. Further details of the method and practical examples can be found in Kim (2021).
Comments