top of page

Statistical independence for beginners

Intuitive interpretations with R and Excel functions


Statistical independence is a fundamental concept in statistics. In this post, I explain its definitions with intuitive interpretations, examples, and resources for statistical testing of independence (R code and Excel function).


 

Probabilities

For simplicity, consider two events A and B and define the following probabilities:

  • Prob(A): (marginal) probability of event A

  • Prob(B): (marginal) probability of event B

  • Prob(A ∩ B): joint probability of event A and B, the probability that A and B occuring at the same time;

  • Prob(A|B): conditional probability of event A given B, the probability of A given that the event B has already occurred;

  • Prob(B|A): conditional probability of event B given A.

These probabilities are related as


As indicated in the Venn Diagram below, Prob(A|B) represents the proportion of the event A’s contribution (yellow area: A ∩ B) to the event B, in probabilities.




For example,

A: married, B: male

P(A|B) = the probability of marriage among males;

A: unemployed, B: university graduates

P(A|B) = the probability of unemployment among university graduates

Statistical Independence

There are two equivalent conditions of statistical independence. First, the events A and B are statistically independent if


Prob(A ∩ B) = Prob(A) × Prob(B).


The probability of A and B occurring at the same time is the product of the probabilities. This means that if they occur at the same time, it is purely by chance and there is no systematic association between the two.


Second, the events A and B are statistically independent if


Prob(A|B) = Prob(A).


This condition follows from the relationship given above:



The probability of A conditional on the event B is the same as the probability of A. It means, given that it already has occurred, your knowledge about the event B has no bearing on the probability of event A.

Similarly for Prob(B|A) = Prob(B).

Simple example

You toss two (fair) coins successively, each coin showing either H (Head) or T (Tail) with Prob(H) = Prob(T) = 0.5. You will have the following outcomes:


(H, H), (H, T), (T, H), (T, T).


For example,

Prob(H ∩ T) = 0.25, and this is equal to Prob(H) × Prob(T) = 0.5 × 0.5.


That is, if you have the outcome (H, T), it is purely by chance with no systematic association. Alternatively,


Prob(T | H) = 0.25 = Prob(T ∩ H)/P(H) = 0.25/0.5= Prob(T)


If you have H from the first coin, that has no bearing on the probability of having T or H from the second coin.


Real-world examples

A: married, B: male

Prob(A ∩ B) = Prob(A) × Prob(B); a randomly selected person is male and married, and this joint occurrence is purely by chance.


Prob(A|B) = Prob(A); probability of marriage among males is same as the probability of marriage. Being a male has no impact on the probability of marriage.

Testing for Independence: chi-square test

A survey is conducted to examine if there is any association between marriage status of individuals and their gender. Out of 100 randomly selected individuals, 40 are males and 60 are females. Among them, 75 are married and 25 are unmarried. The (contingency) table below presents the joint frequencies of marriage status and gender.


For example,

Prob(Y ∩ M) = 25/100; Prob(M) = 40/100

Prob(Y|M) = Prob(Y ∩ M)/Prob(M) = 25/40

These frequencies are compared with the expected frequencies under statistical independence. The expected joint probabilities under independence are listed as above:


For example,

Prob(Y ∩ M) = Prob(Y) × Prob(M) = 75/100 × 40/100 = 0.3; Prob(Y ∩ F) = Prob(Y) × Prob(F) = 75/100 × 60/100 = 0.45.

The actual frequencies are similar to the expected values, but are they close enough to justify statistical independence? To test for the null hypothesis of statistical independence, we need to employ a test for the independence.

The chi-square test is widely used for this purpose, which compares the actual frequencies (Oi) with expected frequencies (Ei)


where n is the number of cells in the table, N is the total number of responses, and pi is the expected probability (or relative frequency) under independence. The above statistic follows a chi-square distribution with degrees of freedom df = (Rows − 1)×(Cols − 1) where Rows and Cols are the number of rows and columns of the table.

The R code below shows the test results with the test statistic and p-value. The object table is defined as the 2 × 2 matrix of the actual frequencies as above, and input to the function chisq.test. At the 5% level of significance, the null hypothesis of independence between Gender and Marriage is rejected, with the p-value of 0.018 and test statistic of 5.56. The option is correct= FALSE is related with the continuity correction, which is not used here to be consistent with the Excel function.


table=matrix(c(25,50,15,10),nrow=2)
> table
     [,1] [,2]
[1,]   25   15
[2,]   50   10
> chisq.test(table,correct = FALSE)

 Pearson's Chi-squared test

data:  table
X-squared = 5.5556, df = 1, p-value = 0.01842

The excel function CHISQ.TEST returns p-value of the test. The function requires the input of the actual range and expected range as below:



 

Statistical independence means that the two random variables have no systematic association. And, as a result, the correlation between the two random variables is zero. This post has provided intuitive explanations of statistical independence, with examples and computational resources.











Comments


bottom of page