Null Hypothesis Significance Testing, part 2

1# r import
2library(tidyverse)
3library(knitr)
4library(reticulate)
5library(corrplot)
6
7options(digits = 4)

1# python import
2import pandas as pd
3from scipy import stats

1# custom functions
2nhst_result <- function(pval, alpha){
3  if(pval < alpha) {
4    print(paste0("p-value is less than alpha (", 
5                 alpha, "). Reject the null hypothesis."))
6  } else {print(paste0("p-value is greater than alpha (", 
7                     alpha, "). Fail to reject the null hypothesis."))}
8}

Introduction

In the previous part¹ we’ve looked and the basic idea of the statistical testing and the inference for a population mean ( $\mu$ ). Here we are going to look at the inference for a population proportion ( $p$ or $\pi$ ).

Data Set

We are going to look at the Student Alcohol Consumption dataset (Math sample) from Kaggle².

Context of the problem: The data were obtained in a survey of student’s math and Portuguese language courses in secondary school. It contains a lot of interesting social, gender and study information about students.

Data set consist of 30 explanatory variables such as weekly study time or parents education level and three target variables which show the final grade for a math exam:

G1 - first period grade (numeric: from 0 to 20)
G2 - second period grade (numeric: from 0 to 20)
G3 - final grade (numeric: from 0 to 20, output target)

Original paper: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7³.

1students_data <- read_csv("data/student-mat.csv")
2sample_n(students_data, 5) %>% kable()

school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	reason	guardian	traveltime	studytime	failures	schoolsup	famsup	paid	activities	nursery	higher	internet	romantic	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
GP	F	15	U	GT3	T	3	2	health	services	home	father	1	2	3	no	yes	no	no	yes	yes	yes	no	3	3	2	1	1	3	0	6	7	0
MS	F	18	R	LE3	A	1	4	at_home	other	course	mother	3	2	0	no	no	no	no	yes	yes	no	yes	4	3	4	1	4	5	0	13	13	13
MS	M	19	R	GT3	T	1	1	other	services	home	other	3	2	3	no	no	no	no	yes	yes	yes	no	5	4	4	3	3	2	8	8	7	8
GP	M	18	U	GT3	T	4	4	teacher	services	home	mother	2	1	0	no	no	yes	yes	yes	yes	yes	no	3	2	4	1	4	3	22	9	9	9
GP	F	18	U	GT3	T	4	3	other	other	reputation	father	1	4	0	no	yes	yes	no	yes	yes	yes	no	4	3	3	1	1	3	0	14	13	14

Inference for a Proportion

We have already discussed how you can run a test for a continuous random variable (such as weight, distance, glucose level, etc.). But what if your variable of interest is not continuous but rather discreet (for example, ratio of success for a new drug, the exam score, etc.)?

Single Proportion

Imagine that previous research suggests that on average 60% of students pass the final math exam. Can we say that this rate became larger after the new data came in?

$H_0$ : Student success rate for a final math exam is 60%, $\pi=0.6$ .
$H_A$ : Student success rate for a final math exam is greater than 60%, $\pi>0.6$
$\alpha = 0.05$

According to the paper, exam is passed is the score is greater than 9, so we are going to create a new variable with binary outcome pass/fail:

1students_data <- students_data %>%
2    mutate(final_score = if_else(G3 > 9, "pass", "fail"))
3
4p_sample <- mean(students_data$final_score == "pass")
5print(paste0("p = ", round(p_sample, 4)))

1## [1] "p = 0.6709"

Code

1ggplot(data = students_data, aes(x = final_score, fill = final_score)) +
2    geom_bar() +
3    labs(title = "Final Math Exam Results",
4         x = "Result")

Now, the variable final_score is actually falling under Binomial distribution⁴. We have (kind of) looked already at how to deal with binomial distributed variables under the null hypothesis testing framework in this intuitive example⁵, but let’s dig into more details.

We are going to use the binomial test⁶ in order to find the p-value.

The idea stays the same, we want to know the probability of observing data as extreme as we got during the experiment ( $p=0.67$ ) under the assumption that the null hypothesis is true ( $\pi=0.6$ ). The Binomial distribution is defined as:

P(X=k) = C_n^k p^k (1-p)^{n-k}

$n$ - number of trials;
$p$ - success probability for each trial;
$k$ - number of successes.

Let’s draw the null distribution:

Code

 1n <- dim(students_data)[1] # total number of strudents
 2p_null <- 0.6
 3n_passed <- sum(students_data$final_score == "pass") 
 4
 5x <- 200:275
 6b_dist <- dbinom(x, p = p_null, size = n)
 7
 8fill <- rep("red", length(x))
 9fill[x >= n_passed] <- "lightblue"
10
11ggplot(data = students_data, aes(x = final_score, fill = final_score)) +
12    geom_bar() +
13    labs(title = "Final Math Exam Results",
14         x = "Result")

We would expect 237 students of 395 to pass the math exam under the assumption that the null hypothesis is true ( $n \times \pi = 395 \times 0.6 = 237$ ). That is the expected value of a null distribution. We have observed 265 students who have passed the test. So now the p-value is the sum of probabilities for $x$ greater or equal to 265:

\scriptsize \text{p-value} = P(X=237) + P(X=238) + ... + P(X=395)

\scriptsize P(X=237) = C_{395}^{237} \times 0.6^{237} \times (1-0.6)^{395-237}

\scriptsize P(X=238) = C_{395}^{238} \times 0.6^{238} \times (1-0.6)^{395-238}

...

\scriptsize P(X=395) = C_{395}^{395} \times 0.6^{395} \times (1-0.6)^{395-395}

As we can see, that is a lot of calculations to do by hand. There is a way to use Normal approximation⁷ which would allow to calculate the p-value with less effort, but we are going to rely on R:

1pval <- 1 - pbinom(q = n_passed, p = p_null, size = n)
2print(paste0("p-value is: ", round(pval, 3)))

1## [1] "p-value is: 0.002"

1alpha <- 0.05
2nhst_result(pval, alpha)

1## [1] "p-value is less than alpha (0.05). Reject the null hypothesis."

We reject the null hypothesis, or in other words, observed data suggests that the average score for the math exam among students is greater than 60%.

And of course, there are implementations for a binomial test that allow skipping most of the calculations:

**R**

Built-in binom.test function:

1binom.test(n_passed, n, p_null, "greater")

 1## 
 2## 	Exact binomial test
 3## 
 4## data:  n_passed and n
 5## number of successes = 265, number of trials = 395, p-value = 0.002
 6## alternative hypothesis: true probability of success is greater than 0.6
 7## 95 percent confidence interval:
 8##  0.6299 1.0000
 9## sample estimates:
10## probability of success 
11##                 0.6709

**Python**

binom_test function from scipy.stats module:

1students_data = pd.read_csv("data/student-mat.csv")
2students_data["final_score"] = students_data["G3"].apply(lambda x: "pass" if x>9 else "fail")
3
4p_null = 0.6
5n = students_data.shape[0]
6n_passed = sum(students_data["final_score"] == "pass")
7
8p_val = stats.binom_test(x=n_passed, n=n, p=p_null, alternative="greater")
9print(f"p-value: {p_val: .4f}")

1## p-value:  0.0022

Multiple Proportions

Now say we want to compare proportion for multiple groups rather than just one. For this purpose one can use $\chi^2$ test for independence. In general form, under the chi-square test we have following hypotheses:

$H_0$ : There is no association between groups.
$H_A$ : There is an association between the groups (one-sided test).

Conditions for $\chi^2$ test:

Independence
Sample size (each “cell” must have at least 5 expected cases)

Test statistic $\chi^2$ (which is following $\chi^2$ distribution⁸) can be found as:

\chi^2 = \sum_{i=1}^k \frac{(O-E)^2}{E}

$O$ : observed data in a “cell”
$E$ : expected data of a “cell”
$k$ : number of “cells”

Example

Does alcohol consumption level on weekends affect the student study results?

$H_0$ : exam results and alcohol consumption are independent.
$H_A$ : exam results and alcohol consumption are dependent (results scores vary by alcohol consumption).
$\alpha = 0.05$

There are 5 levels of alcohol consumption (from 1 - very low to 5 - very high). First, we can take a look at the cross tab to see the number of observations in each group:

1ct <- table(students_data$final_score, students_data$Walc)
2ct %>% kable()

	1	2	3	4	5
fail	50	25	25	20	10
pass	101	60	55	31	18

Each cell has more than 5 observations, so we can say that the sample size condition is met.

We are going to rewrite the previous cross table in the following way:

Weekend alcohol consumption level	1	2	3	4	5	Total
Failed	50 (50)	25 (28)	25 (26)	20 (17)	10 (9)	130
Passed	101 (101)	60 (57)	55 (54)	31 (34)	18 (19)	265
Total	151	85	80	51	28	395

Numbers in parentheses is the expected number of observations for each cell. Assuming that there is no association between the groups we expect 67.09% of students to pass the exam ( $p =\frac{\text{total passed}}{\text{total}} = \frac{265}{395}=0.6709$ ) in each group. Let’s take a look at a 1 level of alcohol consumption. Given that assumption we expect to observe 101 students who passed the exam ( $\text{total for 1 level} \times p$ $= 151 \times 0.6709 = 101$ ). Hence we expect 50 students to fail the exam ( $151-101 = 50$ ).

For the second (2) level of alcohol consumption we expect to observe 57 students who passed the exam ( $\text{total for 2 level} \times p = 85 \times 0.6709 = 57$ ) and 28 who failed ( $85-57 = 28$ ). And so on for each group. After we found the expected values for each cell we can calculate the $\chi^2$ value:

\scriptsize \chi^2 = \frac{(50-50)^2}{50} + \frac{(25-28)^2}{28} + ... + \frac{(18-19)^2}{19}

This also may be a tough task to calculate by hand that’s why we usually rely on software.

**R**

Built-in chisq.test function:

1results <- chisq.test(ct, correct = FALSE)
2results

1## 
2## 	Pearson's Chi-squared test
3## 
4## data:  ct
5## X-squared = 1.6, df = 4, p-value = 0.8

**Python**

chi2_contingency function from scipy.stats module:

1ct = pd.crosstab(students_data.final_score, students_data.Walc).to_numpy()
2chisq_stat, p_val, dof, expctd = stats.chi2_contingency(ct, correction=False)
3print(f"Calculated test statistic: {chisq_stat: .4f}\np-value: {p_val: .4f}")

1## Calculated test statistic:  1.5919
2## p-value:  0.8102

A couple of notes:

Degrees of freedom can be found as $df=(C-1)(R-1)$ , where $C$ - number of columns, $R$ - number of rows.
correction argument in chi-square test function is used for the Yates’s correction for continuity⁹.

Code

 1x <- seq(0,10,0.1)
 2df <- 4
 3chisq_dist <- dchisq(x, df)
 4chi_sq <- results$statistic
 5
 6ggplot() +
 7  geom_line(
 8    mapping = aes(x = x, y = chisq_dist),
 9    color = "black", size = 1.5) +   
10  geom_vline(xintercept = chi_sq) +
11  geom_area(
12    mapping = aes(x = x[x >= chi_sq], y = chisq_dist[x >= chi_sq]),
13    fill="red", alpha=0.6) +
14  labs(title = "Chi-square Distribution",
15       y = "Density") +
16  theme_classic()

We failed to reject the null hypothesis meaning there is not enough evidence to claim that alcohol consumption and exam score are dependent.

R also lets us explore the expected and observed count

1results$observed %>% kable()

	1	2	3	4	5
fail	50	25	25	20	10
pass	101	60	55	31	18

1results$expected %>% kable()

	1	2	3	4	5
fail	49.7	27.97	26.33	16.78	9.215
pass	101.3	57.03	53.67	34.22	18.785

Also, we can check the residual values of each cell, that are defined as:

r = \frac{O-E}{\sqrt{E}}

1residuals_table <- results$residuals
2kable(residuals_table)

	1	2	3	4	5
fail	0.0431	-0.5624	-0.2590	0.7848	0.2585
pass	-0.0302	0.3939	0.1814	-0.5497	-0.1811

1corrplot(residuals_table)

High residual values mean that this cell has the highest influence on a $\chi^2$ score. Another approach would be to find the percentage of contribution using the formula:

\text{Cell Contribution} = \frac{r}{\chi^2} \times 100\%

1contrib_table <- 100 * residuals_table^2 / results$statistic
2kable(contrib_table)

	1	2	3	4	5
fail	0.1167	19.870	4.215	38.69	4.199
pass	0.0572	9.748	2.068	18.98	2.060

As we can see, the pairs of fail & 2, fail & 4 and pass & 4 have the highest percentage of contribution (or in other words, there is some association).

Summary

This was a brief overview of how we can perform hypothesis testing when we deal with discrete variables. In the next (and I hope the final) part I will finally introduce a concept of test power and effect size and discuss why p-value alone is not sufficient for decision making.

References

Null Hypothesis Significance Testing, part 1 | RK ↩︎
Student Alcohol Consumption | Kaggle ↩︎
Cortez, Paulo & Silva, Alice. (2008). Using data mining to predict secondary school student performance. EUROSIS. ↩︎
Binomial distribution - Wikipedia ↩︎
Null Hypothesis Significance Testing, part 1 | RK ↩︎
Binomial test - Wikipedia ↩︎
Normal Approximation to Binomial Distribution ↩︎
Chi-square distribution - Wikipedia ↩︎
Yates’s correction for continuity - Wikipedia ↩︎