Hypothesis testing: Two sample tests for proportions

This post covers the most commly used statistical tests for comparing a binary (success/failure) metric in two independent samples. For example, imagine we run an A/B experiment where our primary goal is to increase the conversion rate. Visitors can either make a purchase (convert), or not. Therefore, our metric is 1 if they convert, else 0. To test whether there was any significant effect on conversion, we compare the proportion of visitors who converted in each sample.

Other examples of proportion metrics: proportion of visitors who click on something, proportion who cancel, proportion who contacted customer service.

Hypothesis

Much like with continuous metrics, our two-sided hypothesis looks something like:

With the following alternates for a one-sided test:

General notation:

Statistical tests:

z-test (normal approximation)

If the sampling distribution for the difference in proportions is normal, we can use a z-test:

\[z = \frac{(\hat{p_1} - \hat{p_2})}{\sqrt{p(1-p)(\frac{1}{n1} + \frac{1}{n2})}}\] \[p = \frac{x_1+x_2}{n_1+n_2}\]

Where x1 and x2 are the number of successes in each group. Like the pooled estimate used in the two-sample t-test, the result for p will be between p̂1 and p̂2.

import numpy as np
import scipy.stats.distributions as dist

def z_test_2_sample_proportions(x1, x2, n1, n2, two_tailed=True):
    '''
    Calculate the test statistic for a z-test on 2 proportions from independent samples
    x1, x2: number of successes in group 1 and 2
    n1, n2: total number of observations in group 1 and 2
    Returns: test statistic (z), and p-value 
    '''
    avg_p = (x1 + x2) / (n1 + n2)
    z_val = (x1/n1 - x2/n2) / np.sqrt(avg_p * (1-avg_p) * (1/n1 + 1/n2))
    z_prob = dist.norm.cdf(-np.abs(z_val))

    if two_tailed:
        return z_val, 2*z_prob

    else:
        return z_val, z_prob

Using statsmodels:

import numpy as np
from statsmodels.stats.proportion import proportions_ztest

success_cnts = np.array([x1, x2])
total_cnts = np.array([n1, n2])
test_stat, pval = proportions_ztest(count=success_cnts, nobs=total_cnts, alternative='two-sided')
print('Two sided z-test: z = {:.4f}, p value = {:.4f}'.format(test_stat, pval))

The wald statistic is an alternative test statistic, which is useful for calculating the confidence interval:

\[\text{Wald test statistic} = \frac{(\hat{p_1} - \hat{p_2})}{\sqrt{\frac{\hat{p_1}(1-\hat{p_1})}{n_1} + \frac{\hat{p_2}(1-\hat{p_2})}{n_2}}}\] \[\text{approximate CI} = \hat{p_1} - \hat{p_2} \pm Z_\text{critical} * \text{Wald test statistic}\]

Assumptions


Chi-square test of homogeneity

Another option is the chi-square test of homogeneity. The chi-square test takes the observed frequencies in a contingency table (crosstab of the outcome levels vs sample groups), and compares them to what we would expect to see if there was no difference (null hypothesis). In the case of an experiment, if the treatment had no effect, we expect to see the proportion in each outcome level approximately the same for both samples. Note, this is just one of the three types of comparison for the chi-square test: 1. goodness of fit for comparing an observed and theoretical distribution; 2. test of homogeneity (this one) to compare two or more groups on the same categorical variable; and, 3. test of independence two compare two variables. However, they all use the same math for the test-statistic.

\[\chi^2 = \sum{\frac{(O_i - E_i)^2}{E_i}}\]

Note, it’s been shown that the 2x2 chi-square homogeneity test is equivelant to the z-test of two proportions (see here). However, the chi-square test can also handle cases with more than 2 levels in the outcome, or more than two populations.

Expected counts:

The expected count for a cell is the joint probability of the groups * total count. E.g. If the joint probability is 0.3 and we have a sample of 100, then we expect 0.3*100=30 for that cell. Worked example:

Group Yes No Total
control C_Y C_N C
variant V_Y V_N V
totals Y N T

Based on this we can calculate the expected counts:

Chi-square statistic:

For each cell (combination of groups), calculate the squared difference between the observed and expected counts and divide by the expected to standardise = (observed-expected)^2/expected. Sum these values to get the Chi-square statistic. The larger the difference between the observed and expected, the larger the test-statistic.

The test statistic approximately follows the chi-square distribution under the null hypothesis. Therefore, we can compare our test value to the chi-square distribution (adjusting the shape based on the degrees of freedom calculation below), and estimate the probability of observing such a value (or larger) when the null is true (p-value).

\[\text{degrees of freedom} = (\text{nr rows} - 1) * (\text{nr columns} - 1)\]

For a 2x2 table, the degrees of freedom = (2-1) * (2-1) = 1.

Python implementation:

from statsmodels.stats.proportion import proportions_chisquare

success_cnts = np.array([x1, x2])
total_cnts = np.array([n1, n2])
chi2, p_val, cont_table = proportions_chisquare(count=success_cnts, nobs=total_cnts)

print('Chi-square statistic = {:.2f}'.format(chi2))
print('p-value = {:.4f}'.format(p_val))

Assumptions

G-test

(based on Tilly, 2009, and McDonald, 2014):

Finally, the g-test of independence is another alternative. The g-test is very similar to the chi-squared test, as the chi-squared test is actually an approximation of the g-test. As with the chi-square test, we take a contingency table, and calculate by how much the observed and expected values differ. The null hypothesis being that the proportions of the outcome levels are the same for both samples.

Expected counts:

The expected counts are calculated the same way as the chi-squared test above, but the test statistic is different.

G test statistic:

For each cell we calculate the g-value and the g-statistic is the sum of these values times 2:

\[g = 2 * \sum{O_i * \log{\frac{O_i}{E_i}}}\]

Python implementation:

from scipy.stats import chi2
import numpy as np

def g_test(n1, n2, successes_1, successes_2):
    """
    Performs a g-test comparing the proportion of successes in two groups.
    Returns: test statistic, p value
    """

    failures_1 = n1 - successes_1
    failures_2 = n2 - successes_2
    total = n1 + n2

    expected_control_success = n1 * (successes_1 + successes_2) / total
    expected_control_failure = n1 * (failures_1 + failures_2) / total
    expected_variant_success = n2 * (successes_1 + successes_2) / total
    expected_variant_failure = n2 * (failures_1 + failures_2) / total

    g1 = successes_1 * np.log(successes_1/expected_control_success)
    g2 = failures_1 * np.log(failures_1/expected_control_failure)
    g3 = successes_2 * np.log(successes_2/expected_variant_success)
    g4 = failures_2 * np.log(failures_2/expected_variant_failure)

    g_test_stat = 2 * (g1+g2+g3+g4)
    pvalue = 1 - chi2.cdf(g_test_stat, 1)

    return g_test_stat, pvalue

Alternatively, there’s a scipy function:

from scipy.stats import chi2_contingency

g, p, dof, expctd = chi2_contingency(contingency_table, lambda_="log-likelihood", correction=False)

print("g test statistic={:.4f}, dof={}, p-value={:.4f}".format(g, dof, p))

Assumptions:

Yates Continuity Correction

Both the chi-square test and g-test err on the small side, so Yates suggested adjusting all observed values 0.5 towards the expected values. This improves accuracy slightly. Set correction=True in the scipy function above for this correction.

Final thoughts: