Hypothesis Testing – Module 6 | Applied Statistics Course

Applied Statistics · Module 6 of 10 · Intermediate Level

Module 6 · Applied Statistics

Hypothesis Testing

Q: When should I use a t-test?

Use a t-test when the population standard deviation is unknown and must be estimated from the sample — which covers the vast majority of real research situations. The t-test is appropriate for any sample size, though it is particularly important for small samples (n < 30).

Q: What is a chi-square test?

The chi-square test is a non-parametric statistical test used for categorical (count) data. It comes in two forms: the goodness-of-fit test (does observed data match an expected distribution?) and the test of independence (are two categorical variables related to each other?).

Q: What is the difference between Type I and Type II errors?

A Type I error (false positive) occurs when we reject a true null hypothesis. Its probability equals α. A Type II error (false negative) occurs when we fail to reject a false null hypothesis. Its probability equals β. Power = 1 − β. Reducing one error increases the other for a fixed sample size.

How statisticians, researchers, and business analysts use data to make confident, evidence-based decisions — step by step, test by test.

📚 10,000+ words 🧮 Worked examples 📝 20 exercises ✅ 30-question quiz ❓ 30 FAQs

📋 Quick Learning Navigation

1Introduction
2What is Hypothesis Testing?
3Null Hypothesis (H₀)
4Alternative Hypothesis (H₁)
5Significance Level (α)
6p-value
7Type I & Type II Errors
8Testing Process (7 Steps)
9Z-Test
10t-Test
11Chi-Square Test
12Choosing the Right Test
13Finance, Audit & Business
14Common Mistakes
15Case Study
16Exercises
1730-Question MCQ Quiz
18Frequently Asked Questions

Introduction

Every day, decisions are made using data. A pharmaceutical company wants to know whether a new drug lowers blood pressure better than the existing one. A marketing team needs to know whether a redesigned homepage increased sign-ups. An auditor needs to determine whether a client's reported figures are materially misstated. A finance professional wants to know whether a portfolio strategy beats the market.

These are not questions that can be answered by gut feeling or by looking at a single number. They require a structured, mathematical framework for separating genuine effects from random variation. That framework is hypothesis testing.

Hypothesis testing is the backbone of inferential statistics — the branch of statistics concerned with drawing conclusions about a population from a sample. Because we can almost never observe an entire population, we collect a sample, measure it, and then ask: "Is what I'm observing real, or could it simply be the result of chance?"

Why This Module Matters

Without hypothesis testing, data-driven decisions would be guesswork dressed up as analysis. With it, we have a rigorous, reproducible method for deciding what the data actually tells us — and how confident we should be in that conclusion.

In this module, you will master every core concept in hypothesis testing: formulating hypotheses, choosing a significance level, computing and interpreting p-values, understanding the two types of errors, and applying the three most common statistical tests — the Z-test, the t-test, and the chi-square test. By the end, you will be equipped to apply hypothesis testing in research, business, finance, and auditing contexts.

What is Hypothesis Testing?

Simple Definition

Hypothesis testing is a statistical procedure used to decide, based on sample data, whether there is enough evidence to support or reject a specific claim about a population.

Statistical Definition

Formally, hypothesis testing is a method of statistical inference that evaluates two competing statements — the null hypothesis and the alternative hypothesis — by quantifying how likely the observed data would be if the null hypothesis were true. The outcome is a decision: either reject the null hypothesis or fail to reject it, based on a pre-specified threshold of evidence called the significance level.

Practical Meaning

Think of hypothesis testing as a courtroom trial. The null hypothesis is the assumption of innocence: the default position that nothing unusual is happening, no effect exists, no difference is present. The data is the evidence. The significance level is the standard of proof required. If the evidence is strong enough — if the data is too unusual to be explained by chance alone — we reject the null hypothesis. Otherwise, we withhold judgment.

Real-World Examples

Business: A company claims its new packaging increases sales. A hypothesis test on sales data will reveal whether the increase is statistically significant or just random fluctuation.

Finance: An analyst claims a stock-picking algorithm generates above-market returns. A hypothesis test on historical returns will evaluate that claim.

Healthcare: Researchers claim a vaccine reduces infection rates. A hypothesis test on clinical trial data will determine whether the reduction is statistically meaningful.

Auditing: An auditor suspects expense claims are inflated. A hypothesis test on a random sample of transactions can support or contradict that suspicion.

Marketing: A team runs an A/B test on two email subject lines. A hypothesis test on open rates tells them which one genuinely performs better.

The Null Hypothesis (H₀)

The null hypothesis (H₀) is the default assumption — the statement that there is no effect, no difference, or no relationship in the population being studied. It is the baseline that the researcher attempts to disprove.

Key Principle

The null hypothesis is never proven true. It is either rejected (when evidence is strong enough) or failed to be rejected (when evidence is insufficient). You cannot "accept" the null hypothesis — absence of evidence is not evidence of absence.

Purpose of the Null Hypothesis

The null hypothesis serves as a benchmark for comparison. It specifies what we would observe if nothing interesting is happening — if the treatment has no effect, the groups are identical, or the variables are unrelated. Statistical tests calculate how surprising the observed data is under this assumption. If the data is very unlikely given H₀, we have grounds to reject it.

Practical Examples

Context	Research Question	Null Hypothesis (H₀)
Medicine	Does Drug A reduce blood pressure more than Drug B?	There is no difference in blood pressure reduction between Drug A and Drug B.
Business	Does the new website design increase conversion rates?	The new design has no effect on conversion rates.
Finance	Does the trading algorithm outperform a buy-and-hold strategy?	The algorithm's average return equals the market's average return.
Auditing	Are the reported expenses consistent with historical patterns?	The mean expense amount equals the historical mean.
Marketing	Do customers in different regions have different brand preferences?	Brand preference is independent of region.
Education	Does the new teaching method improve test scores?	The new method has no effect on mean test scores.

In mathematical notation, null hypotheses are typically stated in terms of population parameters. For example: H₀: μ = 50 (the population mean equals 50), or H₀: μ₁ = μ₂ (two population means are equal), or H₀: p = 0.40 (a population proportion equals 40%).

The Alternative Hypothesis (H₁)

The alternative hypothesis (H₁), also written as Hₐ, is the statement that challenges the null hypothesis. It is the claim that there IS an effect, a difference, or a relationship — the statement the researcher is trying to provide evidence for.

Types of Alternative Hypotheses

Alternative hypotheses come in three forms, each corresponding to a different type of statistical test:

Type	Symbol	Description	Test Type
Two-tailed	H₁: μ ≠ μ₀	The parameter is different from the null value (either direction)	Two-tailed test
Right-tailed	H₁: μ > μ₀	The parameter is greater than the null value	Right-tailed (upper) test
Left-tailed	H₁: μ < μ₀	The parameter is less than the null value	Left-tailed (lower) test

Choosing the Right Form

The choice between one-tailed and two-tailed tests should be made before collecting data, based on the research question:

Use a two-tailed test when you have no prior reason to expect the difference to go in a particular direction (most common in exploratory research).
Use a one-tailed test when theory or prior evidence strongly predicts the direction of the effect (e.g., a new drug should lower, not raise, blood pressure).

Warning: One-Tailed Tests

One-tailed tests are more powerful (easier to reject H₀) but are only valid when the direction is specified in advance. Choosing the direction after seeing the data to get a significant result is a form of statistical misconduct called "p-hacking."

Null vs Alternative Hypothesis — Comparison

Dimension	Null Hypothesis (H₀)	Alternative Hypothesis (H₁)
Definition	Default claim; no effect or difference	Competing claim; asserts an effect or difference
Notation	H₀	H₁ or Hₐ
Content	Includes equality (=, ≤, ≥)	Never includes equality (≠, >, <)
Role	Statement we test against	Statement we seek evidence for
Result	Rejected or failed to be rejected	Supported if H₀ is rejected
Burden of proof	Assumed true until proven otherwise	Must be supported by evidence
Example	H₀: μ = 100	H₁: μ ≠ 100

Significance Level (α)

The significance level, denoted α (alpha), is the probability threshold below which a p-value is considered statistically significant — the maximum probability of rejecting a true null hypothesis that the researcher is willing to accept.

In practical terms, α defines how much evidence is required to conclude that the null hypothesis is false. A smaller α demands stronger evidence.

Common Significance Levels

Significance Level	Confidence Level	Typical Use
α = 0.10 (10%)	90%	Exploratory research, preliminary studies; lower standards acceptable
α = 0.05 (5%)	95%	Standard in most academic, business, and social science research
α = 0.01 (1%)	99%	Medical research, safety-critical applications, financial regulations
α = 0.001 (0.1%)	99.9%	Particle physics (the "five sigma" standard), FDA clinical trials

The 0.05 Rule — Explained

The most widely used significance level, α = 0.05, means: "I will accept a 5% chance of concluding there is an effect when in fact there is none." This is known as the Type I error rate (covered in the next section).

The 0.05 threshold was popularized by statistician Ronald Fisher in the 1920s and has since become the de facto standard. However, it is important to understand that this threshold is a convention, not a law of nature. Depending on the stakes of the decision, a different α may be more appropriate.

Critical Point

The significance level must be set before collecting or analyzing data. Setting it after seeing the results (to achieve significance) is a serious statistical error and undermines the integrity of the entire analysis.

The p-value

Definition

The p-value (probability value) is the probability of observing a test statistic as extreme as — or more extreme than — the one actually observed, assuming the null hypothesis is true.

Put simply: a p-value answers the question, "If H₀ were true, how often would we see data this unusual just by chance?"

Interpreting the p-value

p-value	Interpretation	Decision (α = 0.05)
p < 0.001	Extremely strong evidence against H₀	Reject H₀ — highly significant
0.001 ≤ p < 0.01	Very strong evidence against H₀	Reject H₀ — very significant
0.01 ≤ p < 0.05	Moderate evidence against H₀	Reject H₀ — significant
0.05 ≤ p < 0.10	Weak evidence against H₀	Fail to reject H₀ — marginal
p ≥ 0.10	Little or no evidence against H₀	Fail to reject H₀ — not significant

p < 0.05 — The Decision Rule

At the standard α = 0.05 level:

If p < 0.05: Reject H₀. The result is statistically significant. There is sufficient evidence to conclude that the alternative hypothesis is likely true.
If p ≥ 0.05: Fail to reject H₀. The result is not statistically significant. There is insufficient evidence to conclude the alternative hypothesis.

Common Misconceptions About p-values

A p-value is NOT the probability that H₀ is true.
A p-value is NOT the probability that the result occurred by chance.
A statistically significant result is NOT necessarily practically significant (effect size matters).
p = 0.049 is NOT meaningfully different from p = 0.051.

Practical Example

A company tests whether a new training program increases average employee productivity scores. The old program yielded a mean score of 72. After the new program, a sample of 40 employees shows a mean score of 76. After running the appropriate test, the p-value is 0.021.

Since 0.021 < 0.05, we reject H₀. The improvement from 72 to 76 is statistically significant. There is evidence that the new training program improves productivity scores — the observed difference is unlikely to be due to chance alone.

Type I and Type II Errors

Because hypothesis testing is based on probability, it is never 100% certain. Two kinds of mistakes are possible: rejecting a null hypothesis that is actually true, or failing to reject a null hypothesis that is actually false.

Type I Error (False Positive)

A Type I error occurs when we reject a true null hypothesis. We conclude that an effect exists when in reality it does not.

The probability of a Type I error is equal to the significance level: P(Type I Error) = α
Also called a false positive or false alarm

Type I Error — Real-World Examples

Medicine: A clinical trial concludes a drug is effective when it has no real effect. Patients receive an ineffective treatment.

Auditing: An auditor concludes fraud has occurred when the figures are actually accurate. The client faces unnecessary scrutiny and reputational damage.

Business: A company launches an expensive marketing campaign based on a test that falsely showed the campaign significantly increases sales.

Type II Error (False Negative)

A Type II error occurs when we fail to reject a false null hypothesis. We conclude there is no effect when in reality there is one.

The probability of a Type II error is denoted β (beta)
Also called a false negative or missed detection
Statistical power = 1 − β (the probability of correctly rejecting a false H₀)

Type II Error — Real-World Examples

Medicine: A clinical trial fails to detect that a drug is effective. A beneficial treatment never reaches patients.

Auditing: An auditor fails to detect actual fraud or material misstatement in the financial statements.

Business: A company abandons a marketing campaign that actually works because the sample test failed to detect the real improvement.

Type I vs Type II Error — Comparison Table

Dimension	Type I Error	Type II Error
Definition	Rejecting a true H₀	Failing to reject a false H₀
Also called	False positive	False negative
Probability	α (significance level)	β (beta)
Control	Reduced by lowering α	Reduced by increasing power (larger sample, better design)
The cost	Acting on a false signal	Missing a real signal
Trade-off	Lowering α increases β	Lowering β requires larger sample or higher α
Medical analogy	Diagnosing a disease the patient doesn't have	Missing a disease the patient does have

The Error Trade-Off

There is an inherent trade-off between Type I and Type II errors: reducing one tends to increase the other, for a fixed sample size. The only way to reduce both simultaneously is to increase the sample size. This is why adequate sample size planning (called power analysis) is a critical step in research design.

The 7-Step Hypothesis Testing Process

Every hypothesis test, regardless of the specific statistical method used, follows this standard seven-step procedure:

1

State the Hypotheses Clearly define both H₀ and H₁ in terms of a specific population parameter (mean, proportion, variance, etc.). The hypotheses must be mutually exclusive and collectively exhaustive. Example: H₀: μ = 500; H₁: μ ≠ 500.
2

Select the Significance Level (α) Choose α before seeing the data. The standard is α = 0.05 unless the context demands a stricter or more lenient threshold. Document your choice and rationale.
3

Identify the Test and Check Assumptions Choose the appropriate statistical test (Z-test, t-test, chi-square, etc.) based on the data type, sample size, and knowledge of population parameters. Verify that the test's assumptions are met (normality, independence, equal variance, etc.).
4

Collect Data Gather a representative, random sample from the population. Ensure data quality — errors in data collection undermine even the most sophisticated analysis.
5

Calculate the Test Statistic Compute the numerical value that measures how far the observed sample data is from what H₀ would predict. This is standardized (expressed in standard deviation units) so it can be compared to a known distribution (Z, t, χ², F, etc.).
6

Find the p-value and Make a Decision Calculate the p-value corresponding to the test statistic. Compare it to α. If p < α, reject H₀. If p ≥ α, fail to reject H₀. Alternatively, compare the test statistic to the critical value from the appropriate distribution table.
7

State the Conclusion in Context Interpret the statistical result in plain language, in the context of the original research question. Do not just say "rejected" — say what this means for the business decision, the research finding, or the audit conclusion.

The Z-Test

Definition and Purpose

The Z-test is a parametric hypothesis test used to determine whether the mean of a sample is significantly different from a known or hypothesized population mean, when the population standard deviation (σ) is known and the sample size is large (typically n ≥ 30).

Assumptions

The population standard deviation (σ) is known
The sample size is large (n ≥ 30) OR the population is normally distributed
Observations are independent
Data is continuous (interval or ratio scale)

Z-Test Formula

Z = (x̄ − μ₀) / (σ / √n) Where: x̄ = sample mean μ₀ = hypothesized population mean (from H₀) σ = population standard deviation (known) n = sample size

Decision Rule (Z-Test)

Test Type	Reject H₀ if...	Critical Values (α = 0.05)
Two-tailed	\|Z\| > Z_α/2	Z < −1.96 or Z > +1.96
Right-tailed	Z > Z_α	Z > +1.645
Left-tailed	Z < −Z_α	Z < −1.645

Worked Example — Z-Test

Worked Example

Quality Control in Manufacturing

Problem: A bottling plant claims that each bottle contains a mean of 500 ml of beverage. A quality control inspector takes a random sample of 50 bottles and finds a sample mean of 497 ml. The population standard deviation is known to be 12 ml. Test at the α = 0.05 significance level whether the mean fill volume has changed.

Step 1 — Hypotheses:
H₀: μ = 500 ml (the mean fill is 500 ml)
H₁: μ ≠ 500 ml (the mean fill has changed — two-tailed test)

Step 2 — Significance Level: α = 0.05 (two-tailed, so critical values are ±1.96)

Step 3 — Test Statistic:

Z = (497 − 500) / (12 / √50) Z = (−3) / (12 / 7.071) Z = (−3) / 1.697 Z = −1.768

Step 4 — Decision:
The calculated Z = −1.768. The critical value for a two-tailed test at α = 0.05 is ±1.96.
Since |−1.768| = 1.768 < 1.96, we fail to reject H₀.

Step 5 — Conclusion:
At the 5% significance level, there is insufficient evidence to conclude that the mean fill volume has changed from 500 ml. The difference of 3 ml could plausibly be due to random sampling variation. The quality inspector cannot conclude a systematic underfilling problem based on this sample alone.

The t-Test

Definition and Purpose

The t-test is a parametric hypothesis test used to compare means when the population standard deviation is unknown and must be estimated from the sample data. It is also appropriate for small samples (n < 30). The t-test uses the t-distribution, which has heavier tails than the normal distribution to account for the additional uncertainty in estimating σ.

The Three Types of t-Test

1. One-Sample t-Test

Tests whether the mean of a single sample differs significantly from a hypothesized population mean when σ is unknown.

t = (x̄ − μ₀) / (s / √n) [degrees of freedom: df = n − 1] Where: s = sample standard deviation (estimate of σ) n = sample size df = degrees of freedom

Example: A bank's loan department believes the average loan processing time is 3 days. A sample of 15 loans shows a mean of 3.8 days with a sample standard deviation of 1.2 days. Is the mean processing time significantly different from 3 days? (This requires a one-sample t-test with df = 14.)

2. Independent Samples t-Test (Two-Sample t-Test)

Compares the means of two independent groups to determine whether they differ significantly from each other.

t = (x̄₁ − x̄₂) / √[s²_p × (1/n₁ + 1/n₂)] Where s²_p (pooled variance) = [(n₁−1)s₁² + (n₂−1)s₂²] / (n₁+n₂−2) Degrees of freedom: df = n₁ + n₂ − 2

Example: A company trains two groups of salespeople using different methods. Group A (n=20) achieved a mean monthly sales of $48,000 (s = $5,200). Group B (n=20) achieved $52,000 (s = $4,800). Is there a statistically significant difference in mean sales between the two methods?

3. Paired Samples t-Test

Tests whether the mean difference between two paired (matched) observations is significantly different from zero. Typically used in before-and-after studies.

t = d̄ / (s_d / √n) Where: d̄ = mean of the differences (after − before for each subject) s_d = standard deviation of the differences n = number of pairs df = n − 1

Example: Ten employees' productivity scores were measured before and after a wellness program. The paired t-test tests whether the mean score change is significantly different from zero.

Assumptions of t-Tests

The data is approximately normally distributed (or n is large)
Observations are independent (within each group)
For independent-samples t-test: homogeneity of variance (can be tested with Levene's test)
Data is continuous (interval or ratio scale)

Z-Test vs t-Test — When to Use Each

Criterion	Z-Test	t-Test
Population σ	Known	Unknown (estimated from sample)
Sample Size	Large (n ≥ 30)	Any size (especially n < 30)
Distribution Used	Standard normal (Z)	t-distribution
Degrees of Freedom	Not applicable	Required (df = n−1 or n₁+n₂−2)
Tail heaviness	Normal tails	Heavier tails (more conservative)
When n → ∞	Same result	Converges to Z-test result
Common Use	Quality control, proportion tests	Most practical research with unknown σ

Practical Rule

In most real-world applications, σ is unknown. Default to the t-test unless you have a clear, documented reason to use the Z-test. When the sample is large (n ≥ 30), both tests give nearly identical results.

The Chi-Square Test (χ²)

Definition and Purpose

The chi-square test is a non-parametric hypothesis test used to analyze categorical data. Unlike the Z-test and t-test which compare means, the chi-square test examines whether observed frequencies (counts) in categories differ significantly from expected frequencies.

The Two Types of Chi-Square Tests

1. Chi-Square Goodness-of-Fit Test

Tests whether the distribution of a single categorical variable matches a hypothesized distribution.

χ² = Σ [(O − E)² / E] Where: O = observed frequency in each category E = expected frequency in each category Σ = sum across all categories df = k − 1 (where k = number of categories)

Example: A die is rolled 120 times. If fair, each face should appear 20 times. The observed frequencies are: 1→18, 2→22, 3→25, 4→17, 5→19, 6→19. Is the die fair?

χ² = (18−20)²/20 + (22−20)²/20 + (25−20)²/20 + (17−20)²/20 + (19−20)²/20 + (19−20)²/20 χ² = 4/20 + 4/20 + 25/20 + 9/20 + 1/20 + 1/20 χ² = 0.2 + 0.2 + 1.25 + 0.45 + 0.05 + 0.05 = 2.20 df = 6 − 1 = 5 Critical value at α = 0.05, df = 5: χ²_critical = 11.07 Since 2.20 < 11.07, fail to reject H₀. The die appears to be fair.

2. Chi-Square Test of Independence

Tests whether two categorical variables are independent of each other, or whether there is a significant association between them.

Setup: Data is arranged in a contingency table. Expected frequencies are computed assuming independence:

E_ij = (Row Total_i × Column Total_j) / Grand Total χ² = Σ [(O_ij − E_ij)² / E_ij] df = (r − 1)(c − 1) where r = rows, c = columns

Example: A marketing team surveys 200 customers and records their gender (Male/Female) and product preference (Product A/B/C). The chi-square test of independence will determine whether product preference is associated with gender, or whether the two variables are independent.

Assumptions of the Chi-Square Test

Data consists of counts (frequencies), not means or proportions
Observations are independent
Expected frequency in each cell should be at least 5 (rule of thumb)
Categories are mutually exclusive

Key Applications of Chi-Square Tests

Market Research: Is customer satisfaction related to age group?
HR Analytics: Is promotion rate independent of gender?
Auditing: Are transaction sizes distributed as expected under Benford's Law?
Healthcare: Is disease incidence related to geographic region?
Finance: Are defaults distributed uniformly across credit rating categories?

Choosing the Right Test

Research Question	Data Type	Sample Size	σ Known?	Test to Use
Compare sample mean to known value	Continuous	n ≥ 30	Yes	Z-Test
Compare sample mean to known value	Continuous	Any	No	One-sample t-test
Compare means of two independent groups	Continuous	Any	No	Independent t-test
Compare before/after in same subjects	Continuous	Any	No	Paired t-test
Compare three or more group means	Continuous	Any	No	ANOVA (F-test)
Test if distribution fits expected	Categorical (counts)	Any	N/A	Chi-square goodness-of-fit
Test if two categorical variables are related	Categorical (counts)	Any	N/A	Chi-square independence
Test a proportion against a known value	Proportion	Large	N/A	Z-test for proportions

Applications: Finance, Auditing, and Business

Hypothesis Testing in Finance

Hypothesis testing is a cornerstone of quantitative finance and investment analysis. Financial professionals use it to make objective, evidence-based claims about asset performance, risk, and market behavior.

Investment Performance: Testing whether a fund's mean return significantly exceeds the benchmark (one-sample t-test).
Risk Assessment: Testing whether portfolio volatility has changed after a strategy adjustment (F-test or Levene's test on variances).
Event Studies: Testing whether stock prices react significantly around earnings announcements.
Credit Risk: Testing whether default rates differ significantly across credit rating categories (chi-square test).
Market Efficiency: Testing whether abnormal returns are consistently zero (the Efficient Market Hypothesis).

Hypothesis Testing in Auditing

Auditors use hypothesis testing as a formal framework for evaluating financial statements, internal controls, and compliance without reviewing every transaction.

Audit Sampling: Testing whether the mean error in a sample of invoices is significantly different from zero (one-sample t-test).
Compliance Testing: Testing whether the proportion of transactions that comply with policy meets a minimum threshold (Z-test for proportions).
Benford's Law Analysis: Testing whether the leading digit distribution of reported figures follows Benford's Law (chi-square goodness-of-fit). Significant deviation can be an indicator of manipulation.
Fraud Investigation: Testing whether expense amounts are significantly higher than in comparable periods (two-sample t-test).
Material Misstatement: Testing whether projected total misstatement in the population exceeds the materiality threshold.

Benford's Law in Auditing

Benford's Law states that in many real-world datasets, the first significant digit follows a specific distribution: "1" appears about 30.1% of the time, "2" about 17.6%, etc. Auditors apply the chi-square goodness-of-fit test to compare actual digit distributions against Benford's expected distribution. A significant result (p < 0.05) may indicate manipulated or fabricated figures.

Hypothesis Testing in Business

A/B Testing: Testing whether a redesigned webpage, email, or ad generates a significantly higher conversion rate than the control version (Z-test for proportions or independent t-test).
Product Quality: Testing whether mean product dimensions or weights meet specifications (one-sample t-test or Z-test).
Customer Segmentation: Testing whether customer satisfaction scores differ significantly across demographic groups (independent t-test or ANOVA).
Pricing Strategy: Testing whether price changes significantly affect sales volume (paired t-test for before/after comparisons).
Supply Chain: Testing whether supplier delivery times meet contractual standards (one-sample t-test).

15 Common Hypothesis Testing Mistakes (and How to Avoid Them)

Misinterpreting p-values as the probability that H₀ is true.
The p-value is the probability of the observed data (or more extreme) given H₀ — not the probability that H₀ is true. Fix: always state the correct interpretation.
Treating "fail to reject" as "accept H₀."
Insufficient evidence to reject H₀ does not mean H₀ is confirmed. Fix: use precise language — "fail to reject," never "accept."
Setting α after seeing the data (p-hacking).
Choosing α = 0.05 because your p-value is 0.04, after you've already analyzed the data, inflates Type I errors. Fix: always pre-register your significance level.
Confusing statistical significance with practical significance.
A tiny, meaningless difference can be statistically significant with a large enough sample. Fix: always compute and report effect sizes (Cohen's d, η², etc.).
Using the wrong statistical test.
Applying a t-test when data is categorical, or a chi-square test when data is continuous. Fix: match the test to your data type and research question (see the decision table above).
Ignoring test assumptions.
Running a t-test on severely non-normal data with a tiny sample, or a chi-square test with expected cells below 5. Fix: always verify assumptions before running any test.
Using too small a sample.
Underpowered studies frequently fail to detect real effects (Type II errors). Fix: conduct a power analysis before data collection to determine the required sample size.
Multiple testing without correction.
Running 20 tests at α = 0.05 means approximately 1 will be significant by chance alone. Fix: apply Bonferroni correction or other multiple-comparison adjustments when running multiple tests.
Choosing one-tailed tests opportunistically.
Switching to a one-tailed test after data collection because it gives a smaller p-value. Fix: specify the direction of H₁ before collecting data, based on theory alone.
Using a parametric test on ordinal data.
T-tests require interval or ratio data. Applying them to Likert scale data (strongly disagree–strongly agree) is technically incorrect. Fix: use non-parametric alternatives (Mann-Whitney U, Wilcoxon signed-rank) for ordinal data.
Ignoring outliers.
A single extreme observation can dramatically alter the test statistic. Fix: identify and investigate outliers before analysis; consider robust methods if outliers are genuine.
Non-random sampling.
Hypothesis tests assume random sampling. Convenience samples produce biased results that cannot be generalized. Fix: design rigorous sampling protocols appropriate to the population of interest.
Reporting only significant results (publication bias).
Cherry-picking significant findings while suppressing null results distorts the scientific record. Fix: pre-register studies and commit to reporting all results.
Treating correlated observations as independent.
Using an independent t-test on paired data (before/after) ignores the correlation and reduces statistical power. Fix: use the paired t-test when the same subjects are measured twice.
Forgetting to report confidence intervals.
A p-value alone tells you whether an effect is significant but not how large it is or how precisely it's estimated. Fix: always accompany hypothesis test results with confidence intervals for the parameter of interest.

Practical Case Study

Full Case Study — Retail Banking

Did the New Loan Processing System Reduce Processing Time?

Background: Metro Commercial Bank implemented a new digital loan processing system in Q3. The Head of Operations claims the new system has significantly reduced average loan processing time from the historical average of 8.5 days. She asks the Analytics Team to test this claim using data from the first month after implementation.

Step 1 — State the Hypotheses

H₀: μ = 8.5 days (the mean processing time has not changed)
H₁: μ < 8.5 days (the mean processing time has decreased — one-tailed, left-tailed test)

Step 2 — Significance Level

The team selects α = 0.05. A one-tailed test is appropriate because the claim is directional — the system is supposed to reduce (not change in either direction) processing time.

Step 3 — Select Test and Check Assumptions

Since the population standard deviation is unknown, and the sample is drawn from the first month's processed loans, a one-sample t-test is appropriate.

Assumptions: Loan processing times are approximately normally distributed (confirmed via histogram). Loans are processed independently.

Step 4 — Collect Data

A random sample of 25 loans processed under the new system yields:
Sample size: n = 25
Sample mean: x̄ = 7.8 days
Sample standard deviation: s = 2.1 days

Step 5 — Calculate the Test Statistic

t = (x̄ − μ₀) / (s / √n) t = (7.8 − 8.5) / (2.1 / √25) t = (−0.7) / (2.1 / 5) t = (−0.7) / 0.42 t = −1.667 Degrees of freedom: df = 25 − 1 = 24

Step 6 — Make the Decision

Critical t-value for a one-tailed left test at α = 0.05 with df = 24: t_critical = −1.711

Since our calculated t = −1.667 is greater than (to the right of) the critical value −1.711, we fail to reject H₀ at the 0.05 level.

The corresponding p-value is approximately 0.055 (slightly above 0.05).

Step 7 — Conclusion

At the 5% significance level, there is insufficient statistical evidence to conclude that the new digital loan processing system has significantly reduced mean processing time from 8.5 days. While the sample mean of 7.8 days is lower than the historical mean, the difference is not statistically significant — it could plausibly be due to random sampling variation.

Management Recommendation

The analytics team recommends: (1) gathering a larger sample over the next two months, as the test was likely underpowered with only n = 25; (2) conducting a power analysis to determine the sample size needed to detect a reduction of 0.7 days with 80% power; (3) not drawing definitive conclusions about system effectiveness until stronger evidence is available.

Business Lesson from this Case Study

Failing to reject H₀ is not failure — it is honest reporting. The correct response is to acknowledge the limitation (small sample), plan for more data, and avoid making expensive operational decisions based on inconclusive evidence.

Key Takeaways

Hypotheses

H₀ is the default claim of no effect; H₁ is the alternative. We test H₀ and either reject or fail to reject it — never "accept" either.

Significance Level

α sets the maximum acceptable Type I error rate. The standard is α = 0.05, but context determines what is appropriate. Always set α before seeing the data.

p-value

The p-value measures how surprising the observed data is under H₀. Small p-values (p < α) are evidence against H₀. The p-value is NOT the probability that H₀ is true.

Error Types

Type I error (false positive): rejecting a true H₀. Type II error (false negative): failing to reject a false H₀. Reducing one increases the other — the only solution is larger samples.

Z-Test

Use when σ is known and n ≥ 30. Based on the standard normal distribution. Critical value ±1.96 for two-tailed test at α = 0.05.

t-Test

Use when σ is unknown (most practical situations). Three types: one-sample, independent, and paired. Uses the t-distribution with n−1 degrees of freedom.

Chi-Square Test

Use for categorical (count) data. Goodness-of-fit tests one variable's distribution; independence test examines the relationship between two categorical variables.

Effect Size Matters

Statistical significance ≠ practical significance. Always report effect sizes alongside p-values so readers understand the magnitude of findings, not just their statistical certainty.

Practice Exercises

Part A — 20 Conceptual Questions

Q1.

A researcher concludes "the null hypothesis is true" because p = 0.42. What statistical error has the researcher made in their interpretation?

Answer: The researcher has made an error of language. A large p-value means we "fail to reject" H₀ — it does not confirm H₀ is true. Absence of significant evidence is not evidence of absence. The correct statement is: "There is insufficient evidence to reject H₀."

Q2.

What is the relationship between α, β, and statistical power? If α is reduced from 0.05 to 0.01, what happens to β (for a fixed sample size)?

Answer: Power = 1 − β. Reducing α makes the rejection threshold more stringent, which increases β (the probability of a Type II error) for a fixed sample size. To reduce both α and β simultaneously, the sample size must be increased.

Q3.

An auditor uses a chi-square test with the following contingency table and gets p = 0.002. What does this mean in audit terms?

Answer: A p-value of 0.002 < 0.05 means we reject H₀ of independence. There is a statistically significant association between the two categorical variables tested. In audit terms, the observed frequencies differ significantly from expected, which may warrant further investigation of the underlying transactions.

Q4.

Explain why a statistically significant result is not necessarily practically significant. Give a business example.

Answer: With a very large sample, even a tiny, economically meaningless difference can yield p < 0.05. For example, an e-commerce A/B test with 2 million visitors might find that version B increases conversion rate by 0.01% with p = 0.001. While statistically significant, a 0.01% improvement translates to negligible revenue, making the result practically irrelevant.

Q5.

Why should the significance level be set before data collection? What happens if it is set after?

Answer: Setting α before data collection ensures the decision criterion is objective and not influenced by the data. Setting it after allows a researcher to choose α to match a desired outcome (e.g., setting α = 0.06 after observing p = 0.055), which constitutes p-hacking and inflates Type I error rates beyond the nominal level.

Q6.

A finance manager runs 20 separate hypothesis tests at α = 0.05 to screen for market anomalies. How many significant results would be expected by chance alone?

Answer: At α = 0.05, the expected number of false positives = 20 × 0.05 = 1. So approximately 1 "significant" result should be expected purely by chance. The manager should apply a multiple-testing correction such as Bonferroni (α* = 0.05/20 = 0.0025) to control the family-wise error rate.

Q7.

When is a paired t-test more appropriate than an independent samples t-test? Give an example.

Answer: A paired t-test is more appropriate when the same subjects are measured twice (before and after a treatment), or when subjects are matched in pairs. Example: measuring employee satisfaction scores before and after a workplace reform program for the same employees. Using an independent t-test would ignore the within-subject correlation and reduce statistical power.

Q8.

Describe the conditions under which a chi-square goodness-of-fit test would be invalid. How would you address each condition?

Answer: The chi-square test is invalid when: (1) expected cell frequencies are below 5 — fix by collapsing categories or collecting more data; (2) observations are not independent — fix by redesigning the sampling procedure; (3) the test is applied to non-count data (means instead of frequencies) — fix by using the appropriate parametric test.

Q9.

What is the difference between a one-tailed and a two-tailed hypothesis test? How does the choice affect the critical region and p-value?

Answer: A two-tailed test splits the critical region equally between both tails (2.5% per tail at α = 0.05), testing for differences in either direction. A one-tailed test places the entire critical region in one tail (5%), making it easier to reject H₀ in that direction. For the same test statistic, a one-tailed p-value is exactly half the two-tailed p-value. The choice must be made based on the research question, before data collection.

Q10.

A pharmaceutical company's Type I error in a drug safety trial means the drug is approved despite being ineffective. A Type II error means an effective drug is rejected. Which error is more costly in this context? Explain.

Answer: This depends on the drug's risk profile. For a drug treating a life-threatening disease with no alternatives, a Type II error (rejecting an effective treatment) could be catastrophic — denying patients a cure. For a drug with significant side effects, a Type I error (approving an ineffective one) exposes patients to harm without benefit. Regulators typically prioritize minimizing Type I errors (safety first), hence the strict α = 0.001 standards in clinical trials.

Q11–Q20.

(Additional conceptual questions — complete these independently as revision.)

Q11. Why does the t-distribution have heavier tails than the normal distribution?
Q12. What does degrees of freedom represent in a t-test?
Q13. Explain Benford's Law and its relevance to auditing.
Q14. What is the connection between a 95% confidence interval and a two-tailed hypothesis test at α = 0.05?
Q15. How does sample size affect statistical power?
Q16. What is the null hypothesis in an independence chi-square test?
Q17. Why can't the null hypothesis ever be "proven" true?
Q18. What is a p-value threshold of "p < 0.001" communicating?
Q19. Give three examples of Type I errors in a business context.
Q20. What is meant by "statistically significant at the 1% level"?

Part B — 10 Numerical Problems

Problem 1 — Z-Test

A factory produces bolts with a target diameter of 10 mm. The production process has a known standard deviation of 0.5 mm. A quality engineer takes a sample of 100 bolts and finds a mean diameter of 10.08 mm. Test at α = 0.05 whether the mean diameter has shifted from the target.

Solution:
H₀: μ = 10; H₁: μ ≠ 10 (two-tailed)
Z = (10.08 − 10) / (0.5/√100) = 0.08 / 0.05 = 1.60
Critical values at α = 0.05: ±1.96
|1.60| < 1.96 → Fail to reject H₀
Conclusion: Insufficient evidence to conclude the mean diameter has shifted from 10 mm.

Problem 2 — One-Sample t-Test

A logistics company claims average delivery time is 3 days. A sample of 16 deliveries shows mean = 3.5 days, s = 0.8 days. Test at α = 0.05 (two-tailed) whether mean delivery time differs from 3 days.

Solution:
H₀: μ = 3; H₁: μ ≠ 3; df = 15
t = (3.5 − 3) / (0.8/√16) = 0.5 / 0.2 = 2.50
Critical t at α = 0.05, df = 15 (two-tailed): ±2.131
|2.50| > 2.131 → Reject H₀
Conclusion: The mean delivery time is significantly different from 3 days. The company's claim is not supported by the data.

Problem 3 — Chi-Square Goodness-of-Fit

A market researcher expects customers to prefer four product variants equally (25% each). A survey of 200 customers shows: Variant A = 62, B = 48, C = 54, D = 36. Test at α = 0.05.

Solution:
Expected each: 200 × 0.25 = 50
χ² = (62−50)²/50 + (48−50)²/50 + (54−50)²/50 + (36−50)²/50
χ² = 144/50 + 4/50 + 16/50 + 196/50 = 2.88 + 0.08 + 0.32 + 3.92 = 7.20
df = 4 − 1 = 3; Critical χ² = 7.815
7.20 < 7.815 → Fail to reject H₀
Conclusion: Customer preference is not significantly different from equal distribution at α = 0.05.

Problems 4–10

Solve the following independently. Worked solutions available in the downloadable companion workbook.

P4. Independent t-test: Compare mean sales of two stores (n₁=20, x̄₁=45K, s₁=8K; n₂=18, x̄₂=51K, s₂=9K) at α=0.05.
P5. Paired t-test: 8 employees' performance scores before (72,68,75,80,65,70,74,78) and after training (78,71,79,85,68,76,79,82). Test improvement at α=0.05.
P6. Z-test for proportions: A coin is flipped 400 times and lands heads 218 times. Test if the coin is fair at α=0.05.
P7. Chi-square independence: Test if gender (M/F) and product preference (A/B/C) are independent, given a 2×3 contingency table.
P8. Left-tailed t-test: A bank claims ATM cash refill time ≥ 5 minutes. A sample of 20 refills shows mean=4.6 min, s=0.9 min. Test at α=0.05.
P9. Power analysis: How large a sample is needed to detect a mean difference of 5 units (σ=12) with 80% power at α=0.05?
P10. Business application: A retailer tests whether a loyalty program increases mean basket size. Pre-program mean=BDT 1,200, n=35, post-program mean=BDT 1,340, s=BDT 280. Test at α=0.01.

Module 6 Quiz — 30 Multiple Choice Questions

1. What does it mean to "reject the null hypothesis"?

H₀ has been proven false
There is sufficient evidence that H₀ is unlikely to be true
H₁ is proven true
The sample data is unreliable

✓ Correct: B

Rejecting H₀ means the data is sufficiently unlikely under H₀ that we conclude H₀ is probably false. It does not "prove" anything with certainty.

2. Which of the following is the correct interpretation of a p-value?

The probability that H₀ is true
The probability of a Type II error
The probability of observing data as extreme as the sample, assuming H₀ is true
The probability that the study's conclusion is correct

✓ Correct: C

A p-value is a conditional probability: P(data as extreme or more extreme | H₀ is true). It is not the probability that H₀ itself is true.

3. A Type I error occurs when:

H₀ is false and we fail to reject it
H₀ is true and we reject it
H₁ is true and we reject it
The sample size is too small

✓ Correct: B

Type I error = false positive. We reject a true null hypothesis. Its probability equals α.

4. The power of a statistical test equals:

α
1 − α
β
1 − β

✓ Correct: D

Power = 1 − β = the probability of correctly rejecting a false H₀. Higher power means a greater chance of detecting a real effect.

5. When should a Z-test be used instead of a t-test?

When the sample size is small
When the population standard deviation is unknown
When the population standard deviation is known and n ≥ 30
When data is categorical

✓ Correct: C

The Z-test requires that σ is known. In most practical scenarios σ is unknown, making the t-test more appropriate.

6. For a two-tailed Z-test at α = 0.05, the critical values are:

±1.28
±1.645
±1.96
±2.576

✓ Correct: C

At α = 0.05 two-tailed, each tail has 2.5%. The Z value that cuts off 2.5% in the upper tail is 1.96.

7. Which test is most appropriate to compare the means of two independent groups when the population standard deviation is unknown?

Chi-square test
Z-test
Independent samples t-test
Paired t-test

✓ Correct: C

When comparing two independent group means with unknown σ, the independent samples t-test is the standard choice.

8. In a chi-square goodness-of-fit test with 5 categories, the degrees of freedom are:

✓ Correct: B

df = k − 1 = 5 − 1 = 4, where k is the number of categories.

9. An auditor tests whether a firm's expense distribution follows Benford's Law. This is an example of:

Independent t-test
Paired t-test
Chi-square goodness-of-fit test
Z-test for means

✓ Correct: C

Comparing observed digit frequencies to Benford's expected frequencies is a chi-square goodness-of-fit test — testing whether observed counts match a theoretical distribution.

10. If a p-value is 0.032 and α = 0.05, the correct decision is:

Fail to reject H₀ — not significant
Reject H₀ — statistically significant
Accept H₁ — H₁ is proven true
The result is inconclusive

✓ Correct: B

Since 0.032 < 0.05, we reject H₀. The result is statistically significant at the 5% level.

11. A researcher uses the same data to check 30 hypotheses at α = 0.05. Approximately how many false positives are expected?

0
0.05
1.5
5

✓ Correct: C

Expected false positives = 30 × 0.05 = 1.5. This is the multiple testing problem, addressed by Bonferroni correction.

12. The null hypothesis for an independent chi-square test (test of independence) states that:

The two variables are perfectly correlated
The two variables are independent of each other
Both variables have equal means
The data follows a normal distribution

✓ Correct: B

H₀ in an independence test always asserts that the two categorical variables are statistically independent — knowing the value of one provides no information about the other.

13. Which statement about the significance level is correct?

It equals 1 minus the p-value
It must be set after seeing the data
It represents the maximum acceptable probability of a Type I error
It represents the probability that H₁ is true

✓ Correct: C

α is the maximum Type I error rate the researcher is willing to accept, set before data collection.

14. Which test is appropriate for analyzing paired before-and-after measurements from the same subjects?

Z-test
Chi-square test
Independent samples t-test
Paired samples t-test

✓ Correct: D

Paired measurements from the same subjects require the paired t-test, which accounts for within-subject correlation.

15. A test statistic of Z = 2.89 in a right-tailed test at α = 0.01 (critical value = 2.33) means:

Fail to reject H₀
Reject H₀ — statistically significant
The test is invalid
More data is needed

✓ Correct: B

2.89 > 2.33 (the critical value), so the test statistic falls in the rejection region. Reject H₀.

16. The chi-square test requires that expected cell frequencies be at least:

✓ Correct: B

The standard rule of thumb is that each expected frequency should be at least 5 for the chi-square approximation to be reliable.

17. Reducing α from 0.05 to 0.01 while keeping sample size fixed will:

Decrease both Type I and Type II error rates
Decrease Type I error and increase Type II error
Increase Type I error and decrease Type II error
Have no effect on error rates

✓ Correct: B

Lowering α makes rejection harder, reducing Type I errors but increasing the chance of missing a real effect (Type II error) for a fixed sample.

18. Which of the following represents a correct null hypothesis for an independent t-test?

H₀: μ₁ > μ₂
H₀: μ₁ ≠ μ₂
H₀: μ₁ = μ₂
H₀: μ₁ < μ₂

✓ Correct: C

The null hypothesis always contains equality (=, ≤, or ≥). The most common form for a two-sample test is H₀: μ₁ = μ₂ (the two means are equal).

19. What does statistical power measure?

The probability of a Type I error
How large the sample is
The probability of correctly detecting a real effect
The significance level

✓ Correct: C

Power = 1 − β. It measures the test's ability to correctly reject H₀ when H₀ is false. High power (>80%) is generally desired.

20. A one-sample t-test uses degrees of freedom equal to:

n
n − 1
n − 2
n + 1

✓ Correct: B

For a one-sample t-test, df = n − 1, because one degree of freedom is lost in estimating the sample mean.

21. An analyst finds p = 0.048. A colleague argues the result is "borderline" and recommends re-running with more data to get p < 0.01. This approach is problematic because:

The sample is too large
It constitutes p-hacking — collecting data until significance is achieved
p = 0.048 is already highly significant
The t-distribution should be used instead

✓ Correct: B

Continuing data collection until a desired p-value is achieved inflates the true Type I error rate beyond α. This practice, called "optional stopping," is a form of p-hacking.

22. The degrees of freedom in a chi-square test of independence with a 3×4 contingency table are:

✓ Correct: A

df = (r − 1)(c − 1) = (3 − 1)(4 − 1) = 2 × 3 = 6.

23. In auditing, which hypothesis test is most directly relevant to testing whether financial figures conform to Benford's Law?

Two-sample t-test
Z-test for proportions
Chi-square goodness-of-fit test
Paired t-test

✓ Correct: C

Chi-square goodness-of-fit compares observed digit frequencies to the theoretical Benford distribution — a direct test of whether data conforms to the expected pattern.

24. Which of the following would increase the statistical power of a test?

Decreasing the sample size
Increasing the significance level (α)
Increasing both sample size and α
Using a two-tailed test instead of one-tailed

✓ Correct: C

Power increases when sample size increases (more precise estimate) or when α increases (easier to reject H₀). Increasing both has the greatest effect.

25. A fund manager claims her portfolio outperforms the market index (mean excess return > 0). The appropriate alternative hypothesis is:

H₁: μ = 0
H₁: μ ≠ 0
H₁: μ > 0
H₁: μ < 0

✓ Correct: C

The claim is directional — the portfolio generates positive excess returns. So H₁: μ > 0 and a right-tailed test is appropriate.

26. Which of the following is NOT an assumption of the t-test?

Approximate normality of the data
Independence of observations
The population standard deviation must be known
Continuous data (interval or ratio scale)

✓ Correct: C

The t-test is designed for situations where σ is UNKNOWN (that's precisely when to use it instead of the Z-test). Known σ is an assumption of the Z-test, not the t-test.

27. If a test has p = 0.003, this is best described as:

Not significant
Marginally significant
Significant at the 5% level only
Statistically significant at the 1% (and 5%) level

✓ Correct: D

p = 0.003 < 0.01 < 0.05, so the result is significant at both the 1% and 5% significance levels. This is very strong evidence against H₀.

28. A retail bank wants to know if average customer wait time is more than 10 minutes. The correct null hypothesis is:

H₀: μ > 10
H₀: μ = 10
H₀: μ < 10
H₀: μ ≠ 10

✓ Correct: B

The null hypothesis always includes equality. H₀: μ = 10 (or μ ≤ 10) represents the status quo; the directional alternative H₁: μ > 10 is what the test evaluates.

29. Effect size is important in hypothesis testing because:

It determines the p-value directly
It tells you whether the result is statistically significant
It measures the magnitude of the difference, not just whether it is significant
It replaces the need for a significance level

✓ Correct: C

Effect size (e.g., Cohen's d) measures how large the detected difference actually is in practical terms, which is essential context that a p-value alone does not provide.

30. Which of the following scenarios best justifies a chi-square test of independence?

Comparing mean salaries between two departments
Testing if a sample mean equals a hypothesized population mean
Testing whether voting preference is related to income bracket
Comparing before-and-after blood pressure readings

✓ Correct: C

Testing the relationship between two categorical variables (voting preference and income bracket) is precisely the purpose of a chi-square independence test.

Frequently Asked Questions

What is hypothesis testing in statistics?

Hypothesis testing is a formal statistical procedure that uses sample data to evaluate two competing claims about a population — the null hypothesis (no effect) and the alternative hypothesis (an effect exists). The procedure determines whether the observed data provides sufficient evidence to reject the null hypothesis, based on a pre-set probability threshold called the significance level.

What is the null hypothesis (H₀)?

The null hypothesis is the default claim that no effect, difference, or relationship exists in the population. It represents the status quo and is assumed true until the data provides sufficient evidence to reject it. Example: H₀: the new drug has no effect on blood pressure.

What is the alternative hypothesis (H₁)?

The alternative hypothesis is the competing claim that challenges the null hypothesis — it asserts that an effect, difference, or relationship does exist. Researchers seek evidence to support H₁ by rejecting H₀. It can be directional (one-tailed) or non-directional (two-tailed).

What is a p-value in hypothesis testing?

A p-value is the probability of observing data as extreme as (or more extreme than) the sample result, assuming the null hypothesis is true. A small p-value (below the significance level α) indicates that the observed result would be very unlikely under H₀, providing grounds to reject it. It does not measure the probability that H₀ is true.

What does p < 0.05 mean?

When p < 0.05, the observed result would occur less than 5% of the time if H₀ were true. At a 5% significance level, this is considered statistically significant — there is sufficient evidence to reject the null hypothesis. However, this does not mean the result is practically important or that H₁ is proven true.

What is statistical significance?

A result is statistically significant when the p-value falls below the pre-set significance level (α), indicating the observed data is unlikely to have occurred by chance under the null hypothesis. Importantly, statistical significance is not the same as practical significance — a result can be statistically significant yet trivially small in magnitude.

What is a Type I error?

A Type I error (false positive) occurs when the null hypothesis is rejected even though it is actually true. The probability of a Type I error equals the significance level α. For example, at α = 0.05, there is a 5% chance of incorrectly concluding an effect exists when none does.

What is a Type II error?

A Type II error (false negative) occurs when we fail to reject a false null hypothesis — we miss a real effect. The probability of a Type II error is denoted β. Power (1 − β) is the probability of correctly detecting a real effect. Larger sample sizes reduce β and increase power.

What is the significance level (alpha) in hypothesis testing?

The significance level (α) is the probability threshold below which a p-value is considered statistically significant — and the maximum acceptable probability of making a Type I error. It is set before data collection. The most common value is α = 0.05 (5%), though stricter levels (0.01, 0.001) are used in high-stakes research.

When should I use a Z-test?

Use a Z-test when: (1) the population standard deviation (σ) is known, and (2) the sample is large (n ≥ 30) or the population is normally distributed. In practice, σ is rarely known, making the t-test the default choice for most applications.

When should I use a t-test?

Use a t-test when the population standard deviation is unknown and must be estimated from the sample — which covers the vast majority of real research situations. The t-test is appropriate for any sample size, though it is particularly important for small samples (n < 30) where the Z approximation is less reliable.

What is the difference between a one-sample and a two-sample t-test?

A one-sample t-test compares a single sample mean to a known or hypothesized population value. A two-sample (independent) t-test compares the means of two separate, independent groups to determine whether they differ significantly from each other.

What is a paired t-test used for?

A paired t-test is used when the same subjects are measured twice — in before/after studies, matched pairs, or repeated-measures designs. It tests whether the mean difference between paired observations is significantly different from zero, accounting for within-subject correlation.

What is a chi-square test?

The chi-square (χ²) test is a non-parametric statistical test used for categorical (count) data. It comes in two forms: the goodness-of-fit test (does observed data match an expected distribution?) and the test of independence (are two categorical variables related to each other?).

What is degrees of freedom in hypothesis testing?

Degrees of freedom (df) represents the number of independent values that can vary in a statistical calculation. It determines the exact shape of the t or chi-square distribution used to find critical values and p-values. For a one-sample t-test, df = n − 1; for an independent t-test, df = n₁ + n₂ − 2; for chi-square independence, df = (r−1)(c−1).

What is a one-tailed vs a two-tailed test?

A two-tailed test checks for an effect in either direction (H₁: μ ≠ μ₀) and places the rejection region in both tails of the distribution. A one-tailed test checks for an effect in only one direction (H₁: μ > μ₀ or H₁: μ < μ₀) and is more powerful for detecting effects in the specified direction, but only valid when the direction is specified before data collection.

How is hypothesis testing used in finance?

In finance, hypothesis testing is used to evaluate investment performance (does a portfolio beat its benchmark?), test market efficiency, analyze risk (has volatility changed?), assess credit risk (do default rates differ across rating categories?), and conduct event studies (do stock prices react significantly to earnings announcements?).

How is hypothesis testing used in auditing?

Auditors use hypothesis testing in audit sampling (testing whether mean errors in a sample indicate material misstatement), compliance testing (whether a proportion of transactions meet policy), Benford's Law analysis (chi-square test to detect unusual digit distributions that may indicate fraud), and fraud investigation (testing whether expense amounts deviate significantly from historical norms).

What is the difference between statistical significance and practical significance?

Statistical significance indicates that an observed effect is unlikely to be due to chance (p < α). Practical significance indicates that the effect is large enough to be meaningful in real-world terms. A statistically significant result may have a negligible practical effect if the sample is very large. Always report effect sizes alongside p-values.

What is p-hacking?

P-hacking (data dredging or data fishing) refers to the practice of analyzing data in multiple ways or collecting data until a significant p-value is found, then reporting only the significant result. This exploits the fact that even with no real effect, 5% of tests at α = 0.05 will produce significant results by chance. P-hacking inflates false positive rates and undermines scientific integrity.

What is the Bonferroni correction?

The Bonferroni correction is an adjustment applied when conducting multiple hypothesis tests simultaneously, to control the family-wise Type I error rate. The adjusted significance level is α* = α/k, where k is the number of tests. For example, running 10 tests at α = 0.05 requires p < 0.005 for each individual test to be considered significant.

What is statistical power and how can it be improved?

Statistical power (1 − β) is the probability that a test correctly rejects H₀ when H₀ is false. It can be improved by: increasing sample size (the most effective method), increasing the effect size (not always controllable), raising α (accepting more Type I risk), using a one-tailed test (when appropriate), using more precise measurement instruments, and reducing data variability through better experimental control.

What is an effect size in statistics?

Effect size measures the practical magnitude of a statistical finding, independent of sample size. Common measures include Cohen's d for mean differences (small = 0.2, medium = 0.5, large = 0.8), r or η² for correlation/proportion of variance explained, and odds ratios for categorical outcomes. Effect sizes allow comparison across studies and communicate practical importance.

What is Benford's Law in the context of auditing?

Benford's Law states that in many naturally occurring datasets, the leading digit is not uniformly distributed but follows a specific logarithmic distribution (1 appears ~30%, 2 appears ~18%, etc.). Auditors apply a chi-square goodness-of-fit test to compare the actual leading digit distribution of financial figures against Benford's expected distribution. Significant deviations may indicate data manipulation or fraud.

How is a confidence interval related to hypothesis testing?

A 95% confidence interval and a two-tailed hypothesis test at α = 0.05 are mathematically equivalent. If the null hypothesis value falls outside the 95% confidence interval, H₀ would be rejected at α = 0.05. Confidence intervals provide richer information than hypothesis tests alone — they communicate both the direction and magnitude of the effect and the precision of the estimate.

Can hypothesis testing prove a hypothesis is true?

No. Hypothesis testing can only provide evidence to reject the null hypothesis — it cannot prove any hypothesis is definitively true. Failing to reject H₀ means there is insufficient evidence against it, not that it has been confirmed. Similarly, rejecting H₀ means the alternative is supported by the evidence, not proven with certainty.

What is the minimum expected frequency for a chi-square test?

The standard rule of thumb is that every expected cell frequency should be at least 5. If this condition is violated, the chi-square approximation becomes unreliable. Solutions include: collapsing categories with low frequencies, gathering more data, or using an exact test (such as Fisher's Exact Test for 2×2 tables).

What is the difference between parametric and non-parametric tests?

Parametric tests (Z-test, t-test, ANOVA) assume the data follows a specific distribution (usually normal) and test hypotheses about population parameters (mean, variance). Non-parametric tests (chi-square, Mann-Whitney U, Kruskal-Wallis) make fewer distributional assumptions and work with ranks or frequencies. Non-parametric tests are generally less powerful but more robust when assumptions are violated.

What is the appropriate hypothesis test for A/B testing in digital marketing?

For A/B testing conversion rates (proportions) with large samples, a Z-test for proportions is standard. For comparing mean purchase values or session durations between groups, an independent samples t-test is appropriate. Chi-square tests can be used for categorical outcomes (e.g., which product category was purchased). The choice depends on the metric being tested.

How do I choose between a one-tailed and two-tailed test?

Choose a two-tailed test (H₁: μ ≠ μ₀) when you have no prior reason to predict the direction of the effect — this is appropriate for most exploratory research. Choose a one-tailed test (H₁: μ > μ₀ or H₁: μ < μ₀) only when theory or prior evidence strongly predicts a specific direction, and this direction must be specified before data collection. Never choose the direction based on what the data shows.

What are the steps in hypothesis testing?

The 7 steps of hypothesis testing are: (1) State the null and alternative hypotheses; (2) Set the significance level α; (3) Select the appropriate statistical test and verify assumptions; (4) Collect data; (5) Calculate the test statistic; (6) Find the p-value and compare to α to make a decision; (7) State the conclusion in context, relating the statistical result back to the original research question.

Module 6 — Final Summary

Hypothesis testing is one of the most powerful and widely applied tools in statistics. In this module, you have learned how to transform a research question into a testable hypothesis, how to evaluate that hypothesis using sample data, and how to draw principled, evidence-based conclusions.

The core framework is elegant: formulate competing hypotheses (H₀ and H₁), choose a standard of evidence (α), collect data, compute a test statistic, and compare it — via the p-value — to your threshold. If the evidence exceeds the threshold, reject H₀. If not, withhold judgment.

You have mastered the three most fundamental hypothesis tests:

The Z-test for large samples with known population variance
The t-test (one-sample, independent, and paired) for the more common scenario of unknown variance
The chi-square test (goodness-of-fit and independence) for categorical frequency data

You have also learned about the critical limitations and risks: Type I and Type II errors, p-hacking, the multiple testing problem, the difference between statistical and practical significance, and the 15 most common mistakes that compromise the validity of hypothesis tests.

These skills apply directly to research, business analytics, financial analysis, auditing, and virtually any data-driven decision-making context. In the next module, you will build on this foundation by exploring regression analysis — the tool that moves beyond comparing means to modelling relationships between variables.

Continue Learning — Module 7

Regression Analysis and Applications

Learn simple and multiple regression, interpret coefficients, assess model fit, and apply regression to business forecasting, financial modeling, and data analytics.

Start Module 7 →

📌 SEO & Publishing Metadata

SEO Title: Hypothesis Testing – Complete Guide | Module 6 Applied Statistics

Meta Description: Master hypothesis testing in this complete Module 6 lesson. Learn null hypothesis, alternative hypothesis, p-value, Z-test, t-test, chi-square test, Type I and Type II errors with real examples, 20 exercises, and a 30-question quiz.

URL Slug: /hypothesis-testing-module-6-applied-statistics

Focus Keyword: Hypothesis Testing

Secondary Keywords: null hypothesis, alternative hypothesis, p-value, Z-test, t-test, chi-square test, statistical significance, Type I error, Type II error, inferential statistics, applied statistics

Canonical: Self-referencing

Schema: Course → Module (LearningResource), FAQPage

Start Practicing Smarter Today

Hypothesis Testing

📋 Quick Learning Navigation

Introduction

What is Hypothesis Testing?

Simple Definition

Statistical Definition

Practical Meaning

The Null Hypothesis (H₀)

Purpose of the Null Hypothesis

Practical Examples

The Alternative Hypothesis (H₁)

Types of Alternative Hypotheses

Choosing the Right Form

Null vs Alternative Hypothesis — Comparison

Significance Level (α)

Common Significance Levels

The 0.05 Rule — Explained

The p-value

Definition

Interpreting the p-value

p < 0.05 — The Decision Rule

Practical Example

Type I and Type II Errors

Type I Error (False Positive)

Type II Error (False Negative)

Type I vs Type II Error — Comparison Table

The Error Trade-Off

The 7-Step Hypothesis Testing Process

The Z-Test

Definition and Purpose

Assumptions

Z-Test Formula

Decision Rule (Z-Test)

Worked Example — Z-Test

Quality Control in Manufacturing

The t-Test

Definition and Purpose

The Three Types of t-Test

1. One-Sample t-Test

2. Independent Samples t-Test (Two-Sample t-Test)

3. Paired Samples t-Test

Assumptions of t-Tests

Z-Test vs t-Test — When to Use Each

The Chi-Square Test (χ²)

Definition and Purpose

The Two Types of Chi-Square Tests

1. Chi-Square Goodness-of-Fit Test

2. Chi-Square Test of Independence

Assumptions of the Chi-Square Test

Choosing the Right Test

Applications: Finance, Auditing, and Business

Hypothesis Testing in Finance

Hypothesis Testing in Auditing

Hypothesis Testing in Business

15 Common Hypothesis Testing Mistakes (and How to Avoid Them)

Practical Case Study

Did the New Loan Processing System Reduce Processing Time?

Step 1 — State the Hypotheses

Step 2 — Significance Level

Step 3 — Select Test and Check Assumptions

Step 4 — Collect Data

Step 5 — Calculate the Test Statistic

Step 6 — Make the Decision

Step 7 — Conclusion

Management Recommendation

Key Takeaways

Hypotheses

Significance Level

p-value

Error Types

Z-Test

t-Test

Chi-Square Test

Effect Size Matters

Practice Exercises

Part A — 20 Conceptual Questions

Q1.

Q2.

Q3.