Hypothesis Testing
How statisticians, researchers, and business analysts use data to make confident, evidence-based decisions — step by step, test by test.
📋 Quick Learning Navigation
- 1Introduction
- 2What is Hypothesis Testing?
- 3Null Hypothesis (H₀)
- 4Alternative Hypothesis (H₁)
- 5Significance Level (α)
- 6p-value
- 7Type I & Type II Errors
- 8Testing Process (7 Steps)
- 9Z-Test
- 10t-Test
- 11Chi-Square Test
- 12Choosing the Right Test
- 13Finance, Audit & Business
- 14Common Mistakes
- 15Case Study
- 16Exercises
- 1730-Question MCQ Quiz
- 18Frequently Asked Questions
Introduction
Every day, decisions are made using data. A pharmaceutical company wants to know whether a new drug lowers blood pressure better than the existing one. A marketing team needs to know whether a redesigned homepage increased sign-ups. An auditor needs to determine whether a client's reported figures are materially misstated. A finance professional wants to know whether a portfolio strategy beats the market.
These are not questions that can be answered by gut feeling or by looking at a single number. They require a structured, mathematical framework for separating genuine effects from random variation. That framework is hypothesis testing.
Hypothesis testing is the backbone of inferential statistics — the branch of statistics concerned with drawing conclusions about a population from a sample. Because we can almost never observe an entire population, we collect a sample, measure it, and then ask: "Is what I'm observing real, or could it simply be the result of chance?"
In this module, you will master every core concept in hypothesis testing: formulating hypotheses, choosing a significance level, computing and interpreting p-values, understanding the two types of errors, and applying the three most common statistical tests — the Z-test, the t-test, and the chi-square test. By the end, you will be equipped to apply hypothesis testing in research, business, finance, and auditing contexts.
What is Hypothesis Testing?
Simple Definition
Hypothesis testing is a statistical procedure used to decide, based on sample data, whether there is enough evidence to support or reject a specific claim about a population.
Statistical Definition
Formally, hypothesis testing is a method of statistical inference that evaluates two competing statements — the null hypothesis and the alternative hypothesis — by quantifying how likely the observed data would be if the null hypothesis were true. The outcome is a decision: either reject the null hypothesis or fail to reject it, based on a pre-specified threshold of evidence called the significance level.
Practical Meaning
Think of hypothesis testing as a courtroom trial. The null hypothesis is the assumption of innocence: the default position that nothing unusual is happening, no effect exists, no difference is present. The data is the evidence. The significance level is the standard of proof required. If the evidence is strong enough — if the data is too unusual to be explained by chance alone — we reject the null hypothesis. Otherwise, we withhold judgment.
Finance: An analyst claims a stock-picking algorithm generates above-market returns. A hypothesis test on historical returns will evaluate that claim.
Healthcare: Researchers claim a vaccine reduces infection rates. A hypothesis test on clinical trial data will determine whether the reduction is statistically meaningful.
Auditing: An auditor suspects expense claims are inflated. A hypothesis test on a random sample of transactions can support or contradict that suspicion.
Marketing: A team runs an A/B test on two email subject lines. A hypothesis test on open rates tells them which one genuinely performs better.
The Null Hypothesis (H₀)
The null hypothesis (H₀) is the default assumption — the statement that there is no effect, no difference, or no relationship in the population being studied. It is the baseline that the researcher attempts to disprove.
Purpose of the Null Hypothesis
The null hypothesis serves as a benchmark for comparison. It specifies what we would observe if nothing interesting is happening — if the treatment has no effect, the groups are identical, or the variables are unrelated. Statistical tests calculate how surprising the observed data is under this assumption. If the data is very unlikely given H₀, we have grounds to reject it.
Practical Examples
| Context | Research Question | Null Hypothesis (H₀) |
|---|---|---|
| Medicine | Does Drug A reduce blood pressure more than Drug B? | There is no difference in blood pressure reduction between Drug A and Drug B. |
| Business | Does the new website design increase conversion rates? | The new design has no effect on conversion rates. |
| Finance | Does the trading algorithm outperform a buy-and-hold strategy? | The algorithm's average return equals the market's average return. |
| Auditing | Are the reported expenses consistent with historical patterns? | The mean expense amount equals the historical mean. |
| Marketing | Do customers in different regions have different brand preferences? | Brand preference is independent of region. |
| Education | Does the new teaching method improve test scores? | The new method has no effect on mean test scores. |
In mathematical notation, null hypotheses are typically stated in terms of population parameters. For example: H₀: μ = 50 (the population mean equals 50), or H₀: μ₁ = μ₂ (two population means are equal), or H₀: p = 0.40 (a population proportion equals 40%).
The Alternative Hypothesis (H₁)
The alternative hypothesis (H₁), also written as Hₐ, is the statement that challenges the null hypothesis. It is the claim that there IS an effect, a difference, or a relationship — the statement the researcher is trying to provide evidence for.
Types of Alternative Hypotheses
Alternative hypotheses come in three forms, each corresponding to a different type of statistical test:
| Type | Symbol | Description | Test Type |
|---|---|---|---|
| Two-tailed | H₁: μ ≠ μ₀ | The parameter is different from the null value (either direction) | Two-tailed test |
| Right-tailed | H₁: μ > μ₀ | The parameter is greater than the null value | Right-tailed (upper) test |
| Left-tailed | H₁: μ < μ₀ | The parameter is less than the null value | Left-tailed (lower) test |
Choosing the Right Form
The choice between one-tailed and two-tailed tests should be made before collecting data, based on the research question:
- Use a two-tailed test when you have no prior reason to expect the difference to go in a particular direction (most common in exploratory research).
- Use a one-tailed test when theory or prior evidence strongly predicts the direction of the effect (e.g., a new drug should lower, not raise, blood pressure).
Null vs Alternative Hypothesis — Comparison
| Dimension | Null Hypothesis (H₀) | Alternative Hypothesis (H₁) |
|---|---|---|
| Definition | Default claim; no effect or difference | Competing claim; asserts an effect or difference |
| Notation | H₀ | H₁ or Hₐ |
| Content | Includes equality (=, ≤, ≥) | Never includes equality (≠, >, <) |
| Role | Statement we test against | Statement we seek evidence for |
| Result | Rejected or failed to be rejected | Supported if H₀ is rejected |
| Burden of proof | Assumed true until proven otherwise | Must be supported by evidence |
| Example | H₀: μ = 100 | H₁: μ ≠ 100 |
Significance Level (α)
The significance level, denoted α (alpha), is the probability threshold below which a p-value is considered statistically significant — the maximum probability of rejecting a true null hypothesis that the researcher is willing to accept.
In practical terms, α defines how much evidence is required to conclude that the null hypothesis is false. A smaller α demands stronger evidence.
Common Significance Levels
| Significance Level | Confidence Level | Typical Use |
|---|---|---|
| α = 0.10 (10%) | 90% | Exploratory research, preliminary studies; lower standards acceptable |
| α = 0.05 (5%) | 95% | Standard in most academic, business, and social science research |
| α = 0.01 (1%) | 99% | Medical research, safety-critical applications, financial regulations |
| α = 0.001 (0.1%) | 99.9% | Particle physics (the "five sigma" standard), FDA clinical trials |
The 0.05 Rule — Explained
The most widely used significance level, α = 0.05, means: "I will accept a 5% chance of concluding there is an effect when in fact there is none." This is known as the Type I error rate (covered in the next section).
The 0.05 threshold was popularized by statistician Ronald Fisher in the 1920s and has since become the de facto standard. However, it is important to understand that this threshold is a convention, not a law of nature. Depending on the stakes of the decision, a different α may be more appropriate.
The p-value
Definition
The p-value (probability value) is the probability of observing a test statistic as extreme as — or more extreme than — the one actually observed, assuming the null hypothesis is true.
Put simply: a p-value answers the question, "If H₀ were true, how often would we see data this unusual just by chance?"
Interpreting the p-value
| p-value | Interpretation | Decision (α = 0.05) |
|---|---|---|
| p < 0.001 | Extremely strong evidence against H₀ | Reject H₀ — highly significant |
| 0.001 ≤ p < 0.01 | Very strong evidence against H₀ | Reject H₀ — very significant |
| 0.01 ≤ p < 0.05 | Moderate evidence against H₀ | Reject H₀ — significant |
| 0.05 ≤ p < 0.10 | Weak evidence against H₀ | Fail to reject H₀ — marginal |
| p ≥ 0.10 | Little or no evidence against H₀ | Fail to reject H₀ — not significant |
p < 0.05 — The Decision Rule
At the standard α = 0.05 level:
- If p < 0.05: Reject H₀. The result is statistically significant. There is sufficient evidence to conclude that the alternative hypothesis is likely true.
- If p ≥ 0.05: Fail to reject H₀. The result is not statistically significant. There is insufficient evidence to conclude the alternative hypothesis.
- A p-value is NOT the probability that H₀ is true.
- A p-value is NOT the probability that the result occurred by chance.
- A statistically significant result is NOT necessarily practically significant (effect size matters).
- p = 0.049 is NOT meaningfully different from p = 0.051.
Practical Example
A company tests whether a new training program increases average employee productivity scores. The old program yielded a mean score of 72. After the new program, a sample of 40 employees shows a mean score of 76. After running the appropriate test, the p-value is 0.021.
Since 0.021 < 0.05, we reject H₀. The improvement from 72 to 76 is statistically significant. There is evidence that the new training program improves productivity scores — the observed difference is unlikely to be due to chance alone.
Type I and Type II Errors
Because hypothesis testing is based on probability, it is never 100% certain. Two kinds of mistakes are possible: rejecting a null hypothesis that is actually true, or failing to reject a null hypothesis that is actually false.
Type I Error (False Positive)
A Type I error occurs when we reject a true null hypothesis. We conclude that an effect exists when in reality it does not.
- The probability of a Type I error is equal to the significance level: P(Type I Error) = α
- Also called a false positive or false alarm
Auditing: An auditor concludes fraud has occurred when the figures are actually accurate. The client faces unnecessary scrutiny and reputational damage.
Business: A company launches an expensive marketing campaign based on a test that falsely showed the campaign significantly increases sales.
Type II Error (False Negative)
A Type II error occurs when we fail to reject a false null hypothesis. We conclude there is no effect when in reality there is one.
- The probability of a Type II error is denoted β (beta)
- Also called a false negative or missed detection
- Statistical power = 1 − β (the probability of correctly rejecting a false H₀)
Auditing: An auditor fails to detect actual fraud or material misstatement in the financial statements.
Business: A company abandons a marketing campaign that actually works because the sample test failed to detect the real improvement.
Type I vs Type II Error — Comparison Table
| Dimension | Type I Error | Type II Error |
|---|---|---|
| Definition | Rejecting a true H₀ | Failing to reject a false H₀ |
| Also called | False positive | False negative |
| Probability | α (significance level) | β (beta) |
| Control | Reduced by lowering α | Reduced by increasing power (larger sample, better design) |
| The cost | Acting on a false signal | Missing a real signal |
| Trade-off | Lowering α increases β | Lowering β requires larger sample or higher α |
| Medical analogy | Diagnosing a disease the patient doesn't have | Missing a disease the patient does have |
The Error Trade-Off
There is an inherent trade-off between Type I and Type II errors: reducing one tends to increase the other, for a fixed sample size. The only way to reduce both simultaneously is to increase the sample size. This is why adequate sample size planning (called power analysis) is a critical step in research design.
The 7-Step Hypothesis Testing Process
Every hypothesis test, regardless of the specific statistical method used, follows this standard seven-step procedure:
-
1State the Hypotheses Clearly define both H₀ and H₁ in terms of a specific population parameter (mean, proportion, variance, etc.). The hypotheses must be mutually exclusive and collectively exhaustive. Example: H₀: μ = 500; H₁: μ ≠ 500.
-
2Select the Significance Level (α) Choose α before seeing the data. The standard is α = 0.05 unless the context demands a stricter or more lenient threshold. Document your choice and rationale.
-
3Identify the Test and Check Assumptions Choose the appropriate statistical test (Z-test, t-test, chi-square, etc.) based on the data type, sample size, and knowledge of population parameters. Verify that the test's assumptions are met (normality, independence, equal variance, etc.).
-
4Collect Data Gather a representative, random sample from the population. Ensure data quality — errors in data collection undermine even the most sophisticated analysis.
-
5Calculate the Test Statistic Compute the numerical value that measures how far the observed sample data is from what H₀ would predict. This is standardized (expressed in standard deviation units) so it can be compared to a known distribution (Z, t, χ², F, etc.).
-
6Find the p-value and Make a Decision Calculate the p-value corresponding to the test statistic. Compare it to α. If p < α, reject H₀. If p ≥ α, fail to reject H₀. Alternatively, compare the test statistic to the critical value from the appropriate distribution table.
-
7State the Conclusion in Context Interpret the statistical result in plain language, in the context of the original research question. Do not just say "rejected" — say what this means for the business decision, the research finding, or the audit conclusion.
The Z-Test
Definition and Purpose
The Z-test is a parametric hypothesis test used to determine whether the mean of a sample is significantly different from a known or hypothesized population mean, when the population standard deviation (σ) is known and the sample size is large (typically n ≥ 30).
Assumptions
- The population standard deviation (σ) is known
- The sample size is large (n ≥ 30) OR the population is normally distributed
- Observations are independent
- Data is continuous (interval or ratio scale)
Z-Test Formula
Decision Rule (Z-Test)
| Test Type | Reject H₀ if... | Critical Values (α = 0.05) |
|---|---|---|
| Two-tailed | |Z| > Z_α/2 | Z < −1.96 or Z > +1.96 |
| Right-tailed | Z > Z_α | Z > +1.645 |
| Left-tailed | Z < −Z_α | Z < −1.645 |
Worked Example — Z-Test
Quality Control in Manufacturing
Problem: A bottling plant claims that each bottle contains a mean of 500 ml of beverage. A quality control inspector takes a random sample of 50 bottles and finds a sample mean of 497 ml. The population standard deviation is known to be 12 ml. Test at the α = 0.05 significance level whether the mean fill volume has changed.
Step 1 — Hypotheses:
H₀: μ = 500 ml (the mean fill is 500 ml)
H₁: μ ≠ 500 ml (the mean fill has changed — two-tailed test)
Step 2 — Significance Level: α = 0.05 (two-tailed, so critical values are ±1.96)
Step 3 — Test Statistic:
Step 4 — Decision:
The calculated Z = −1.768. The critical value for a two-tailed test at α = 0.05 is ±1.96.
Since |−1.768| = 1.768 < 1.96, we fail to reject H₀.
Step 5 — Conclusion:
At the 5% significance level, there is insufficient evidence to conclude that the mean fill volume has changed from 500 ml. The difference of 3 ml could plausibly be due to random sampling variation. The quality inspector cannot conclude a systematic underfilling problem based on this sample alone.
The t-Test
Definition and Purpose
The t-test is a parametric hypothesis test used to compare means when the population standard deviation is unknown and must be estimated from the sample data. It is also appropriate for small samples (n < 30). The t-test uses the t-distribution, which has heavier tails than the normal distribution to account for the additional uncertainty in estimating σ.
The Three Types of t-Test
1. One-Sample t-Test
Tests whether the mean of a single sample differs significantly from a hypothesized population mean when σ is unknown.
Example: A bank's loan department believes the average loan processing time is 3 days. A sample of 15 loans shows a mean of 3.8 days with a sample standard deviation of 1.2 days. Is the mean processing time significantly different from 3 days? (This requires a one-sample t-test with df = 14.)
2. Independent Samples t-Test (Two-Sample t-Test)
Compares the means of two independent groups to determine whether they differ significantly from each other.
Example: A company trains two groups of salespeople using different methods. Group A (n=20) achieved a mean monthly sales of $48,000 (s = $5,200). Group B (n=20) achieved $52,000 (s = $4,800). Is there a statistically significant difference in mean sales between the two methods?
3. Paired Samples t-Test
Tests whether the mean difference between two paired (matched) observations is significantly different from zero. Typically used in before-and-after studies.
Example: Ten employees' productivity scores were measured before and after a wellness program. The paired t-test tests whether the mean score change is significantly different from zero.
Assumptions of t-Tests
- The data is approximately normally distributed (or n is large)
- Observations are independent (within each group)
- For independent-samples t-test: homogeneity of variance (can be tested with Levene's test)
- Data is continuous (interval or ratio scale)
Z-Test vs t-Test — When to Use Each
| Criterion | Z-Test | t-Test |
|---|---|---|
| Population σ | Known | Unknown (estimated from sample) |
| Sample Size | Large (n ≥ 30) | Any size (especially n < 30) |
| Distribution Used | Standard normal (Z) | t-distribution |
| Degrees of Freedom | Not applicable | Required (df = n−1 or n₁+n₂−2) |
| Tail heaviness | Normal tails | Heavier tails (more conservative) |
| When n → ∞ | Same result | Converges to Z-test result |
| Common Use | Quality control, proportion tests | Most practical research with unknown σ |
The Chi-Square Test (χ²)
Definition and Purpose
The chi-square test is a non-parametric hypothesis test used to analyze categorical data. Unlike the Z-test and t-test which compare means, the chi-square test examines whether observed frequencies (counts) in categories differ significantly from expected frequencies.
The Two Types of Chi-Square Tests
1. Chi-Square Goodness-of-Fit Test
Tests whether the distribution of a single categorical variable matches a hypothesized distribution.
Example: A die is rolled 120 times. If fair, each face should appear 20 times. The observed frequencies are: 1→18, 2→22, 3→25, 4→17, 5→19, 6→19. Is the die fair?
2. Chi-Square Test of Independence
Tests whether two categorical variables are independent of each other, or whether there is a significant association between them.
Setup: Data is arranged in a contingency table. Expected frequencies are computed assuming independence:
Example: A marketing team surveys 200 customers and records their gender (Male/Female) and product preference (Product A/B/C). The chi-square test of independence will determine whether product preference is associated with gender, or whether the two variables are independent.
Assumptions of the Chi-Square Test
- Data consists of counts (frequencies), not means or proportions
- Observations are independent
- Expected frequency in each cell should be at least 5 (rule of thumb)
- Categories are mutually exclusive
- Market Research: Is customer satisfaction related to age group?
- HR Analytics: Is promotion rate independent of gender?
- Auditing: Are transaction sizes distributed as expected under Benford's Law?
- Healthcare: Is disease incidence related to geographic region?
- Finance: Are defaults distributed uniformly across credit rating categories?
Choosing the Right Test
| Research Question | Data Type | Sample Size | σ Known? | Test to Use |
|---|---|---|---|---|
| Compare sample mean to known value | Continuous | n ≥ 30 | Yes | Z-Test |
| Compare sample mean to known value | Continuous | Any | No | One-sample t-test |
| Compare means of two independent groups | Continuous | Any | No | Independent t-test |
| Compare before/after in same subjects | Continuous | Any | No | Paired t-test |
| Compare three or more group means | Continuous | Any | No | ANOVA (F-test) |
| Test if distribution fits expected | Categorical (counts) | Any | N/A | Chi-square goodness-of-fit |
| Test if two categorical variables are related | Categorical (counts) | Any | N/A | Chi-square independence |
| Test a proportion against a known value | Proportion | Large | N/A | Z-test for proportions |
Applications: Finance, Auditing, and Business
Hypothesis Testing in Finance
Hypothesis testing is a cornerstone of quantitative finance and investment analysis. Financial professionals use it to make objective, evidence-based claims about asset performance, risk, and market behavior.
- Investment Performance: Testing whether a fund's mean return significantly exceeds the benchmark (one-sample t-test).
- Risk Assessment: Testing whether portfolio volatility has changed after a strategy adjustment (F-test or Levene's test on variances).
- Event Studies: Testing whether stock prices react significantly around earnings announcements.
- Credit Risk: Testing whether default rates differ significantly across credit rating categories (chi-square test).
- Market Efficiency: Testing whether abnormal returns are consistently zero (the Efficient Market Hypothesis).
Hypothesis Testing in Auditing
Auditors use hypothesis testing as a formal framework for evaluating financial statements, internal controls, and compliance without reviewing every transaction.
- Audit Sampling: Testing whether the mean error in a sample of invoices is significantly different from zero (one-sample t-test).
- Compliance Testing: Testing whether the proportion of transactions that comply with policy meets a minimum threshold (Z-test for proportions).
- Benford's Law Analysis: Testing whether the leading digit distribution of reported figures follows Benford's Law (chi-square goodness-of-fit). Significant deviation can be an indicator of manipulation.
- Fraud Investigation: Testing whether expense amounts are significantly higher than in comparable periods (two-sample t-test).
- Material Misstatement: Testing whether projected total misstatement in the population exceeds the materiality threshold.
Hypothesis Testing in Business
- A/B Testing: Testing whether a redesigned webpage, email, or ad generates a significantly higher conversion rate than the control version (Z-test for proportions or independent t-test).
- Product Quality: Testing whether mean product dimensions or weights meet specifications (one-sample t-test or Z-test).
- Customer Segmentation: Testing whether customer satisfaction scores differ significantly across demographic groups (independent t-test or ANOVA).
- Pricing Strategy: Testing whether price changes significantly affect sales volume (paired t-test for before/after comparisons).
- Supply Chain: Testing whether supplier delivery times meet contractual standards (one-sample t-test).
15 Common Hypothesis Testing Mistakes (and How to Avoid Them)
-
Misinterpreting p-values as the probability that H₀ is true.
The p-value is the probability of the observed data (or more extreme) given H₀ — not the probability that H₀ is true. Fix: always state the correct interpretation. -
Treating "fail to reject" as "accept H₀."
Insufficient evidence to reject H₀ does not mean H₀ is confirmed. Fix: use precise language — "fail to reject," never "accept." -
Setting α after seeing the data (p-hacking).
Choosing α = 0.05 because your p-value is 0.04, after you've already analyzed the data, inflates Type I errors. Fix: always pre-register your significance level. -
Confusing statistical significance with practical significance.
A tiny, meaningless difference can be statistically significant with a large enough sample. Fix: always compute and report effect sizes (Cohen's d, η², etc.). -
Using the wrong statistical test.
Applying a t-test when data is categorical, or a chi-square test when data is continuous. Fix: match the test to your data type and research question (see the decision table above). -
Ignoring test assumptions.
Running a t-test on severely non-normal data with a tiny sample, or a chi-square test with expected cells below 5. Fix: always verify assumptions before running any test. -
Using too small a sample.
Underpowered studies frequently fail to detect real effects (Type II errors). Fix: conduct a power analysis before data collection to determine the required sample size. -
Multiple testing without correction.
Running 20 tests at α = 0.05 means approximately 1 will be significant by chance alone. Fix: apply Bonferroni correction or other multiple-comparison adjustments when running multiple tests. -
Choosing one-tailed tests opportunistically.
Switching to a one-tailed test after data collection because it gives a smaller p-value. Fix: specify the direction of H₁ before collecting data, based on theory alone. -
Using a parametric test on ordinal data.
T-tests require interval or ratio data. Applying them to Likert scale data (strongly disagree–strongly agree) is technically incorrect. Fix: use non-parametric alternatives (Mann-Whitney U, Wilcoxon signed-rank) for ordinal data. -
Ignoring outliers.
A single extreme observation can dramatically alter the test statistic. Fix: identify and investigate outliers before analysis; consider robust methods if outliers are genuine. -
Non-random sampling.
Hypothesis tests assume random sampling. Convenience samples produce biased results that cannot be generalized. Fix: design rigorous sampling protocols appropriate to the population of interest. -
Reporting only significant results (publication bias).
Cherry-picking significant findings while suppressing null results distorts the scientific record. Fix: pre-register studies and commit to reporting all results. -
Treating correlated observations as independent.
Using an independent t-test on paired data (before/after) ignores the correlation and reduces statistical power. Fix: use the paired t-test when the same subjects are measured twice. -
Forgetting to report confidence intervals.
A p-value alone tells you whether an effect is significant but not how large it is or how precisely it's estimated. Fix: always accompany hypothesis test results with confidence intervals for the parameter of interest.
Practical Case Study
Did the New Loan Processing System Reduce Processing Time?
Background: Metro Commercial Bank implemented a new digital loan processing system in Q3. The Head of Operations claims the new system has significantly reduced average loan processing time from the historical average of 8.5 days. She asks the Analytics Team to test this claim using data from the first month after implementation.
Step 1 — State the Hypotheses
H₀: μ = 8.5 days (the mean processing time has not changed)
H₁: μ < 8.5 days (the mean processing time has decreased — one-tailed, left-tailed test)
Step 2 — Significance Level
The team selects α = 0.05. A one-tailed test is appropriate because the claim is directional — the system is supposed to reduce (not change in either direction) processing time.
Step 3 — Select Test and Check Assumptions
Since the population standard deviation is unknown, and the sample is drawn from the first month's processed loans, a one-sample t-test is appropriate.
Assumptions: Loan processing times are approximately normally distributed (confirmed via histogram). Loans are processed independently.
Step 4 — Collect Data
A random sample of 25 loans processed under the new system yields:
Sample size: n = 25
Sample mean: x̄ = 7.8 days
Sample standard deviation: s = 2.1 days
Step 5 — Calculate the Test Statistic
Step 6 — Make the Decision
Critical t-value for a one-tailed left test at α = 0.05 with df = 24: t_critical = −1.711
Since our calculated t = −1.667 is greater than (to the right of) the critical value −1.711, we fail to reject H₀ at the 0.05 level.
The corresponding p-value is approximately 0.055 (slightly above 0.05).
Step 7 — Conclusion
At the 5% significance level, there is insufficient statistical evidence to conclude that the new digital loan processing system has significantly reduced mean processing time from 8.5 days. While the sample mean of 7.8 days is lower than the historical mean, the difference is not statistically significant — it could plausibly be due to random sampling variation.
Management Recommendation
The analytics team recommends: (1) gathering a larger sample over the next two months, as the test was likely underpowered with only n = 25; (2) conducting a power analysis to determine the sample size needed to detect a reduction of 0.7 days with 80% power; (3) not drawing definitive conclusions about system effectiveness until stronger evidence is available.
Key Takeaways
Hypotheses
H₀ is the default claim of no effect; H₁ is the alternative. We test H₀ and either reject or fail to reject it — never "accept" either.
Significance Level
α sets the maximum acceptable Type I error rate. The standard is α = 0.05, but context determines what is appropriate. Always set α before seeing the data.
p-value
The p-value measures how surprising the observed data is under H₀. Small p-values (p < α) are evidence against H₀. The p-value is NOT the probability that H₀ is true.
Error Types
Type I error (false positive): rejecting a true H₀. Type II error (false negative): failing to reject a false H₀. Reducing one increases the other — the only solution is larger samples.
Z-Test
Use when σ is known and n ≥ 30. Based on the standard normal distribution. Critical value ±1.96 for two-tailed test at α = 0.05.
t-Test
Use when σ is unknown (most practical situations). Three types: one-sample, independent, and paired. Uses the t-distribution with n−1 degrees of freedom.
Chi-Square Test
Use for categorical (count) data. Goodness-of-fit tests one variable's distribution; independence test examines the relationship between two categorical variables.
Effect Size Matters
Statistical significance ≠ practical significance. Always report effect sizes alongside p-values so readers understand the magnitude of findings, not just their statistical certainty.
Practice Exercises
Part A — 20 Conceptual Questions
Q1.
A researcher concludes "the null hypothesis is true" because p = 0.42. What statistical error has the researcher made in their interpretation?
Q2.
What is the relationship between α, β, and statistical power? If α is reduced from 0.05 to 0.01, what happens to β (for a fixed sample size)?
Q3.
An auditor uses a chi-square test with the following contingency table and gets p = 0.002. What does this mean in audit terms?
Q4.
Explain why a statistically significant result is not necessarily practically significant. Give a business example.
Q5.
Why should the significance level be set before data collection? What happens if it is set after?
Q6.
A finance manager runs 20 separate hypothesis tests at α = 0.05 to screen for market anomalies. How many significant results would be expected by chance alone?
Q7.
When is a paired t-test more appropriate than an independent samples t-test? Give an example.
Q8.
Describe the conditions under which a chi-square goodness-of-fit test would be invalid. How would you address each condition?
Q9.
What is the difference between a one-tailed and a two-tailed hypothesis test? How does the choice affect the critical region and p-value?
Q10.
A pharmaceutical company's Type I error in a drug safety trial means the drug is approved despite being ineffective. A Type II error means an effective drug is rejected. Which error is more costly in this context? Explain.
Q11–Q20.
(Additional conceptual questions — complete these independently as revision.)
Q11. Why does the t-distribution have heavier tails than the normal distribution?
Q12. What does degrees of freedom represent in a t-test?
Q13. Explain Benford's Law and its relevance to auditing.
Q14. What is the connection between a 95% confidence interval and a two-tailed hypothesis test at α = 0.05?
Q15. How does sample size affect statistical power?
Q16. What is the null hypothesis in an independence chi-square test?
Q17. Why can't the null hypothesis ever be "proven" true?
Q18. What is a p-value threshold of "p < 0.001" communicating?
Q19. Give three examples of Type I errors in a business context.
Q20. What is meant by "statistically significant at the 1% level"?
Part B — 10 Numerical Problems
Problem 1 — Z-Test
A factory produces bolts with a target diameter of 10 mm. The production process has a known standard deviation of 0.5 mm. A quality engineer takes a sample of 100 bolts and finds a mean diameter of 10.08 mm. Test at α = 0.05 whether the mean diameter has shifted from the target.
H₀: μ = 10; H₁: μ ≠ 10 (two-tailed)
Z = (10.08 − 10) / (0.5/√100) = 0.08 / 0.05 = 1.60
Critical values at α = 0.05: ±1.96
|1.60| < 1.96 → Fail to reject H₀
Conclusion: Insufficient evidence to conclude the mean diameter has shifted from 10 mm.
Problem 2 — One-Sample t-Test
A logistics company claims average delivery time is 3 days. A sample of 16 deliveries shows mean = 3.5 days, s = 0.8 days. Test at α = 0.05 (two-tailed) whether mean delivery time differs from 3 days.
H₀: μ = 3; H₁: μ ≠ 3; df = 15
t = (3.5 − 3) / (0.8/√16) = 0.5 / 0.2 = 2.50
Critical t at α = 0.05, df = 15 (two-tailed): ±2.131
|2.50| > 2.131 → Reject H₀
Conclusion: The mean delivery time is significantly different from 3 days. The company's claim is not supported by the data.
Problem 3 — Chi-Square Goodness-of-Fit
A market researcher expects customers to prefer four product variants equally (25% each). A survey of 200 customers shows: Variant A = 62, B = 48, C = 54, D = 36. Test at α = 0.05.
Expected each: 200 × 0.25 = 50
χ² = (62−50)²/50 + (48−50)²/50 + (54−50)²/50 + (36−50)²/50
χ² = 144/50 + 4/50 + 16/50 + 196/50 = 2.88 + 0.08 + 0.32 + 3.92 = 7.20
df = 4 − 1 = 3; Critical χ² = 7.815
7.20 < 7.815 → Fail to reject H₀
Conclusion: Customer preference is not significantly different from equal distribution at α = 0.05.
Problems 4–10
Solve the following independently. Worked solutions available in the downloadable companion workbook.
P4. Independent t-test: Compare mean sales of two stores (n₁=20, x̄₁=45K, s₁=8K; n₂=18, x̄₂=51K, s₂=9K) at α=0.05.
P5. Paired t-test: 8 employees' performance scores before (72,68,75,80,65,70,74,78) and after training (78,71,79,85,68,76,79,82). Test improvement at α=0.05.
P6. Z-test for proportions: A coin is flipped 400 times and lands heads 218 times. Test if the coin is fair at α=0.05.
P7. Chi-square independence: Test if gender (M/F) and product preference (A/B/C) are independent, given a 2×3 contingency table.
P8. Left-tailed t-test: A bank claims ATM cash refill time ≥ 5 minutes. A sample of 20 refills shows mean=4.6 min, s=0.9 min. Test at α=0.05.
P9. Power analysis: How large a sample is needed to detect a mean difference of 5 units (σ=12) with 80% power at α=0.05?
P10. Business application: A retailer tests whether a loyalty program increases mean basket size. Pre-program mean=BDT 1,200, n=35, post-program mean=BDT 1,340, s=BDT 280. Test at α=0.01.
Module 6 Quiz — 30 Multiple Choice Questions
1. What does it mean to "reject the null hypothesis"?
- H₀ has been proven false
- There is sufficient evidence that H₀ is unlikely to be true
- H₁ is proven true
- The sample data is unreliable
2. Which of the following is the correct interpretation of a p-value?
- The probability that H₀ is true
- The probability of a Type II error
- The probability of observing data as extreme as the sample, assuming H₀ is true
- The probability that the study's conclusion is correct
3. A Type I error occurs when:
- H₀ is false and we fail to reject it
- H₀ is true and we reject it
- H₁ is true and we reject it
- The sample size is too small
4. The power of a statistical test equals:
- α
- 1 − α
- β
- 1 − β
5. When should a Z-test be used instead of a t-test?
- When the sample size is small
- When the population standard deviation is unknown
- When the population standard deviation is known and n ≥ 30
- When data is categorical
6. For a two-tailed Z-test at α = 0.05, the critical values are:
- ±1.28
- ±1.645
- ±1.96
- ±2.576
7. Which test is most appropriate to compare the means of two independent groups when the population standard deviation is unknown?
- Chi-square test
- Z-test
- Independent samples t-test
- Paired t-test
8. In a chi-square goodness-of-fit test with 5 categories, the degrees of freedom are:
- 5
- 4
- 3
- 6
9. An auditor tests whether a firm's expense distribution follows Benford's Law. This is an example of:
- Independent t-test
- Paired t-test
- Chi-square goodness-of-fit test
- Z-test for means
10. If a p-value is 0.032 and α = 0.05, the correct decision is:
- Fail to reject H₀ — not significant
- Reject H₀ — statistically significant
- Accept H₁ — H₁ is proven true
- The result is inconclusive
11. A researcher uses the same data to check 30 hypotheses at α = 0.05. Approximately how many false positives are expected?
- 0
- 0.05
- 1.5
- 5
12. The null hypothesis for an independent chi-square test (test of independence) states that:
- The two variables are perfectly correlated
- The two variables are independent of each other
- Both variables have equal means
- The data follows a normal distribution
13. Which statement about the significance level is correct?
- It equals 1 minus the p-value
- It must be set after seeing the data
- It represents the maximum acceptable probability of a Type I error
- It represents the probability that H₁ is true
14. Which test is appropriate for analyzing paired before-and-after measurements from the same subjects?
- Z-test
- Chi-square test
- Independent samples t-test
- Paired samples t-test
15. A test statistic of Z = 2.89 in a right-tailed test at α = 0.01 (critical value = 2.33) means:
- Fail to reject H₀
- Reject H₀ — statistically significant
- The test is invalid
- More data is needed
16. The chi-square test requires that expected cell frequencies be at least:
- 1
- 5
- 10
- 30
17. Reducing α from 0.05 to 0.01 while keeping sample size fixed will:
- Decrease both Type I and Type II error rates
- Decrease Type I error and increase Type II error
- Increase Type I error and decrease Type II error
- Have no effect on error rates
18. Which of the following represents a correct null hypothesis for an independent t-test?
- H₀: μ₁ > μ₂
- H₀: μ₁ ≠ μ₂
- H₀: μ₁ = μ₂
- H₀: μ₁ < μ₂
19. What does statistical power measure?
- The probability of a Type I error
- How large the sample is
- The probability of correctly detecting a real effect
- The significance level
20. A one-sample t-test uses degrees of freedom equal to:
- n
- n − 1
- n − 2
- n + 1
21. An analyst finds p = 0.048. A colleague argues the result is "borderline" and recommends re-running with more data to get p < 0.01. This approach is problematic because:
- The sample is too large
- It constitutes p-hacking — collecting data until significance is achieved
- p = 0.048 is already highly significant
- The t-distribution should be used instead
22. The degrees of freedom in a chi-square test of independence with a 3×4 contingency table are:
- 6
- 12
- 7
- 11
23. In auditing, which hypothesis test is most directly relevant to testing whether financial figures conform to Benford's Law?
- Two-sample t-test
- Z-test for proportions
- Chi-square goodness-of-fit test
- Paired t-test
24. Which of the following would increase the statistical power of a test?
- Decreasing the sample size
- Increasing the significance level (α)
- Increasing both sample size and α
- Using a two-tailed test instead of one-tailed
25. A fund manager claims her portfolio outperforms the market index (mean excess return > 0). The appropriate alternative hypothesis is:
- H₁: μ = 0
- H₁: μ ≠ 0
- H₁: μ > 0
- H₁: μ < 0
26. Which of the following is NOT an assumption of the t-test?
- Approximate normality of the data
- Independence of observations
- The population standard deviation must be known
- Continuous data (interval or ratio scale)
27. If a test has p = 0.003, this is best described as:
- Not significant
- Marginally significant
- Significant at the 5% level only
- Statistically significant at the 1% (and 5%) level
28. A retail bank wants to know if average customer wait time is more than 10 minutes. The correct null hypothesis is:
- H₀: μ > 10
- H₀: μ = 10
- H₀: μ < 10
- H₀: μ ≠ 10
29. Effect size is important in hypothesis testing because:
- It determines the p-value directly
- It tells you whether the result is statistically significant
- It measures the magnitude of the difference, not just whether it is significant
- It replaces the need for a significance level
30. Which of the following scenarios best justifies a chi-square test of independence?
- Comparing mean salaries between two departments
- Testing if a sample mean equals a hypothesized population mean
- Testing whether voting preference is related to income bracket
- Comparing before-and-after blood pressure readings
Frequently Asked Questions
Module 6 — Final Summary
Hypothesis testing is one of the most powerful and widely applied tools in statistics. In this module, you have learned how to transform a research question into a testable hypothesis, how to evaluate that hypothesis using sample data, and how to draw principled, evidence-based conclusions.
The core framework is elegant: formulate competing hypotheses (H₀ and H₁), choose a standard of evidence (α), collect data, compute a test statistic, and compare it — via the p-value — to your threshold. If the evidence exceeds the threshold, reject H₀. If not, withhold judgment.
You have mastered the three most fundamental hypothesis tests:
- The Z-test for large samples with known population variance
- The t-test (one-sample, independent, and paired) for the more common scenario of unknown variance
- The chi-square test (goodness-of-fit and independence) for categorical frequency data
You have also learned about the critical limitations and risks: Type I and Type II errors, p-hacking, the multiple testing problem, the difference between statistical and practical significance, and the 15 most common mistakes that compromise the validity of hypothesis tests.
These skills apply directly to research, business analytics, financial analysis, auditing, and virtually any data-driven decision-making context. In the next module, you will build on this foundation by exploring regression analysis — the tool that moves beyond comparing means to modelling relationships between variables.
📌 SEO & Publishing Metadata
SEO Title: Hypothesis Testing – Complete Guide | Module 6 Applied Statistics
Meta Description: Master hypothesis testing in this complete Module 6 lesson. Learn null hypothesis, alternative hypothesis, p-value, Z-test, t-test, chi-square test, Type I and Type II errors with real examples, 20 exercises, and a 30-question quiz.
URL Slug: /hypothesis-testing-module-6-applied-statistics
Focus Keyword: Hypothesis Testing
Secondary Keywords: null hypothesis, alternative hypothesis, p-value, Z-test, t-test, chi-square test, statistical significance, Type I error, Type II error, inferential statistics, applied statistics
Canonical: Self-referencing
Schema: Course → Module (LearningResource), FAQPage
0 Comments