Data Summarization
Learn how to transform raw data into meaningful numbers using descriptive statistics — the foundation of every quantitative decision in business, finance, and research.
🎯 Learning Objectives
- Define data summarization and explain its role in statistical analysis
- Calculate and interpret the mean, median, and mode for any dataset
- Compute range, variance, standard deviation, and IQR step by step
- Distinguish between measures of central tendency and measures of dispersion
- Apply descriptive statistics to real business, finance, and audit scenarios
- Identify outliers using the IQR method and understand their effect on summary measures
- Select the most appropriate summary measure for any given situation
📋 Quick Navigation
- Introduction to Data Summarization
- Descriptive Statistics Overview
- Mean (Arithmetic Average)
- Median
- Mode
- Mean vs Median vs Mode Comparison
- Range
- Variance
- Standard Deviation
- Interquartile Range (IQR)
- Central Tendency vs Dispersion
- Real-World Applications
- Practical Case Study
- Common Mistakes Beginners Make
- Practice Exercises
- Multiple Choice Quiz
- Frequently Asked Questions
- Module Summary
Introduction to Data Summarization
Imagine you are a financial analyst at a retail company. Your manager hands you a spreadsheet containing 12,000 individual customer transaction records from the past year and asks: "How are our sales performing?" You cannot read 12,000 rows and give a meaningful answer in seconds. But you can calculate five key numbers — an average transaction value, the middle-most sale, the most common purchase amount, how widely values spread, and the range from smallest to largest — and suddenly a complex dataset becomes a clear, actionable story.
This is the power of data summarization: the science of distilling large volumes of raw data into a small set of numbers that preserve the essential characteristics of the data and enable informed decision-making.
Data summarization is the process of reducing a large collection of observations into a compact set of descriptive numerical measures — such as averages, spread indicators, and position measures — that capture the key features of the dataset without listing every individual value.
Why Data Summarization Matters
In the modern data-driven world, every organization collects enormous quantities of data. Without summarization techniques, this data is virtually useless. Here is why data summarization is fundamental across every professional domain:
- Comprehensibility: A single average transforms thousands of values into one interpretable number.
- Communication: Managers, investors, and auditors cannot read raw datasets — they rely on summary statistics.
- Comparison: Summary measures allow you to compare departments, time periods, competitors, or investment options.
- Decision-making: Budgets, pricing strategies, risk assessments, and resource allocation all depend on summarized data.
- Error detection: Outliers and anomalies become visible when you examine the spread and distribution of data.
- Foundation for inference: Descriptive statistics form the foundation upon which inferential statistics (hypothesis testing, regression) are built.
From Raw Data to Useful Information
Data summarization works by applying mathematical operations to convert raw observations into meaningful summary measures. This process follows a consistent logical flow:
- Individual observations
- Survey responses
- Transaction records
- Measurements
- Sort values
- Group into classes
- Remove duplicates
- Check for errors
- Central tendency
- Dispersion measures
- Positional measures
- Growth rates
- Compare with benchmarks
- Identify trends
- Detect anomalies
- Make decisions
Real-World Examples of Data Summarization in Action
| Profession | Raw Data | Summary Measure Used | Purpose |
|---|---|---|---|
| Finance Manager | 500 daily stock prices | Mean, Standard Deviation | Evaluate average return and volatility |
| Auditor | 10,000 invoice amounts | Mean, Median, IQR | Identify unusual transactions for testing |
| HR Director | 800 employee salaries | Median, Range | Analyze pay equity and compensation bands |
| Marketing Analyst | 5,000 survey responses | Mode, Frequency | Identify most common customer preferences |
| Quality Controller | 2,000 product measurements | Mean, Std Dev | Monitor production consistency |
| Researcher | 1,200 experimental results | Mean, Variance | Assess effectiveness of treatment |
Descriptive Statistics Overview
Descriptive statistics is the branch of statistics that focuses on summarizing, organizing, and presenting data in a meaningful way. Unlike inferential statistics (which draws conclusions about populations from samples), descriptive statistics simply describes what the data shows — no assumptions, no predictions, just clear numerical summaries.
Descriptive statistics comprises numerical and graphical methods used to organize, summarize, and describe a collection of data. The resulting summary measures — such as the mean, standard deviation, and percentiles — provide a complete numerical portrait of the dataset's key characteristics.
The Two Main Categories of Descriptive Statistics
Every descriptive statistical method falls into one of two fundamental categories, each answering a different question about your data:
- Question: Where is the center of my data?
- Mean — Arithmetic average
- Median — Middle value
- Mode — Most frequent value
- Question: How spread out is my data?
- Range — Max minus Min
- Variance — Average squared deviation
- Standard Deviation — √Variance
- IQR — Q3 minus Q1
| Category | Measure | What It Tells You | Formula (simplified) |
|---|---|---|---|
| Central Tendency | Mean | Arithmetic center of all values | Σx / n |
| Median | Middle value of ordered data | Middle observation | |
| Mode | Most frequently occurring value | Highest frequency | |
| Dispersion | Range | Distance from min to max | Max − Min |
| Variance | Average of squared deviations | Σ(x−x̄)² / (n−1) | |
| Std Dev | Typical distance from the mean | √Variance | |
| IQR | Spread of the middle 50% | Q3 − Q1 |
Mean (Arithmetic Average)
The mean is the arithmetic average of a dataset — calculated by summing all values and dividing by the number of observations. It represents the "balance point" of the distribution.
A retail store recorded the following monthly sales revenues ($000s) over 8 months:
Dataset: 42, 55, 38, 61, 49, 53, 47, 44
42 + 55 + 38 + 61 + 49 + 53 + 47 + 44 = 389389 ÷ 8 = 48.625Business Application: Finance Example
An investment portfolio holds 6 stocks with annual returns of: 8.2%, 12.5%, −3.4%, 15.1%, 6.8%, 9.7%. The mean return is: (8.2 + 12.5 − 3.4 + 15.1 + 6.8 + 9.7) ÷ 6 = 48.9 ÷ 6 = 8.15%. An investor knows the portfolio's average annual return is 8.15%, a crucial figure for performance comparison.
Audit Application
An auditor examines a sample of 200 invoices totaling $1,840,000. The mean invoice value is $1,840,000 ÷ 200 = $9,200. If the population contains 5,000 invoices, the estimated total population value is 5,000 × $9,200 = $46,000,000. This is the basis of mean-per-unit estimation in audit sampling.
- Uses every observation in the dataset
- Mathematically precise and stable
- Easy to calculate and interpret
- Essential for further statistical calculations
- Best used with symmetric, bell-shaped data
- Highly sensitive to extreme values (outliers)
- Can be misleading with skewed distributions
- May not represent any actual data point
- Not appropriate for categorical data
Median
The median is the middle value of a dataset when all observations are arranged in ascending or descending order. Exactly 50% of values fall below the median and 50% fall above it. It is the most resistant measure of center to extreme values.
Calculation Rules
The calculation method depends on whether the dataset has an odd or even number of observations:
Raw data: 55, 72, 48, 95, 63, 51, 68
48, 51, 55, 63, 68, 72, 9563Raw data: 32, 45, 28, 61, 39, 52, 44, 37
28, 32, 37, 39, 44, 45, 52, 6183 / 2 = 41.5When to Use the Median: Real Finance Example
Consider income data for a neighborhood: $35K, $38K, $42K, $45K, $41K, $39K, $2,400K (a billionaire). The mean = $377K — completely misleading. The median = $41K — accurately representing the typical resident. Real estate analysts, economists, and policy makers always prefer median income and median home prices over means for exactly this reason.
- Resistant to outliers and skewed data
- Always represents an actual position in data
- Ideal for income, house prices, survey ratings
- Better for skewed distributions
- Does not use all values in its calculation
- Less mathematically convenient for further calculations
- Ignores the magnitude of extreme values
Mode
The mode is the value that appears most frequently in a dataset. A dataset may have one mode (unimodal), two modes (bimodal), multiple modes (multimodal), or no mode at all if all values occur with equal frequency.
A clothing store records 15 customer purchases by category: Electronics, Clothing, Clothing, Shoes, Clothing, Accessories, Shoes, Clothing, Electronics, Clothing, Shoes, Clothing, Clothing, Electronics, Clothing
Types of Mode
| Type | Description | Example |
|---|---|---|
| Unimodal | One value appears most frequently | 2, 3, 3, 4, 5 → Mode = 3 |
| Bimodal | Two values tied for highest frequency | 2, 3, 3, 4, 4, 5 → Mode = 3 and 4 |
| Multimodal | Three or more values with equal highest frequency | 1, 1, 2, 2, 3, 3 → Mode = 1, 2, and 3 |
| No Mode | All values appear exactly once | 2, 4, 6, 8, 10 → No mode |
Business Applications of Mode
- Retail: Most frequently purchased product size, color, or category → stock management
- Banking: Most common loan default amount → credit risk profiling
- HR: Most common employee age group → workforce planning
- Survey Research: Most popular response on a Likert scale
- Healthcare: Most common diagnosis code → resource allocation
Mean vs Median vs Mode: Complete Comparison
| Characteristic | Mean | Median | Mode |
|---|---|---|---|
| Definition | Sum ÷ count | Middle value (sorted) | Most frequent value |
| Data Type | Numerical only | Numerical, ordinal | Any (including nominal) |
| Uses all data? | Yes | No | No |
| Outlier effect | High — severely distorted | Low — resistant | None |
| Unique? | Always unique | Always unique | May have multiple modes |
| Best for | Symmetric data, quality control, financial ratios | Skewed data, income, home prices | Categorical data, fashion, voting |
| Finance example | Average portfolio return | Median household income | Most traded stock sector |
| Audit example | Mean invoice value for estimation | Median transaction for anomaly detection | Most common transaction type |
| Formula | Σx / n | (n+1)/2 th position | Highest frequency |
| Weaknesses | Distorted by extreme values | Ignores extreme values entirely | May not be meaningful; can be multiple |
Range
The range is the simplest measure of dispersion. It is the difference between the maximum and minimum values in a dataset. It provides a quick snapshot of the total spread of the data.
Annual returns for 5 different mutual funds: 4.2%, 9.8%, −2.1%, 15.3%, 7.6%
17.4%- Extremely simple to calculate
- Gives immediate sense of data spread
- Useful for quality control limits
- Easy to communicate to non-statisticians
- Uses only two data points (max and min)
- Extremely sensitive to outliers
- Ignores all values between extremes
- Cannot compare datasets of different sizes reliably
Variance
Variance measures how far each data point in a dataset is from the mean, on average. It quantifies the overall degree of spread in a distribution. A higher variance means values are more scattered around the mean; a lower variance means they cluster tightly.
The key concept behind variance is the squared deviation. For each data point, we calculate how far it is from the mean (the deviation), then square it (to eliminate negative signs and penalize large deviations more heavily), and finally average all squared deviations.
A factory measures the diameter (mm) of 6 ball bearings: 10.1, 9.8, 10.3, 10.0, 9.9, 10.2
(10.1 + 9.8 + 10.3 + 10.0 + 9.9 + 10.2) / 6 = 60.3 / 6 = 10.05 mm(10.1−10.05)² = 0.0025(9.8−10.05)² = 0.0625(10.3−10.05)² = 0.0625(10.0−10.05)² = 0.0025(9.9−10.05)² = 0.0225(10.2−10.05)² = 0.0225
0.0025 + 0.0625 + 0.0625 + 0.0025 + 0.0225 + 0.0225 = 0.175s² = 0.175 / 5 = 0.035 mm²Why Does Variance Matter?
| Field | How Variance is Used |
|---|---|
| Finance / Investments | Portfolio variance measures investment risk. Higher variance = higher risk but potentially higher return. |
| Quality Control | Low variance in product measurements indicates consistent manufacturing processes. |
| Insurance Pricing | Higher claim variance → higher premiums to cover unexpected large losses. |
| Academic Testing | High variance in scores indicates diverse skill levels; low variance suggests a well-calibrated test. |
| Auditing | High variance in transaction amounts triggers enhanced audit scrutiny. |
Standard Deviation
Standard deviation is the square root of variance. It measures the average distance of each data point from the mean, expressed in the same units as the original data. It is the most widely used measure of dispersion in statistics, finance, and scientific research.
We calculated s² = 0.035 mm² for ball bearing diameters.
0.187 mmInterpreting Standard Deviation
- Data points cluster tightly around the mean
- High consistency, predictability, reliability
- Low volatility in financial context
- Example: Blue-chip stock with σ = 2%
- Data points are widely scattered from the mean
- High variability, unpredictability, risk
- High volatility in financial context
- Example: Speculative crypto asset with σ = 40%
The Empirical Rule (68-95-99.7 Rule)
For data that follows a normal (bell-shaped) distribution, standard deviation has a powerful interpretive property:
| Range | Percentage of Data Included | Example (Mean=100, σ=15) |
|---|---|---|
| Mean ± 1σ | ≈ 68% of all observations | 85 to 115 |
| Mean ± 2σ | ≈ 95% of all observations | 70 to 130 |
| Mean ± 3σ | ≈ 99.7% of all observations | 55 to 145 |
Finance Application: Stock Volatility
Stock A has a mean annual return of 12% with σ = 5%. Stock B has the same 12% mean return but σ = 22%. An investor can expect 68% of years to fall between 7%–17% for Stock A, but between −10%–34% for Stock B. Standard deviation is the foundation of modern portfolio theory and Value-at-Risk (VaR) calculations.
Interquartile Range (IQR)
The Interquartile Range (IQR) is the difference between the third quartile (Q3) and the first quartile (Q1). It measures the spread of the middle 50% of data, completely ignoring the top and bottom 25%, making it highly resistant to outliers.
Understanding Quartiles
Quartiles divide an ordered dataset into four equal parts:
| Quartile | Symbol | Position | Meaning |
|---|---|---|---|
| First Quartile | Q1 | 25th percentile | 25% of data falls below this value |
| Second Quartile | Q2 | 50th percentile | The Median — 50% of data falls below |
| Third Quartile | Q3 | 75th percentile | 75% of data falls below this value |
| IQR | Q3 − Q1 | Middle 50% | Spread of the central portion of data |
Dataset (12 invoices): 8, 12, 15, 18, 21, 24, 27, 30, 35, 42, 55, 280
(Note: 280 is a suspected outlier — possibly a fraudulent or erroneous invoice)
(15 + 18) / 2 = 16.5(35 + 42) / 2 = 38.538.5 − 16.5 = 22.0Lower:
Q1 − 1.5 × IQR = 16.5 − 33 = −16.5 (no lower outlier)Upper:
Q3 + 1.5 × IQR = 38.5 + 33 = 71.5IQR in Box Plots
The IQR is the foundation of the box plot (box-and-whisker diagram), one of the most powerful data visualization tools in statistics. The box spans from Q1 to Q3, with a line at Q2 (median). Whiskers extend to the outermost values within 1.5×IQR of the quartiles. Points beyond the whiskers are plotted individually as potential outliers.
Measures of Central Tendency vs Measures of Dispersion
| Feature | Measures of Central Tendency | Measures of Dispersion |
|---|---|---|
| Core Question | Where is the center of data? | How spread out is the data? |
| Purpose | Represent the typical value | Quantify variability and risk |
| Methods | Mean, Median, Mode | Range, Variance, Std Dev, IQR |
| Used alone? | Insufficient — can be misleading | Meaningless without central tendency |
| Together | Provide a complete picture: "The average return is 8% with a standard deviation of 3%" | |
| Finance use | Expected return (mean) | Risk / Volatility (σ) |
| Quality use | Target specification (mean) | Process consistency (σ) |
| Outlier effect | Mean is heavily affected; median resistant | Range and σ heavily affected; IQR resistant |
| Best pairing | Mean + Standard Deviation | Median + IQR |
Real-World Applications of Data Summarization
Finance and Investment Analysis
- Mean return: Portfolio managers calculate mean historical returns to estimate future expected returns and benchmark against indices.
- Standard deviation (volatility): The σ of a stock's returns is its risk measure in Modern Portfolio Theory. Lower σ = lower risk.
- Sharpe Ratio: (Mean Return − Risk-Free Rate) ÷ Standard Deviation — directly uses two summary measures.
- Median price targets: Analysts report median price targets (not mean) to avoid distortion from extreme outlier estimates.
- Quartile analysis: Fund performance is ranked by quartiles — top-quartile funds are those in Q4 by return.
Accounting and Financial Reporting
- Expense analysis: Accountants calculate mean and median operating expenses across departments to identify outlier cost centers.
- Revenue variance analysis: Comparing actual vs. budgeted revenue using variance as a performance metric.
- Receivables aging: Mean and median days sales outstanding (DSO) measure collection efficiency.
- Budget preparation: Prior-year mean expenses are the starting point for next-year budget projections.
Auditing
- Audit sampling: Mean-per-unit estimation projects total population value from a sample's mean.
- Analytical procedures: Comparing current-year mean transaction values to prior-year baselines to detect material misstatements.
- Outlier detection: IQR-based outlier flagging prioritizes transactions for detailed testing.
- Benford's Law analysis: Examining the frequency distribution (mode analysis) of leading digits in financial data to detect manipulation.
Business Operations
- Sales analysis: Mean and median weekly sales by territory, product line, or salesperson.
- Customer behavior: Mode of purchase frequency identifies the most common buying pattern.
- Supply chain: Standard deviation of delivery times measures supplier reliability.
- Pricing strategy: Median market prices inform competitive positioning without distortion from luxury outlier products.
Research and Academia
- Survey analysis: Mode of Likert-scale responses; mean satisfaction scores across groups.
- Clinical trials: Mean treatment effect with standard deviation quantifies both efficacy and variability.
- Education: Median test scores prevent a few high performers from inflating the apparent class performance.
Practical Case Study: Meridian Retail Group
Scenario: You are a data analyst at Meridian Retail Group, a mid-sized clothing retailer with 8 regional stores. The CEO asks you to analyze monthly sales revenue for Q4 (October, November, December) across all 8 stores to support strategic decisions about store investment, bonus allocation, and expansion planning.
Raw Data: Monthly Store Revenue ($000s)
| Store | October | November | December | Q4 Total |
|---|---|---|---|---|
| Store 1 — Downtown | 142 | 168 | 225 | 535 |
| Store 2 — Suburb North | 88 | 102 | 139 | 329 |
| Store 3 — Suburb South | 95 | 108 | 147 | 350 |
| Store 4 — Mall West | 178 | 215 | 298 | 691 |
| Store 5 — Mall East | 156 | 189 | 261 | 606 |
| Store 6 — Airport | 62 | 71 | 94 | 227 |
| Store 7 — University | 75 | 88 | 118 | 281 |
| Store 8 — Flagship City | 312 | 387 | 524 | 1,223 |
Step 1: Calculate Q4 Total Revenue Summary Measures
Q4 Totals ($000s): 535, 329, 350, 691, 606, 227, 281, 1,223
(535+329+350+691+606+227+281+1,223) ÷ 8 = 4,242 ÷ 8 = $530.25K227, 281, 329, 350, 535, 606, 691, 1,223(350 + 535) ÷ 2 = $442.5K1,223 − 227 = $996K(281 + 329) ÷ 2 = $305KQ3 (avg of 6th & 7th):
(606 + 691) ÷ 2 = $648.5KIQR:
648.5 − 305 = $343.5KDeviations from mean (530.25): 4.75, −201.25, −180.25, 160.75, 75.75, −303.25, −249.25, 692.75
Squared: 22.56, 40,501.6, 32,490.1, 25,840.6, 5,738.1, 91,960.6, 62,125.6, 479,902.6
Sum = 738,581.7 ÷ 7 = s² = 105,511.7
s = √105,511.7 = $324.8K
Step 2: Interpret Results and Make Business Decisions
| Measure | Value | Business Interpretation |
|---|---|---|
| Mean | $530.25K | Average Q4 revenue per store. Use as performance benchmark. |
| Median | $442.5K | Typical store revenue ($87.75K below mean — skewed upward by the flagship). Most stores perform below mean. |
| Std Dev | $324.8K | High variability — stores are very different in revenue. Not appropriate to apply uniform investment across all stores. |
| IQR | $343.5K | Middle 50% of stores range from $305K to $648.5K. A healthy performance band for the core group. |
| Outlier Test | $1,223K | Upper fence = $648.5 + 1.5×343.5 = $1,163.8K. The Flagship City store ($1,223K) is technically an outlier. |
Step 3: CEO Recommendations Based on Statistical Analysis
- Use median ($442.5K), not mean ($530.25K), as the "typical store" benchmark because the Flagship outlier inflates the mean significantly.
- Prioritize investment in Mall West and Mall East (Q3 performers) — they show strong organic growth potential.
- Review underperformers: Airport Store ($227K) and University Store ($281K) are below Q1 = $305K — consider performance improvement plans or strategic repositioning.
- Bonus structure: Set the bonus threshold at the median ($442.5K) rather than the mean to ensure 50% of stores qualify — a fairer incentive design.
- December seasonality: The mean December revenue ($225.75K) is 72% higher than October ($138.5K) — maintain significantly higher inventory and staffing in Q4.
10 Common Mistakes Beginners Make
Solution: Always visualize your data distribution first. When data is skewed (e.g., income, house prices, insurance claims), use the median instead of the mean.
Solution: Always sort data in ascending or descending order before identifying quartiles or the median.
Solution: Always take the square root to get the standard deviation (s = 12), which is in the same units as the original data and is interpretable.
Solution: When working with a sample (which is almost always), use n−1 in the denominator (Bessel's correction). Use N only when you have complete population data.
Solution: Investigate outliers before removing them. An outlier might be a data entry error (remove it), a genuinely exceptional observation (keep it and note it), or evidence of fraud (escalate it).
Solution: Always pair a central tendency measure with a dispersion measure. Mean + Standard Deviation, or Median + IQR.
Solution: Mode is most useful for categorical data or discrete data with limited unique values. For continuous data, group into class intervals and find the modal class.
Solution: Context determines interpretation. Compare σ to the mean (coefficient of variation = σ/mean × 100%) for relative variability. Benchmark against industry standards.
Solution: Use range as an initial quick check only. Always follow up with standard deviation and IQR for a complete picture of variability.
Solution: For nominal (categorical) data, use only the mode and frequency distributions. Never apply mean, median, variance, or standard deviation to non-numerical categories.
⭐ Key Takeaways — Module 3
- Data summarization converts raw observations into a small set of meaningful numbers that capture the essential characteristics of a dataset.
- The three measures of central tendency — mean, median, mode — each answer "where is the center?" but in different ways suited to different data types and distributions.
- The mean uses all observations but is highly sensitive to outliers. The median is resistant to outliers but ignores extreme values. The mode identifies the most common value and works for categorical data.
- Measures of dispersion — range, variance, standard deviation, IQR — quantify how spread out data is. They are meaningless without a corresponding measure of central tendency.
- Standard deviation is the most commonly used dispersion measure. It is expressed in the same units as the data and forms the foundation of finance's risk measurement framework.
- The IQR (Q3 − Q1) measures the spread of the middle 50% of data and is used to detect outliers via the 1.5×IQR rule, making it essential in auditing and quality control.
- Always use mean + standard deviation together for symmetric data, and median + IQR together for skewed data or data with outliers.
- The empirical rule states that in normally distributed data, 68% falls within ±1σ, 95% within ±2σ, and 99.7% within ±3σ of the mean.
- Context determines which summary measure is most appropriate. Finance uses σ for risk; income analysis uses median; categorical surveys use mode.
- Descriptive statistics are the foundation upon which inferential statistics, hypothesis testing, and predictive modeling are built.
Practice Exercises
Part A: Conceptual Questions (15 Questions)
Explain in your own words why the median is preferred over the mean when analyzing household income data. Provide a numerical example to support your answer.
A dataset has a mean of 50 and a standard deviation of 2, while another has a mean of 50 and a standard deviation of 18. What can you conclude about these two datasets without seeing the data?
A retail store sells shoes in sizes: 7, 8, 8, 9, 9, 9, 10, 10, 11. Which measure of central tendency should the store manager focus on when deciding which sizes to stock most heavily, and why?
Why do we divide by (n−1) instead of n when calculating sample variance? What problem does this correction solve?
Describe two different business scenarios where the IQR would be more appropriate than the standard deviation for measuring spread. Explain your reasoning.
Remaining conceptual questions: (6) What does the empirical rule state, and when does it apply? (7) Can the mean, median, and mode all be equal? When? (8) What does a variance of zero tell you about a dataset? (9) How does data summarization support the audit process? (10) Why is range considered an incomplete measure of dispersion? (11) Describe a situation where a bimodal distribution would occur in business data. (12) What is the coefficient of variation, and why is it useful for comparing two datasets? (13) How does standard deviation relate to risk in finance? (14) Why is it incorrect to calculate a mean for nominal data? (15) Explain what "resistant to outliers" means in the context of the median and IQR.
Part B: Calculation Problems (10 Problems)
A sales team recorded weekly units sold: 42, 55, 42, 68, 73, 42, 61, 55, 78, 66. Calculate the mean, median, and mode.
Mean: (42+42+42+55+55+61+66+68+73+78) ÷ 10 = 582 ÷ 10 = 58.2 units
Median: (55+61) ÷ 2 = 58 units
Mode: 42 units (appears 3 times)
Five quarterly profits ($M): 12, 15, 10, 18, 14. Calculate sample variance and standard deviation.
Squared deviations: (12−13.8)²=3.24, (15−13.8)²=1.44, (10−13.8)²=14.44, (18−13.8)²=17.64, (14−13.8)²=0.04
Sum = 36.8. s² = 36.8 ÷ 4 = 9.2. s = √9.2 = $3.03M
Audit sample of 10 transaction values ($000s): 5, 8, 12, 14, 16, 19, 22, 28, 35, 180. Calculate IQR and identify any outliers.
Upper fence = 25 + 1.5×12 = 25+18 = 43. Lower fence = 13−18 = −5
$180K exceeds 43K → Flagged as outlier for audit investigation.
(4) Stock returns over 6 months: 3%, −1%, 5%, 8%, 2%, 4%. Find mean and standard deviation. (5) 9 employee ages: 23, 28, 32, 35, 35, 41, 47, 52, 61. Find Q1, Q2, Q3, and IQR. (6) Monthly expenses for 7 months: $8,200, $7,500, $9,100, $8,800, $7,200, $8,600, $9,400. Find range, mean, and median. (7) A factory produces bolts with diameters: 9.9, 10.0, 10.1, 9.8, 10.2, 10.0, 9.9, 10.1 mm. Calculate variance and std dev. (8) Test scores: 72, 85, 91, 68, 77, 85, 94, 85, 79, 88. Find all three measures of central tendency. (9) Portfolio annual returns for 5 years: 12%, 8%, −3%, 15%, 11%. Calculate mean return, variance, and standard deviation. (10) Revenue CAGR: A business grew revenue from $2.4M to $3.8M over 4 years. Calculate the CAGR. [Hint: CAGR = (End/Start)^(1/n) − 1]
Module 3 Quiz — 20 Multiple Choice Questions
A dataset contains values: 4, 7, 7, 9, 13. What is the mean?
For the dataset 3, 8, 8, 12, 15, 22, 28, what is the median?
Which measure of central tendency is MOST resistant to extreme values (outliers)?
The standard deviation of a dataset is always:
A dataset has values: 10, 10, 10, 10, 10. What is its standard deviation?
IQR = Q3 − Q1. If Q1 = 20 and Q3 = 50, what is the upper outlier fence using the 1.5×IQR rule?
Which measure is MOST appropriate for summarizing categorical data such as "most popular payment method"?
A company's 5 sales reps achieved revenues of $120K, $135K, $118K, $142K, $125K. The sample variance requires dividing the sum of squared deviations by:
The empirical rule states that approximately what percentage of data falls within ±2 standard deviations of the mean in a normal distribution?
For the dataset 5, 10, 15, 20, 25, 100, which pairing gives the most informative summary?
A dataset has Range = 0. What does this tell you?
An auditor uses mean-per-unit estimation. A sample of 100 invoices has a mean value of $850. If the population contains 4,000 invoices, what is the estimated total population value?
Which measure of central tendency is used in the Sharpe Ratio for financial performance evaluation?
A dataset is bimodal. This means:
In Modern Portfolio Theory, standard deviation of returns is used as a measure of:
What is the Q2 quartile equivalent to?
If a dataset's mean is significantly higher than its median, the distribution is likely:
A factory sets a quality control specification of 100mm ± 3mm. If the production process has mean = 100mm and σ = 1mm, approximately what percentage of parts meet specification (using the empirical rule)?
Why is the range considered an incomplete measure of dispersion?
A researcher collects data on customer satisfaction on a 5-point scale (1=Very Dissatisfied to 5=Very Satisfied) with results: 3, 4, 5, 4, 4, 3, 5, 4, 4, 2. Which measure best summarizes the most common satisfaction level?
Frequently Asked Questions
Module 3 Final Summary
This module introduced you to the core tools of data summarization and descriptive statistics — the essential first step in any quantitative analysis. You are now equipped to transform any raw dataset into a compact, meaningful set of numbers that tell a clear story.
What You Mastered
| Measure | Type | Formula | Outlier Resistant? | Best Use Case |
|---|---|---|---|---|
| Mean | Central Tendency | Σx / n | No | Symmetric data, financial ratios |
| Median | Central Tendency | Middle value | Yes | Skewed data, income, prices |
| Mode | Central Tendency | Most frequent | Yes | Categorical data, inventory |
| Range | Dispersion | Max − Min | No | Quick spread check |
| Variance | Dispersion | Σ(x−x̄)² / (n−1) | No | Mathematical analysis, finance |
| Std Dev | Dispersion | √Variance | No | Risk, quality control, research |
| IQR | Dispersion | Q3 − Q1 | Yes | Outlier detection, skewed data |
The ability to select the right measure — knowing when to use mean vs median, or standard deviation vs IQR — separates a competent analyst from a great one. These tools will underpin every module that follows, from probability distributions through regression analysis.
Continue Your Learning
Data summarization connects to every other area of statistics. Explore related modules to deepen your understanding:
- Module 2: Data and Measurement Scales
- Nominal, ordinal, interval, ratio scales
- Why measurement type determines which statistics to use
- Module 4: Probability Fundamentals
- Classical, empirical, subjective probability
- Probability rules and expected value
0 Comments