Statistical Inference: Sample Proportions
1. Population Parameter vs. Sample Statistic
-
Population Parameter: A numerical value that describes a characteristic of the entire population. It is usually unknown and denoted by Greek letters (e.g., \(p\) for population proportion).
-
Sample Statistic: A numerical value that describes a characteristic of a sample taken from the population. It is used to estimate the population parameter and is denoted by Roman letters (e.g., \(\hat{p}\) for sample proportion).
| Feature |
Population Parameter |
Sample Statistic |
| Definition |
Describes population |
Describes sample |
| Notation |
Greek letters (e.g., \(p\)) |
Roman letters (e.g., \(\hat{p}\)) |
| Known/Unknown |
Usually Unknown |
Known |
| Variability |
Constant |
Varies between samples |
| Purpose |
True value |
Estimate of population parameter |
KEY TAKEAWAY: The sample statistic is our best guess for the population parameter, but it’s inherently variable due to random sampling.
2. Simulation of Random Sampling
- Purpose: To visualize the distribution of sample proportions (\(\hat{P}\)) and understand how confidence intervals vary from sample to sample.
- Process:
- Define the population proportion (\(p\)).
- Choose a sample size (\(n\)).
- Generate many random samples of size \(n\) from the population.
- Calculate the sample proportion (\(\hat{p}\)) for each sample.
- Plot the distribution of the sample proportions (\(\hat{P}\)).
- Calculate confidence intervals for each sample.
- Observations:
- The distribution of \(\hat{P}\) becomes approximately normal as \(n\) increases.
- The mean of the distribution of \(\hat{P}\) is close to \(p\).
- The standard deviation of the distribution of \(\hat{P}\) decreases as \(n\) increases.
- Confidence intervals vary in width and position from sample to sample.
- Increasing sample size
n reduces the width of the confidence interval.
STUDY HINT: Use statistical software or online simulators to experiment with different values of \(p\) and \(n\) to observe the effects on the distribution of \(\hat{P}\) and confidence intervals.
3. Sample Proportion as a Random Variable
- Definition: The sample proportion, denoted by \(\hat{P}\), is the proportion of items in a sample that have a particular characteristic.
- Formula: \(\hat{P} = \frac{X}{n}\), where:
- \(X\) is the number of items with the characteristic in the sample. \(X\) follows a binomial distribution: \(X \sim Bin(n, p)\).
- \(n\) is the sample size.
- Random Variable: \(\hat{P}\) is a random variable because its value varies from sample to sample due to random sampling.
- Relationship to Binomial Distribution: Since \(X\) follows a binomial distribution, the distribution of \(\hat{P}\) is related to the binomial distribution.
EXAM TIP: Be clear about the difference between \(X\) (number of successes) and \(\hat{P}\) (proportion of successes).
4. Approximate Normality of the Distribution of \(\hat{P}\)
- Condition: For large samples, the distribution of \(\hat{P}\) is approximately normal. A common rule of thumb is that \(np \geq 5\) and \(n(1-p) \geq 5\).
- Mean: The mean of the distribution of \(\hat{P}\) is the population proportion, \(p\).
\$\(E(\hat{P}) = p\)\$
- Standard Deviation: The standard deviation of the distribution of \(\hat{P}\) is:
\$\(SD(\hat{P}) = \sqrt{\frac{p(1-p)}{n}}\)\$
- Implication: When the distribution of \(\hat{P}\) is approximately normal, we can use the standard normal distribution to calculate probabilities and confidence intervals.
COMMON MISTAKE: Forgetting to check the condition for approximate normality before using the normal distribution to analyze \(\hat{P}\).
5. Confidence Intervals for a Population Proportion
- Definition: A confidence interval is a range of values that is likely to contain the true population proportion, \(p\).
- Formula: The approximate confidence interval for a population proportion is:
\$\$ \left(\hat{p}-z \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p}+z \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right) \$\$
where:
- \(\hat{p}\) is the sample proportion.
- \(n\) is the sample size.
- \(z\) is the z-score corresponding to the desired level of confidence (quantile for the standard normal distribution).
- Standard Error: The term \(\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\) is sometimes referred to as the standard error of the sample proportion.
- 95% Confidence Interval: For a 95% confidence interval, \(z \approx 1.96\). The interval is:
\$\$ \left(\hat{p}-1.96 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}, \hat{p}+1.96 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right) \$\$
- Interpretation: We are 95% confident that the true population proportion, \(p\), lies within the calculated interval. This does not mean that there is a 95% chance that p is in the interval. Rather, if we were to repeat the sampling process many times, 95% of the intervals constructed would contain the true population proportion.
VCAA FOCUS: VCAA often requires you to interpret the meaning of a confidence interval in context.
6. Factors Affecting Confidence Interval Width
- Sample Size (n): As the sample size increases, the width of the confidence interval decreases. Larger samples provide more information about the population.
- Confidence Level: As the confidence level increases (e.g., from 95% to 99%), the width of the confidence interval increases. A higher confidence level requires a wider interval to capture the true population proportion with greater certainty.
- Sample Proportion (p-hat): The width of the interval is also affected by \(\hat{p}\). The further \(\hat{p}\) is from 0.5, the smaller the standard error becomes, and hence the narrower the confidence interval.
REMEMBER: Larger sample size = narrower interval; Higher confidence level = wider interval.
7. Example
Suppose a survey of 500 randomly selected voters finds that 55% support a particular candidate. Calculate a 95% confidence interval for the proportion of all voters who support the candidate.
- \(\hat{p} = 0.55\)
- \(n = 500\)
- \(z = 1.96\)
The 95% confidence interval is:
\[ \left(0.55 - 1.96 \sqrt{\frac{0.55(1-0.55)}{500}}, 0.55 + 1.96 \sqrt{\frac{0.55(1-0.55)}{500}}\right) \]
\[ \left(0.55 - 1.96 \sqrt{\frac{0.2475}{500}}, 0.55 + 1.96 \sqrt{\frac{0.2475}{500}}\right) \]
\[ \left(0.55 - 1.96(0.02226), 0.55 + 1.96(0.02226)\right) \]
\[ (0.5064, 0.5936) \]
Therefore, we are 95% confident that the true proportion of all voters who support the candidate is between 50.64% and 59.36%.
APPLICATION: Confidence intervals are widely used in surveys, opinion polls, and scientific research to estimate population parameters.