Statistical Inference: Estimation of the Mean and Proportion
The most fundamental point and interval estimation process involves the estimation of a population mean. When the sample mean is used as a point estimate of the population mean, some error can be expected owing to the fact that a sample, or subset of the population, is used to compute the point estimate.
A point estimate is a single value given as the estimate of a population parameter that is of interest, for example, the mean of some quantity. An interval estimate specifies instead a range within which the parameter is estimated to lie. Interval estimates can be contrasted with point estimates. Confidence intervals are commonly reported in tables or graphs along with point estimates of the same parameters, to show the reliability of the estimates.
Suppose we want to estimate an actual population mean μ. As you know, we can only obtain \( \overline{x} , \) the mean of a sample randomly selected from the population of interest. We can use \( \overline{x} \) to find a range of values:
\[ \mbox{Lower value} < \mbox{population mean } \mu < \mbox{Upper value} \] that we can be really confident contains the population mean μ. The range of values is called a confidence interval. The general form of most confidence intervals is \[ \mbox{Sample estimate} \pm \mbox{margin of error} . \] That is, \[ \mbox{the lower limit } L \mbox{ of the interval } = \mbox{estimate} - \mbox{margin of error} , \] and \[ \mbox{the upper limit } U \mbox{ of the interval } = \mbox{estimate} + \mbox{margin of error} ,. \] Once we have obtained the interval, we can claim that we are really confident that the value of the population parameter is somewhere between the value of L and the value of U.The number we add and subtract from the point estimate is called the margin of error. The question arises: What number should we subtract from and add to a point estimate to obtain an interval estimate? The answer to this question depends on two considerations:
- The standard deviation \( \sigma_{\overline{x}} \) of the sample mean, \( \overline{x} . \)
- The level of confidence to be attached to the interval.
Second, the quantity subtracted and added must be larger if we want to have a higher confidence in our interval. It is a custom to attach a probabilistic statement to the interval estimation. This probabilistic statement is given by the confidence level. An interval constructed based on this confidence level is called a confidence interval. The confidence interval is given as
\[ \mbox{point estimate } \pm \mbox{margin of error} . \] The confidence level associated with a confidence interval states how much confidence we have that this interval contains the true population parameter. The confidence level is denoted by \( (1- \alpha )\,100\% , \) where α is the Greek letter alpha. When expressed as probability, it is called the confidence coefficient and is denoted by 1 - α. The α is called the significance level.More generally and more precisely, we can say that 100(1-α)% of all samples of size n have means within the interval:
\[ \left[ \overline{x} - z_{\alpha /2} \cdot \frac{\sigma}{\sqrt{n}} , \ \overline{x} + z_{\alpha /2} \cdot \frac{\sigma}{\sqrt{n}} \right] , \] where \( z_{\alpha /2} \) is the value of the standard normal distribution giving probability α/2, that is, \[ \frac{1}{\sqrt{2\pi}} \, \int_{-\infty}^{z_{\alpha /2}} {\text d}t \, e^{-t^2 /2} = \frac{\alpha}{2} . \] R has a special command to calculate z-values:Note that we assume that the standard deviation σ of the the total population is known. The above z-interval procedure works reasonably well even when the variable is not normally distributed and the sample size is small or moderate, provided the variable is not too far from being normally distributed. Thus, we say that the z-interval procedure is robust to moderate violations of the normality assumptions.
Example: Consider weights of hockey players in NHL during 2017-2018 season, which has mean 173.5 lbs with standard deviation of 13.39 (according to the official NHL web data). Now we take a sample of five players from Washington Capital:
Player | Weight |
---|---|
Alexanter Ovechkin | 236 |
Nicklas Backstrom | 214 |
Jay Beagle | 216 |
Brooks Orpik | 220 |
Dmitry Orlov | 209 |
Therefore, this sample shows the mean of 219 with standard deviation of 10.29563. We know that the sample mean \( \overline{x} = 219 \) and its variance s2 are unbiased estimators of the population mean μ = 173.5 and the population variance \( \sigma^2 = 13.39^2 \approx 178.2921 . \) However, the sample standard deviation s is a biased estimator of the statistic parameter (in our case, standard deviation of the population).
Now we take another sample from Boston Bruins:
Player | Weight |
---|---|
Brad Marchand | 181 |
Patrice Bergeron | 195 |
David Pastrňák | 188 |
Torey Krug | 186 |
Brandon Carlo | 208 |
You can find the confidence interval using R. However, you need first to install two packages (the later one will be used for proportions).
install.packages("Rmisc", lib= "/data/Rpackages/") install.packages("lattice", lib= "/data/Rpackages/") install(plyr) install.packages("PropCIs", lib= "/data/Rpackages/")
lizard = c(6.2, 6.6, 7.1, 7.4, 7.6, 7.9, 8, 8.3, 8.4, 8.5, 8.6, + 8.8, 8.8, 9.1, 9.2, 9.4, 9.4, 9.7, 9.9, 10.2, 10.4, 10.8, + 11.3, 11.9) If we use the t.test command listing only the data name, we get a 95% confidence interval for the mean after the significance test. n.draw = 100 mu = 9 n = 24 SD = sd(lizard) draws = matrix(rnorm(n.draw * n, mu, SD), n) get.conf.int = function(x) t.test(x)$conf.int conf.int = apply(draws, 2, get.conf.int) sum(conf.int[1, ] <= mu & conf.int[2, ] >= mu) plot(range(conf.int), c(0, 1 + n.draw), type = "n", xlab = "mean tail length", + ylab = "sample run") for (i in 1:n.draw) lines(conf.int[, i], rep(i, 2), lwd = 2) abline(v = 9, lwd = 2, lty = 2)
The sample variance is calculated according to the formula
\[ s^2 = \frac{1}{n-1} \, \sum_i \left( x_i - \overline{x} \right)^2 . \] The above formula can be slightly modified: \[ s^2 = \frac{1}{n-1} \left( \sum_i x_i^2 - \frac{1}{n} \left( \sum_i x_i \right)^2 \right) , \] where n is the sample size. When using this formula, do not perform any rounding until the computation is complete; otherwise, substantial roundoff error can result.Upon taking a square root from the right hand side, we obtain the sample standard deviation, which is biased estimator of the population standard deviation. On the other hand, s2 is the unbiased estimator of the variance for the infinite population. However, it is not an unbiased estimator of the variance of a finite population. Recall that a statistic \( \hat{p} \) is an unbiased estimator of the parameter p if and only if its expected value is equal to \( E \left[ \hat{p} \right] = p . \)
Theorem: If s2 is the variance of a random sample from an infinite population with the finite variance σ2, then its expected value is equal to the mean of the population, that is, \( E\left[ s^2 \right] = \sigma^2 . \)
Proof: According to definition of the expected value, we have \begin{align*} E\left[ s^2 \right] &= E \left[ \frac{1}{n-1} \cdot \sum_{i=1}^n \left( x_i - \overline{x} \right)^2 \right] \\ &= \frac{1}{n-1} \cdot E \left[ \sum_{i=1}^n \left\{ \left( x_i - \mu \right) - \left( \overline{x} - \mu \right) \right\}^2 \right] \\ &= \frac{1}{n-1} \cdot \left[ \sum_{i=1}^n E \left( x_i - \mu \right)^2 - n \cdot E \left[ \left( \overline{x} - \mu \right)^2 \right] \right] . \end{align*} Then, since \( E \left[ \left( x_i - \mu \right)^2 \right] = \sigma^2 \) and \( E \left[ \left( \overline{x} - \mu \right)^2 \right] = \frac{\sigma^2}{n} , \) it follows that \[ E\left[ s^2 \right] = \frac{1}{n-1} \cdot \left[ \sum_{i=1}^n \sigma^2 - n \cdot \frac{\sigma^2}{n} \right] = \sigma^2 . \qquad ■ \]The standard deviation is a measure of variation---the more variation there is in a data, the larger is its standard deviation. Almost all the observations in any data set lie within three standard deviations to either side of the mean. A more precise version of the three-standard deviations rule can be obtained from Chebyshev's rule:
For any quantitative data set and any real number k greater than or equal to 1, at least 1 - 1/k2 of the observations lie within k standard deviations to either side of the mean, that is, between \( \overline{x} - k\,s \) and \( \overline{x} + k\,s . \) ■Example: We return to a sample of five hockey players taken from Washington Capital team.
A point estimate is a single value given as the estimate of a population parameter that is of interest, for example, the mean of some quantity. An interval estimate specifies instead a range within which the parameter is estimated to lie. Interval estimates can be contrasted with point estimates. Confidence intervals are commonly reported in tables or graphs along with point estimates of the same parameters, to show the reliability of the estimates.