A point estimate is a single value given as the estimate of a population parameter that is of interest, for example, the mean of some quantity. An interval estimate specifies instead a range within which the parameter is estimated to lie. Interval estimates can be contrasted with point estimates. Confidence intervals are commonly reported in tables or graphs along with point estimates of the same parameters, to show the reliability of the estimates.
Suppose we want to estimate an actual population mean μ. As you know, we can only obtain \( \overline{x} , \) the mean of a sample randomly selected from the population of interest. We can use \( \overline{x} \) to find a range of values:
\[ \mbox{Lower value} < \mbox{population mean } \mu < \mbox{Upper value} \] that we can be really confident contains the population mean μ. The range of values is called a confidence interval. The general form of most confidence intervals is \[ \mbox{Sample estimate} \pm \mbox{margin of error} . \] That is, \[ \mbox{the lower limit } L \mbox{ of the interval } = \mbox{estimate} - \mbox{margin of error} , \] and \[ \mbox{the upper limit } U \mbox{ of the interval } = \mbox{estimate} + \mbox{margin of error} ,. \] Once we have obtained the interval, we can claim that we are really confident that the value of the population parameter is somewhere between the value of L and the value of U.The number we add and subtract from the point estimate is called the margin of error. The question arises: What number should we subtract from and add to a point estimate to obtain an interval estimate? The answer to this question depends on two considerations:
Second, the quantity subtracted and added must be larger if we want to have a higher confidence in our interval. It is a custom to attach a probabilistic statement to the interval estimation. This probabilistic statement is given by the confidence level. An interval constructed based on this confidence level is called a confidence interval. The confidence interval is given as
\[ \mbox{point estimate } \pm \mbox{margin of error} . \] The confidence level associated with a confidence interval states how much confidence we have that this interval contains the true population parameter. The confidence level is denoted by \( (1- \alpha )\,100\% , \) where α is the Greek letter alpha. When expressed as probability, it is called the confidence coefficient and is denoted by 1 - α. The α is called the significance level.More generally and more precisely, we can say that 100(1-α)% of all samples of size n have means within the interval:
\[ \left[ \overline{x} - z_{\alpha /2} \cdot \frac{\sigma}{\sqrt{n}} , \ \overline{x} + z_{\alpha /2} \cdot \frac{\sigma}{\sqrt{n}} \right] , \] where \( z_{\alpha /2} \) is the value of the standard normal distribution giving probability α/2, that is, \[ \frac{1}{\sqrt{2\pi}} \, \int_{-\infty}^{z_{\alpha /2}} {\text d}t \, e^{-t^2 /2} = \frac{\alpha}{2} . \] R has a special command to calculate z-values:Note that we assume that the standard deviation σ of the the total population is known. The above z-interval procedure works reasonably well even when the variable is not normally distributed and the sample size is small or moderate, provided the variable is not too far from being normally distributed. Thus, we say that the z-interval procedure is robust to moderate violations of the normality assumptions.
Example: Consider weights of hockey players in NHL during 2017-2018 season, which has mean 173.5 lbs with standard deviation of 13.39 (according to the official NHL web data). Now we take a sample of five players from Washington Capital:
Player | Weight |
---|---|
Alexanter Ovechkin | 236 |
Nicklas Backstrom | 214 |
Jay Beagle | 216 |
Brooks Orpik | 220 |
Dmitry Orlov | 209 |
Therefore, this sample shows the mean of 219 with standard deviation of 10.29563. We know that the sample mean \( \overline{x} = 219 \) and its variance s2 are unbiased estimators of the population mean μ = 173.5 and the population variance \( \sigma^2 = 13.39^2 \approx 178.2921 . \) However, the sample standard deviation s is a biased estimator of the statistic parameter (in our case, standard deviation of the population).
Now we take another sample from Boston Bruins:
Player | Weight |
---|---|
Brad Marchand | 181 |
Patrice Bergeron | 195 |
David Pastrňák | 188 |
Torey Krug | 186 |
Brandon Carlo | 208 |
You can find the confidence interval using R. However, you need first to install two packages (the later one will be used for proportions).
install.packages("Rmisc", lib= "/data/Rpackages/") install.packages("lattice", lib= "/data/Rpackages/") install(plyr) install.packages("PropCIs", lib= "/data/Rpackages/")
lizard = c(6.2, 6.6, 7.1, 7.4, 7.6, 7.9, 8, 8.3, 8.4, 8.5, 8.6, + 8.8, 8.8, 9.1, 9.2, 9.4, 9.4, 9.7, 9.9, 10.2, 10.4, 10.8, + 11.3, 11.9) If we use the t.test command listing only the data name, we get a 95% confidence interval for the mean after the significance test. n.draw = 100 mu = 9 n = 24 SD = sd(lizard) draws = matrix(rnorm(n.draw * n, mu, SD), n) get.conf.int = function(x) t.test(x)$conf.int conf.int = apply(draws, 2, get.conf.int) sum(conf.int[1, ] <= mu & conf.int[2, ] >= mu) plot(range(conf.int), c(0, 1 + n.draw), type = "n", xlab = "mean tail length", + ylab = "sample run") for (i in 1:n.draw) lines(conf.int[, i], rep(i, 2), lwd = 2) abline(v = 9, lwd = 2, lty = 2)