Linear Regression
The sampling distribution of a statistic is the distribution of that statistic, considered as a random variable, when derived from a random sample of size n. It may be considered as the distribution of the statistic for all possible samples from the same population of a given sample size.
Formally, if we are given the joint distribution of two random variables X and Y, and X is known to take on the value x, the basic problem of bivariate regression is that of determining the conditional mean \( \mu_{Y|x} , \) that is, the "average" value of Y for the given value of X.
If f(x,y) is the value of the joint density of two variables, X and Y at (x,y), the problem of bivariate regression is simply that of determining the conditional density of Y given X = x and then evaluating the integral
\[ \mu_{Y|x} = E\left[ Y\, |\,x \right] = \int_{-\infty}^{\infty} y\cdot w(y|x)\, {\text d}y . \] The resulting equation is called the regression equation of Y on X. Alternatively, we might be interested in the regression equation \[ \mu_{X|y} = E\left[ X\, |\,y \right] = \int_{-\infty}^{\infty} x\cdot f(x|y) \, {\text d}x . \] In the discrete case, when we are dealing with probability distributions instead of probability densities, the integrals in the two regression equations given above are simply replaced by sums.The term regression was coined by the English statistician Francis Galton (1822--1911) in the nineteenth century to describe a biological phenomenon. The phenomenon was that the heights of descendants of tall ancestors tend to regress down towards a normal average (a phenomenon also known as regression toward the mean).
The regression is a probability distribution of a statistic obtained through a large number of samples drawn from a specific population. The sampling distribution of a given population is the distribution of frequencies of a range of different outcomes that could possibly occur for a statistic of a population.
Galton produced over 340 papers and books. He also created the statistical concept of correlation and widely promoted regression toward the mean. He was the first to apply statistical methods to the study of human differences and inheritance of intelligence, and introduced the use of questionnaires and surveys for collecting data on human communities, which he needed for genealogical and biographical works and for his anthropometric studies. As an investigator of the human mind, he founded psychometrics (the science of measuring mental faculties) and differential psychology and the lexical hypothesis of personality. He devised a method for classifying fingerprints that proved useful in forensic science.
Here are some examples of statistical relationships involving regressions:
- Distance driving against time for driving.
- Household's income and food spending.
- IQ scores of students and their future salary.
Example: We will use the cars dataset that comes with R by default.
cars
is a standard built-in dataset, that makes it convenient to
demonstrate linear regression in a simple and easy to understand fashion. You
can access this dataset simply by typing in cars
in your R
console. You will find that it consists of 50 observations(rows) and 2
variables (columns) – dist and speed. Lets print out the first six
observations here.
To build a simple regression model that we can use to predict Distance (dist) by establishing a statistically significant linear relationship with Speed (speed). But before jumping in to the syntax, lets try to understand these variables graphically. Typically, for each of the independent variables (predictors), the following plots are drawn to visualize the following behavior:
- Scatter plot: Visualize the linear relationship between the predictor and response
- Box plot: To spot any outlier observations in the variable. Having outliers in your predictor can drastically affect the predictions as they can easily affect the direction/slope of the line of best fit.
- Density plot: To see the distribution of the predictor variable. Ideally, a close to normal distribution (a bell shaped curve), without being skewed to the left or right is preferred. Let us see how to make each one of them.
Generally, any datapoint that lies outside the 1.5 * interquartile-range (1.5 * IQR) is considered an outlier, where, IQR is calculated as the distance between the 25th percentile and 75th percentile values for that variable.
Before executing the R code, please install the package "e1071" from
https://cran.r-project.org/web/packages/e1071/index.html
A value closer to 0 suggests a weak relationship between the variables. A low correlation (-0.2 < x < 0.2) probably suggests that much of variation of the response variable (Y) is unexplained by the predictor (X), in which case, we should probably look for better explanatory variables.
The earliest form of regression was the method of least squares, which was published by the French mathematician Adrien-Marie Legendre (1752--1833) in 1805, and by the German mathematician Johann Carl Friedrich Gauss (1777--1855) in 1809. Gauss applied the method to the problem of determining, from astronomical observations, orbits of bodies about the Sun, while Legendre came to the same method from geology and metrology.
Adrien-Marie Legendre was a French mathematician. Legendre made numerous contributions to mathematics. Well-known and important concepts such as the Legendre polynomials and Legendre transformation are named after him. Adrien-Marie Legendre was born in Paris to a wealthy family. Adrien-Marie lost his private fortune in 1793 during the French Revolution.
Johann Carl Friedrich Gauss was a German mathematician who made significant contributions to many fields, including number theory, algebra, statistics, analysis, differential geometry, geodesy, geophysics, mechanics, electrostatics, magnetic fields, astronomy, matrix theory, and optics. Carl was born to poor, working-class parents from Lower Saxony (Germany). Gauss was a child prodigy. A contested story relates that, when he was eight, he figured out how to add up all the numbers from 1 to 100.
A random error term ε is included in the model to represent the following two phenomena.
- Missing or omitted variables. The random error term ε is included to capture the effect of all those missing or omitted variables that have not been included in the model.
- Random variation. Human behavior is unpredictable.
In model \( y = \alpha + \beta \,x + \varepsilon ,\) α and β are the population parameters. The regression line obtained for model by using the population data is called the population regression line. The values of α and β are called the true values of the y-intercept and slope.
However, population data are difficult to obtain. As a result, we almost always use sample data to estimate linear regression model \( y = \alpha + \beta \,x + \varepsilon . \) The values of the y-intercept and slope calculated from sample data on x and y are called the estimated values of α and β and are denoted by a and b. Then the estimated regression line will be
\[ \hat{y} = a+b\,x , \] where ŷ (read as y hat) is the estimated or predicted value of y for a given value of x. The above equation is called the estimated regression model; it gives the regression of y on x.When n observations have been made, we get a table of independent xi and dependent yi variables that can be used for estimation of regression model. The least squares method uses the sample data to determine the coefficients in linear regression line in such a way that it minimizes the sum of of the squares of the deviations between the observed values and predicted values. Therefore, the least squares method minimizes the following sum
\[ \min \sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2 = \min \sum_{i=1}^n \left( y_i - a - b\,x_i \right)^2 , \] where- yi = observed value of the dependent variable for the i-th observation;
- ŷi = predicted value of the dependent variable for the i-th observation;
- n = total number of observations.
Example: we return to our car's data: ī
Now we take a sample of four scores: 80, 82, 88, 95 and calculate sample statistics:
If we take another sample: 70, 80, 91, 95, the results will be completely different: \( \overline{x} = 84 \) and \( s \approx 11.28421 . \) So we see that sampling mean is a random variable depending on sample chosen. Total number of samples of size four available for our disposal is \[ \binom{8}{4} = \frac{8^{\underline{4}}}{4!} = \frac{8 \cdot 7 \cdot 6 \cdot 5}{1 \cdot 2 \cdot 3 \cdot 4} = 70. \]
We check the answer with R:From statistical theory we know that s2 must have the smallest variance among all unbiased estimators of σ2, and so it is natural to wonder how much precision of estimation is lost by basing an estimate of σ2 on R instead of s2.
Example: Let \( X_1, X_2 , \ldots , X_8 \) be a random sample from NORM(100,8). The R-scipt below simulate the sample range R and for m = 100 000 such samples in order to learn about the distribution of R.
We start with the important quantity, known as the sum of squares due to error, which is denoted by SSE:
Here are some examples of statistical relationships involving regressions:
Example: Suppose there are only eight students in an advanced statistics class and the midterm scores of these students are
\[ 70\quad 76 \quad 80 \quad 82 \quad 88 \quad 88 \quad 91 \quad 95 \] Let x denote the score of a student in this class. Each score except one corresponding 88 has relative frequency 1/8, which score 88 has relative frequency 1/4. These frequencies give the population probability distribution. Now we use R to calculate population parameters.Now we take a sample of four scores: 80, 82, 88, 95 and calculate sample statistics:
If we take another sample: 70, 80, 91, 95, the results will be completely different: \( \overline{x} = 84 \) and \( s \approx 11.28421 . \) So we see that sampling mean is a random variable depending on sample chosen. Total number of samples of size four available for our disposal is \[ \binom{8}{4} = \frac{8^{\underline{4}}}{4!} = \frac{8 \cdot 7 \cdot 6 \cdot 5}{1 \cdot 2 \cdot 3 \cdot 4} = 70. \]
We check the answer with R:From statistical theory we know that s2 must have the smallest variance among all unbiased estimators of σ2, and so it is natural to wonder how much precision of estimation is lost by basing an estimate of σ2 on R instead of s2.
Example: Let \( X_1, X_2 , \ldots , X_8 \) be a random sample from NORM(100,8). The R-scipt below simulate the sample range R and for m = 100 000 such samples in order to learn about the distribution of R.
Let X be the standard normal distribution (with mean 0 and standard deviation 1). Then X2 has the special distribution, which is usually referred to as the chi-square distribution. It is often denoted by χ2 distribution, where χ is the lowercase Greek letter chi. A random variable X has the chi-square distribution with ν degrees of freedom if its probability density is given by
\[ f(x) = \begin{cases} \dfrac{1}{2^{\nu /2} \Gamma (\nu /2)} \, x^{\nu /2 -1} \, e^{-x/2} , & \quad\mbox{for } x>0, \\ 0, & \quad\mbox{elsewhere}. \end{cases} \]The mean and the variance of the chi-square distribution with ν degrees of freedom are ν and 2ν, respectively.
We built an example of chi-square distribution from three standard normal distributions.
You can compare chi-square distributions with three and four degrees of freedom:
We can generate chi-square distribution with, say, 20 degrees of freedom directly using the following R script:
Let X be the standard normal distribution (with mean 0 and standard deviation 1). Then X2 has the special distribution, which is usually referred to as the chi-square distribution. It is often denoted by χ2 distribution, where χ is the lowercase Greek letter chi. A random variable X has the chi-square distribution with ν degrees of freedom if its probability density is given by
\[ f(x) = \begin{cases} \dfrac{1}{2^{\nu /2} \Gamma (\nu /2)} \, x^{\nu /2 -1} \, e^{-x/2} , & \quad\mbox{for } x>0, \\ 0, & \quad\mbox{elsewhere}. \end{cases} \]The mean and the variance of the chi-square distribution with ν degrees of freedom are ν and 2ν, respectively.
We built an example of chi-square distribution from three standard normal distributions.
You can compare chi-square distributions with three and four degrees of freedom:
We can generate chi-square distribution with, say, 20 degrees of freedom directly using the following R script:
Let X be the standard normal distribution (with mean 0 and standard deviation 1). Then X2 has the special distribution, which is usually referred to as the chi-square distribution. It is often denoted by χ2 distribution, where χ is the lowercase Greek letter chi. A random variable X has the chi-square distribution with ν degrees of freedom if its probability density is given by
\[ f(x) = \begin{cases} \dfrac{1}{2^{\nu /2} \Gamma (\nu /2)} \, x^{\nu /2 -1} \, e^{-x/2} , & \quad\mbox{for } x>0, \\ 0, & \quad\mbox{elsewhere}. \end{cases} \]The mean and the variance of the chi-square distribution with ν degrees of freedom are ν and 2ν, respectively.
We built an example of chi-square distribution from three standard normal distributions.
You can compare chi-square distributions with three and four degrees of freedom:
We can generate chi-square distribution with, say, 20 degrees of freedom directly using the following R script:
Let X be the standard normal distribution (with mean 0 and standard deviation 1). Then X2 has the special distribution, which is usually referred to as the chi-square distribution. It is often denoted by χ2 distribution, where χ is the lowercase Greek letter chi. A random variable X has the chi-square distribution with ν degrees of freedom if its probability density is given by
\[ f(x) = \begin{cases} \dfrac{1}{2^{\nu /2} \Gamma (\nu /2)} \, x^{\nu /2 -1} \, e^{-x/2} , & \quad\mbox{for } x>0, \\ 0, & \quad\mbox{elsewhere}. \end{cases} \]The mean and the variance of the chi-square distribution with ν degrees of freedom are ν and 2ν, respectively.
We built an example of chi-square distribution from three standard normal distributions.
You can compare chi-square distributions with three and four degrees of freedom:
We can generate chi-square distribution with, say, 20 degrees of freedom directly using the following R script:
Let X be the standard normal distribution (with mean 0 and standard deviation 1). Then X2 has the special distribution, which is usually referred to as the chi-square distribution. It is often denoted by χ2 distribution, where χ is the lowercase Greek letter chi. A random variable X has the chi-square distribution with ν degrees of freedom if its probability density is given by
\[ f(x) = \begin{cases} \dfrac{1}{2^{\nu /2} \Gamma (\nu /2)} \, x^{\nu /2 -1} \, e^{-x/2} , & \quad\mbox{for } x>0, \\ 0, & \quad\mbox{elsewhere}. \end{cases} \]The mean and the variance of the chi-square distribution with ν degrees of freedom are ν and 2ν, respectively.
We built an example of chi-square distribution from three standard normal distributions.
You can compare chi-square distributions with three and four degrees of freedom:
We can generate chi-square distribution with, say, 20 degrees of freedom directly using the following R script: