An R TUTORIAL for Statistics Applications: Linear Regression

Linear Regression

The sampling distribution of a statistic is the distribution of that statistic, considered as a random variable, when derived from a random sample of size n. It may be considered as the distribution of the statistic for all possible samples from the same population of a given sample size.

Email Vladimir Dobrushkin

A major objective of many statistical investigations is to establish relationships that make it possible to predict one or more variables in terms of others. The variable being predicted is called the dependent variable, or response, and variables being used to predict the value of the dependent variable are called the independent variables, or predictor variables.

Formally, if we are given the joint distribution of two random variables X and Y, and X is known to take on the value x, the basic problem of bivariate regression is that of determining the conditional mean \( \mu_{Y|x} , \) that is, the "average" value of Y for the given value of X.

If f(x,y) is the value of the joint density of two variables, X and Y at (x,y), the problem of bivariate regression is simply that of determining the conditional density of Y given X = x and then evaluating the integral

\[ \mu_{Y|x} = E\left[ Y\, |\,x \right] = \int_{-\infty}^{\infty} y\cdot w(y|x)\, {\text d}y . \] The resulting equation is called the regression equation of Y on X. Alternatively, we might be interested in the regression equation \[ \mu_{X|y} = E\left[ X\, |\,y \right] = \int_{-\infty}^{\infty} x\cdot f(x|y) \, {\text d}x . \] In the discrete case, when we are dealing with probability distributions instead of probability densities, the integrals in the two regression equations given above are simply replaced by sums.

Section 1: Simple Linear Regression

The term regression was coined by the English statistician Francis Galton (1822--1911) in the nineteenth century to describe a biological phenomenon. The phenomenon was that the heights of descendants of tall ancestors tend to regress down towards a normal average (a phenomenon also known as regression toward the mean).

The regression is a probability distribution of a statistic obtained through a large number of samples drawn from a specific population. The sampling distribution of a given population is the distribution of frequencies of a range of different outcomes that could possibly occur for a statistic of a population.

Galton produced over 340 papers and books. He also created the statistical concept of correlation and widely promoted regression toward the mean. He was the first to apply statistical methods to the study of human differences and inheritance of intelligence, and introduced the use of questionnaires and surveys for collecting data on human communities, which he needed for genealogical and biographical works and for his anthropometric studies. As an investigator of the human mind, he founded psychometrics (the science of measuring mental faculties) and differential psychology and the lexical hypothesis of personality. He devised a method for classifying fingerprints that proved useful in forensic science.

Here are some examples of statistical relationships involving regressions:

Distance driving against time for driving.
Household's income and food spending.
IQ scores of students and their future salary.

A (simple) regression model that gives a straight-line relationship between two variables is called a linear regression model.

\[ y = \alpha + \beta\,x + \varepsilon , \] where α and β are population parameters that describe the y-intercept and slope of the line relating y and x. The error term ε (Greek letter epsilon) accounts for the variability in y that cannot be explained by the linear relationship between x and y. The simple linear regression model assumes that the error term is a normally distributed random variable with a mean zero and constant variance for all observations.

Example: We will use the cars dataset that comes with R by default. cars is a standard built-in dataset, that makes it convenient to demonstrate linear regression in a simple and easy to understand fashion. You can access this dataset simply by typing in cars in your R console. You will find that it consists of 50 observations(rows) and 2 variables (columns) – dist and speed. Lets print out the first six observations here.

R code

Before we begin building the regression model, it is a good practice to analyze and understand the variables. The graphical analysis and correlation study below will help with this.

To build a simple regression model that we can use to predict Distance (dist) by establishing a statistically significant linear relationship with Speed (speed). But before jumping in to the syntax, lets try to understand these variables graphically. Typically, for each of the independent variables (predictors), the following plots are drawn to visualize the following behavior:

Scatter plot: Visualize the linear relationship between the predictor and response
Box plot: To spot any outlier observations in the variable. Having outliers in your predictor can drastically affect the predictions as they can easily affect the direction/slope of the line of best fit.
Density plot: To see the distribution of the predictor variable. Ideally, a close to normal distribution (a bell shaped curve), without being skewed to the left or right is preferred. Let us see how to make each one of them.

Scatter plots can help visualize any linear relationships between the dependent (response) variable and independent (predictor) variables. Ideally, if you are having multiple predictor variables, a scatter plot is drawn for each one of them against the response, along with the line of best as seen below.

The scatter plot along with the smoothing line above suggests a linearly increasing relationship between the ‘dist’ and ‘speed’ variables. This is a good thing, because, one of the underlying assumptions in linear regression is that the relationship between the response and predictor variables is linear and additive.

Generally, any datapoint that lies outside the 1.5 * interquartile-range (1.5 * IQR) is considered an outlier, where, IQR is calculated as the distance between the 25th percentile and 75th percentile values for that variable.

Density plot – Check if the response variable is close to normality

Before executing the R code, please install the package "e1071" from
https://cran.r-project.org/web/packages/e1071/index.html

The code reveals that sample mean is \( E(R) \approx 24.60876 \) with the standard deviation \( s \approx 6.370791 .\) Correlation is a statistical measure that suggests the level of linear dependence between two variables, that occur in pair – just like what we have here in speed and dist. Correlation can take values between -1 to +1. If we observe for every instance where speed increases, the distance also increases along with it, then there is a high positive correlation between them and therefore the correlation between them will be closer to 1. The opposite is true for an inverse relationship, in which case, the correlation between the variables will be close to -1.

A value closer to 0 suggests a weak relationship between the variables. A low correlation (-0.2 < x < 0.2) probably suggests that much of variation of the response variable (Y) is unexplained by the predictor (X), in which case, we should probably look for better explanatory variables.

Section 2: Least Squares Method

The earliest form of regression was the method of least squares, which was published by the French mathematician Adrien-Marie Legendre (1752--1833) in 1805, and by the German mathematician Johann Carl Friedrich Gauss (1777--1855) in 1809. Gauss applied the method to the problem of determining, from astronomical observations, orbits of bodies about the Sun, while Legendre came to the same method from geology and metrology.

Adrien-Marie Legendre was a French mathematician. Legendre made numerous contributions to mathematics. Well-known and important concepts such as the Legendre polynomials and Legendre transformation are named after him. Adrien-Marie Legendre was born in Paris to a wealthy family. Adrien-Marie lost his private fortune in 1793 during the French Revolution.

Johann Carl Friedrich Gauss was a German mathematician who made significant contributions to many fields, including number theory, algebra, statistics, analysis, differential geometry, geodesy, geophysics, mechanics, electrostatics, magnetic fields, astronomy, matrix theory, and optics. Carl was born to poor, working-class parents from Lower Saxony (Germany). Gauss was a child prodigy. A contested story relates that, when he was eight, he figured out how to add up all the numbers from 1 to 100.

In the regression model \( y = \alpha + \beta \,x + \varepsilon , \) α is called the y-intercept or constant term, β is the slope, and ε is the random error term. The dependent and independent variables are y and x, respectively.

A random error term ε is included in the model to represent the following two phenomena.

Missing or omitted variables. The random error term ε is included to capture the effect of all those missing or omitted variables that have not been included in the model.
Random variation. Human behavior is unpredictable.

In model \( y = \alpha + \beta \,x + \varepsilon ,\) α and β are the population parameters. The regression line obtained for model by using the population data is called the population regression line. The values of α and β are called the true values of the y-intercept and slope.

However, population data are difficult to obtain. As a result, we almost always use sample data to estimate linear regression model \( y = \alpha + \beta \,x + \varepsilon . \) The values of the y-intercept and slope calculated from sample data on x and y are called the estimated values of α and β and are denoted by a and b. Then the estimated regression line will be

\[ \hat{y} = a+b\,x , \] where ŷ (read as y hat) is the estimated or predicted value of y for a given value of x. The above equation is called the estimated regression model; it gives the regression of y on x.

In the model \( \hat{y} = a + b \,x , \) the coefficients a and b are calculated using sample data, and they are called the estimates of α and β.

When n observations have been made, we get a table of independent x_i and dependent y_i variables that can be used for estimation of regression model. The least squares method uses the sample data to determine the coefficients in linear regression line in such a way that it minimizes the sum of of the squares of the deviations between the observed values and predicted values. Therefore, the least squares method minimizes the following sum

\[ \min \sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2 = \min \sum_{i=1}^n \left( y_i - a - b\,x_i \right)^2 , \] where

y_i = observed value of the dependent variable for the i-th observation;
ŷ_i = predicted value of the dependent variable for the i-th observation;
n = total number of observations.

The error we make using the regression model to estimate the mean value of the dependent variable for the i-th observation is often written as = y_i - ŷ_i and is referred to as the i-th residual.

Example: we return to our car's data: ī

Therefore, the population mean is \( \mu = 83.75 \) and standard deviation is \( \sigma = 8.293715 . \)

Now we take a sample of four scores: 80, 82, 88, 95 and calculate sample statistics:

So we get for this particular sample, its mean is \( \overline{x} \approx 86.25 \) with standard deviation \( s \approx 6.751543 . \) So we see that sample mean exceeds true mean by 2.5 and sample standard deviation is not accurate by about 1.54217.

If we take another sample: 70, 80, 91, 95, the results will be completely different: \( \overline{x} = 84 \) and \( s \approx 11.28421 . \) So we see that sampling mean is a random variable depending on sample chosen. Total number of samples of size four available for our disposal is \[ \binom{8}{4} = \frac{8^{\underline{4}}}{4!} = \frac{8 \cdot 7 \cdot 6 \cdot 5}{1 \cdot 2 \cdot 3 \cdot 4} = 70. \]

We check the answer with R:

From statistical theory we know that s² must have the smallest variance among all unbiased estimators of σ², and so it is natural to wonder how much precision of estimation is lost by basing an estimate of σ² on R instead of s².

Example: Let \( X_1, X_2 , \ldots , X_8 \) be a random sample from NORM(100,8). The R-scipt below simulate the sample range R and for m = 100 000 such samples in order to learn about the distribution of R.

The code reveals that sample mean is \( E(R) \approx 24.60876 \) with the standard deviation \( s \approx 6.370791 .\)

Section 3: Linear Regression Analysis

We start with the important quantity, known as the sum of squares due to error, which is denoted by SSE:

Sum of Squares due to Error: \[ \mbox{SSE} = \sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2 . \]

The value of SSE is a measure of the error (in the same units as the dependent variable) that results from using the estimated regression equation to predict the values of the dependent variable in the sample.

Here are some examples of statistical relationships involving regressions:

Total Sum of Squares: \[ \mbox{SST} = \sum_{i=1}^n \left( y_i - \overline{y} \right)^2 . \]

Example: Suppose there are only eight students in an advanced statistics class and the midterm scores of these students are

\[ 70\quad 76 \quad 80 \quad 82 \quad 88 \quad 88 \quad 91 \quad 95 \] Let x denote the score of a student in this class. Each score except one corresponding 88 has relative frequency 1/8, which score 88 has relative frequency 1/4. These frequencies give the population probability distribution. Now we use R to calculate population parameters.

Therefore, the population mean is \( \mu = 83.75 \) and standard deviation is \( \sigma = 8.293715 . \)

Now we take a sample of four scores: 80, 82, 88, 95 and calculate sample statistics:

We check the answer with R:

The code reveals that sample mean is \( E(R) \approx 24.60876 \) with the standard deviation \( s \approx 6.370791 .\)

Section 4: Standard Deviation of Random Errors

Section 5: Coefficient of Determination

Sum of Squares due to Regression: \[ \mbox{SSR} = \sum_{i=1}^n \left( \hat{y}_i - \overline{y} \right)^2 . \]

Coefficient of Determinantion: \[ r^2 = \frac{\mbox{SSR}}{\mbox{SST}} \]

Section 6: Inference about Slope

Section 7: Validity Conditions

Section 8: Hypothesis Testing

Let X be the standard normal distribution (with mean 0 and standard deviation 1). Then X² has the special distribution, which is usually referred to as the chi-square distribution. It is often denoted by χ² distribution, where χ is the lowercase Greek letter chi. A random variable X has the chi-square distribution with ν degrees of freedom if its probability density is given by

\[ f(x) = \begin{cases} \dfrac{1}{2^{\nu /2} \Gamma (\nu /2)} \, x^{\nu /2 -1} \, e^{-x/2} , & \quad\mbox{for } x>0, \\ 0, & \quad\mbox{elsewhere}. \end{cases} \]

The mean and the variance of the chi-square distribution with ν degrees of freedom are ν and 2ν, respectively.

We built an example of chi-square distribution from three standard normal distributions.

You can compare chi-square distributions with three and four degrees of freedom: