An R TUTORIAL for Statistics Applications

Part 1 - Section 3: Measures of Location

This chapter covers basic information regarding data visualisation using R.

Section 3: Measures of Location

The most important aspect of studying the distribution of a sample of measurements is locating the position of a central value about which the measurements are distributed. The arithmetic mean (average) of a set of n measurements \( X_1 , X_2 , \ldots , X_n \) is given by the formula

\[ \overline{X} = \frac{1}{n}\, \sum_{i=1}^n X_i = \frac{X_1 + X_2 + \cdots + X_n}{n} . \]

The mean provides a measure of central location for the data. If the data are for a sample (typically the case), the mean is denoted by \( \overline{X} . \) The sample mean is a point estimate of the (typically unknown) population mean for the variable of interest. If the data for the entire population are available, the population mean is computed in the same manner, but denoted either by \( E[X] , \) or by the Greek letter μ.

If the data are organized in the frequency distribution table then we can calculate the mean by the formula

\[ \overline{X} = \frac{1}{k}\, \sum_{i=1}^k n_i X_i , \]

where \( n_1 , n_2 , \ldots , n_k \) are frequencies of variable varieties \( X_1 , X_2 , \ldots , X_k . \)

Elementary properties of the arithmetic mean:

the sum of deviations between the values and the mean is equal to zero:
\[ \sum_{i=1}^k \left( X_i - \overline{X} \right) =0 ; \]
if the variable is constant then the mean is equal to this constant:
\[ \frac{1}{k}\, \sum_{i=1}^k c = c; \]
if we add a constant to the values of the variable, then
\[ \frac{1}{k}\, \sum_{i=1}^k \left( X_i + c \right) = c + \overline{X} ; \]
if we multiply the values of the variable by a constant c, then

\[ \frac{1}{k}\, \sum_{i=1}^k c\cdot X_i = c \cdot \overline{X} . \]

The harmonic mean of a set of n measurements \( X_1 , X_2 , \ldots , X_k \) is defined by the formula

\[ \overline{X}_H = \frac{n}{\sum_{i=1}^n X_i^{-1}} . \]

In certain situations, especially many situations involving rates and ratios, the harmonic mean provides the truest average.

The geometric mean of a set of n measurements \( X_1 , X_2 , \ldots , X_k \) is defined by the formula

\[ \overline{X}_G = \left( X_1 \cdot X_2 \cdot \cdots \cdot X_n \right)^{1/n} = \sqrt[n]{X_1 \cdot X_2 \cdot \cdots \cdot X_n} . \]

The geometric mean may be more appropriate than the arithmetic mean for describing percentage growth.

Suppose an apple tree yields 50 oranges one year, then 60, 80 and 95 the following years, so the growth is 20 %, 60 % and 90 % for each of the years. Using the arithmetic mean, we can calculate an average growth as 56.66 % (20 % + 60 % + 90 % divided by 3). However, if we start with 50 apples and let it grow with (56+2/3) % for three years, the result is 220 applees, not 95.

Example: Calculate the arithmetic, harmonic and geometric mean of the first 10 Fibonacci numbers, \( F_{n+2} = F_{n+1} + F_n , \quad F_0 =0, \ F_1 =1 . \)

The quantile x_p is the value of the variable which fulfils that 100p% of values of ordered sample (or population) are smaller or equal to x_p and 100(1−p) % of values of ordered sample (or population) are larger or equal to x_p.
The quantile is not uniquely defined.

There are three possible methods of calculating quantiles.

Sort the data in ascending order. Find the sequential index i_p of the quantile x_p that satisfies the inequalities
\[ n\, p < i_p < n\,p +1 . \]
The quantile x_p is then equal to the value of variable with the sequential index \( i_p - x_p = \langle x_p \rangle . \) If np and <np+1 are integers, we calculate the quantile as an aritmetic mean of \( \langle x_{np} \rangle \) and \( \langle x_{np+1} \rangle : \)
\[ x_p = \frac{1}{2} \left( \langle x_{np} \rangle + \langle x_{np+1} \rangle \right) . \]
According to matlab, we calculate
\[ \overline{i_p} = \frac{np+np+1}{2} = \frac{2np+1}{2} , \]
which determine the location of the quantile. Using linear interpolation we get
\[ x_p = \langle x_{\lfloor \overline{i_p} \rfloor} \rangle + \left( \langle x_{\lfloor \overline{i_p +1} \rfloor} - \langle x_{\lfloor \overline{i_p} \rfloor} \right) \left( \overline{i_p} - \lfloor \overline{i_p} \rfloor \right) , \]
where \( \lfloor \cdot \rfloor \) denotes the integer part of the number, called the floor. If \( \overline{i_p} < 1 , \) then \( x_p = \langle x_{1} \rangle ; \) if \( \overline{i_p} > n, \) then \( x_p = \langle x_{n} \rangle . \)
According to EXCEL, we assign values
\[ 0, \frac{1}{n-1} , \frac{2}{n-1} , \ldots , \frac{n-2}{n-1} \]
to the data sorted in ascending order. If p is equal to the multiple of \( \frac{1}{n-1} , \) the quantile x_p is equal to the value corresponding to the given multiple. If p is not the multiple of \( \frac{1}{n-1} , \) the inear interpolation is used.

The n-th percentile of an observation variable is the value that cuts off the first n percent of the data values when it is sorted in ascending order.

The median of an observation variable is the value at the middle when the data is sorted in ascending order. It is an ordinal measure of the central location of the data values.

We apply the median function to compute the median value of eruptions.

The mode \( \hat{X} \) is the value of variable with the highest frequency. In the case of continuous variable (data) the mode is the value where the histogram reaches its peak.