Part 1 - Section 3: Measures of Location
This chapter covers basic information regarding data visualisation using R.
The most important aspect of studying the distribution of a sample of measurements is locating the position of a central value about which the measurements are distributed. The arithmetic mean (average) of a set of n measurements \( X_1 , X_2 , \ldots , X_n \) is given by the formula
The mean provides a measure of central location for the data. If the data are for a sample (typically the case), the mean is denoted by \( \overline{X} . \) The sample mean is a point estimate of the (typically unknown) population mean for the variable of interest. If the data for the entire population are available, the population mean is computed in the same manner, but denoted either by \( E[X] , \) or by the Greek letter μ.
If the data are organized in the frequency distribution table then we can calculate the mean by the formula
Elementary properties of the arithmetic mean:
- the sum of deviations between the values and the mean is
equal to zero:
\[ \sum_{i=1}^k \left( X_i - \overline{X} \right) =0 ; \]
- if the variable is constant then the mean is equal to this
constant:
\[ \frac{1}{k}\, \sum_{i=1}^k c = c; \]
- if we add a constant to the values of the variable, then
\[ \frac{1}{k}\, \sum_{i=1}^k \left( X_i + c \right) = c + \overline{X} ; \]
- if we multiply the values of the variable by a constant c, then
The harmonic mean of a set of n measurements \( X_1 , X_2 , \ldots , X_k \) is defined by the formula
The geometric mean may be more appropriate than the arithmetic mean for describing percentage growth.
Suppose an apple tree yields 50 oranges one year, then 60, 80 and 95 the following years, so the growth is 20 %, 60 % and 90 % for each of the years. Using the arithmetic mean, we can calculate an average growth as 56.66 % (20 % + 60 % + 90 % divided by 3). However, if we start with 50 apples and let it grow with (56+2/3) % for three years, the result is 220 applees, not 95.
Example: Calculate the arithmetic, harmonic and geometric mean of the first 10 Fibonacci numbers, \( F_{n+2} = F_{n+1} + F_n , \quad F_0 =0, \ F_1 =1 . \)
The quantile xp is the value of the variable which fulfils that 100p% of values of ordered sample (or population) are smaller or equal to xp
and 100(1−p) % of values of ordered sample (or
population) are larger or equal to xp.
The quantile is not uniquely defined.
There are three possible methods of calculating quantiles.
- Sort the data in ascending order. Find the sequential index ip of the quantile xp that satisfies the inequalities
\[ n\, p < i_p < n\,p +1 . \]The quantile xp is then equal to the value of variable with the sequential index \( i_p - x_p = \langle x_p \rangle . \) If np and <np+1 are integers, we calculate the quantile as an aritmetic mean of \( \langle x_{np} \rangle \) and \( \langle x_{np+1} \rangle : \)\[ x_p = \frac{1}{2} \left( \langle x_{np} \rangle + \langle x_{np+1} \rangle \right) . \]
- According to matlab, we calculate
\[ \overline{i_p} = \frac{np+np+1}{2} = \frac{2np+1}{2} , \]which determine the location of the quantile. Using linear interpolation we get\[ x_p = \langle x_{\lfloor \overline{i_p} \rfloor} \rangle + \left( \langle x_{\lfloor \overline{i_p +1} \rfloor} - \langle x_{\lfloor \overline{i_p} \rfloor} \right) \left( \overline{i_p} - \lfloor \overline{i_p} \rfloor \right) , \]where \( \lfloor \cdot \rfloor \) denotes the integer part of the number, called the floor. If \( \overline{i_p} < 1 , \) then \( x_p = \langle x_{1} \rangle ; \) if \( \overline{i_p} > n, \) then \( x_p = \langle x_{n} \rangle . \)
- According to EXCEL, we assign values
\[ 0, \frac{1}{n-1} , \frac{2}{n-1} , \ldots , \frac{n-2}{n-1} \]to the data sorted in ascending order. If p is equal to the multiple of \( \frac{1}{n-1} , \) the quantile xp is equal to the value corresponding to the given multiple. If p is not the multiple of \( \frac{1}{n-1} , \) the inear interpolation is used.
The n-th percentile of an observation variable is the value that cuts off the first n percent of the data values when it is sorted in ascending order.
The median of an observation variable is the value at the middle when the data is sorted in ascending order. It is an ordinal measure of the central location of the data values.We apply the median function to compute the median value of eruptions.
The mode \( \hat{X} \) is the value of variable with the highest frequency. In the case of continuous variable (data) the mode is the value where the histogram reaches its peak.