An R TUTORIAL for Statistics Applications. Part 0: Introduction to R

Part 0: Introduction to R

This chapter covers basic information regarding basic comamnds in R.

Section 1: R as a calculator

There are many build-in functions in R. They respectively compute the cosine, sine, tangent, cotangent, arc-cosine, arc-sine, arc-tangent, and the two-argument arc-tangent. WE list these functions

cos(x)
sin(x)
tan(x)
cot(x)

acos(x)
asin(x)
atan(x)
atan2(y, x)

cospi(x)    # cos(pi*x) 
sinpi(x)    # sin(pi*x) 
tanpi(x)    # tan(pi*x) 

floor(x)    # returns the integer
ceil(x)     # 
log(x)      # natural logarithm 
log2(x)     # logarithm with base 2

In the above commands, x and y are numeric or complex vectors. Angles are in radians, not degrees, for the standard versions (i.e., a right angle is π/2), and in ‘half-rotations’ for cospi etc. The arc-tangent of two arguments atan2(y, x) returns the angle between the x-axis and the vector from the origin to (x, y), i.e., for positive arguments atan2(y, x) == atan(y/x). So it returns the arctangent (inverse tangent) of two numeric variables.

We give some examples.

Sometimes the result of a calculation is dependent on multiple values in a vector. One example is the sum of a vector; when any value changes in the vector, the outcome is different. This complete set of functions and operators is called the vector operations.

The built-in arc tangent function, arctan(x), cannot return the correct quadrant of an angle specified by x- and y-coordinates, because the argument does not contain enough information.

Efficient Approximations for the Arctangent Function

Section 2: Vectors and Arrays in R

Array is the basic object in mathematics and its applications. The simplest way to build an array is to use the combine function, c(0. All you need to do is to type all the numbers you want to store in a comma separated list.

Sometimes you would like to change the values stored in a vector.

It was mentioned earlier that all the elements of a vector must be of the same mode. To see the mode of an object, you can use the mode function. What happens if we try to combine objects of different modes using the c function? The answer is that R will find a common mode that can accomodate all the objects, resulting in the mode of some of the objects changing. For example, let's try combining some numbers and some character strings:

You can see that the numbers have been changed to characters because they are now displayed surrounded by quotes. They also will no longer behave like numbers:

The error message means that the two values can no longer be added. If you really need to treat character strings like numbers, you can use the as.numeric function:

Of course, the best thing is to avoid combining objects of different modes with the c function. We'll see later that R provides an object known as a list that can store different types of objects without having to change their modes.

Once you start working with larger amounts of data, it becomes very tedious to enter data into the c function, especially considering the need to put quotes around character values and commas between values. To read data from a file or from the terminal without the need for quotes and commas, you can use the scan function. To read from a file (or a URL), pass it a quoted string with the name of the file or URL you wish to read; to read from the terminal, call scan() with no arguments, and enter a completely blank line when you're done entering your data. Additionally, on Windows or Mac OS X, you can substitute a call to the file.choose() function for the quoted string with the file name, and you'll be presented with the familiar file chooser used by most programs on those platforms.

No matter how carefully we collect our data, there will always be situations where we don't know the value of a particular variable. For example, we might conduct a survey where we ask people 10 questions, and occasionally we forget to ask one, or people don't know the proper answer. We don't want values like this to enter into calculations, but we can't just eliminate them because then observations that have missing values won't "fit in" with the rest of the data.

In R, missing values are represented by the string NA. For example, suppose we have a vector of 10 values, but the fourth one is missing. I can enter a missing value by passing NA to the c function just as if it was a number (no quotes needed):

R will also recognize the unquoted string NA as a missing value when data is read from a file or URL. Missing values are different from other values in R in two ways:

Any computation involving a missing value will return a missing value.
Unlike other quantities in R, we can't directly test to see if something is equal to a missing value with the equality operator (==). We must use the builtin is.na function, which will return TRUE if a value is missing and FALSE otherwise.

Here are some simple R statements that illustrate these points:

In the second case, we just need to remember to always use is.na whenever we are testing to see if a value is a missing value.

Section 3: Matrices and data

<b>Matrix is a two dimensional data structure in R programming. Matrix is similar to vector but additionally contains the dimension attribute. All attributes of an object can be checked with the attributes() function (dimension can be checked directly with the dim() function). There are severatal options to create and operate with matrices. The basic command is matrix creates a matrix from the given set of values or vectors. The default is to fill the matrix by columns. To fill the entries of a matrix from left to right, top to bottom, add the parameter byrow=T.

matrix has two atributes:
as.matrix that attempts to turn its argument into a matrix;
is.matrix that tests if its argument is a (strict) matrix.

The basic usage of matrix is


matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE,
       dimnames = NULL)

as.matrix(x, …)
# S3 method for data.frame
as.matrix(x, rownames.force = NA, …)

is.matrix(x)

Here data is an optional data vector (including a list or expression vector). Non-atomic classed R objects are coerced by as.vector and all attributes discarded.
nrow is the desired number of rows.
ncol is the desired number of columns.
byrow is logical. If FALSE (the default) the matrix is filled by columns, otherwise the matrix is filled by rows.
dimnames A dimnames attribute for the matrix: NULL or a list of length 2 giving the row and column names respectively. An empty list is treated as NULL, and a list of length one as row names. The list can be named, and the list names will be used as names for the dimensions.

If one of nrow or ncol is not given, an attempt is made to infer it from the length of data and the other parameter. If neither is given, a one-column matrix is returned.
If there are too few elements in data to fill the matrix, then the elements in data are recycled. If data has length zero, NA of an appropriate type is used for atomic vectors (0 for raw vectors) and NULL for lists.
is.matrix returns TRUE if x is a vector and has a "dim" attribute of length 2 and FALSE otherwise. Note that a data.frame is not a matrix by this test. The function is generic: you can write methods to handle specific classes of objects,

as.matrix is a generic function. The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying as.vector to factors and format to other non-character columns. Otherwise, the usual coercion hierarchy (logical < integer < double < complex) will be used, e.g., all-logical data frames will be coerced to a logical matrix, mixed logical-integer will give a integer matrix, etc.
The default method for as.matrix calls as.vector(x), and hence e.g.coerces factors to character vectors.

Actually, you don't need to specify both the number of rows and the number of columns. You can specify one, and matrix() will automatically guess the other using the length of the vector.

The elements are added the matrix one column at a time. You can change it so they are added one column at a time using byrow = TRUE. For comparison, see the transpose function, t().

In the same way that a vector can have named elements, a matrix can have row and column names. dimnames must be specified as a list containing two character vectors. The first character vector contains the row names, and the second contains the column names.

You can also give names to the dimensions by making dimnames a named list.

We can extract elements or vector from a matrix.

An element at the mth row, nth column of A can be accessed by the expression A[m, n].

The entire mth row A can be extracted as A[m, ].

Similarly, the entire nth column A can be extracted as A[ ,n].

If we assign names to the rows and columns of the matrix, than we can access the elements by names.

Also

Another way:

These names can be accessed or changed with two helpful functions colnames() and rownames().

Another way of creating a matrix is by using functions cbind() and rbind() as in column bind and row bind. colnames() and rownames().

Finally, you can also create a matrix from a vector by setting its dimension using dim(). colnames() and rownames().

One shortcoming of vectors and matrices is that they can only hold one mode of data; they don't allow us to mix, say, numbers and character strings. If we try to do so, it will change the mode of the other elements in the vector to conform. For example: colnames() and rownames().

For small datasets, you can enter each of the columns (variables) of your data frame using the data.frame function. For example, let's extend our temperature example by creating a data frame that has the day of the month, the minimum temperature and the maximum temperature: colnames() and rownames().

Probably the easiest way to refer to this column is to use a special notation that eliminates the need to put quotes around the variable names (unless they contain blanks or other special characters). Separate the data frame name from the variable name with a dollar sign ($): colnames() and rownames().

Section 4: Operations with Matrices and data

Multiply matrices:

We can add row or column using rbind() and cbind() function respectively. Similarly, it can be removed through reassignment.

Section 5: RStudio

R is a great place to start your data science journey because it is an environment designed from the ground up to support data science. R is not just a programming language, but it is also an inter‐active environment for doing data science. To support interaction, R is a much more flexible language than many of its peers. This flexi‐ bility comes with its downsides, but the big upside is how easy it is to evolve tailored grammars for specific parts of the data science process. These mini languages help you think about problems as a data scientist, while supporting fluent interaction between your brain and the computer.

There are four things you need to run the code: R, RStudio, a collection of R packages called the tidyverse, and a hand‐ ful of other packages. Packages are the fundamental units of reproducible R code. They include reusable functions, the documentation that describes how to use them, and sample data.

A new major version of R comes out once a year, and there are 2-3 minor releases each year. It’s a good idea to update regularly. Upgrading can be a bit of a hassle, especially for major versions, which require you to reinstall all your packages, but putting it off only makes it worse.

RStudio is an integrated development environment, or IDE, for R programming. Download and install it from http://www.rstudio.com/download. RStudio is updated a couple of times a year. When a new version is available, RStudio will let you know. It’s a good idea to upgrade regularly so you can take advantage of the latest and greatest features.

Once you have installed RStudio, it is right time to understand how to use it. When you open RStudio, you will see a window with 4 panes (source, console,environment, and files), something similar to matlab. You can resize, rearrange these panes as you like. The fourth pane (source) appears only when you open an R file or type R scripts.

Console window is your interface to R with some added features. You can type any R command and press enter, and R will display the answer below the command line. Some pointers about console:

use the UP arrow keys to retrieve most recent R commands that you have used;
use CONTROL + L to clear the console;
use TAB to complete the commands, get arguments for functions, etc.

R has various inbuild functions that are very useful. Some of them are loaded by defualt, and some of them needs to be loaded separately. Let us try some defualt functions.

Now that you have run few R commands in the console, let us try to understand the difference between typing a code in the console pane and typing a code in teh source pane. A source pane is merely a text editor. You can write the R commands the same way you did in the console, but then pressing enter will not execute the commands. However, unlike the console, you can save your R commands for later (refer to as a R script file), build multiple programs, etc.

To open a new script file, use FILE > NEW FILE > R SCRIPT. This will open an UNTITLED new source pane. As you can see, you can have multiple script files open at the same time (will be open as different tabs in the source pane).

To run a command in a script file, you have two options. If you want to run line-by-line, take CURSOR to the line you want to execute and press COMMAND+ENTER in a Mac, or CONTROL+ENTER in a PC or linux. Observe the changes at the console pane. Otherwise, you can lighlight the command(s) you want to run and use the RUN button at the top to execute the command. Either way, the commands will be executed in the console pane.

One important practice when you start building multi line codes is to add comments throughout your program so that you or someone else can understand your program later. To add comments in R script files you can use "#." Motice that your comemnts will be shown in GREEN.
Saving R scripts use FILE > SAVE AS, select the location you want to save, then click on SAVE.
Opening an existing R script: FILE > OPEN, and select the file.
Sourceing an R script file.

Section 6: Functions

There are many build-in function in R. We list some of them in alphabetic order. A FedSQL function performs a computation on FedSQL expressions and returns either a single value or a set of values if the FedSQL expression is a column. Functions that perform a computation on a column are aggregate functions. In other SQL environments, aggregate functions are also known as set functions. If the value of an argument is invalid, FedSQL sets the result to a null or missing value. Here are some common restrictions on function arguments:

Some functions require that their arguments be restricted within a certain range. For example, the argument of the LOG function must be greater than 0.
Most functions do not permit nulls or missing values as arguments. Exceptions include some of the descriptive statistics functions and the IFNULL function.
In general, the allowed range of the arguments is platform-dependent, such as with the EXP function.

FedSQL aggregate functions operate on all values for an expression in a table column and return a single result. If the aggregate function is processed in a GROUP BY statement, the aggregate function returns a single result for each group. Null values and SAS missing values are not considered in the operation, except for the COUNT(*) syntax of the COUNT function. The table column that you specify in the function can be any FedSQL expression that evaluates to a column name. The following are FedSQL aggregate functions:


avg(population)   # average function
count(population) # Returns the number of rows retrieved by a SELECT statement for a specified table. 
css(density)      # Returns the corrected sum of squares of all values in an expression. 
kurtosis(average) # Returns the kurtosis of all values in an expression.
max(population)   # Returns the maximum value in a column.
min(density) 
nmiss(AvgHigh) 
probt(population) 
range(AvgHigh) 
skewness(AvgHigh)
std(2,4,6,3,1);   
stddev(AvgHigh)  
stderr(AvgHigh  
students_t(population) 
sum(population)  
uss(population) 
variance(population)

The AVG function adds the values of all the rows in the specified column and divides the result by the number of rows. Null values and SAS missing values are ignored and are not included in the computation.
count: returns the number of unique values, excluding null values. It returns a count of all rows from the table, including rows that contain null values or SAS missing values. You use the COUNT function in a SELECT statement to return the requested number of rows in a table.
css: The corrected sum of squares is the sum of squared deviations (differences) from the mean. Null values and SAS missing values are ignored and are not included in the computation.
Kurtosis is primarily a measure of the heaviness of the tails of a distribution. Large values of kurtosis indicate that the distribution has heavy tails. Null values and SAS missing values are ignored by kurtosisand are not included in the computation.
The max and min functions ignore null values and SAS missing values. The max function returns the maximum value in a column. The min function returns the minimum value in a column.
nmiss indicates the total number of null or SAS missing values. You can use an aggregate function to produce a statistical summary of data in the entire table that is listed in the FROM clause or for each group that is specified in a GROUP BY clause.
The probt function returns the probability that an observation from a Student's t distribution, with degrees of freedom n-1 and noncentrality parameter nc equal to 0, is less than or equal to expression. The significance level of a two-tailed t test is given by this code line.


p=(1-probt(abs(x),df))*2;

You can use an aggregate function to produce a statistical summary of data in the entire table that is listed in the FROM clause or for each group that is specified in a GROUP BY clause. The GROUP BY clause groups data by a specified column or columns. When you use a GROUP BY clause, the aggregate function in the SELECT clause or in a HAVING clause instructs FedSQL in how to summarize the data for each group. FedSQL calculates the aggregate function separately for each group. If GROUP BY is omitted, then all the rows in the table or view are considered to be a single group.
The range function returns the difference between the largest and the smallest values in the specified expression. The RANGE function ignores null values and SAS missing values. You can use an aggregate function to produce a statistical summary of data in the entire table that is listed in the FROM clause or for each group that is specified in a GROUP BY clause. The GROUP BY clause groups data by a specified column or columns. When you use a GROUP BY clause, the aggregate function in the SELECT clause or in a HAVING clause instructs FedSQL in how to summarize the data for each group. FedSQL calculates the aggregate function separately for each group. If GROUP BY is omitted, then all the rows in the table or view are considered to be a single group.
Skewness is a measure of the tendency of the deviations from the mean to be larger in one direction than in the other. A positive value for skewness indicates that the data is skewed to the right. A negative value indicates that the data is skewed to the left. Null values and SAS missing values are ignored and are not included in the computation.
The std function returns the standard deviation. It specifies any valid expression that evaluates to a numeric value.
The stddev function returns the standard deviation is calculated as the square root of the variance. Null values and SAS missing values are ignored and are not included in the computation.
The stderr function returns the statistical standard error of all values in an expression.
The students_t function returns the Student's t distribution. The PROBT function returns the probability that the Student's t distribution is less than or equal to a given value.
The sum function returns the sum of all the values in an expression. It specifies any valid SQL expression.
The uss function returns the uncorrected sum of squares of all the values in an expression. Null values and SAS missing values are ignored and are not included in the computation.
The variance function returns the measure of the dispersion of all values in an expression. Null values and SAS missing values are ignored and are not included in the computation.

We list other functions.


abs(-513); 
beta(5,3);
ceil(-1+1.e-11);
ceilz(2.1);
character_length('December');
current_date;
current_time;
current_timestamp;
cv(5,8,9,6,.);
degrees(2*pi());
floor(2.95);
floorz(-2.4);
gamma(6);
gcd(5,25);
geomean(1,2,2,4);
harmean(1,2,4,4);
iqr(2,4,1,3,999999);
largest(1, 456, 789, .Q, 123);

The abs function returns the absolute value of a numeric value expression.
The beta function returns the value of the beta function: $ \beta (a,b) = \int_0^1 x^{a-1} \left( 1 -x \right)^{b-1} {\text d}x . $
The beta function returns the smallest integer greater than or equal to a numeric value expression.
The ceil function returns the smallest integer greater than or equal to a numeric value expression. The ceilz function returns the smallest integer that is greater than or equal to the argument, using zero fuzzing.
The character_length function returns the number of characters in a string of any data type.
The current_date function returns the current date for the time zone.
The current_time function returns the current time for your local zone.
The current_timestamp function returns the date and time for your time zone.
The cv function returns the coefficient of variation.
The floor function returns the largest integer less than or equal to a numeric value expression. The floorz function returns the largest integer that is less than or equal to the argument, using zero fuzzing.
The gamma function returns the value of the gamma function: $ \gamma (x) = \int_0^{\infty} t^{x-1} e^{-t} {\text d} t . $
The gcd function returns the greatest common divisor for a set of integers.
The geomean function returns the geometric mean of n numbers: $ \sqrt[n]{x_1 * x_2 * \cdots * x_n} . $
The harmean function returns the harmonic mean of n numbers: $ \frac{n}{x_1^{-1} + x_2^{-1} + \cdots + x_n^{-1}} . $
The iqr function returns the interquartile range.
The largest function returns the kth largest non-null or ' nonmissing value.

Section 7: Logical Operations

R has the following basic logical operations.

Operations	Operator	Example	Output
less than	<	3 < 5	TRUE
less than or equal to	<=	5 <= 7	TRUE
greater than	>	3 > 5	FALSE
greater than or equal to	>=	5 >= 5	TRUE
equal to	==	3 == 5	FALSE
not equal to	!=	3 != 5	TRUE
not	!	!(2==2)	FALSE
and	&	(2==2) & (2==5)	FALSE
or	\|	(2==2) \| (2==5)	TRUE

Logical functions:


band(11,3); 
blshift(5,3); 
bxor(128,64);
ifnull(AvgHigh, 0)

The band function returns bitwise logical AND of two arguments.
The blshift function returns the bitwise logical left shift of two arguments.
The bxor returns the bitwise logical EXCLUSIVE OR of two arguments.
The ifnull checks the value of the first expression and, if it is null or a SAS missing value, returns the second expression.

You can store vectors of logical values in exactly the same way that you can store vectors of numbers.

You can apply a logical operator to a vector. This might not make a lot of sense to you, so let let us unpack slowly. First, we define a vector of numbers "sales by month" and we check whether profit was made in a particular month.