This chapter covers basic information regarding basic comamnds in R.
There are many build-in functions in R. They respectively compute the cosine, sine, tangent, cotangent, arc-cosine, arc-sine, arc-tangent, and the two-argument arc-tangent. WE list these functions
cos(x)
sin(x)
tan(x)
cot(x)
acos(x)
asin(x)
atan(x)
atan2(y, x)
cospi(x) # cos(pi*x)
sinpi(x) # sin(pi*x)
tanpi(x) # tan(pi*x)
floor(x) # returns the integer
ceil(x) #
log(x) # natural logarithm
log2(x) # logarithm with base 2
In the above commands, x and y are numeric or complex vectors.
Angles are in radians, not degrees, for the standard versions (i.e., a right angle is π/2), and in ‘half-rotations’ for cospi etc.
The arc-tangent of two arguments atan2(y, x) returns the angle between the
x-axis and the vector from the origin to (x, y), i.e., for positive
arguments atan2(y, x) == atan(y/x)
. So it returns the arctangent
(inverse tangent) of two numeric variables.
We give some examples.
Sometimes the result of a calculation is dependent on multiple values in a vector. One example is the sum of a vector; when any value changes in the vector, the outcome is different. This complete set of functions and operators is called the vector operations.
The built-in arc tangent function, arctan(x), cannot return the correct quadrant of an angle specified by x- and y-coordinates, because the argument does not contain enough information.
Array is the basic object in mathematics and its applications. The simplest way to build an array is to use the combine function, c(0
. All you need to do is to type all the numbers you want to store in a comma separated list.
Once you start working with larger amounts of data, it becomes very tedious to enter data into the c function, especially considering the need to put quotes around character values and commas between values. To read data from a file or from the terminal without the need for quotes and commas, you can use the scan function. To read from a file (or a URL), pass it a quoted string with the name of the file or URL you wish to read; to read from the terminal, call scan() with no arguments, and enter a completely blank line when you're done entering your data. Additionally, on Windows or Mac OS X, you can substitute a call to the
file.choose()
function for the quoted string with the file name, and you'll be presented with the familiar file chooser used by most programs on those platforms.
No matter how carefully we collect our data, there will always be situations where we don't know the value of a particular variable. For example, we might conduct a survey where we ask people 10 questions, and occasionally we forget to ask one, or people don't know the proper answer. We don't want values like this to enter into calculations, but we can't just eliminate them because then observations that have missing values won't "fit in" with the rest of the data.
In R, missing values are represented by the string NA. For example, suppose we have a vector of 10 values, but the fourth one is missing. I can enter a missing value by passing NA to the c function just as if it was a number (no quotes needed):
- Any computation involving a missing value will return a missing value.
- Unlike other quantities in R, we can't directly test to see if something is equal to a missing value with the equality operator (==). We must use the builtin is.na function, which will return TRUE if a value is missing and FALSE otherwise.
is.na
whenever we are testing to see if a value is a missing value.
dim()
function).
There are severatal options to create and operate with matrices. The basic
command is matrix
creates a matrix from the given set of values
or vectors. The default is to fill the matrix by columns. To fill the entries
of a matrix from left to right, top to bottom, add the parameter
byrow=T
.
matrix
has two atributes: as.matrix
that attempts to turn its argument into a matrix; is.matrix
that tests if its argument is a (strict) matrix.
The basic usage of matrix
is
matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE,
dimnames = NULL)
as.matrix(x, …)
# S3 method for data.frame
as.matrix(x, rownames.force = NA, …)
is.matrix(x)
Here data is an optional data vector (including a list or expression
vector). Non-atomic classed R objects are coerced by as.vector
and all attributes discarded.
nrow is the desired number of rows.
ncol is the desired number of columns.
byrow is logical. If FALSE (the default) the matrix is filled by columns, otherwise the matrix is filled by rows.
dimnames A dimnames attribute for the matrix: NULL or a list of length 2 giving the row and column names respectively. An empty list is treated as NULL, and a list of length one as row names. The list can be named, and the list names will be used as names for the dimensions.
If one of nrow
or ncol
is not given, an attempt is
made to infer it from the length of data and the other parameter. If neither
is given, a one-column matrix is returned.
If there are too few elements in data to fill the matrix, then the elements in
data are recycled. If data has length zero, NA of an appropriate type is used
for atomic vectors (0 for raw vectors) and NULL for lists.
is.matrix
returns TRUE if x is a vector and has a "dim" attribute
of length 2 and FALSE otherwise. Note that a data.frame
is not a
matrix by this test. The function is generic: you can write methods to handle
specific classes of objects,
as.matrix
is a generic function. The method for data frames will
return a character matrix if there is only atomic columns and any
non-(numeric/logical/complex) column, applying as.vector to factors and format
to other non-character columns. Otherwise, the usual coercion hierarchy
(logical < integer < double < complex) will be used, e.g.,
all-logical data frames will be coerced to a logical matrix, mixed
logical-integer will give a integer matrix, etc.
The default method for as.matrix
calls as.vector(x)
,
and hence e.g.coerces factors to character vectors.
dimnames
must be specified as a list containing two character vectors. The first character vector contains the row names, and the second contains the column names.
dimnames
a named list.
Or
We can extract elements or vector from a matrix.
colnames()
and rownames()
.
cbind()
and rbind()
as in column bind and row bind.
colnames()
and rownames()
.
dim()
.
colnames()
and rownames()
.
colnames()
and rownames()
.
colnames()
and rownames()
.
colnames()
and rownames()
.
We can add row or column using rbind()
and cbind()
function respectively. Similarly, it can be removed through reassignment.
R is a great place to start your data science journey because it is an environment designed from the ground up to support data science. R is not just a programming language, but it is also an inter‐active environment for doing data science. To support interaction, R is a much more flexible language than many of its peers. This flexi‐ bility comes with its downsides, but the big upside is how easy it is to evolve tailored grammars for specific parts of the data science process. These mini languages help you think about problems as a data scientist, while supporting fluent interaction between your brain and the computer.
There are four things you need to run the code: R, RStudio, a collection of R packages called the tidyverse, and a hand‐ ful of other packages. Packages are the fundamental units of reproducible R code. They include reusable functions, the documentation that describes how to use them, and sample data.
A new major version of R comes out once a year, and there are 2-3 minor releases each year. It’s a good idea to update regularly. Upgrading can be a bit of a hassle, especially for major versions, which require you to reinstall all your packages, but putting it off only makes it worse.
RStudio is an integrated development environment, or IDE, for R programming. Download and install it from http://www.rstudio.com/download. RStudio is updated a couple of times a year. When a new version is available, RStudio will let you know. It’s a good idea to upgrade regularly so you can take advantage of the latest and greatest features.
Once you have installed RStudio, it is right time to understand how to use it. When you open RStudio, you will see a window with 4 panes (source, console,environment, and files), something similar to matlab. You can resize, rearrange these panes as you like. The fourth pane (source) appears only when you open an R file or type R scripts.
Console window is your interface to R with some added features. You can type any R command and press enter, and R will display the answer below the command line. Some pointers about console:
- use the UP arrow keys to retrieve most recent R commands that you have used;
- use CONTROL + L to clear the console;
- use TAB to complete the commands, get arguments for functions, etc.
R has various inbuild functions that are very useful. Some of them are loaded by defualt, and some of them needs to be loaded separately. Let us try some defualt functions.
Now that you have run few R commands in the console, let us try to understand the difference between typing a code in the console pane and typing a code in teh source pane. A source pane is merely a text editor. You can write the R commands the same way you did in the console, but then pressing enter will not execute the commands. However, unlike the console, you can save your R commands for later (refer to as a R script file), build multiple programs, etc.
To open a new script file, use FILE > NEW FILE > R SCRIPT. This will open an UNTITLED new source pane. As you can see, you can have multiple script files open at the same time (will be open as different tabs in the source pane).
To run a command in a script file, you have two options. If you want to run line-by-line, take CURSOR to the line you want to execute and press COMMAND+ENTER in a Mac, or CONTROL+ENTER in a PC or linux. Observe the changes at the console pane. Otherwise, you can lighlight the command(s) you want to run and use the RUN button at the top to execute the command. Either way, the commands will be executed in the console pane.
- One important practice when you start building multi line codes is to add comments throughout your program so that you or someone else can understand your program later. To add comments in R script files you can use "#." Motice that your comemnts will be shown in GREEN.
- Saving R scripts use FILE > SAVE AS, select the location you want to save, then click on SAVE.
- Opening an existing R script: FILE > OPEN, and select the file.
- Sourceing an R script file.
There are many build-in function in R. We list some of them in alphabetic order. A FedSQL function performs a computation on FedSQL expressions and returns either a single value or a set of values if the FedSQL expression is a column. Functions that perform a computation on a column are aggregate functions. In other SQL environments, aggregate functions are also known as set functions. If the value of an argument is invalid, FedSQL sets the result to a null or missing value. Here are some common restrictions on function arguments:
- Some functions require that their arguments be restricted within a certain range. For example, the argument of the LOG function must be greater than 0.
- Most functions do not permit nulls or missing values as arguments. Exceptions include some of the descriptive statistics functions and the IFNULL function.
- In general, the allowed range of the arguments is platform-dependent, such as with the EXP function.
avg(population) # average function
count(population) # Returns the number of rows retrieved by a SELECT statement for a specified table.
css(density) # Returns the corrected sum of squares of all values in an expression.
kurtosis(average) # Returns the kurtosis of all values in an expression.
max(population) # Returns the maximum value in a column.
min(density)
nmiss(AvgHigh)
probt(population)
range(AvgHigh)
skewness(AvgHigh)
std(2,4,6,3,1);
stddev(AvgHigh)
stderr(AvgHigh
students_t(population)
sum(population)
uss(population)
variance(population)
The AVG function adds the values of all the rows in the specified column and divides the result by the number of rows. Null values and SAS missing values are ignored and are not included in the computation. count
: returns the number of unique values, excluding null values.
It returns a count of all rows from the table, including rows that contain null values or SAS missing values. You use the COUNT function in a SELECT statement to return the requested number of rows in a table. css
: The corrected sum of squares is the sum of squared deviations (differences) from the mean.
Null values and SAS missing values are ignored and are not included in the computation. Kurtosis is primarily a measure of the heaviness of the tails of a distribution. Large values of kurtosis indicate that the distribution has heavy tails. Null values and SAS missing values are ignored by
kurtosis
and are not included in the computation. The
max
and min
functions ignore null values and SAS missing values. The max
function returns the maximum value in a column. The min
function returns the minimum value in a column. nmiss
indicates the total number of null or SAS missing values.
You can use an aggregate function to produce a statistical summary of data in the entire table that is listed in the FROM clause or for each group that is specified in a GROUP BY clause. The
probt
function returns the probability that an observation from a Student's t distribution, with degrees of freedom n-1 and noncentrality parameter nc equal to 0, is less than or equal to expression.
The significance level of a two-tailed t test is given by this code line.
p=(1-probt(abs(x),df))*2;
You can use an aggregate function to produce a statistical summary of data in the entire table that is listed in the FROM clause or for each group that is specified in a GROUP BY clause. The GROUP BY clause groups data by a specified column or columns. When you use a GROUP BY clause, the aggregate function in the SELECT clause or in a HAVING clause instructs FedSQL in how to summarize the data for each group. FedSQL calculates the aggregate function separately for each group. If GROUP BY is omitted, then all the rows in the table or view are considered to be a single group. The
range
function returns the difference between the largest and the smallest values in the specified expression. The RANGE function ignores null values and SAS missing values.
You can use an aggregate function to produce a statistical summary of data in the entire table that is listed in the FROM clause or for each group that is specified in a GROUP BY clause. The GROUP BY clause groups data by a specified column or columns. When you use a GROUP BY clause, the aggregate function in the SELECT clause or in a HAVING clause instructs FedSQL in how to summarize the data for each group. FedSQL calculates the aggregate function separately for each group. If GROUP BY is omitted, then all the rows in the table or view are considered to be a single group. Skewness is a measure of the tendency of the deviations from the mean to be larger in one direction than in the other. A positive value for skewness indicates that the data is skewed to the right. A negative value indicates that the data is skewed to the left. Null values and SAS missing values are ignored and are not included in the computation.
The
std
function returns the standard deviation.
It specifies any valid expression that evaluates to a numeric value. The
stddev
function returns the standard deviation is calculated as the square root of the variance.
Null values and SAS missing values are ignored and are not included in the computation. The
stderr
function returns the statistical standard error of all values in an expression. The
students_t
function returns the Student's t distribution. The PROBT function returns the probability that the Student's t distribution is less than or equal to a given value. The
sum
function returns the sum of all the values in an
expression. It specifies any valid SQL expression. The
uss
function returns the uncorrected sum of squares of all the values in an expression. Null values and SAS missing values are ignored and are not included in the computation. The
variance
function returns the measure of the dispersion of all values in an expression. Null values and SAS missing values are ignored and are not included in the computation.
We list other functions.
abs(-513);
beta(5,3);
ceil(-1+1.e-11);
ceilz(2.1);
character_length('December');
current_date;
current_time;
current_timestamp;
cv(5,8,9,6,.);
degrees(2*pi());
floor(2.95);
floorz(-2.4);
gamma(6);
gcd(5,25);
geomean(1,2,2,4);
harmean(1,2,4,4);
iqr(2,4,1,3,999999);
largest(1, 456, 789, .Q, 123);
The abs
function returns the absolute value of a numeric value
expression. The
beta
function returns the value of the beta function:
\( \beta (a,b) = \int_0^1 x^{a-1} \left( 1 -x \right)^{b-1} {\text d}x . \) The
beta
function returns the smallest integer greater than or
equal to a numeric value expression. The
ceil
function returns the smallest integer greater than or equal to a numeric value expression.
The ceilz
function returns the smallest integer that is greater than or equal to the argument, using zero fuzzing. The
character_length
function returns the number of characters in a string of any data type. The
current_date
function returns the current date for the time zone. The
current_time
function returns the current time for your local zone. The
current_timestamp
function returns the date and time for your time zone. The
cv
function returns the coefficient of variation. The
floor
function returns the largest integer less than or equal to a numeric value expression.
The floorz
function returns the largest integer that is less than or equal to the argument, using zero fuzzing. The
gamma
function returns the value of the gamma function:
\( \gamma (x) = \int_0^{\infty} t^{x-1} e^{-t} {\text d} t .
\) The
gcd
function returns the greatest common divisor for a set of integers. The
geomean
function returns the geometric mean of n
numbers: \( \sqrt[n]{x_1 * x_2 * \cdots * x_n} . \)
The
harmean
function returns the harmonic mean of n
numbers: \( \frac{n}{x_1^{-1} + x_2^{-1} + \cdots + x_n^{-1}} . \)
The
iqr
function returns the interquartile range. The
largest
function returns the kth largest non-null or '
nonmissing value.
R has the following basic logical operations.
Operations | Operator | Example | Output |
---|---|---|---|
less than | < | 3 < 5 | TRUE |
less than or equal to | <= | 5 <= 7 | TRUE |
greater than | > | 3 > 5 | FALSE |
greater than or equal to | >= | 5 >= 5 | TRUE |
equal to | == | 3 == 5 | FALSE |
not equal to | != | 3 != 5 | TRUE |
not | ! | !(2==2) | FALSE |
and | & | (2==2) & (2==5) | FALSE |
or | | | (2==2) | (2==5) | TRUE |
Logical functions:
band(11,3);
blshift(5,3);
bxor(128,64);
ifnull(AvgHigh, 0)
The band
function returns bitwise logical AND of two arguments.
The
blshift
function returns the bitwise logical left shift of two arguments. The
bxor
returns the bitwise logical EXCLUSIVE OR of two arguments. The
ifnull
checks the value of the first expression and, if it is null or a SAS missing value, returns the second expression.
You can store vectors of logical values in exactly the same way that you can store vectors of numbers.
You can apply a logical operator to a vector. This might not make a lot of sense to you, so let let us unpack slowly. First, we define a vector of numbers "sales by month" and we check whether profit was made in a particular month.