Part 1, Section 1: Modifying Data
This chapter covers basic information regarding the methods used by R for organizing and graphing data, respectively.
Projects often involve so much data that it is difficult to analyze all of the data at once. We present some methods to manipulate data in order to make the data more manageable.
Subsection: Types of Data
Data can be splitted in several ways based on how they are collected and the type collected.
Subsection: Modifying Data in Excel
Subsection: Sorting and Filtering Data
R contains useful features for sorting and filtering data so that one can more easily identify patterns.
For example, with the data.frame below I would like to sort by column z (descending) then by column b (ascending):
You can use the order() function directly without resorting to add-on tools -- see this simpler answer which uses a trick right from the top of the example(order) code:
Should work the same way, but you can't use with. Try M <- matrix(c(1,2,2,2,3,6,4,5), 4, 2, byrow=FALSE, dimnames=list(NULL, c("a","b"))) to create a matrix M, then use M[order(M[,"a"],-M[,"b"]),] to order it on two columns.
To begin understanding how to properly sort data frames in R, we of course must first generate a data frame to manipulate.
Executing our run.R script outputs the list of vectors in our data frame as expected, in the order they were entered.
The Order Function
While perhaps not the easiest sorting method to type out in terms of syntax, the one that is most readily available to all installations of R, due to being a part of the base module, is the order function.
The order function accepts a number of arguments, but at the simplest level the first argument must be a sequence of values or logical vectors.
RStudio includes a data viewer that allows you to look inside data frames and other rectangular data structures. The viewer also allows includes some simple exploratory data analysis (EDA) features that can help you understand the data as you manipulate it with R. You can invoke the viewer in a console by calling the View function on the data frame you want to look at. For instance, to view the built-in iris dataset, run these commands:
Sorting a Data Frame by Vector Name
As you might expect, you can sort by any column by just by clicking on the column. Click on a column that’s already sorted to reverse the sort direction.
For example, we can use order() to simply sort a vector of five randomly ordered numbers with this script:
Executing the script, we see the initial output of the unordered vector, followed by the now ordered list afterward.
With the order() function in our tool belt, we’ll start sorting our data frame by passing in the vector names within the data frame.
For example, using our previously generated dataframe object, we can sort by the vector z by adding the following code to our script:
What we’re effectively doing is calling our original dataframe object, and passing in the new index order that we’d like to have. This index order is generated using the with() function, which effectively creates a new environment using the passed in data in the first argument along with an expression for evaluating that data in the second argument.
Thus, we’re reevaluating the dataframe data using the order() function, and we want to order based on the z vector within that data frame. This returns a new index order for the data frame values, which is then finally evaluated within the [brackets] of dataframe[], outputting our new ordered result.
Consequently, we see our original unordered output, followed by a second output with the data sorted by column z.
Sorting by Column Index
Similar to the above method, it’s also possible to sort based on the numeric index of a column in the data frame, rather than the specific name.
Instead of using the with() function, we can simply pass the order() function to our dataframe. We indicate that we want to sort by the column of index 1 by using the dataframe[,1] syntax, which causes R to return the levels (names) of that index 1 column. In other words, similar to when we passed in the z vector name above, order is sorting based on the vector values that are within column of index 1:
As expected, we get our normal output followed by the sorted output in the first column:
Sorting by Multiple Columns
In some cases, it may be desired to sort by multiple columns. Thankfully, doing so is very simple with the previously described methods.
To sort multiple columns using vector names, simply add additional arguments to the order() function call as before:
Similarly, to sort by multiple columns based on column index, add additional arguments to order() with differing indices:
How to sort in decreasing order
Just like sort(), the order() function also takes an argument called decreasing. For example, to sort some.states in decreasing order of population:
Suppose we want to sort a vector, matrix, or data frame.
You’ll see the age of the first tree change from 118 to 120 in the viewer.
This auto-refreshing feature has some prerequisites, so if it doesn’t seem to be working:
You must call View()
on a variable directly. If, for instance, you call
View(as.data.frame(foo))
or View(rbind(foo, bar))
you’re invoking View()
on a new object created by evaluating your expression, and while that object contains data, it’s just a copy and won’t update when foo and bar do.
The number of rows the viewer can display is effectively unbounded, and large numbers of rows won’t slow down the interface. It uses the DataTables JavaScript library to virtualize scrolling, so only a few hundred rows are actually loaded at a time.
While rows are unbounded, columns are capped at 100. It’s not currently possible to virtualize columns in the same way as rows, and large numbers of columns cause the interface to slow significantly.
Finally, while we’ve made every effort to keep things speedy, very large amounts of data may cause sluggishness, especially when a sort or filter is applied, as this requires R to fully scan the frame. If you’re working with large frames, try applying filters to reduce it to the subset you’re interested in to improve performance.
Data frames
To sort a data frame on one or more columns, you can use the arrange function from plyr package, or use R’s built-in functions. The arrange function is much easier to use, but does require the external package to be installed.
Note that the size column is a factor and is sorted by the order of the factor levels. In this case, the levels were automatically assigned alphabetically (when creating the data frame), so large is first and small is last.
Reverse sort
The overall order of the sort can be reversed with the argument decreasing=TRUE
.
To reverse the direction of a particular column, the method depends on the data type:
- Numbers: put a - in front of the variable name, e.g.
df[ order(-df$weight), ].
- Factors: convert to integer and put a - in front of the variable name, e.g.
df[ order(-xtfrm(df$size)), ].
- Characters: there isn’t a simple way to do this. One method is to convert to a factor first and then sort as above.
------------------------------------------------------- http://sites.stat.psu.edu/~drh20/R/html/base/html/sort.html Conditional Formatting in Data Frames https://www.rdocumentation.org/packages/condformat/versions/0.7.0 The example is properly formatted at http://zeehio.github.io/condformat. R-script: --------------------------------------------------- data(iris) library(condformat) condformat(iris[c(1:5,70:75, 120:125),]) %>% rule_fill_discrete(Species) %>% rule_fill_discrete(c(Sepal.Width, Sepal.Length), expression = Sepal.Width > Sepal.Length - 2.25, colours = c("TRUE" = "#7D00FF")) %>% rule_fill_gradient2(Petal.Length) %>% rule_css(Sepal.Length, expression = ifelse(Species == "setosa", "bold", "regular"), css_field = "font-weight") %>% rule_css(Sepal.Length, expression = ifelse(Species == "setosa", "yellow", "black"), css_field = "color") -------------------------------------------------------- More Examples on Styling Cells, Rows, and Tables: https://rstudio.github.io/DT/010-style.html Filtering data: https://stackoverflow.com/questions/1686569/filter-data-frame-rows-by-a-logical-condition http://www.rexamples.com/11/Filtering%20data https://www.rdocumentation.org/packages/dplyr/versions/0.7.3/topics/filter https://blog.exploratory.io/filter-data-with-dplyr-76cf5f1a258e https://stats.stackexchange.com/questions/6187/filtering-a-dataframe