Set Operations |
---|
A set is basic concept in mathematics and has no direct definition.
A set is a collection of some items (elements). We often use capital letters to denote a set. To define a set we can simply list all the elements in curly brackets, for example to define a set A that consists of the two elements ♣ and ♦, we write A = { ♣, ♦ }. To say that ♦ belongs to A, we write ♦ ∈ A, where "∈" is pronounced "belongs to." To say that an element does not belong to a set, we use ∉. For example, we may write ♥ ∉ A. So a set is a collection of things (elements).
Note that ordering does not matter, so two sets { ♣, ♦ } and { ♦ , ♣ } are equal. We often work with sets of numbers. Some important sets are given the following example.
- The set of real numbers ℝ.
- The set of natural numbers, ℕ = { 1, 2, 3, ... }.
- The set of rational numbers, ℚ.
- The set of integers, ℤ = { ..., -3, -2, -1, 0, 1, 2, 3, ... }.
- Closed intervals on the real line. For example, [1,7] is the set of all real numbers x such that 1 ≤ x ≤ 7.
- Open intervals on the real line. For example (1,7) is the set of all real numbers x such that 1 < x < 7.
- Similarly, [1,7) is the set of all real numbers x such that 1 ≤ x < 7.
- An interval on the real line independently endpoints are included or not. For example, |1,7| denotes one of the following intervals: [1,7], (1,7), (1,7], or [1,7).
- The set of complex numbers ℂ is the set of numbers in the form of a + bj, where a, a are real numbers, and j is the unit vector in the positive vertical direction (along ordinate) so j² = -1.
We can also define a set by mathematically stating the properties satisfied by the elements in the set. In particular, we may write
- If the set A is defines as A = {x | x∈ℤ, -2 ≤ x ≤ 3}, then A = {-2, -1, 0, 1, 2, 3}.
- The set of rational numbers can be defined as ℚ = { a/b : a,b ∈ ℕ, b ≠ 0}.
- If A = { 3, 4 } and B = { 3, 4, 7 }, then A ⊂ B;
- ℕ ⊂ ℤ;
- ℚ ⊂ ℝ.
Russell's paradox prevents the existence of a universal set and other set theories that include Zermelo's axiom of comprehension. We will use this word in more restrictive sense. In probability theory, the universal set is called the sample space, and usually is denoted as Ω.
Similarly we can define the union of three or more sets. In particular, if A1, A2, ..., An, are n sets, their union A1 ∪ A2 ∪... ∪ An is a set containing all elements that are in at least one of the sets. We can write this union more compactly by
Theorem: (De Morgan's law) For two sets A and B
- \( \left( A \cup B \right)^c = A^c \cap B^c ; \)
- \( \left( A \cap B \right)^c = A^c \cup B^c . \)
Theorem: (Distributive law) For any three sets A, B and C
- \( A \cap \left( B \cup C \right) = \left( A \cap B \right) \cup \left( A \cap C \right) ; \)
- \( A \cup \left( B \cap C \right) = \left( A \cup B \right) \cap \left( A \cup C \right) . \)
- \( A \cup B = \left\{ 1, 2, 3, 4, 5, 6, 8 \right\} ; \)
- \( A \cap C = \left\{ 2, 3\right\} ; \)
- \( A^c = \left\{ 6, 7, 8, 9 \right\} ; \)
- \( B^c = \left\{ 1, 3, 5, 7, 9 \right\} . \)
Theorem: (Inclusion-exclusion principle:)
- \( \left\vert A \cup B \right\vert = |A| + |B| - \left\vert A \cap B \right\vert ; \)
- \( \left\vert A \cup B \cup C \right\vert = |A| + |B| +|C| - \left\vert A \cap B \right\vert - \left\vert A \cap C \right\vert - \left\vert C \cap B \right\vert + \left\vert A \cap B \cap C \right\vert \)
In is very convenient to illustrate interralation of sets using Venn diagrams These diagrams depict elements as points in the plane, and sets as regions inside closed curves. A Venn diagram consists of multiple overlapping closed curves, usually circles, each representing a set. Venn diagrams were conceived around 1880 by the English mathematician, logician and philosopher John Venn (1834--1923). Venn himself did not use the term "Venn diagram" and referred to his invention as "Eulerian Circles." The term "Venn diagram" was later used by the American academic philosopher Clarence Irving Lewis in 1918, in his book "A Survey of Symbolic Logic."
Venn diagrams can be created in R using code written as part of the Bioconductor Project. To install limma from the R command line, type
source("http://www.bioconductor.org/biocLite.R")
biocLite("limma")
biocLite("statmod")
The next step in the installation is a call to the biocLite function
class(biocLite)
biocLite("limma")
The output from these calls indicates the installation of the limma package.
library(limma)
We can now use the commands in this package for generating Venn diagrams. The data needed for a Venn diagram consists of a set of binary variables indicating membership. We will be using the hsb2 (https://stats.idre.ucla.edu/hsb2-3.csv) dataset from the Institute for Digital Research and Education consisting of data from 200 students including scores from writing, reading, and math tests. We will create indicators for “high” values in each of these variables and generate Venn diagrams that tell us about the degree of overlap in high math, writing, and reading scores.
vennCounts
command to impose the structure
needed to generate the Venn diagram.
a <- vennCounts(c3)
a
We can now generate our Venn diagram with the vennDiagram command:
vennDiagram(a)
While some of the options for the vennDiagram command are specific to tests run on microarray data, we can change some of the formatting. Below, we add names to the groups, we change the relative size of the labels and counts, and we opt for the counts to appear in red.
vennDiagram(a, include = "both",
names = c("High Writing", "High Math", "High Reading"),
cex = 1, counts.col = "red")
One can make other plots:
hsb2 <- read.table('https://stats.idre.ucla.edu/wp-content/uploads/2016/02/hsb2-2.csv', header=T, sep=",")
attach(hsb2)
library(lattice)
#defining ses.f to be a factor variable
hsb2$ses.f = factor(hsb2$ses, labels=c("low", "middle", "high"))
#histograms
histogram(~write, hsb2)
#conditional plot
histogram(~write | ses.f, hsb2)
Density plot:
densityplot(~socst, hsb2)
#conditional plot
densityplot(~socst | ses.f, hsb2)
Quantile-quantile plots
qqmath(~write, hsb2)
#conditional plot
qqmath(~write | ses.f, hsb2)
Box and whiskers plots
bwplot(~math, hsb2)
There is a special package VennDiagram available since 2018 by Hanbo Chen. Let us start with drawing a simple circle or ellipse:
grid.newpage()
draw.single.venn(22, category = "Dog People", lty = "blank", fill = "cornflower blue",
alpha = 0.5)
Creating a Venn Diagram with two circles
grid.newpage()
draw.pairwise.venn(area1 = 22, area2 = 20, cross.area = 11, category = c("Dogs",
"Cats"))
Adding colour & moving titles
grid.newpage()
draw.pairwise.venn(22, 20, 11, category = c("Dogs", "Cats"), lty = rep("blank",
2), fill = c("light blue", "pink"), alpha = rep(0.5, 2), cat.pos = c(0,
0), cat.dist = rep(0.025, 2))
or
grid.newpage()
venn.plot <- draw.pairwise.venn(area1 = 100,
area2 = 70,
cross.area = 68,
category = c("First", "Second"),
fill = c("blue", "red"),
lty = "blank",
cex = 2,
cat.cex = 2,
cat.pos = c(285, 105),
cat.dist = 0.09,
cat.just = list(c(-1, -1), c(1, 1)),
ext.pos = 30,
ext.dist = -0.05,
ext.length = 0.85,
ext.line.lwd = 2,
ext.line.lty = "dashed"
)
We remove scaling
grid.newpage()
draw.pairwise.venn(22, 20, 11, category = c("Dog People", "Cat People"), lty = rep("blank",
2), fill = c("light blue", "pink"), alpha = rep(0.5, 2), cat.pos = c(0,
0), cat.dist = rep(0.025, 2), scaled = FALSE)
We make two non-overlapping circles
grid.newpage()
draw.pairwise.venn(area1 = 22, area2 = 6, cross.area = 0, category = c("Dog People",
"Snake People"), lty = rep("blank", 2), fill = c("light blue", "green"),
alpha = rep(0.5, 2), cat.pos = c(0, 180), euler.d = TRUE, sep.dist = 0.03,
rotation.degree = 45)
Creating a Venn Diagram with three circles
grid.newpage()
draw.triple.venn(area1 = 22, area2 = 20, area3 = 13, n12 = 11, n23 = 4, n13 = 5,
n123 = 1, category = c("Dogs", "Cats", "Lizards"), lty = "blank",
fill = c("skyblue", "pink1", "mediumorchid"))
Let's speed up the nrow(subset(…)) process for the area counts.
This “likes” function finds the total area for a circle or overlap subset, how many people like those animals. It takes the first letter(s) of the animal(s) in lower case, e.g. c(“d”, “c”)
likes <- function(animals) {
ppl <- d
names(ppl) <- c("p", "d", "c", "s", "l")
for (i in 1:length(animals)) {
ppl <- subset(ppl, ppl[animals[i]] == T)
}
nrow(ppl)
}
# How many people like dogs?
likes("d")
grid.newpage()
venn.plot <- draw.triple.venn(area1 = 4,
area2 = 3,
area3 = 4,
n12 = 2,
n23 = 2,
n13 = 2,
n123 = 1,
category = c('A', 'B', 'C'),
fill = c('red', 'blue', 'green'),
cat.col = c('red', 'blue', 'green'),
cex = c(1/2,2/2,3/2,4/2,5/2,6/2,7/2),
cat.cex = c(1,2,3),
euler = TRUE,
scaled = FALSE
)
A different package: venneuler
which is a lot cleaner to code, it can take the actual dataset and work out the areas itself. However, you have to install this package first along with rJava
:
install.packages("venneuler")
install.packages(rJava)
You download Java 64-bit (you should choose) from this page: https://www.java.com/en/download/manual.jsp
The package eulerr
plots Venn diagrams. It is quite similar to venneuler
but without its inconsistencies.
library(eulerr)
fit <- euler(c(A = 450, B = 1800, "A&B" = 230))
plot(fit)
One can use the venn()
function from the gplots package: http://www.inside-r.org/packages/cran/gplots/docs/venn
require(gplots)
## construct some fake gene names..
oneName <- function() paste(sample(LETTERS,5,replace=TRUE),collapse="")
geneNames <- replicate(1000, oneName())
##
GroupA <- sample(geneNames, 400, replace=FALSE)
GroupB <- sample(geneNames, 750, replace=FALSE)
GroupC <- sample(geneNames, 250, replace=FALSE)
GroupD <- sample(geneNames, 300, replace=FALSE)
venn(list(GrpA=GroupA,GrpB=GroupB,GrpC=GroupC,GrpD=GroupD))