R has a working directory. This should be the main location to store external data files, as well as other code files.
getwd()
[1] "/Users/jiaqiguo/Dropbox/Math189Winter2017/lab2"
One can also set the working directory to a desired location.
setwd() # desired directory as argument
getwd()
Please make sure the external data file is in the working directory. There are multiple functions to read in external data into R, depending on the file format. After reading in the external file, data is usually stored in an R dataframe. One can also peak into the data by using head()
function.
data <- read.table("babies.txt", header=TRUE)
head(data)
One of the most efficient way to learn about the data is through summary()
function, which contains min, 1st quartile, median, mean, 3rd quartile, and max of each column.
summary(data)
bwt gestation parity age height
Min. : 55.0 Min. :148.0 Min. :0.0000 Min. :15.00 Min. :53.00
1st Qu.:108.8 1st Qu.:272.0 1st Qu.:0.0000 1st Qu.:23.00 1st Qu.:62.00
Median :120.0 Median :280.0 Median :0.0000 Median :26.00 Median :64.00
Mean :119.6 Mean :286.9 Mean :0.2549 Mean :27.37 Mean :64.67
3rd Qu.:131.0 3rd Qu.:288.0 3rd Qu.:1.0000 3rd Qu.:31.00 3rd Qu.:66.00
Max. :176.0 Max. :999.0 Max. :1.0000 Max. :99.00 Max. :99.00
weight smoke
Min. : 87 Min. :0.0000
1st Qu.:115 1st Qu.:0.0000
Median :126 Median :0.0000
Mean :154 Mean :0.4644
3rd Qu.:140 3rd Qu.:1.0000
Max. :999 Max. :9.0000
Mean and standard deviations can be found using base functions mean()
and sd()
in R respectively.
mean(data$bwt)
[1] 119.5769
sd(data$bwt)
[1] 18.23645
To access a column of the data, one can simply use [...]
to select. A dataframe work similar as a matrix; however, it does offer additional flexibility and information in handling data. For example, one can select a column of a dataframe by column number or column name. Both returns the same entry values. However, selecting by column number returns the raw data, whereas selecting by column name returns a dataframe with column information. Quotation marks are necessary, if column name is used for selection.
data[,7]
data['smoke']
Also, $
may be used to access a certain data field, which is equivalent to selecting by column number.
data$smoke
Acessing certain rows of the data is usually straightforward using the row numbers.
data[100,]
In addition, we introduce selecting rows based on a certain condition here, which is useful in application. To see which rows contain observations of smokers, we use which()
function, which returns indices of observations from smokers.
smoker.ind <- which(data['smoke'] == 1)
smoker.ind
To use the indices, we can pass in the vector of indices. Also, the setdiff()
function is useful in set difference operations.
data.smoker <- data[smoker.ind,]
nonsmoker.ind <- setdiff(rownames(data), smoker.ind)
data.nonsmoker <- data[nonsmoker.ind,]
# compare summary statistics between smokers and nonsmokers
summary(data.smoker)
bwt gestation parity age height
Min. : 58.0 Min. :223.0 Min. :0.00 Min. :15.00 Min. :53.00
1st Qu.:102.0 1st Qu.:271.0 1st Qu.:0.00 1st Qu.:22.00 1st Qu.:63.00
Median :115.0 Median :279.0 Median :0.00 Median :26.00 Median :64.00
Mean :114.1 Mean :283.9 Mean :0.25 Mean :26.88 Mean :65.03
3rd Qu.:126.0 3rd Qu.:286.0 3rd Qu.:0.25 3rd Qu.:30.00 3rd Qu.:66.00
Max. :163.0 Max. :999.0 Max. :1.00 Max. :99.00 Max. :99.00
weight smoke
Min. : 87.0 Min. :1
1st Qu.:112.0 1st Qu.:1
Median :125.0 Median :1
Mean :161.1 Mean :1
3rd Qu.:140.0 3rd Qu.:1
Max. :999.0 Max. :1
summary(data.nonsmoker)
bwt gestation parity age height
Min. : 55.0 Min. :148.0 Min. :0.000 Min. :17.00 Min. :56.00
1st Qu.:113.0 1st Qu.:273.0 1st Qu.:0.000 1st Qu.:23.00 1st Qu.:62.00
Median :123.5 Median :281.0 Median :0.000 Median :27.00 Median :64.00
Mean :123.1 Mean :288.8 Mean :0.258 Mean :27.69 Mean :64.44
3rd Qu.:134.0 3rd Qu.:289.0 3rd Qu.:1.000 3rd Qu.:31.00 3rd Qu.:66.00
Max. :176.0 Max. :999.0 Max. :1.000 Max. :99.00 Max. :99.00
weight smoke
Min. : 89.0 Min. :0.0000
1st Qu.:115.0 1st Qu.:0.0000
Median :127.0 Median :0.0000
Mean :149.4 Mean :0.1197
3rd Qu.:140.0 3rd Qu.:0.0000
Max. :999.0 Max. :9.0000
In many cases, graphs may be more straightforward than numeric values. We first introduce historgrams. Histrograms plot frequencies versus the values. Below is an application of the function hist()
, a base function provided in R.
hist(data.smoker$bwt)
hist(data.nonsmoker$bwt)
However, there are multiple ways for plotting histograms in R. Before that, we will first introduce how to download and import open-source packages. We will take ggplot2 for example, which is a nice plotting package in R. To import packages, use library()
function.
install.packages('ggplot2',repos="http://cran.rstudio.com/")
library(ggplot2)
The ggplot2 package provides great tools in plotting, which you can explore later on your own. Here a crude histogram example is provided.
In addition to histograms, boxplots provide visualization when comparing a column of multiple categories. The argument ‘formula’ of boxplot works much like an equation. It tells R the variable relation between two variables. This will come up again when we talk about regressions.
boxplot(bwt~smoke, data)
Note that from the plots and summary statistics, one can often detect irregular values in data, which requires additional inspection and cleaning.
irreg.index <- which(data$smoke == 9)
data.irregular <- data[irreg.index,]
data.irregular
\(\alpha\)-quantile of a random variable \(X\), denoted as \(q_\alpha\), is define as \(P(X \leq q_\alpha) = \alpha\). This will require knowledge of the distribution \(X\) follows. The quantile for normal distribution, for example, can be found using qnorm()
function. This number for standard normal distribution is also known as the \(z\)-score. The sample quantile is available from the function quantile()
.
qnorm(0.95)
[1] 1.644854
quantile(data$bwt, 0.1)
10%
97
Quantile-Quantile plot is useful for examining whether data follows a normal distribution closely or not. It is also used when trying to determine whether two samples come from a common distribution.
qqnorm(data$bwt) # q-q plot against normal distribution
qqline(data$bwt)
For two sample comparison, if the plotted points are mostly above or below the \(y=x\) line, then one can tell that the two samples most likely have a different mean.
qqplot(data.smoker$bwt, data.nonsmoker$bwt) # q-q plot of smoker and non-smoker samples
abline(c(0,1)) # reference line