Get and Set Working Directory

R has a working directory. This should be the main location to store external data files, as well as other code files.

getwd()
[1] "/Users/jiaqiguo/Dropbox/Math189Winter2017/lab2"

One can also set the working directory to a desired location.

setwd() # desired directory as argument
getwd()

Read in Data

Please make sure the external data file is in the working directory. There are multiple functions to read in external data into R, depending on the file format. After reading in the external file, data is usually stored in an R dataframe. One can also peak into the data by using head() function.

data <- read.table("babies.txt", header=TRUE)
head(data)

Summary of Data

One of the most efficient way to learn about the data is through summary() function, which contains min, 1st quartile, median, mean, 3rd quartile, and max of each column.

summary(data)
      bwt          gestation         parity            age            height     
 Min.   : 55.0   Min.   :148.0   Min.   :0.0000   Min.   :15.00   Min.   :53.00  
 1st Qu.:108.8   1st Qu.:272.0   1st Qu.:0.0000   1st Qu.:23.00   1st Qu.:62.00  
 Median :120.0   Median :280.0   Median :0.0000   Median :26.00   Median :64.00  
 Mean   :119.6   Mean   :286.9   Mean   :0.2549   Mean   :27.37   Mean   :64.67  
 3rd Qu.:131.0   3rd Qu.:288.0   3rd Qu.:1.0000   3rd Qu.:31.00   3rd Qu.:66.00  
 Max.   :176.0   Max.   :999.0   Max.   :1.0000   Max.   :99.00   Max.   :99.00  
     weight        smoke       
 Min.   : 87   Min.   :0.0000  
 1st Qu.:115   1st Qu.:0.0000  
 Median :126   Median :0.0000  
 Mean   :154   Mean   :0.4644  
 3rd Qu.:140   3rd Qu.:1.0000  
 Max.   :999   Max.   :9.0000  

Mean and standard deviations can be found using base functions mean() and sd() in R respectively.

mean(data$bwt)
[1] 119.5769
sd(data$bwt)
[1] 18.23645

Acessing Data

To access a column of the data, one can simply use [...] to select. A dataframe work similar as a matrix; however, it does offer additional flexibility and information in handling data. For example, one can select a column of a dataframe by column number or column name. Both returns the same entry values. However, selecting by column number returns the raw data, whereas selecting by column name returns a dataframe with column information. Quotation marks are necessary, if column name is used for selection.

data[,7]
data['smoke']

Also, $ may be used to access a certain data field, which is equivalent to selecting by column number.

data$smoke

Acessing certain rows of the data is usually straightforward using the row numbers.

data[100,]

In addition, we introduce selecting rows based on a certain condition here, which is useful in application. To see which rows contain observations of smokers, we use which() function, which returns indices of observations from smokers.

smoker.ind <- which(data['smoke'] == 1)
smoker.ind

To use the indices, we can pass in the vector of indices. Also, the setdiff() function is useful in set difference operations.

data.smoker <- data[smoker.ind,]
nonsmoker.ind <- setdiff(rownames(data), smoker.ind)
data.nonsmoker <- data[nonsmoker.ind,]
# compare summary statistics between smokers and nonsmokers
summary(data.smoker) 
      bwt          gestation         parity          age            height     
 Min.   : 58.0   Min.   :223.0   Min.   :0.00   Min.   :15.00   Min.   :53.00  
 1st Qu.:102.0   1st Qu.:271.0   1st Qu.:0.00   1st Qu.:22.00   1st Qu.:63.00  
 Median :115.0   Median :279.0   Median :0.00   Median :26.00   Median :64.00  
 Mean   :114.1   Mean   :283.9   Mean   :0.25   Mean   :26.88   Mean   :65.03  
 3rd Qu.:126.0   3rd Qu.:286.0   3rd Qu.:0.25   3rd Qu.:30.00   3rd Qu.:66.00  
 Max.   :163.0   Max.   :999.0   Max.   :1.00   Max.   :99.00   Max.   :99.00  
     weight          smoke  
 Min.   : 87.0   Min.   :1  
 1st Qu.:112.0   1st Qu.:1  
 Median :125.0   Median :1  
 Mean   :161.1   Mean   :1  
 3rd Qu.:140.0   3rd Qu.:1  
 Max.   :999.0   Max.   :1  
summary(data.nonsmoker)
      bwt          gestation         parity           age            height     
 Min.   : 55.0   Min.   :148.0   Min.   :0.000   Min.   :17.00   Min.   :56.00  
 1st Qu.:113.0   1st Qu.:273.0   1st Qu.:0.000   1st Qu.:23.00   1st Qu.:62.00  
 Median :123.5   Median :281.0   Median :0.000   Median :27.00   Median :64.00  
 Mean   :123.1   Mean   :288.8   Mean   :0.258   Mean   :27.69   Mean   :64.44  
 3rd Qu.:134.0   3rd Qu.:289.0   3rd Qu.:1.000   3rd Qu.:31.00   3rd Qu.:66.00  
 Max.   :176.0   Max.   :999.0   Max.   :1.000   Max.   :99.00   Max.   :99.00  
     weight          smoke       
 Min.   : 89.0   Min.   :0.0000  
 1st Qu.:115.0   1st Qu.:0.0000  
 Median :127.0   Median :0.0000  
 Mean   :149.4   Mean   :0.1197  
 3rd Qu.:140.0   3rd Qu.:0.0000  
 Max.   :999.0   Max.   :9.0000  

Graphs from Data

In many cases, graphs may be more straightforward than numeric values. We first introduce historgrams. Histrograms plot frequencies versus the values. Below is an application of the function hist(), a base function provided in R.

hist(data.smoker$bwt)

hist(data.nonsmoker$bwt)

However, there are multiple ways for plotting histograms in R. Before that, we will first introduce how to download and import open-source packages. We will take ggplot2 for example, which is a nice plotting package in R. To import packages, use library() function.

install.packages('ggplot2',repos="http://cran.rstudio.com/")
library(ggplot2)

The ggplot2 package provides great tools in plotting, which you can explore later on your own. Here a crude histogram example is provided.

In addition to histograms, boxplots provide visualization when comparing a column of multiple categories. The argument ‘formula’ of boxplot works much like an equation. It tells R the variable relation between two variables. This will come up again when we talk about regressions.

boxplot(bwt~smoke, data)

Note that from the plots and summary statistics, one can often detect irregular values in data, which requires additional inspection and cleaning.

irreg.index <- which(data$smoke == 9)
data.irregular <- data[irreg.index,]
data.irregular

Quantiles

\(\alpha\)-quantile of a random variable \(X\), denoted as \(q_\alpha\), is define as \(P(X \leq q_\alpha) = \alpha\). This will require knowledge of the distribution \(X\) follows. The quantile for normal distribution, for example, can be found using qnorm() function. This number for standard normal distribution is also known as the \(z\)-score. The sample quantile is available from the function quantile().

qnorm(0.95)
[1] 1.644854
quantile(data$bwt, 0.1)
10% 
 97 

Q-Q Plot (Quantile-Quantile Plot)

Quantile-Quantile plot is useful for examining whether data follows a normal distribution closely or not. It is also used when trying to determine whether two samples come from a common distribution.

qqnorm(data$bwt) # q-q plot against normal distribution
qqline(data$bwt)

For two sample comparison, if the plotted points are mostly above or below the \(y=x\) line, then one can tell that the two samples most likely have a different mean.

qqplot(data.smoker$bwt, data.nonsmoker$bwt) # q-q plot of smoker and non-smoker samples
abline(c(0,1)) # reference line

LS0tCnRpdGxlOiAiTWF0aCAxODkvMjg5IExhYiAyIgphdXRob3I6ICJKaWFxaSBHdW8iCm91dHB1dDoKICBodG1sX25vdGVib29rOiBkZWZhdWx0CiAgcGRmX2RvY3VtZW50OiBkZWZhdWx0Ci0tLQoKIyBHZXQgYW5kIFNldCBXb3JraW5nIERpcmVjdG9yeQpSIGhhcyBhIHdvcmtpbmcgZGlyZWN0b3J5LiBUaGlzIHNob3VsZCBiZSB0aGUgbWFpbiBsb2NhdGlvbiB0byBzdG9yZSBleHRlcm5hbCBkYXRhIGZpbGVzLCBhcyB3ZWxsIGFzIG90aGVyIGNvZGUgZmlsZXMuCmBgYHtyfQpnZXR3ZCgpCmBgYApPbmUgY2FuIGFsc28gc2V0IHRoZSB3b3JraW5nIGRpcmVjdG9yeSB0byBhIGRlc2lyZWQgbG9jYXRpb24uCmBgYHtyfQpzZXR3ZCgpICMgZGVzaXJlZCBkaXJlY3RvcnkgYXMgYXJndW1lbnQKZ2V0d2QoKQpgYGAKCiMgUmVhZCBpbiBEYXRhClBsZWFzZSBtYWtlIHN1cmUgdGhlIGV4dGVybmFsIGRhdGEgZmlsZSBpcyBpbiB0aGUgd29ya2luZyBkaXJlY3RvcnkuIFRoZXJlIGFyZSBtdWx0aXBsZSBmdW5jdGlvbnMgdG8gcmVhZCBpbiBleHRlcm5hbCBkYXRhIGludG8gUiwgZGVwZW5kaW5nIG9uIHRoZSBmaWxlIGZvcm1hdC4gQWZ0ZXIgcmVhZGluZyBpbiB0aGUgZXh0ZXJuYWwgZmlsZSwgZGF0YSBpcyB1c3VhbGx5IHN0b3JlZCBpbiBhbiBSIGRhdGFmcmFtZS4gT25lIGNhbiBhbHNvIHBlYWsgaW50byB0aGUgZGF0YSBieSB1c2luZyBgaGVhZCgpYCBmdW5jdGlvbi4KYGBge3J9CmRhdGEgPC0gcmVhZC50YWJsZSgiYmFiaWVzLnR4dCIsIGhlYWRlcj1UUlVFKQpoZWFkKGRhdGEpCmBgYAoKIyBTdW1tYXJ5IG9mIERhdGEKT25lIG9mIHRoZSBtb3N0IGVmZmljaWVudCB3YXkgdG8gbGVhcm4gYWJvdXQgdGhlIGRhdGEgaXMgdGhyb3VnaCBgc3VtbWFyeSgpYCBmdW5jdGlvbiwgd2hpY2ggY29udGFpbnMgbWluLCAxc3QgcXVhcnRpbGUsIG1lZGlhbiwgbWVhbiwgM3JkIHF1YXJ0aWxlLCBhbmQgbWF4IG9mIGVhY2ggY29sdW1uLgpgYGB7cn0Kc3VtbWFyeShkYXRhKQpgYGAKTWVhbiBhbmQgc3RhbmRhcmQgZGV2aWF0aW9ucyBjYW4gYmUgZm91bmQgdXNpbmcgYmFzZSBmdW5jdGlvbnMgYG1lYW4oKWAgYW5kIGBzZCgpYCBpbiBSIHJlc3BlY3RpdmVseS4KYGBge3J9Cm1lYW4oZGF0YSRid3QpCnNkKGRhdGEkYnd0KQpgYGAKCiMgQWNlc3NpbmcgRGF0YQpUbyBhY2Nlc3MgYSBjb2x1bW4gb2YgdGhlIGRhdGEsIG9uZSBjYW4gc2ltcGx5IHVzZSBgWy4uLl1gIHRvIHNlbGVjdC4gQSBkYXRhZnJhbWUgd29yayBzaW1pbGFyIGFzIGEgbWF0cml4OyBob3dldmVyLCBpdCBkb2VzIG9mZmVyIGFkZGl0aW9uYWwgZmxleGliaWxpdHkgYW5kIGluZm9ybWF0aW9uIGluIGhhbmRsaW5nIGRhdGEuIApGb3IgZXhhbXBsZSwgb25lIGNhbiBzZWxlY3QgYSBjb2x1bW4gb2YgYSBkYXRhZnJhbWUgYnkgY29sdW1uIG51bWJlciBvciBjb2x1bW4gbmFtZS4gQm90aCByZXR1cm5zIHRoZSBzYW1lIGVudHJ5IHZhbHVlcy4gSG93ZXZlciwgc2VsZWN0aW5nIGJ5IGNvbHVtbiBudW1iZXIgcmV0dXJucyB0aGUgcmF3IGRhdGEsIHdoZXJlYXMgc2VsZWN0aW5nIGJ5IGNvbHVtbiBuYW1lIHJldHVybnMgYSBkYXRhZnJhbWUgd2l0aCBjb2x1bW4gaW5mb3JtYXRpb24uIFF1b3RhdGlvbiBtYXJrcyBhcmUgbmVjZXNzYXJ5LCBpZiBjb2x1bW4gbmFtZSBpcyB1c2VkIGZvciBzZWxlY3Rpb24uCmBgYHtyfQpkYXRhWyw3XQpgYGAKYGBge3J9CmRhdGFbJ3Ntb2tlJ10KYGBgCkFsc28sIGAkYCBtYXkgYmUgdXNlZCB0byBhY2Nlc3MgYSBjZXJ0YWluIGRhdGEgZmllbGQsIHdoaWNoIGlzIGVxdWl2YWxlbnQgdG8gc2VsZWN0aW5nIGJ5IGNvbHVtbiBudW1iZXIuCmBgYHtyfQpkYXRhJHNtb2tlCmBgYAoKQWNlc3NpbmcgY2VydGFpbiByb3dzIG9mIHRoZSBkYXRhIGlzIHVzdWFsbHkgc3RyYWlnaHRmb3J3YXJkIHVzaW5nIHRoZSByb3cgbnVtYmVycy4gCmBgYHtyfQpkYXRhWzEwMCxdCmBgYApJbiBhZGRpdGlvbiwgd2UgaW50cm9kdWNlIHNlbGVjdGluZyByb3dzIGJhc2VkIG9uIGEgY2VydGFpbiBjb25kaXRpb24gaGVyZSwgd2hpY2ggaXMgdXNlZnVsIGluIGFwcGxpY2F0aW9uLiBUbyBzZWUgd2hpY2ggcm93cyBjb250YWluIG9ic2VydmF0aW9ucyBvZiBzbW9rZXJzLCB3ZSB1c2UgYHdoaWNoKClgIGZ1bmN0aW9uLCB3aGljaCByZXR1cm5zIGluZGljZXMgb2Ygb2JzZXJ2YXRpb25zIGZyb20gc21va2Vycy4KYGBge3J9CnNtb2tlci5pbmQgPC0gd2hpY2goZGF0YVsnc21va2UnXSA9PSAxKQpzbW9rZXIuaW5kCmBgYApUbyB1c2UgdGhlIGluZGljZXMsIHdlIGNhbiBwYXNzIGluIHRoZSB2ZWN0b3Igb2YgaW5kaWNlcy4gQWxzbywgdGhlIGBzZXRkaWZmKClgIGZ1bmN0aW9uIGlzIHVzZWZ1bCBpbiBzZXQgZGlmZmVyZW5jZSBvcGVyYXRpb25zLgpgYGB7cn0KZGF0YS5zbW9rZXIgPC0gZGF0YVtzbW9rZXIuaW5kLF0Kbm9uc21va2VyLmluZCA8LSBzZXRkaWZmKHJvd25hbWVzKGRhdGEpLCBzbW9rZXIuaW5kKQpkYXRhLm5vbnNtb2tlciA8LSBkYXRhW25vbnNtb2tlci5pbmQsXQojIGNvbXBhcmUgc3VtbWFyeSBzdGF0aXN0aWNzIGJldHdlZW4gc21va2VycyBhbmQgbm9uc21va2VycwpzdW1tYXJ5KGRhdGEuc21va2VyKSAKc3VtbWFyeShkYXRhLm5vbnNtb2tlcikKYGBgCgojIEdyYXBocyBmcm9tIERhdGEKSW4gbWFueSBjYXNlcywgZ3JhcGhzIG1heSBiZSBtb3JlIHN0cmFpZ2h0Zm9yd2FyZCB0aGFuIG51bWVyaWMgdmFsdWVzLiBXZSBmaXJzdCBpbnRyb2R1Y2UgaGlzdG9yZ3JhbXMuIEhpc3Ryb2dyYW1zIHBsb3QgZnJlcXVlbmNpZXMgdmVyc3VzIHRoZSB2YWx1ZXMuIEJlbG93IGlzIGFuIGFwcGxpY2F0aW9uIG9mIHRoZSBmdW5jdGlvbiBgaGlzdCgpYCwgYSBiYXNlIGZ1bmN0aW9uIHByb3ZpZGVkIGluIFIuIApgYGB7cn0KaGlzdChkYXRhLnNtb2tlciRid3QpCmBgYApgYGB7cn0KaGlzdChkYXRhLm5vbnNtb2tlciRid3QpCmBgYApIb3dldmVyLCB0aGVyZSBhcmUgbXVsdGlwbGUgd2F5cyBmb3IgcGxvdHRpbmcgaGlzdG9ncmFtcyBpbiBSLiBCZWZvcmUgdGhhdCwgd2Ugd2lsbCBmaXJzdCBpbnRyb2R1Y2UgaG93IHRvIGRvd25sb2FkIGFuZCBpbXBvcnQgb3Blbi1zb3VyY2UgcGFja2FnZXMuIFdlIHdpbGwgdGFrZSBnZ3Bsb3QyIGZvciBleGFtcGxlLCB3aGljaCBpcyBhIG5pY2UgcGxvdHRpbmcgcGFja2FnZSBpbiBSLiBUbyBpbXBvcnQgcGFja2FnZXMsIHVzZSBgbGlicmFyeSgpYCBmdW5jdGlvbi4KYGBge3J9Cmluc3RhbGwucGFja2FnZXMoJ2dncGxvdDInLHJlcG9zPSJodHRwOi8vY3Jhbi5yc3R1ZGlvLmNvbS8iKQpsaWJyYXJ5KGdncGxvdDIpCmBgYApUaGUgZ2dwbG90MiBwYWNrYWdlIHByb3ZpZGVzIGdyZWF0IHRvb2xzIGluIHBsb3R0aW5nLCB3aGljaCB5b3UgY2FuIGV4cGxvcmUgbGF0ZXIgb24geW91ciBvd24uIEhlcmUgYSBjcnVkZSBoaXN0b2dyYW0gZXhhbXBsZSBpcyBwcm92aWRlZC4KYGBge3J9CmdncGxvdChkYXRhLCBhZXMoYnd0KSkgKyBnZW9tX2hpc3RvZ3JhbSgpCmBgYAoKSW4gYWRkaXRpb24gdG8gaGlzdG9ncmFtcywgYm94cGxvdHMgcHJvdmlkZSB2aXN1YWxpemF0aW9uIHdoZW4gY29tcGFyaW5nIGEgY29sdW1uIG9mIG11bHRpcGxlIGNhdGVnb3JpZXMuIFRoZSBhcmd1bWVudCAnZm9ybXVsYScgb2YgYm94cGxvdCB3b3JrcyBtdWNoIGxpa2UgYW4gZXF1YXRpb24uIEl0IHRlbGxzIFIgdGhlIHZhcmlhYmxlIHJlbGF0aW9uIGJldHdlZW4gdHdvIHZhcmlhYmxlcy4gVGhpcyB3aWxsIGNvbWUgdXAgYWdhaW4gd2hlbiB3ZSB0YWxrIGFib3V0IHJlZ3Jlc3Npb25zLgpgYGB7cn0KYm94cGxvdChid3R+c21va2UsIGRhdGEpCmBgYApOb3RlIHRoYXQgZnJvbSB0aGUgcGxvdHMgYW5kIHN1bW1hcnkgc3RhdGlzdGljcywgb25lIGNhbiBvZnRlbiBkZXRlY3QgaXJyZWd1bGFyIHZhbHVlcyBpbiBkYXRhLCB3aGljaCByZXF1aXJlcyBhZGRpdGlvbmFsIGluc3BlY3Rpb24gYW5kIGNsZWFuaW5nLgpgYGB7cn0KaXJyZWcuaW5kZXggPC0gd2hpY2goZGF0YSRzbW9rZSA9PSA5KQpkYXRhLmlycmVndWxhciA8LSBkYXRhW2lycmVnLmluZGV4LF0KZGF0YS5pcnJlZ3VsYXIKYGBgCgojIFF1YW50aWxlcwokXGFscGhhJC1xdWFudGlsZSBvZiBhIHJhbmRvbSB2YXJpYWJsZSAkWCQsIGRlbm90ZWQgYXMgJHFfXGFscGhhJCwgaXMgZGVmaW5lIGFzICRQKFggXGxlcSBxX1xhbHBoYSkgPSBcYWxwaGEkLiBUaGlzIHdpbGwgcmVxdWlyZSBrbm93bGVkZ2Ugb2YgdGhlIGRpc3RyaWJ1dGlvbiAkWCQgZm9sbG93cy4gVGhlIHF1YW50aWxlIGZvciBub3JtYWwgZGlzdHJpYnV0aW9uLCBmb3IgZXhhbXBsZSwgY2FuIGJlIGZvdW5kIHVzaW5nIGBxbm9ybSgpYCBmdW5jdGlvbi4gVGhpcyBudW1iZXIgZm9yIHN0YW5kYXJkIG5vcm1hbCBkaXN0cmlidXRpb24gaXMgYWxzbyBrbm93biBhcyB0aGUgJHokLXNjb3JlLiBUaGUgc2FtcGxlIHF1YW50aWxlIGlzIGF2YWlsYWJsZSBmcm9tIHRoZSBmdW5jdGlvbiBgcXVhbnRpbGUoKWAuCmBgYHtyfQpxbm9ybSgwLjk1KQpxdWFudGlsZShkYXRhJGJ3dCwgMC4xKQpgYGAKCiMgUS1RIFBsb3QgKFF1YW50aWxlLVF1YW50aWxlIFBsb3QpClF1YW50aWxlLVF1YW50aWxlIHBsb3QgaXMgdXNlZnVsIGZvciBleGFtaW5pbmcgd2hldGhlciBkYXRhIGZvbGxvd3MgYSBub3JtYWwgZGlzdHJpYnV0aW9uIGNsb3NlbHkgb3Igbm90LiBJdCBpcyBhbHNvIHVzZWQgd2hlbiB0cnlpbmcgdG8gZGV0ZXJtaW5lIHdoZXRoZXIgdHdvIHNhbXBsZXMgY29tZSBmcm9tIGEgY29tbW9uIGRpc3RyaWJ1dGlvbi4KYGBge3J9CnFxbm9ybShkYXRhJGJ3dCkgIyBxLXEgcGxvdCBhZ2FpbnN0IG5vcm1hbCBkaXN0cmlidXRpb24KcXFsaW5lKGRhdGEkYnd0KQpgYGAKRm9yIHR3byBzYW1wbGUgY29tcGFyaXNvbiwgaWYgdGhlIHBsb3R0ZWQgcG9pbnRzIGFyZSBtb3N0bHkgYWJvdmUgb3IgYmVsb3cgdGhlICR5PXgkIGxpbmUsIHRoZW4gb25lIGNhbiB0ZWxsIHRoYXQgdGhlIHR3byBzYW1wbGVzIG1vc3QgbGlrZWx5IGhhdmUgYSBkaWZmZXJlbnQgbWVhbi4KYGBge3J9CnFxcGxvdChkYXRhLnNtb2tlciRid3QsIGRhdGEubm9uc21va2VyJGJ3dCkgIyBxLXEgcGxvdCBvZiBzbW9rZXIgYW5kIG5vbi1zbW9rZXIgc2FtcGxlcwphYmxpbmUoYygwLDEpKSAjIHJlZmVyZW5jZSBsaW5lCmBgYA==