Dataframe

R has a working directory. This should be the main location to store external data files, as well as other code files.

getwd()

One can also set the working directory to a desired location.

setwd() # desired directory as argument
getwd()

If you are not sure where the source file is located, you can use “Session –> Set Working Directory –> To Source File Location”.

First, let’s load in the video game data for homework 2.

setwd("~/Dropbox/Math189Winter2017/lab3") # change to your own working directory
data <- read.table("videodata.txt", header=TRUE)
head(data)

If a question was not answered or improperly answered, it was decoded as 99. There are total number of 65 such fields.

data[data == 99] <- NA
sum(is.na(data))
[1] 65

attach() and detach() functions

The $ notation for list components is not always very convenient. Sometimes it is more convenient to temporarily make the components of data frame visible as variables under their component name. We can use attach() function in R to do this.

For example, if we want to access the age variable, before attaching the data frame data, we can only access by:

data$age # returns an array
 [1] 19 18 19 19 19 19 20 19 19 19 20 19 19 18 18 19 21 20 18 19 20 24 19 19 21 20 22 18 19 20 19 19 19 19 19 18 19 20 19
[40] 20 19 19 19 19 19 19 19 19 20 19 18 20 19 18 19 21 18 20 19 19 19 21 20 19 18 19 18 20 19 19 19 20 19 23 19 20 19 19
[79] 25 19 20 19 19 21 18 19 19 20 33 19 19
data['age'] # return a dataframe

If we directly type in age, an error will be returned.

age
Error: object 'age' not found

After attaching the dataframe, we have an individual variable called age.

attach(data)
The following object is masked _by_ .GlobalEnv:

    age
age
 [1] 1000   18   19   19   19   19   20   19   19   19   20   19   19   18   18   19   21   20   18   19   20   24   19
[24]   19   21   20   22   18   19   20   19   19   19   19   19   18   19   20   19   20   19   19   19   19   19   19
[47]   19   19   20   19   18   20   19   18   19   21   18   20   19   19   19   21   20   19   18   19   18   20   19
[70]   19   19   20   19   23   19   20   19   19   25   19   20   19   19   21   18   19   19   20   33   19   19

However, now any change on age will only affect the new variable age in the working directory, not the age attribute of the data frame data.

age[1] <- 1000
age
 [1] 1000   18   19   19   19   19   20   19   19   19   20   19   19   18   18   19   21   20   18   19   20   24   19
[24]   19   21   20   22   18   19   20   19   19   19   19   19   18   19   20   19   20   19   19   19   19   19   19
[47]   19   19   20   19   18   20   19   18   19   21   18   20   19   19   19   21   20   19   18   19   18   20   19
[70]   19   19   20   19   23   19   20   19   19   25   19   20   19   19   21   18   19   19   20   33   19   19
data$age
 [1] 19 18 19 19 19 19 20 19 19 19 20 19 19 18 18 19 21 20 18 19 20 24 19 19 21 20 22 18 19 20 19 19 19 19 19 18 19 20 19
[40] 20 19 19 19 19 19 19 19 19 20 19 18 20 19 18 19 21 18 20 19 19 19 21 20 19 18 19 18 20 19 19 19 20 19 23 19 20 19 19
[79] 25 19 20 19 19 21 18 19 19 20 33 19 19

To detach a data frame, use function detach().

search()
 [1] ".GlobalEnv"        "data"              "tools:rstudio"     "package:stats"     "package:graphics" 
 [6] "package:grDevices" "package:utils"     "package:datasets"  "package:methods"   "Autoloads"        
[11] "package:base"     
ls(2)
 [1] "age"   "busy"  "cdrom" "educ"  "email" "freq"  "grade" "home"  "like"  "math"  "own"   "sex"   "time"  "where"
[15] "work" 
detach(data)
search()
 [1] ".GlobalEnv"        "tools:rstudio"     "package:stats"     "package:graphics"  "package:grDevices"
 [6] "package:utils"     "package:datasets"  "package:methods"   "Autoloads"         "package:base"     

More on estimating the distribution of data

Histogram

summary(data$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18.00   19.00   19.00   19.52   20.00   33.00 
hist(data$age)

hist(data$age, breaks = 25, probability = TRUE, density = 20, col = 3, border = 3)
lines(density(data$age, bw = 0.25), col = 1)
lines(density(data$age, bw = 0.5), col = 2)
rug(data$age)

Empirical cumulative distribution function (ecdf())

plot(ecdf(data$age), do.points = F, verticals = T)
x <- seq(min(data$age)-4, max(data$age), length.out = 1000)
lines(x, pnorm(x, mean=mean(data$age), sd=sd(data$age)), lty=3, col = 2)

Let’s compare it with a Normal distribution with the same sample mean and variance (red dotted line above).

Of course, one would check the Normality more carefully using Q-Q plot. We see a very long right tail.

par(pty = 's') # plot in a square
qqnorm(data$age)
qqline(data$age)

More formally, we can do a Shapiro-Wilk test. In this case, the p-value is significantly small and we can reject the Normality hypothesis \(H_0\).

shapiro.test(data$age)

    Shapiro-Wilk normality test

data:  data$age
W = 0.51412, p-value = 7.666e-16

Sampling and Bootstrap

Sampling

Sampling (with or without replacement) can be done in R using the function sample().

sample(1:10, size = 10, replace = F)
 [1]  8  9  2  7  3  1  6  5  4 10
sample(1:10, size = 10, replace = T)
 [1]  2  1  5 10  8  6  4  2  7  5

Bootstrap

Suppose we want to find the point estimate of the proportion of male in the population and its confidence interval. Note that the sex variable is 1 if the student is male and 0 if female. Then the point estimation of the proportion of male is:

male.percentage <- mean(data$sex)
male.percentage
[1] 0.5824176

Now we also want to have a confidence interval of this estimator. However, clearly the distribution of sex variable is not Normal, it is a Bernoulli random variable. We know our data were drawn from a population with size \(N = 314\). Hence, we first create a bootstrap population of this size by repeating every sample for \(\frac{314}{91} = 3.45\) times. Here, we’ll just specify the parameter length.out to be 314.

boot.population <- rep(data$sex, length.out = 314)
length(boot.population)
[1] 314

Then we will choose \(n = 91\) samples from the Bootstrap population and call this a Bootstrap sample.

sample1 <- sample(boot.population, size = 91, replace = FALSE)

Continue this procedure until we have 400 Bootstrap samples.

set.seed(189289)
B = 400 # the number of bootstrap samples we want
boot.sample <- array(dim = c(B, 91))
for (i in 1:B) {
  boot.sample[i, ] <- sample(boot.population, size = 91, replace = FALSE)
}

Then we can calculate the sample mean for each Bootstrap sample (i.e. each row of the Bootstrap sample matrix).

boot.mean <- apply(X = boot.sample, MARGIN = 1, FUN = mean)
head(boot.mean)
[1] 0.5054945 0.6153846 0.6263736 0.6043956 0.4835165 0.5824176

Let’s see the histogram of these Bootstrap sample means.

hist(boot.mean, breaks = 20, probability = TRUE, density = 20, col = 3, border = 3)
lines(density(boot.mean, adjust = 2), col = 2)

Check Normality by Q-Q plot and Shapiro-Wilk test.

par(pty = 's')
qqnorm(boot.mean)
qqline(boot.mean)

shapiro.test(boot.mean)

    Shapiro-Wilk normality test

data:  boot.mean
W = 0.99244, p-value = 0.04051

So we can accept that the sample mean follows a Normal distribution. Then we can construct \(95\%\) confidence intervals.

boot.sd <- sd(boot.mean)
male.percentage + c(-1, 1)*1.96*boot.sd
[1] 0.4969015 0.6679337
LS0tCnRpdGxlOiAiTWF0aCAxODkvMjg5IExhYiAzIgphdXRob3I6ICJBbGV4IEhhbmJvIExpIgpvdXRwdXQ6CiAgaHRtbF9ub3RlYm9vazogZGVmYXVsdAogIHBkZl9kb2N1bWVudDogZGVmYXVsdAotLS0KCiMgRGF0YWZyYW1lClIgaGFzIGEgd29ya2luZyBkaXJlY3RvcnkuIFRoaXMgc2hvdWxkIGJlIHRoZSBtYWluIGxvY2F0aW9uIHRvIHN0b3JlIGV4dGVybmFsIGRhdGEgZmlsZXMsIGFzIHdlbGwgYXMgb3RoZXIgY29kZSBmaWxlcy4KYGBge3J9CmdldHdkKCkKYGBgCk9uZSBjYW4gYWxzbyBzZXQgdGhlIHdvcmtpbmcgZGlyZWN0b3J5IHRvIGEgZGVzaXJlZCBsb2NhdGlvbi4KYGBge3J9CnNldHdkKCkgIyBkZXNpcmVkIGRpcmVjdG9yeSBhcyBhcmd1bWVudApnZXR3ZCgpCmBgYApJZiB5b3UgYXJlIG5vdCBzdXJlIHdoZXJlIHRoZSBzb3VyY2UgZmlsZSBpcyBsb2NhdGVkLCB5b3UgY2FuIHVzZSAiU2Vzc2lvbiAtLT4gU2V0IFdvcmtpbmcgRGlyZWN0b3J5IC0tPiBUbyBTb3VyY2UgRmlsZSBMb2NhdGlvbiIuCgpGaXJzdCwgbGV0J3MgbG9hZCBpbiB0aGUgdmlkZW8gZ2FtZSBkYXRhIGZvciBob21ld29yayAyLgpgYGB7cn0Kc2V0d2QoIn4vRHJvcGJveC9NYXRoMTg5V2ludGVyMjAxNy9sYWIzIikgIyBjaGFuZ2UgdG8geW91ciBvd24gd29ya2luZyBkaXJlY3RvcnkKZGF0YSA8LSByZWFkLnRhYmxlKCJ2aWRlb2RhdGEudHh0IiwgaGVhZGVyPVRSVUUpCmhlYWQoZGF0YSkKYGBgCgpJZiBhIHF1ZXN0aW9uIHdhcyBub3QgYW5zd2VyZWQgb3IgaW1wcm9wZXJseSBhbnN3ZXJlZCwgaXQgd2FzIGRlY29kZWQgYXMgOTkuIFRoZXJlIGFyZSB0b3RhbCBudW1iZXIgb2YgNjUgc3VjaCBmaWVsZHMuCmBgYHtyfQpkYXRhW2RhdGEgPT0gOTldIDwtIE5BCnN1bShpcy5uYShkYXRhKSkKYGBgCgojIyBgYXR0YWNoKClgIGFuZCBgZGV0YWNoKClgIGZ1bmN0aW9ucwpUaGUgKiokKiogbm90YXRpb24gZm9yIGxpc3QgY29tcG9uZW50cyBpcyBub3QgYWx3YXlzIHZlcnkgY29udmVuaWVudC4KU29tZXRpbWVzIGl0IGlzIG1vcmUgY29udmVuaWVudCB0byB0ZW1wb3JhcmlseSBtYWtlIHRoZSBjb21wb25lbnRzIG9mIGRhdGEgZnJhbWUKdmlzaWJsZSBhcyB2YXJpYWJsZXMgdW5kZXIgdGhlaXIgY29tcG9uZW50IG5hbWUuIFdlIGNhbiB1c2UgYGF0dGFjaCgpYCBmdW5jdGlvbiBpbiBSIHRvIGRvIHRoaXMuCgpGb3IgZXhhbXBsZSwgaWYgd2Ugd2FudCB0byBhY2Nlc3MgdGhlICoqYWdlKiogdmFyaWFibGUsIGJlZm9yZSBhdHRhY2hpbmcgdGhlIGRhdGEgZnJhbWUgYGRhdGFgLCB3ZSBjYW4gb25seSBhY2Nlc3MgYnk6CmBgYHtyfQpkYXRhJGFnZSAjIHJldHVybnMgYW4gYXJyYXkKZGF0YVsnYWdlJ10gIyByZXR1cm4gYSBkYXRhZnJhbWUKYGBgCgpJZiB3ZSBkaXJlY3RseSB0eXBlIGluIGBhZ2VgLCBhbiBlcnJvciB3aWxsIGJlIHJldHVybmVkLgpgYGB7cn0KYWdlCmBgYAoKQWZ0ZXIgYXR0YWNoaW5nIHRoZSBkYXRhZnJhbWUsIHdlIGhhdmUgYW4gaW5kaXZpZHVhbCB2YXJpYWJsZSBjYWxsZWQgYGFnZWAuCmBgYHtyfQphdHRhY2goZGF0YSkKYWdlCmBgYAoKSG93ZXZlciwgbm93IGFueSBjaGFuZ2Ugb24gYGFnZWAgd2lsbCBvbmx5IGFmZmVjdCB0aGUgbmV3IHZhcmlhYmxlIGBhZ2VgIGluIHRoZSB3b3JraW5nIGRpcmVjdG9yeSwgbm90IHRoZSBgYWdlYCBhdHRyaWJ1dGUgb2YgdGhlIGRhdGEgZnJhbWUgYGRhdGFgLiAKYGBge3J9CmFnZVsxXSA8LSAxMDAwCmFnZQpkYXRhJGFnZQpgYGAKClRvIGRldGFjaCBhIGRhdGEgZnJhbWUsIHVzZSBmdW5jdGlvbiBgZGV0YWNoKClgLgpgYGB7cn0Kc2VhcmNoKCkKbHMoMikKZGV0YWNoKGRhdGEpCnNlYXJjaCgpCmBgYAoKIyMgTW9yZSBvbiBlc3RpbWF0aW5nIHRoZSBkaXN0cmlidXRpb24gb2YgZGF0YQojIyMgSGlzdG9ncmFtCmBgYHtyfQpzdW1tYXJ5KGRhdGEkYWdlKQpoaXN0KGRhdGEkYWdlKQpoaXN0KGRhdGEkYWdlLCBicmVha3MgPSAyNSwgcHJvYmFiaWxpdHkgPSBUUlVFLCBkZW5zaXR5ID0gMjAsIGNvbCA9IDMsIGJvcmRlciA9IDMpCmxpbmVzKGRlbnNpdHkoZGF0YSRhZ2UsIGJ3ID0gMC4yNSksIGNvbCA9IDEpCmxpbmVzKGRlbnNpdHkoZGF0YSRhZ2UsIGJ3ID0gMC41KSwgY29sID0gMikKcnVnKGRhdGEkYWdlKQpgYGAKCiMjIyBFbXBpcmljYWwgY3VtdWxhdGl2ZSBkaXN0cmlidXRpb24gZnVuY3Rpb24gKGBlY2RmKClgKQpgYGB7cn0KcGxvdChlY2RmKGRhdGEkYWdlKSwgZG8ucG9pbnRzID0gRiwgdmVydGljYWxzID0gVCkKeCA8LSBzZXEobWluKGRhdGEkYWdlKS00LCBtYXgoZGF0YSRhZ2UpLCBsZW5ndGgub3V0ID0gMTAwMCkKbGluZXMoeCwgcG5vcm0oeCwgbWVhbj1tZWFuKGRhdGEkYWdlKSwgc2Q9c2QoZGF0YSRhZ2UpKSwgbHR5PTMsIGNvbCA9IDIpCmBgYApMZXQncyBjb21wYXJlIGl0IHdpdGggYSBOb3JtYWwgZGlzdHJpYnV0aW9uIHdpdGggdGhlIHNhbWUgc2FtcGxlIG1lYW4gYW5kIHZhcmlhbmNlIChyZWQgZG90dGVkIGxpbmUgYWJvdmUpLgoKT2YgY291cnNlLCBvbmUgd291bGQgY2hlY2sgdGhlIE5vcm1hbGl0eSBtb3JlIGNhcmVmdWxseSB1c2luZyBRLVEgcGxvdC4gV2Ugc2VlIGEgdmVyeSBsb25nIHJpZ2h0IHRhaWwuCmBgYHtyfQpwYXIocHR5ID0gJ3MnKSAjIHBsb3QgaW4gYSBzcXVhcmUKcXFub3JtKGRhdGEkYWdlKQpxcWxpbmUoZGF0YSRhZ2UpCmBgYAoKTW9yZSBmb3JtYWxseSwgd2UgY2FuIGRvIGEgU2hhcGlyby1XaWxrIHRlc3QuIEluIHRoaXMgY2FzZSwgdGhlIHAtdmFsdWUgaXMgc2lnbmlmaWNhbnRseSBzbWFsbCBhbmQgd2UgY2FuIHJlamVjdCB0aGUgTm9ybWFsaXR5IGh5cG90aGVzaXMgJEhfMCQuCmBgYHtyfQpzaGFwaXJvLnRlc3QoZGF0YSRhZ2UpCmBgYAoKIyBTYW1wbGluZyBhbmQgQm9vdHN0cmFwCiMjIyBTYW1wbGluZwpTYW1wbGluZyAod2l0aCBvciB3aXRob3V0IHJlcGxhY2VtZW50KSBjYW4gYmUgZG9uZSBpbiBSIHVzaW5nIHRoZSBmdW5jdGlvbiBgc2FtcGxlKClgLgpgYGB7cn0Kc2FtcGxlKDE6MTAsIHNpemUgPSAxMCwgcmVwbGFjZSA9IEYpCnNhbXBsZSgxOjEwLCBzaXplID0gMTAsIHJlcGxhY2UgPSBUKQpgYGAKCiMjIyBCb290c3RyYXAKU3VwcG9zZSB3ZSB3YW50IHRvIGZpbmQgdGhlIHBvaW50IGVzdGltYXRlIG9mIHRoZSBwcm9wb3J0aW9uIG9mIG1hbGUgaW4gdGhlIHBvcHVsYXRpb24gYW5kIGl0cyBjb25maWRlbmNlIGludGVydmFsLiBOb3RlIHRoYXQgdGhlICpzZXgqIHZhcmlhYmxlIGlzIDEgaWYgdGhlIHN0dWRlbnQgaXMgbWFsZSBhbmQgMCBpZiBmZW1hbGUuIFRoZW4gdGhlIHBvaW50IGVzdGltYXRpb24gb2YgdGhlIHByb3BvcnRpb24gb2YgbWFsZSBpczoKYGBge3J9Cm1hbGUucGVyY2VudGFnZSA8LSBtZWFuKGRhdGEkc2V4KQptYWxlLnBlcmNlbnRhZ2UKYGBgCgpOb3cgd2UgYWxzbyB3YW50IHRvIGhhdmUgYSBjb25maWRlbmNlIGludGVydmFsIG9mIHRoaXMgZXN0aW1hdG9yLiBIb3dldmVyLCBjbGVhcmx5IHRoZSBkaXN0cmlidXRpb24gb2YgKnNleCogdmFyaWFibGUgaXMgbm90IE5vcm1hbCwgaXQgaXMgYSBCZXJub3VsbGkgcmFuZG9tIHZhcmlhYmxlLiBXZSBrbm93IG91ciBkYXRhIHdlcmUgZHJhd24gZnJvbSBhIHBvcHVsYXRpb24gd2l0aCBzaXplICROID0gMzE0JC4gSGVuY2UsIHdlIGZpcnN0IGNyZWF0ZSBhIGJvb3RzdHJhcCBwb3B1bGF0aW9uIG9mIHRoaXMgc2l6ZSBieSByZXBlYXRpbmcgZXZlcnkgc2FtcGxlIGZvciAkXGZyYWN7MzE0fXs5MX0gPSAzLjQ1JCB0aW1lcy4gSGVyZSwgd2UnbGwganVzdCBzcGVjaWZ5IHRoZSBwYXJhbWV0ZXIgKmxlbmd0aC5vdXQqIHRvIGJlIDMxNC4KYGBge3J9CmJvb3QucG9wdWxhdGlvbiA8LSByZXAoZGF0YSRzZXgsIGxlbmd0aC5vdXQgPSAzMTQpCmxlbmd0aChib290LnBvcHVsYXRpb24pCmBgYAoKVGhlbiB3ZSB3aWxsIGNob29zZSAkbiA9IDkxJCBzYW1wbGVzIGZyb20gdGhlIEJvb3RzdHJhcCBwb3B1bGF0aW9uIGFuZCBjYWxsIHRoaXMgYSBCb290c3RyYXAgc2FtcGxlLgpgYGB7cn0Kc2FtcGxlMSA8LSBzYW1wbGUoYm9vdC5wb3B1bGF0aW9uLCBzaXplID0gOTEsIHJlcGxhY2UgPSBGQUxTRSkKYGBgCgpDb250aW51ZSB0aGlzIHByb2NlZHVyZSB1bnRpbCB3ZSBoYXZlIDQwMCBCb290c3RyYXAgc2FtcGxlcy4KYGBge3J9CnNldC5zZWVkKDE4OTI4OSkKQiA9IDQwMCAjIHRoZSBudW1iZXIgb2YgYm9vdHN0cmFwIHNhbXBsZXMgd2Ugd2FudApib290LnNhbXBsZSA8LSBhcnJheShkaW0gPSBjKEIsIDkxKSkKZm9yIChpIGluIDE6QikgewogIGJvb3Quc2FtcGxlW2ksIF0gPC0gc2FtcGxlKGJvb3QucG9wdWxhdGlvbiwgc2l6ZSA9IDkxLCByZXBsYWNlID0gRkFMU0UpCn0KYGBgCgpUaGVuIHdlIGNhbiBjYWxjdWxhdGUgdGhlIHNhbXBsZSBtZWFuIGZvciBlYWNoIEJvb3RzdHJhcCBzYW1wbGUgKGkuZS4gZWFjaCByb3cgb2YgdGhlIEJvb3RzdHJhcCBzYW1wbGUgbWF0cml4KS4KYGBge3J9CmJvb3QubWVhbiA8LSBhcHBseShYID0gYm9vdC5zYW1wbGUsIE1BUkdJTiA9IDEsIEZVTiA9IG1lYW4pCmhlYWQoYm9vdC5tZWFuKQpgYGAKCkxldCdzIHNlZSB0aGUgaGlzdG9ncmFtIG9mIHRoZXNlIEJvb3RzdHJhcCBzYW1wbGUgbWVhbnMuCmBgYHtyfQpoaXN0KGJvb3QubWVhbiwgYnJlYWtzID0gMjAsIHByb2JhYmlsaXR5ID0gVFJVRSwgZGVuc2l0eSA9IDIwLCBjb2wgPSAzLCBib3JkZXIgPSAzKQpsaW5lcyhkZW5zaXR5KGJvb3QubWVhbiwgYWRqdXN0ID0gMiksIGNvbCA9IDIpCmBgYAoKQ2hlY2sgTm9ybWFsaXR5IGJ5IFEtUSBwbG90IGFuZCBTaGFwaXJvLVdpbGsgdGVzdC4KYGBge3J9CnBhcihwdHkgPSAncycpCnFxbm9ybShib290Lm1lYW4pCnFxbGluZShib290Lm1lYW4pCgpzaGFwaXJvLnRlc3QoYm9vdC5tZWFuKQpgYGAKClNvIHdlIGNhbiBhY2NlcHQgdGhhdCB0aGUgc2FtcGxlIG1lYW4gZm9sbG93cyBhIE5vcm1hbCBkaXN0cmlidXRpb24uIFRoZW4gd2UgY2FuIGNvbnN0cnVjdCAkOTVcJSQgY29uZmlkZW5jZSBpbnRlcnZhbHMuCmBgYHtyfQpib290LnNkIDwtIHNkKGJvb3QubWVhbikKbWFsZS5wZXJjZW50YWdlICsgYygtMSwgMSkqMS45Nipib290LnNkCmBgYAoK