Create Data

Let’s say today we are going to work with some simulated data. We generate \(N\) numbers from bernoulli distribution with success rate 0.3, and take them as our population.

N <- 1000
data.population <- rbinom(n=N, size=1, prob=0.3)

One can give a story line to this data. If put in the same context as the video games dataset that we have for homework, for example, this could be the response of the students in the whole school whether they played games in the past week or not. Thus, the school has 1000 students in total. 1 indicates the student played, and 0 indicates did not play.

Then in that sense, we might not have observed every one of these responses as in for the survey data in the video games dataset. Thus, we sample, say \(n\), observations from the population, and take them as our sample units.

n <- 300
ind.sample <- sample.int(n=N, size=n)
data.sample <- data.population[ind.sample]

Sample Statistics

Now we will stick with the story line. Once we have the sample data, we can compute a point estimate, as well as an interval estimate for the fraction of the students who played games in the past week or not.

A point estimate in this case is just the mean of the sample,

mean.sample <- mean(data.sample)
mean.sample
[1] 0.3033333

To get an interval estimate for the fraction, we follow the derivation in the lecture slides, the interval estimate is then given by, \[ \left(\bar x - 1.96 \sqrt{\frac{\bar x (1 - \bar x)}{n-1} \frac{N - n}{N}}, \bar x + 1.96 \sqrt{\frac{\bar x (1 - \bar x)}{n-1} \frac{N - n}{N}} \right), \] where \(\bar x\) indicates the sample mean, \(N\) indicates the population size, and \(n\) indicates the sample size. Thus, an interval estimate is then given as the following.

width <- 1.96 * sqrt(mean.sample*(1-mean.sample)*(N-n)/((n-1)*N))
int.sample <- c(mean.sample - width, mean.sample + width)
int.sample
[1] 0.2597378 0.3469289

Bootstrap Estimate

Another popular method is using the idea of bootstrap. We need a bootstrap population to start with. With the population size of \(N=1000\), and sample size of \(n=300\), approximately each sample occurs about 3 times in the bootstrap population. One can simply duplicate each observation in the sample 3 times, and treat the resulting sample as the bootstrap population.

One can also sample from the sample with replacement.

ind.boot <- sample.int(n, size=N, replace=TRUE)
data.boot <- data.sample[ind.boot]

With the bootstrap population, we are now ready to generate bootstrap sample means. We will take, say 2000, random samples of size \(n\).

B <- 2000
boot.sample.mean <- rep(NA, B)
for(i in 1:B){
  ind <- sample.int(N, size=n, replace=FALSE)
  boot <- data.boot[ind]
  boot.sample.mean[i] <- mean(boot)
}

Let’s take a look at the distribution of bootstrap sample means.

hist(boot.sample.mean)

The point estimate is the mean of bootstrap sample means.

mean.boot <- mean(boot.sample.mean)
mean.boot
[1] 0.300755

There are two approaches one can take from here. We can derive a confidence interval using the bootstrap sample means, as given by \[ \left( \bar x - 1.96 s, \bar x + 1.96s \right) \]

s <- sd(boot.sample.mean)
int.boot <- c(mean.boot - 1.96*s, mean.boot + 1.96*s)
int.boot
[1] 0.2580095 0.3435005

Another approach to derive an interval estimate using the bootstrap sample means, one can simply extract the 0.025-quantile and 0.975 quantile of the bootstrap sample means and arrive at an interval estimate.

int.boot <- c(quantile(boot.sample.mean, 0.025), quantile(boot.sample.mean, 0.975))
int.boot
     2.5%     97.5% 
0.2566667 0.3433333 

Comparing Two Distributions

To compare two distributions, the Kolmogorov-Smirnov (KS) statistics is a helpful measure. For example, say if we take another sample from the population data earlier, and compare it with the earlier sample we took.

ind.sample2 <- sample.int(n=N, size=n)
data.sample2 <- data.population[ind.sample2]
ks.test(data.sample, data.sample2)
p-value will be approximate in the presence of ties

    Two-sample Kolmogorov-Smirnov test

data:  data.sample and data.sample2
D = 0.043333, p-value = 0.9408
alternative hypothesis: two-sided

Now let’s take another sample that is completely different.

data.sample2 <- rbinom(n=n, size=1, prob=0.7)
ks.test(data.sample, data.sample2)
p-value will be approximate in the presence of ties

    Two-sample Kolmogorov-Smirnov test

data:  data.sample and data.sample2
D = 0.39, p-value < 2.2e-16
alternative hypothesis: two-sided

Important Features

Often we woule like to pick out important reasons resulting an outcome. For example, we would like to figure out what are the import reasons why students like/dislike video games.

We first load the data.

data <- read.table("videodata.txt", header=TRUE)

The package we are going to introduce today is named ‘tree’.

install.packages("tree")
Error in install.packages : Updating loaded packages
library(tree)

Since our task only ask for like or dislike, we need to congregate the column ‘like’ into 2 categories from 5, as a new column named ‘dis_like’. We will denote 1=never played, 4=not really, and 5=not at all of the column ‘like’ as 0=dislike in the new column ‘dis/like’, and the rest as 1=like.

data['dis_like'] <- rep(NA, dim(data)[1])
for(i in 1:dim(data)[1]){
  like <- data[i, 'like']
  if(like==0 || like==4 || like==5){
    data[i, 'dis_like'] = 0
  }else{
    data[i, 'dis_like'] = 1
  }
}

We then fit a tree on the data frame with appropriate columns.

data.tree <- tree(dis_like~educ+sex+age+home+math+work+own+cdrom+grade, data=data)
plot(data.tree, type="uniform")
text(data.tree)

LS0tCnRpdGxlOiAiTWF0aCAxODkvMjg5IExhYiA0IgphdXRob3I6ICJKaWFxaSBHdW8iCm91dHB1dDoKICBodG1sX25vdGVib29rOiBkZWZhdWx0CiAgcGRmX2RvY3VtZW50OiBkZWZhdWx0Ci0tLQoKIyBDcmVhdGUgRGF0YQpMZXQncyBzYXkgdG9kYXkgd2UgYXJlIGdvaW5nIHRvIHdvcmsgd2l0aCBzb21lIHNpbXVsYXRlZCBkYXRhLiBXZSBnZW5lcmF0ZSAkTiQgbnVtYmVycyBmcm9tIGJlcm5vdWxsaSBkaXN0cmlidXRpb24gd2l0aCBzdWNjZXNzIHJhdGUgMC4zLCBhbmQgdGFrZSB0aGVtIGFzIG91ciBwb3B1bGF0aW9uLgpgYGB7cn0KTiA8LSAxMDAwCmRhdGEucG9wdWxhdGlvbiA8LSByYmlub20obj1OLCBzaXplPTEsIHByb2I9MC4zKQpgYGAKCk9uZSBjYW4gZ2l2ZSBhIHN0b3J5IGxpbmUgdG8gdGhpcyBkYXRhLiBJZiBwdXQgaW4gdGhlIHNhbWUgY29udGV4dCBhcyB0aGUgdmlkZW8gZ2FtZXMgZGF0YXNldCB0aGF0IHdlIGhhdmUgZm9yIGhvbWV3b3JrLCBmb3IgZXhhbXBsZSwgdGhpcyBjb3VsZCBiZSB0aGUgcmVzcG9uc2Ugb2YgdGhlIHN0dWRlbnRzIGluIHRoZSB3aG9sZSBzY2hvb2wgd2hldGhlciB0aGV5IHBsYXllZCBnYW1lcyBpbiB0aGUgcGFzdCB3ZWVrIG9yIG5vdC4gVGh1cywgdGhlIHNjaG9vbCBoYXMgMTAwMCBzdHVkZW50cyBpbiB0b3RhbC4gMSBpbmRpY2F0ZXMgdGhlIHN0dWRlbnQgcGxheWVkLCBhbmQgMCBpbmRpY2F0ZXMgZGlkIG5vdCBwbGF5LgoKVGhlbiBpbiB0aGF0IHNlbnNlLCB3ZSBtaWdodCBub3QgaGF2ZSBvYnNlcnZlZCBldmVyeSBvbmUgb2YgdGhlc2UgcmVzcG9uc2VzIGFzIGluIGZvciB0aGUgc3VydmV5IGRhdGEgaW4gdGhlIHZpZGVvIGdhbWVzIGRhdGFzZXQuIFRodXMsIHdlIHNhbXBsZSwgc2F5ICRuJCwgb2JzZXJ2YXRpb25zIGZyb20gdGhlIHBvcHVsYXRpb24sIGFuZCB0YWtlIHRoZW0gYXMgb3VyIHNhbXBsZSB1bml0cy4KYGBge3J9Cm4gPC0gMzAwCmluZC5zYW1wbGUgPC0gc2FtcGxlLmludChuPU4sIHNpemU9bikKZGF0YS5zYW1wbGUgPC0gZGF0YS5wb3B1bGF0aW9uW2luZC5zYW1wbGVdCmBgYAoKIyBTYW1wbGUgU3RhdGlzdGljcwpOb3cgd2Ugd2lsbCBzdGljayB3aXRoIHRoZSBzdG9yeSBsaW5lLiBPbmNlIHdlIGhhdmUgdGhlIHNhbXBsZSBkYXRhLCB3ZSBjYW4gY29tcHV0ZSBhIHBvaW50IGVzdGltYXRlLCBhcyB3ZWxsIGFzIGFuIGludGVydmFsIGVzdGltYXRlIGZvciB0aGUgZnJhY3Rpb24gb2YgdGhlIHN0dWRlbnRzIHdobyBwbGF5ZWQgZ2FtZXMgaW4gdGhlIHBhc3Qgd2VlayBvciBub3QuCgpBIHBvaW50IGVzdGltYXRlIGluIHRoaXMgY2FzZSBpcyBqdXN0IHRoZSBtZWFuIG9mIHRoZSBzYW1wbGUsCmBgYHtyfQptZWFuLnNhbXBsZSA8LSBtZWFuKGRhdGEuc2FtcGxlKQptZWFuLnNhbXBsZQpgYGAKClRvIGdldCBhbiBpbnRlcnZhbCBlc3RpbWF0ZSBmb3IgdGhlIGZyYWN0aW9uLCB3ZSBmb2xsb3cgdGhlIGRlcml2YXRpb24gaW4gdGhlIGxlY3R1cmUgc2xpZGVzLCB0aGUgaW50ZXJ2YWwgZXN0aW1hdGUgaXMgdGhlbiBnaXZlbiBieSwKJCQgXGxlZnQoXGJhciB4IC0gMS45NiBcc3FydHtcZnJhY3tcYmFyIHggKDEgLSBcYmFyIHgpfXtuLTF9IFxmcmFje04gLSBufXtOfX0sIFxiYXIgeCArIDEuOTYgXHNxcnR7XGZyYWN7XGJhciB4ICgxIC0gXGJhciB4KX17bi0xfSBcZnJhY3tOIC0gbn17Tn19IFxyaWdodCksICQkCndoZXJlICRcYmFyIHgkIGluZGljYXRlcyB0aGUgc2FtcGxlIG1lYW4sICROJCBpbmRpY2F0ZXMgdGhlIHBvcHVsYXRpb24gc2l6ZSwgYW5kICRuJCBpbmRpY2F0ZXMgdGhlIHNhbXBsZSBzaXplLiBUaHVzLCBhbiBpbnRlcnZhbCBlc3RpbWF0ZSBpcyB0aGVuIGdpdmVuIGFzIHRoZSBmb2xsb3dpbmcuCmBgYHtyfQp3aWR0aCA8LSAxLjk2ICogc3FydChtZWFuLnNhbXBsZSooMS1tZWFuLnNhbXBsZSkqKE4tbikvKChuLTEpKk4pKQppbnQuc2FtcGxlIDwtIGMobWVhbi5zYW1wbGUgLSB3aWR0aCwgbWVhbi5zYW1wbGUgKyB3aWR0aCkKaW50LnNhbXBsZQpgYGAKCiMgQm9vdHN0cmFwIEVzdGltYXRlCkFub3RoZXIgcG9wdWxhciBtZXRob2QgaXMgdXNpbmcgdGhlIGlkZWEgb2YgYm9vdHN0cmFwLiBXZSBuZWVkIGEgYm9vdHN0cmFwIHBvcHVsYXRpb24gdG8gc3RhcnQgd2l0aC4gV2l0aCB0aGUgcG9wdWxhdGlvbiBzaXplIG9mICROPTEwMDAkLCBhbmQgc2FtcGxlIHNpemUgb2YgJG49MzAwJCwgYXBwcm94aW1hdGVseSBlYWNoIHNhbXBsZSBvY2N1cnMgYWJvdXQgMyB0aW1lcyBpbiB0aGUgYm9vdHN0cmFwIHBvcHVsYXRpb24uIE9uZSBjYW4gc2ltcGx5IGR1cGxpY2F0ZSBlYWNoIG9ic2VydmF0aW9uIGluIHRoZSBzYW1wbGUgMyB0aW1lcywgYW5kIHRyZWF0IHRoZSByZXN1bHRpbmcgc2FtcGxlIGFzIHRoZSBib290c3RyYXAgcG9wdWxhdGlvbi4KCk9uZSBjYW4gYWxzbyBzYW1wbGUgZnJvbSB0aGUgc2FtcGxlIHdpdGggcmVwbGFjZW1lbnQuCmBgYHtyfQppbmQuYm9vdCA8LSBzYW1wbGUuaW50KG4sIHNpemU9TiwgcmVwbGFjZT1UUlVFKQpkYXRhLmJvb3QgPC0gZGF0YS5zYW1wbGVbaW5kLmJvb3RdCmBgYAoKV2l0aCB0aGUgYm9vdHN0cmFwIHBvcHVsYXRpb24sIHdlIGFyZSBub3cgcmVhZHkgdG8gZ2VuZXJhdGUgYm9vdHN0cmFwIHNhbXBsZSBtZWFucy4gV2Ugd2lsbCB0YWtlLCBzYXkgMjAwMCwgcmFuZG9tIHNhbXBsZXMgb2Ygc2l6ZSAkbiQuCmBgYHtyfQpCIDwtIDIwMDAKYm9vdC5zYW1wbGUubWVhbiA8LSByZXAoTkEsIEIpCmZvcihpIGluIDE6Qil7CiAgaW5kIDwtIHNhbXBsZS5pbnQoTiwgc2l6ZT1uLCByZXBsYWNlPUZBTFNFKQogIGJvb3QgPC0gZGF0YS5ib290W2luZF0KICBib290LnNhbXBsZS5tZWFuW2ldIDwtIG1lYW4oYm9vdCkKfQpgYGAKCkxldCdzIHRha2UgYSBsb29rIGF0IHRoZSBkaXN0cmlidXRpb24gb2YgYm9vdHN0cmFwIHNhbXBsZSBtZWFucy4gCmBgYHtyfQpoaXN0KGJvb3Quc2FtcGxlLm1lYW4pCmBgYAoKVGhlIHBvaW50IGVzdGltYXRlIGlzIHRoZSBtZWFuIG9mIGJvb3RzdHJhcCBzYW1wbGUgbWVhbnMuCmBgYHtyfQptZWFuLmJvb3QgPC0gbWVhbihib290LnNhbXBsZS5tZWFuKQptZWFuLmJvb3QKYGBgCgpUaGVyZSBhcmUgdHdvIGFwcHJvYWNoZXMgb25lIGNhbiB0YWtlIGZyb20gaGVyZS4gV2UgY2FuIGRlcml2ZSBhIGNvbmZpZGVuY2UgaW50ZXJ2YWwgdXNpbmcgdGhlIGJvb3RzdHJhcCBzYW1wbGUgbWVhbnMsIGFzIGdpdmVuIGJ5CiQkIFxsZWZ0KCBcYmFyIHggLSAxLjk2IHMsIFxiYXIgeCArIDEuOTZzIFxyaWdodCkgJCQKYGBge3J9CnMgPC0gc2QoYm9vdC5zYW1wbGUubWVhbikKaW50LmJvb3QgPC0gYyhtZWFuLmJvb3QgLSAxLjk2KnMsIG1lYW4uYm9vdCArIDEuOTYqcykKaW50LmJvb3QKYGBgCgpBbm90aGVyIGFwcHJvYWNoIHRvIGRlcml2ZSBhbiBpbnRlcnZhbCBlc3RpbWF0ZSB1c2luZyB0aGUgYm9vdHN0cmFwIHNhbXBsZSBtZWFucywgb25lIGNhbiBzaW1wbHkgZXh0cmFjdCB0aGUgMC4wMjUtcXVhbnRpbGUgYW5kIDAuOTc1IHF1YW50aWxlIG9mIHRoZSBib290c3RyYXAgc2FtcGxlIG1lYW5zIGFuZCBhcnJpdmUgYXQgYW4gaW50ZXJ2YWwgZXN0aW1hdGUuCmBgYHtyfQppbnQuYm9vdCA8LSBjKHF1YW50aWxlKGJvb3Quc2FtcGxlLm1lYW4sIDAuMDI1KSwgcXVhbnRpbGUoYm9vdC5zYW1wbGUubWVhbiwgMC45NzUpKQppbnQuYm9vdApgYGAKCiMgQ29tcGFyaW5nIFR3byBEaXN0cmlidXRpb25zClRvIGNvbXBhcmUgdHdvIGRpc3RyaWJ1dGlvbnMsIHRoZSBLb2xtb2dvcm92LVNtaXJub3YgKEtTKSBzdGF0aXN0aWNzIGlzIGEgaGVscGZ1bCBtZWFzdXJlLiBGb3IgZXhhbXBsZSwgc2F5IGlmIHdlIHRha2UgYW5vdGhlciBzYW1wbGUgZnJvbSB0aGUgcG9wdWxhdGlvbiBkYXRhIGVhcmxpZXIsIGFuZCBjb21wYXJlIGl0IHdpdGggdGhlIGVhcmxpZXIgc2FtcGxlIHdlIHRvb2suCmBgYHtyfQppbmQuc2FtcGxlMiA8LSBzYW1wbGUuaW50KG49Tiwgc2l6ZT1uKQpkYXRhLnNhbXBsZTIgPC0gZGF0YS5wb3B1bGF0aW9uW2luZC5zYW1wbGUyXQprcy50ZXN0KGRhdGEuc2FtcGxlLCBkYXRhLnNhbXBsZTIpCmBgYAoKTm93IGxldCdzIHRha2UgYW5vdGhlciBzYW1wbGUgdGhhdCBpcyBjb21wbGV0ZWx5IGRpZmZlcmVudC4KYGBge3J9CmRhdGEuc2FtcGxlMiA8LSByYmlub20obj1uLCBzaXplPTEsIHByb2I9MC43KQprcy50ZXN0KGRhdGEuc2FtcGxlLCBkYXRhLnNhbXBsZTIpCmBgYAoKIyBJbXBvcnRhbnQgRmVhdHVyZXMKT2Z0ZW4gd2Ugd291bGUgbGlrZSB0byBwaWNrIG91dCBpbXBvcnRhbnQgcmVhc29ucyByZXN1bHRpbmcgYW4gb3V0Y29tZS4gRm9yIGV4YW1wbGUsIHdlIHdvdWxkIGxpa2UgdG8gZmlndXJlIG91dCB3aGF0IGFyZSB0aGUgaW1wb3J0IHJlYXNvbnMgd2h5IHN0dWRlbnRzIGxpa2UvZGlzbGlrZSB2aWRlbyBnYW1lcy4KCldlIGZpcnN0IGxvYWQgdGhlIGRhdGEuCmBgYHtyfQpkYXRhIDwtIHJlYWQudGFibGUoInZpZGVvZGF0YS50eHQiLCBoZWFkZXI9VFJVRSkKYGBgCgpUaGUgcGFja2FnZSB3ZSBhcmUgZ29pbmcgdG8gaW50cm9kdWNlIHRvZGF5IGlzIG5hbWVkICd0cmVlJy4KYGBge3J9Cmluc3RhbGwucGFja2FnZXMoInRyZWUiKQpsaWJyYXJ5KHRyZWUpCmBgYAoKU2luY2Ugb3VyIHRhc2sgb25seSBhc2sgZm9yIGxpa2Ugb3IgZGlzbGlrZSwgd2UgbmVlZCB0byBjb25ncmVnYXRlIHRoZSBjb2x1bW4gJ2xpa2UnIGludG8gMiBjYXRlZ29yaWVzIGZyb20gNSwgYXMgYSBuZXcgY29sdW1uIG5hbWVkICdkaXNfbGlrZScuIFdlIHdpbGwgZGVub3RlIDE9bmV2ZXIgcGxheWVkLCA0PW5vdCByZWFsbHksIGFuZCA1PW5vdCBhdCBhbGwgb2YgdGhlIGNvbHVtbiAnbGlrZScgYXMgMD1kaXNsaWtlIGluIHRoZSBuZXcgY29sdW1uICdkaXMvbGlrZScsIGFuZCB0aGUgcmVzdCBhcyAxPWxpa2UuCmBgYHtyfQpkYXRhWydkaXNfbGlrZSddIDwtIHJlcChOQSwgZGltKGRhdGEpWzFdKQpmb3IoaSBpbiAxOmRpbShkYXRhKVsxXSl7CiAgbGlrZSA8LSBkYXRhW2ksICdsaWtlJ10KICBpZihsaWtlPT0wIHx8IGxpa2U9PTQgfHwgbGlrZT09NSl7CiAgICBkYXRhW2ksICdkaXNfbGlrZSddID0gMAogIH1lbHNlewogICAgZGF0YVtpLCAnZGlzX2xpa2UnXSA9IDEKICB9Cn0KYGBgCgpXZSB0aGVuIGZpdCBhIHRyZWUgb24gdGhlIGRhdGEgZnJhbWUgd2l0aCBhcHByb3ByaWF0ZSBjb2x1bW5zLgpgYGB7cn0KZGF0YS50cmVlIDwtIHRyZWUoZGlzX2xpa2V+ZWR1YytzZXgrYWdlK2hvbWUrbWF0aCt3b3JrK293bitjZHJvbStncmFkZSwgZGF0YT1kYXRhKQpwbG90KGRhdGEudHJlZSwgdHlwZT0idW5pZm9ybSIpCnRleHQoZGF0YS50cmVlKQpgYGAKCg==