Uniform and Poisson distribution
Let’s generate 200 samples from uniform distribution in the interval \([0,10]\).
set.seed(2017)
n <- 1000
sample <- runif(n, min = 0, max = 10)
hist(sample, breaks = 100, probability = TRUE, col = 3, main = "Uniform distribution samples")
lines(density(sample, adjust = 2), col = 2)
If we divide the interval \([0,10]\) into \(k\) sub-intervals and count how many data points are inside each of them, say \(n_1, \cdots, n_k\). Then these counts will follow Poisson distribution. That is, \[ \mathbb{P} (m \text{ points in a unit interval}) = \frac{\lambda^m}{m!}e^{-\lambda}, m=0,1,\cdots\]
Therefore, we expect \(\{n_1, \cdots, n_k\}\) to follow a Poisson. Let’s verify it by simulation. We divide the interval into 100 sub-intervals, i.e. \(k = 100\).
k <- 100
tab <- table(cut(sample, breaks = seq(0, 10, length.out = k+1), include.lowest = TRUE))
head(tab, 10)
[0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
13 10 13 8 11 5 11 7 6 7
counts <- as.vector(tab)
head(counts, 10)
[1] 13 10 13 8 11 5 11 7 6 7
So by theory, the counts should follow a Poisson process. Let’s compare it with a generated Poisson sequence. We know \(\lambda\) equals to the expected number of points in an interval, so here, we set \(\lambda = k^{-1}\sum_{i=1}^k n_i\).
hist(counts, breaks = 15, col = rgb(1,0,0,0.5), probability = TRUE, xlab = "number of points inside an interval", ylim = c(0,0.2))
lines(density(counts, adjust = 2), col = rgb(1,0,0,0.5))
Pois <- rpois(1000, lambda = mean(counts))
hist(Pois, breaks = 15, col = rgb(0,0,1,0.5), probability = TRUE, add = TRUE)
lines(density(Pois, adjust = 2), col = rgb(0,0,1,0.5))
legend(x = 14, y = 0.15, legend = c("sample", "Poisson"), lty = c(1,1), col = c(rgb(1,0,0,0.5), rgb(0,0,1,0.5)))
We can see our counts variable is indeed very close to Poisson distribution.
Pearson’s \(\chi^2\) Test (goodness of fit)
A test of goodness of fit establishes whether an observed frequency distribution differs from a theoretical distribution.
Test procedure
Let’s now use the Peason’s \(\chi^2\) test to check whether the sample does follow Uniform distribution (because we know we generated the sample from Uniform, so of course we should expect a good result).
Step 1. Calculate the chi-squared test statistic, \(\chi^2\), which resembles a normalized sum of squared deviations between observed and theoretical frequencies, \(O_i\) and \(E_i\). The formula is \[ \chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} = n \sum_{i=1}^k \frac{(O_i/n - p_i)^2}{p_i}, \] where \(k\) is the number of cells or intervals in which you calculated frequencies, \(n\) is the total number of samples, and \(p_i = E_i/n\) is the expected probability in the \(i\)-th interval.
Let’s calculate the observed and theoretical frequencies for our sample and Uniform distribution. We already have our \(O_i\)’s, for Uniform distribution, we know \(E_i = n/k = 1000/100\) in this case. We have \(\chi^2 = 82.2\).
E_i <- n/k
chi_2 <- sum((counts - E_i)^2/E_i)
chi_2
Step 2. Determine the degrees of freedom, df, of that statistic. In the test of goodness of fit, the degree of freedom equals to \(k - p\), where \(k\) is the number of cells or intervals, and \(p = s+1\) is the contraint in that distribution, here, \(s\) is the number of parameters for the distribution we want to test against. For example, a Normal distribution \(\mathcal{N}(\mu, \sigma^2)\) has two parameters, \(\mu\) and \(\sigma\), thus \(s = 2\). In our discrete Uniform case, we have \(p = 1\). So df will be \(100-1=99\).
Step 3. Select a desired level of confidence. We will select \(\alpha = 0.05\).
- Step 4.
- Compare the test statistic \(\chi^2\) to the value from the chi-squared distribution with df degrees of freedom and the selected confidence level, \(1 - \alpha\).
- Or we can calculate the \(p\)-value, \(\mathbb{P}(X > \chi^2)\), where \(X\) is a chi-square random variable with df degree of freedoms.
- Note that the test is one-sided.
chi2_compare <- qchisq(p = 0.95, df = 99)
chi2_compare
p_value <- 1 - pchisq(chi_2, df = 99)
p_value
Built-in R function
We can also use the built-in function chisq.test()
.
p_i <- rep(E_i/n, k)
chisq.test(counts, p = p_i)
Standardized residual plot
We can also plot the standardized residuals, which are defined to be \[ R_i = \frac{|O_i - E_i|}{\sqrt{E_i}}\] in each interval. So recall the definition of \(\chi^2\) statistic, we actually have \[\chi^2 = \sum_{i=1}^k R_i^2.\]
Residuals <- (counts - E_i) / sqrt(E_i)
plot(Residuals, type = 'h', ylab = "standardized residuals", xlab = "interval index")
Two Sample t-test
- Two sample t-test can be used to determine if the means of two sets of data are significantly different. But one of the underlying assumptions is that both group of data follow Normal distribution. So we should first check if Normality assumption is satisfied (Q-Q plot, ecdf, Shapiro–Wilk, Kolmogorov–Smirnov) before we apply this test.
- We should first check if two samples have the same variance. This can be done using Fisher’s F-test (it counts in the effect of different sample sizes).
- Details of two sample t-test are postponed to MATH 181 if you have not touched it before.
set.seed(2017)
sample1 <- rnorm(200, mean = 0, sd = 1)
sample2 <- rnorm(300, mean = 2, sd = 1)
sample3 <- rnorm(250, mean = 0, sd = 10)
Check if two samples have the same variance.
var.test(sample1, sample2)
F test to compare two variances
data: sample1 and sample2
F = 0.96949, num df = 199, denom df = 299, p-value = 0.8179
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.7543491 1.2542128
sample estimates:
ratio of variances
0.9694931
var.test(sample1, sample3)
F test to compare two variances
data: sample1 and sample3
F = 0.0096017, num df = 199, denom df = 249, p-value < 2.2e-16
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.007386628 0.012530430
sample estimates:
ratio of variances
0.009601719
At level \(\alpha = 0.05\), we accept sample1 and sample2 have the same variance, and reject the hypothesis that sample1 and sample3 have the same variance.
t.test(sample1, sample2, paired = FALSE, var.equal = TRUE)
Two Sample t-test
data: sample1 and sample2
t = -22.493, df = 498, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.160604 -1.813477
sample estimates:
mean of x mean of y
0.02008153 2.00712204
t.test(sample1, sample3, paired = FALSE, var.equal = FALSE)
Welch Two Sample t-test
data: sample1 and sample3
t = -1.2912, df = 254.97, p-value = 0.1978
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.0295776 0.4221389
sample estimates:
mean of x mean of y
0.02008153 0.82380085
At level \(\alpha = 0.05\), we can conclude that sample1 and sample2 have different means, while sample1 and sample3 have the same mean.
- So what if the Normality assumption is not valid. Well, when sample size is large enough, by the central limit theorem, the two sample t-test is still a decent way to achieve our goal. In another word, when the sample size is moderate or large, the two sample t-test is robust to non-Normality.
---
title: "Math 189/289 Lab 3"
author: "Alex Hanbo Li"
output: html_notebook
---

# Uniform and Poisson distribution
Let's generate 200 samples from uniform distribution in the interval $[0,10]$.
```{r}
set.seed(2017)
n <- 1000
sample <- runif(n, min = 0, max = 10)
hist(sample, breaks = 100, probability = TRUE, col = 3, main = "Uniform distribution samples")
lines(density(sample, adjust = 2), col = 2)
```

If we divide the interval $[0,10]$ into $k$ sub-intervals and count how many data points are inside each of them, say $n_1, \cdots, n_k$. Then these counts will follow Poisson distribution. That is,
$$ \mathbb{P} (m \text{ points in a unit interval}) = \frac{\lambda^m}{m!}e^{-\lambda}, m=0,1,\cdots$$

Therefore, we expect $\{n_1, \cdots, n_k\}$ to follow a Poisson. Let's verify it by simulation. We divide the interval into 100 sub-intervals, i.e. $k = 100$.
```{r}
k <- 100
tab <- table(cut(sample, breaks = seq(0, 10, length.out = k+1), include.lowest = TRUE))
head(tab, 10)
counts <- as.vector(tab)
head(counts, 10)
```

So by theory, the **counts** should follow a Poisson process. Let's compare it with a generated Poisson sequence. We know $\lambda$ equals to the expected number of points in an interval, so here, we set $\lambda = k^{-1}\sum_{i=1}^k n_i$.
```{r}
hist(counts, breaks = 15, col = rgb(1,0,0,0.5), probability = TRUE, xlab = "number of points inside an interval", ylim = c(0,0.2))
lines(density(counts, adjust = 2), col = rgb(1,0,0,0.5))
Pois <- rpois(1000, lambda = mean(counts))
hist(Pois, breaks = 15, col = rgb(0,0,1,0.5), probability = TRUE, add = TRUE)
lines(density(Pois, adjust = 2), col = rgb(0,0,1,0.5))
legend(x = 14, y = 0.15, legend = c("sample", "Poisson"), lty = c(1,1), col = c(rgb(1,0,0,0.5), rgb(0,0,1,0.5)))
```

We can see our **counts** variable is indeed very close to Poisson distribution.



# Pearson's $\chi^2$ Test (goodness of fit)

A test of goodness of fit establishes whether an observed **frequency distribution** differs from a **theoretical distribution**.

### Test procedure
Let's now use the Peason's $\chi^2$ test to check whether the **sample** does follow Uniform distribution (because we know we generated the sample from Uniform, so of course we should expect a good result).

* Step 1. 
Calculate the chi-squared test statistic, $\chi^2$, which resembles a normalized sum of squared deviations between observed and theoretical frequencies, $O_i$ and $E_i$. The formula is 
$$ \chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} = n \sum_{i=1}^k \frac{(O_i/n - p_i)^2}{p_i}, $$
where $k$ is the number of cells or intervals in which you calculated frequencies, $n$ is the total number of samples, and $p_i = E_i/n$ is the expected probability in the $i$-th interval.

    Let's calculate the observed and theoretical frequencies for our sample and Uniform distribution. 
    We already have our $O_i$'s, for Uniform distribution, we know $E_i = n/k = 1000/100$ in this case. We have $\chi^2 = 82.2$.
    

```{r}
E_i <- n/k
chi_2 <- sum((counts - E_i)^2/E_i)
chi_2
```

* Step 2.
Determine the degrees of freedom, *df*, of that statistic. In the test of goodness of fit, the degree of freedom equals to $k - p$, where $k$ is the number of cells or intervals, and $p = s+1$ is the contraint in that distribution, here, $s$ is the number of parameters for the distribution we want to test against. For example, a Normal distribution $\mathcal{N}(\mu, \sigma^2)$ has two parameters, $\mu$ and $\sigma$, thus $s = 2$. In our discrete Uniform case, we have $p = 1$. So *df* will be $100-1=99$.

* Step 3.
Select a desired level of confidence. We will select $\alpha = 0.05$.

* Step 4.
    + Compare the test statistic $\chi^2$ to the value from the chi-squared distribution with *df* degrees of freedom and the selected confidence level, $1 - \alpha$.
    + Or we can calculate the $p$-value, $\mathbb{P}(X > \chi^2)$, where $X$ is a chi-square random variable with *df* degree of freedoms.
    + Note that the test is one-sided.
```{r}
chi2_compare <- qchisq(p = 0.95, df = 99)
chi2_compare
p_value <- 1 - pchisq(chi_2, df = 99)
p_value
```

### Built-in R function
We can also use the built-in function `chisq.test()`.
```{r}
p_i <- rep(E_i/n, k)
chisq.test(counts, p = p_i)
```

### Standardized residual plot
We can also plot the standardized residuals, which are defined to be
$$ R_i = \frac{|O_i - E_i|}{\sqrt{E_i}}$$
in each interval.
So recall the definition of $\chi^2$ statistic, we actually have 
$$\chi^2 = \sum_{i=1}^k R_i^2.$$
```{r}
Residuals <- (counts - E_i) / sqrt(E_i)
plot(Residuals, type = 'h', ylab = "standardized residuals", xlab = "interval index")
```

# Two Sample t-test
* Two sample t-test can be used to determine if the means of two sets of data are significantly different. But one of the underlying assumptions is that both group of data follow Normal distribution. So we should first check if Normality assumption is satisfied (Q-Q plot, ecdf, Shapiro–Wilk, Kolmogorov–Smirnov) before we apply this test.
* We should first check if two samples have the same variance. This can be done using Fisher's F-test (it counts in the effect of different sample sizes).
* Details of two sample t-test are postponed to MATH 181 if you have not touched it before.
```{r}
set.seed(2017)
sample1 <- rnorm(200, mean = 0, sd = 1)
sample2 <- rnorm(300, mean = 2, sd = 1)
sample3 <- rnorm(250, mean = 0, sd = 10)
```

Check if two samples have the same variance.
```{r}
var.test(sample1, sample2)
var.test(sample1, sample3)
```

At level $\alpha = 0.05$, we accept *sample1* and *sample2* have the same variance, and reject the hypothesis that *sample1* and *sample3* have the same variance.

```{r}
t.test(sample1, sample2, paired = FALSE, var.equal = TRUE)
t.test(sample1, sample3, paired = FALSE, var.equal = FALSE)
```
At level $\alpha = 0.05$, we can conclude that *sample1* and *sample2* have different means, while *sample1* and *sample3* have the same mean.

* So what if the Normality assumption is not valid. Well, when sample size is large enough, by the central limit theorem, the two sample t-test is still a decent way to achieve our goal. In another word, when the sample size is moderate or large, the two sample t-test is robust to non-Normality.

