Math 189

Exploratory Data Analysis and Inference
Winter 2020

 

                               ********** Announcements ************

 

·      Please form groups of 5 students each by the end of week 2; any groups not of size 5 will be subject to re-assignment of group membership.

·      Be sure to attend week 1 TA sessions which will cover how to get started with R, a statistical programming language and environment.

·      Student presentations are mostly on Mondays, but could also be on other days if Monday is a holiday, or if we need more time slots.

·      If you have added the course late, please email your TA to obtain access to Gradescope.

·      Suggestions for student presentations:

1)    Bring your laptop and adaptor, or USB to be plugged into my mac air

2)    Show your data set (perhaps in excel)

3)    Show/talk through background, description, summary etc.

·      Instructions for presentation on Wed 1/22:

1)    Group #1 and 2 will present;

2)    Prepare a few slides to describe your data and research question; try not to just show a web link which might have too much contents for the audience to read through in a short amount of time;

3)    Be prepared to have your data ready to show (eg. In excel);

4)    Show your Table 1;

5)    Show plots;

6)    Each member of the group should present something; aim for a total of 20 minutes;

·      Important from the grader:

1.    Use your name as appearing in Gradescope

2.    For group project, only one member of the group submits the homework please, and add their group members on the Gradescope submission page; do NOT make duplicate submission.

3.    On the first page of each group assignment, put down all the member names and email addresses.

·      Group #3 and 4 will continue presentation on Monday 1/27 (there have been updates to members of group 5, 8, 17, 19, please check Canvas)

·      All are required to attend student presentations – it’s respect to your fellow students, and it is where you learn to improve on your work. Random attendance will be checked, and each absence will be subject to 2% deduction in the final grades.

·      Group #5 and 6 will present on Monday 2/3: group 5 will present their data set as the previous groups did; group 6 will present the simulation assignment of week 4.

·      TA sessions have started to introduce R Markdown. You are encouraged to use it in week 5 homework, and will be required to use it starting from week 6 homework and for the rest of the quarter when doing data analysis including the final project.

·      Group #7 and 8 will present on Monday 2/10 part 3) of week 5 assignment; be sure to give an overview of your data first.

·      Groups #9 and 10 will present on Wednesday 2/19 part 3) of week 6 assignment; group 9 please focus on discussing the use of regression models for prediction, and group 10 on the use for causal inference.

·      Groups #11 and 12 will present on Monday 2/24 part 2) of week 7 assignment (although the written version is due end of week 8); group 11 please focus on the stepwise approach, and group 12 on R-squared and AIC.

·      Groups #13 and 14 will present on Monday 3/2 part 2) of week 8 assignment; please have R on your laptop so we can generate random group numbers to take attendance.

·      Week 10 presentations: each group please give a comprehensive presentation of the analysis of your data set, pay some attention to univariate screening; if you have time to try some tree-based methods, that’ll be nice too.

 

 

Overview: This course will build upon previous iterations of MATH 189, with emphasis on data sense, intuitions and skills of data analysis. From a personal perspective of the instructor, I have been doing and then overseeing real world data analysis for over 20 years since my Ph.D.. The course is planned to be taught in a more interactive fashion during lecture hours, including discussions, group presentations and critiques every week. In other words, this will be a partially or hybrid ‘flipped’ classroom. The main idea is to learn from mistakes. For this purpose, sometimes a preliminary version of the homework needs to be done (and collected) ahead of the presentation, and an improved version can then be handed in. Formal statistical knowledge is a necessity of good analysis, but is not the sole target in this course, and will be covered by a combination of pre-requisites, lectures and TA sessions.

 

After initial introduction into all things data related, content-wise we will focus more on categorical outcomes data (a gap in our undergraduate curriculum), which leads naturally into classification etc..

 

Important Note: You are strongly encouraged to attend lectures where interactions happen.

 

 

Lecture:  MWF 4:00-4:50pm, CSB 001

 

 

Instructor: Ronghui (Lily) Xu

Office:  APM 5856

Phone:  534-6380
Email: rxu@ucsd.edu
 

Office Hours:

 Wed 2-3pm, or by appointment.

 

 

Teaching Assistants:  

Yuqian Zhang, yuz643@ucsd.edu, office hours F 2-4pm, APM 1210

Yuyao Wang, yuw079@ucsd.edu, office hours F 12:30-2:30pm, APM 6446

 

Reference books:

 

1.   OpenIntro Statistics, https://www.openintro.org

2.   James et al., An Introduction to Statistical Learning, Springer, 2013.

3. Li and Xu (eds), High-Dimensional Data Analysis in Cancer Research, Springer, 2009.

 

 

Topics covered: (future topics are subject to updates)

 

Week 1: active learning and why; importance of communication in data science; random variables, sample and population; data examples; unboxing the data.

Week 2: exploratory data analysis; “Table 1”; principles of visualization (better plots).

Week 3: visualization continued; concept of inference.

Week 4: confidence intervals; hypothesis testing.

Week 5: exact tests; 2x2 contingency table; odds ratio.

Week 6: logistic regression: inference.

Week 7: logistic regression: variable screening, model building (stepwise, generalized R-squared, information criteria).

Week 8: prediction error; cross-validation; classification trees (CART approach);

Week 9: tree-based ensembles; high-dimensional data methods (LASSO).

 

 

 

Homework:  You may discuss, but please write them independently. Write your solutions, answers and results in your own words (and in complete sentences). In general, clearly lay out the context (including background, setup as applicable), solution, interpretation of the analysis results for a non-statistical audience etc. in the main part, and append R program codes in the back; all needs to be turned in. More instructions will be given for specific assignments.  Any two students/groups turning in exactly the same solutions may be considered plagiarism, in which case 0 points will be given to all parties involved, and any additional action will be determined by the office for academic integrity.

 

Homeworks are due in Gradescope by 11:59pm on Sunday of the same week (eg. Week 1 is due on 1/12) unless specified otherwise. No late HW will be accepted. They are individual assignments unless specified as group assignments.

 

  Week 1: 

1) Write a paragraph with at least 5 (and no more than 10) sentences on how you feel about 'comfort zone'. Then write a 2nd paragraph with 2-3 sentences on how it relates to this course (MATH 189).

2) Find your own data set and write a description about: where it came from, what it was collected for, how many observations and how many variables, what are some examples of research questions it can answer. Prepare to discuss your data next Monday.

 

  Week 2: 

1) submit your group membership in a PDF file with 5 names, and you will get a group number assigned in Gradescope;

2) [due 11:59pm on Tuesday 1/21; you may do it as group or individual assignment] continue with the data set and description (be sure to include that) from week 1 (or you may use a different one, in which case you need to re-write the description), think of and state clearly a research question where you will compare between 2-4 groups (if the data does not come with this many groups, you can most likely categorize a variable into groups). Produce Table 1 similar to the Leflunomide paper, then plot histograms, densities, and boxplots for the continuous variables, and bar plots for the discrete variables. Do the plots by the groups as in your Table 1, and try to place them side by side. Do these plots for up to 10 variables.

 

  Week 3: 

[group project] submit an improved and final version of #2) from Week 2 above. You don’t need to provide p-values in Table 1 as I said in class.

 

  Week 4: 

1)   New York Times is well-known for their data graphics. Find a favorite data presentation of theirs (Upshot is a good section to try), submit with reasons why you like it.

2)   A) For Y ~ Binomial (n, p), write down the formula for a 95% confidence interval (CI) of p.  B) For n = 100 and p = 0.1, 0.2, 0.3, 0.4, 0.5, respectively, simulate 500 such Y’s. Tabulate over these 500 simulation runs: the average of the estimated p’s (call them p_hat’s), the empirical variance of the p_hat’s, the average of the estimated variances of the p_hat’s, the proportion of the 95% CI’s that contain the true value of p, and the average of the length of the 95% CI’s. Discuss the simulation results. 

 

  Week 5 [group project]

1)   Constructive criticism for student presentations.

2)   Find out (you can search online) what is reproducible research and why it is important. Write a paragraph about it together with your thoughts on how it relates to data analysis. Seven sentences minimum.

3)   Continue from your data set from week 3, keep the description, reduce to 2 groups if you had more, by either excluding the extra group(s) or combining them into 2 groups. Identify an outcome variable that you are interested in comparing between these 2 groups. Make sure that your outcome is binary by dichotomizing it if it is not binary initially. Then do the following:

a)    State the research question of interest;

b)    Discuss why it is reasonable to assume that the observations are i.i.d. (if they are not, for example, collected over the years, you might want to reduce your data to include just one year);

c)    Set up the null and alternative hypothesis (introduce the random variables, distribution(s) and parameter(s) first), use a two-sided significant level of 0.05;

d)    Find the 95% confidence intervals for p1, p2, the risk difference, risk ratio, and odds ratio;

e)    Carry out both the Chi-squared and Fisher’s exact test, and discuss their suitability to your data.

 

  Week 6 [group project, due 11:59pm on Monday 2/17]

1)   Constructive criticism for student presentations.

2)   Research and write about the use of regression models in the context of a) prediction, b) causal inference on effect of a variable on the outcome.

3)   [preliminary version] With your data set (same as before or a different one that consists of i.i.d. observations suitable for multiple logistic regression with a binary outcome), keep the overall description as before. Then do the following:

a)    Describe the distribution of the outcome variable, identify a main predictor that you’re interested in studying its effect on the outcome (this can be the group variable from week 5 or a different one);

b)    Identify other variables (i.e. predictors, often called covariates) that might be related to the outcome or the main predictor, discuss these variables in the context of part 2) above of this assignment;

c)    Carry out univariate logistic regression of the outcome on each of the predictors including the main predictor, interpret the results in terms of odds ratio etc.

d)    Fit a multiple logistic regression model by including more than one predictors, interpret the results in terms of conditional odds ratio etc.

 

  Week 7 [group project, due 11:59pm on Sunday 3/1]

1)   Constructive criticism for student presentations.

2)   [improved version] Continue with and polish your work in part 3) of week 6. Also do the following:

a)    Instead of 3d) from week 6, use one of the stepwise procedures we talked about, together with computing the generalized R-squared and AIC for each model considered during the process, to arrive at a ‘final’ multiple logistic regression model. Consider interaction terms also.  Interpret the results from your final model in the context of the research question that you are trying to answer.

b)    Towards the end of your report, write a paragraph discussing limitations from your data source, assumptions, approaches etc. as applicable. For those that the grader marked comments about the i.i.d. assumption from week 5 homework, be sure to including discussion on those.

 

 Week 8 [group project]

1)   Constructive criticism for student presentations.

2)   [final version] Continue with your work from previous two weeks, take your final model from week 7:

a)    Perform prediction on the whole data set, plot the ROC curve and compute the AUC;

b)    Use a randomly chosen 90% of your observations as training sample to fit the final model (if your data set is too small, you may reduce your final model to a smaller one this week), and use the rest 10% as test sample to compute the out-of-sample AUC;

c)    Now instead of test-training sample, carry out 10-fold CV with 10 repetitions to estimate the out-of-sample AUC;

d)    Comment on your results.

 

 Week 10: 

Constructive criticism for each of the 4 group presentations: 1) what the data were about and what analyses were done; 2) strengths of the analyses and presentation; 3) room for improvement.

 

 

  Final Group Project (20%, due 11:59pm on Sunday 3/15): 

1)   Compile a collection of tips for best data presentation, including the illustration and R script for each. (5%)

2)   See Canvas. (15% + 2% bonus)

 

Grading: 70% Homework (20% preliminary + 50% improved) + 10% Presentation + 20% Final Project
 Note: we will drop at least one lowest HW score before computing the final grade. Each week’s assignment otherwise carry the same weight within the 20% and 50%, respectively.