********** Announcements ************
· Attendance will be required during the final paper presentations by all. Presentations will be in week 10, starting on 6/3. Each group will have about 20 minutes to present a paper, plus 5 minutes for questions. You should try to rehearse in order to keep on time and give a desired presentation. You are welcome to discuss the paper with the instructor if you like.
Overview: Survival outcome is often the ‘ultimate’ outcome, in many critical areas of disease research such as cancer, as well as recently emerging medical AI. This course discusses the concepts, theories, and applications associated with censored and truncated survival data. The topics include likelihood for right censored and left truncated data, nonparametric estimation of survival distributions, comparing survival distributions, proportional hazards regression, semiparametric theory and other extended topics on complex survival data including competing risks etc. as time permitting.
Important Note: You are strongly encouraged to attend lectures and take notes. You are also strongly encouraged to take advantage of the office hours to discuss any questions/problems that you have - Note that you can make appointments for office hours!
Lecture: MWF 2:00-2:50pm, AP&M 2402
Instructor: Ronghui (Lily) Xu
Office: APM 5856
Teaching Assistant: Denise Rava
1. Cox and Oakes, Analysis of Survival Data, Chapman & Hall, 1984
2. Fleming and Harrington, Counting Processes and Survival Analysis, Wiley, 1991
3. O'Quigley, Proportional Hazards Regression, Springer, 2008
4. Kalbfleisch and Prentice, The Statistical Analysis of Failure Time Data, Wiley, 1st or 2nd ed.
Not reference but read for fun: Gladwell “David and Goliath” which has the story of the Freireich (1963) leukemia survival data that D.R.Cox used and we also use.
Week 1: Brief review of likelihood methods commonly used in practice; right-censored and left truncated data; Kaplan-Meier estimate of survival.
Week 2: Log-rank test of two-sample survival; weighted log-rank tests and efficiency; counting processes.
Week 3: Parametric survival distributions; likelihood; Cox proportional hazards regression model – partial likelihood.
Week 4: Predict survival under the Cox model; time-dependent covariates; martingale theory.
Week 5: Profile likelihood; stratified Cox model; goodness-of-fit methods.
Week 6: Case study; model selection - stepwise, explained variation, information criteria, penalized log-likelihood.
Week 7: Design of a survival study; other survival models; additive hazards model.
Week 8: Competing risks; multivariate survival; robust estimation.
Week 9: Semiparametric efficiency.
1. [Introduction] Efron, B. and Hinkley, D.V. (1978) Assessing the accuracy of the maximum likelihood estimator: observed versus expected Fisher information. Biometrika, 65, 457-487.
2. Tsiatis A A. A nonidentifiability aspect of the problem of competing risks. Proceedings of the National Academy of Science USA, 1975; 72: 20-22.
3. Cox DR. (1969) Some sampling problems in technology. In: New Development in Survey Sampling, Ed. Johnson and Smith. Wiley.
4. Vardi Y. Multiplicative censoring, renewal processes, deconvolution and decreasing density: Nonparametric estimation. Biometrika, 1989; 76: 751-61.
5. Tsai, Jewell and Wang, A note on the product-limit estimator under right censoring and left truncation. Biometrika, 1987; 74: 883-6.
6. Wang M-C. Nonparametric estimation of cross-sectional survival data. JASA, 1991; 86: 130-143.
7. Wang M-C. A semiparametric model for randomly truncated data. JASA, 1989; 84: 742-748.
8. Struthers and Farewell. A mixture model for time to AIDS data with left truncation and an uncertain origin. Biometrika, 1989; 76: 814-7.
9. Asgharian M, M’Lan CE, Walfson DB. Length-biased sampling with right censoring: an unconditional approach. J Amer Stat Assoc (JASA) 2002, 97: 201-209.
10. Harrington DP, Fleming TR. A class of rank test procedures for censored survival data. Biometrika, 1982; 69(3): 553-566.
11. Reid N. A conversation with Sir David Cox. Statistical Science, 1994; 9: p449-450 (about the Cox model).
12. Thomsen and Keiding. A note on the calculation of expected survival. Statistics in Medicine, 1991; vol. 10, p. 733-738.
13. Xu R and O’Quigley J. Proportional hazards estimate of the conditional survival function. Journal of the Royal Statistical Society, Series B, 2000; vol.62, p. 667-680.
14. Xu R, Luo Y, Chambers, CD. Assessing the effect of vaccine on spontaneous abortion using time-dependent covariates Cox models. Pharmacoepidemiology and Drug Safety, 2012; 21(8): 844-50; doi: 10.1002/pds.3301.
15. O’Quigley J and Pessione F. The problem of a covariate-time qualitative interaction in a survival study. Biometrics, 1991; 47: 101-115.
16. Xu R, Adak S. Survival analysis with time-varying regression effects using a tree-based approach. Biometrics, 2002; 58: 305-315.
17. Gill R. Understanding Cox’s regression model: a martingale approach. J Amer Stat Assoc (JASA). 1984; 79: 441-447.
18. Andersen PK and Gill RD. Cox’s regression model for counting processes: a large sample theory. The Annals of Statistics, 1982; 10: 1100-1120.
19. Lin et al. Checking the Cox model with cumulative sums of martingale-based residuals. Biometrika, 1993; vol. 80, p. 557-572.
20. Xu R, O’Quigley J. Estimating average regression effect under non-proportional hazards. Biostatistics, 2000; 1: 423-439.
21. Xu R, Harrington DP. A semiparametric estimate of treatment effects with censored data. Biometrics, 2001; 57:875-885.
22. Loftus JR and Taylor JE. A significance test for forward stepwise model selection. http://arxiv.org/pdf/1405.3920.pdf
23. Akaika H (1973). Information theory and an extension of the maximum likelihood principle. In: Breakthroughs in Statistics, 1992, vol.1, p.610-24. Springer, New York.
24. Xu, Vaida and Harrington. Using profile likelihood for semiparametric model selection with application to proportional hazards mixed models. Statistica Sinica, 2009; 19: 819-842.
25. Volinsky, CT and Raftery, AE. Bayesian information criterion for censored survival models. Biometrics, 2000; 56: 256-262.
26. Harezlak et al. Variable selection in regression – estimation, prediction, sparsity, inference. In Li and Xu (ed) ‘High-Dimensional Data Analysis in Cancer Research’. Springer, 2009. (available via elink)
27. Tibshirani, R. The lasso method for variable selection in the Cox model. Statistics in medicine, 1997; 16(4): 385-395.
28. Huang J and Harrington D. Penalized partial likelihood regression for right-censored data with bootstrap selection of the penalty parameter. Biometrics, 2002; 58: 781-791.
29. Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Annals of Statistics, 2002; 30(1): 74-99.
30. Bradic J, Fan J, Jiang J. Regularization for Cox's Proportional Hazards Model with NP-Dimensionality. Annals of Statistics, 2011; 39(6): 3092-3120.
31. Kent J. Information gain and a general measure of correlation. Biometrika, 1983; 70: 163-173.
32. O’Quigley J, Xu R, Stare J. Explained randomness in proportional hazards models. Statistics in Medicine, 2005; 24: 479-489.
33. Xu R, Chambers C. A sample size calculation for spontaneous abortion in observational studies. Reproductive Toxicology, 2011; 32: 490-493.
34. Gray RJ. Flexible methods for analyzing survival data using splines, with application to breast cancer prognosis. JASA, 1992: 87: 942-951.
35. Chan P, Xu R, Chambers C. A study of R-squared measure under the accelerated failure time models. Communications in Statistics – Simulation and Computation, 2018, 47(2): 380-391.
36. Struthers CA, Kalbfleisch JD. Misspecified proportional hazards models. Biometrika, 1986; 73: 363-369.
37. Lagakos SW, Schoenfeld DA. Properties of proportional-hazards score tests under misspecified regression models. Biometrics, 1984; 40: 1037-1048.
38. Chastang C, Byar D, Piantadosi S. A quantitative study of the bias in estimating the treatment effect caused by omitting a balanced covariate in survival model. Statistics in Medicine, 1988; 7: 1243-1255.
39. Murphy SA, van der Vaart AW. On profile likelihood (with discussion). JASA. 2000; 95: 449-485.
40. Maples JJ, Murphy SA, Axinn WG. Two-level proportional hazards models. Biometrics, 2002; 58: 754-763.
41. Newey WK. Semiparametric efficiency bounds. J Applied Econometrics, 1990; 5(2): 99-135.
42. Li X, Xu R. Empirical and kernel estimation of covariate distribution conditional on survival time. Computational Statistics and Data Analysis. 2006; 50(12): 3629-3643.
43. Strandberg E, Lin X, Xu R. Estimation of main effect when covariates have non-proportional hazards. Communications in Statistics – Simulation and Computation, 2014, 43(7): 1760-1770.
44. Prentice RL. On non-parametric maximum likelihood estimation of the bivariate survivor function. Statistics in Medicine, 1999; 18: 2517-2527.
45. Wei LJ, Lin DY, Weissfeld L. Failure time data by modeling marginal distributions. JASA 1989; 84: 1065-1073.
46. Morris CN. Parametric empirical Bayes inference: theory and applications (with discussion). JASA, 1983; 78: 47-65.
47. Vaida F, Xu R. Proprotional hazards model with random effects. Statistics in Medicine, 2000; 19: 3309-3324.
48. Gamst A, Donohue M, Xu R. Asymptotic properties and empirical evaluation of the NPMLE in the proportional hazards mixed-effects model. Statistica Sinica, 2009; 19: 997-1011.
49. Louis TA. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society B, 1982; 44(2): 226-233.
50. Ripatti S, Palmgren J. Estimation of multivariate frailty models using penalized partial likelihood. Biometrics, 2000; 56: 1016-1022.
51. Murphy SA. Consistency in a proportional hazards model incorporating a random effect. Annals of Statistics, 1994; 22(2): 712-731.
Homework: You may discuss, but please write them independently. Write your solutions, answers and results in your own words (and in complete sentences, and clearly lay out your setup, background etc.) in the main part, and append program codes in the back; all needs to be turned in. Any two students turning in exactly the same solutions may be considered plagiarism.
Use ONLY the updated lecture notes in TED (Triton Ed) to refer to the assignments.
HW1 (due 4/22 in class):
1. a) Explain in the derivation of the Kaplan-Meier estimate, where the assumption that C is independent of T is used;
b) Write the Kaplan-Meier estimate in counting process notation.
2. a) Simulate a sample of size 200 using the standard Exponential (1) distribution. Now generate C from Uniform (0, c), and choose c such that about 20% of the data are right-censored. Plot the Kaplan-Meier curve and the 95% confidence intervals.
b) Now focus on estimating S(0.5) from the above distribution. Repeat the simulation of part a) 1000 times, summarize the bias, standard error (SE), standard deviation (SD) of the estimates from the 1000 repeats, and coverage probability (CP) of the 95% confidence intervals.
3. Do the 3 exercises on page 30 of Lecture 3 notes (see updated notes in TED).
4. Do the 3 exercises on page 34 of Lecture 3 notes.
HW2 (due 5/20 in class):
1. Refer to Lecture 7 notes, simulate a single data set with n = 100 for Z = 0 and 1 with probability 0.5 each, use beta(t) = 1.4 – 8.32t from page 18 and baseline hazard of constant one. Fit the two models on page 16 and test the PH assumption.
2. Do the exercise on the bottom of page 20 of Lecture 7 notes.
3. Download the PBC data from http://lib.stat.cmu.edu/datasets/, fit a Cox model with age, bilirubin, protime, albumin and edema. Use the cumulative martingale-based residuals to check: 1) the proportional hazards assumption, 2) functional form, of each covariate. Compute one of the R-squares measures that we talked about.
Papers for final presentation:
# 15[6/3], 20[6/3], 21[6/5], 27[6/5], 28[6/7], 31[6/7].
Grading: 70% Homework + 30% Final presentation/project