DSC 155 & Math 182

Winter 2024, Lecture A00 (Dumitriu) MWF 3:00-3:50pm

Hidden Data in Random Matrices


Course Information

  Instructional Staff

Name Role E-mail Meetings (Zoom, or in-person)
Ioana Dumitriu Instructor idumitriu@ucsd.edu Lecture:   MWF 3-3:50pm;
                  (WLH 2205)

Office  Hours:   M 4-5pm (APM 5824);
                           Thu 4-6pm (Zoom, link on Canvas)
Haixiao Wang
Teaching Assistant h9wang@ucsd.edu Discussion/Lab:  Fri  5-5:50pm (A01), 6-6:50pm (A02)
                                 (APM 5402)

Office Hours:  (combined) 
                           (Zoom or TBA)

Calendar

This is a tentative course outline and might be adjusted during the quarter.

Week Monday Wednesday Thursday Friday
1
Jan 8
Lecture 1
OH: Dumitriu
POSTED: HW 1
Jan 10
Lecture 2
Jan 11
OH: Wang
OH: Dumitriu
Jan 12
Lecture 3
Lab/Discussion
POSTED: Lab 1
2
Jan 15
MLK Day
no instruction
HW 1 DUE Jan 16
Jan 17
Lecture 4
Jan 18
OH: Wang
OH: Dumitriu
Jan 19
Lecture 5
Lab/Discussion
DUE: Lab 1
POSTED: HW 2
3
Jan 22
Lecture 6
OH: Dumitriu
Jan 24
Lecture 7
Jan 25
OH: Wang
OH: Dumitriu
Jan 26
Lecture 8
Lab/Discussion
DUE: HW 2
POSTED: Lab 2
4
Jan 29
Lecture 9
OH: Dumitriu
Jan 31
Lecture 10
Feb 1
OH: Wang
OH: Dumitriu
Feb 2
Lecture 11
Lab/Discussion
DUE: Lab 2
POSTED: Lab 3
5
Feb 5
Lecture 12
OH: Dumitriu
Feb 7
Lecture 13
Feb 8
OH: Wang
OH: Dumitriu
Feb 9
Lecture 14
Lab/Discussion
DUE: Lab 3
POSTED: HW 3
6
Feb 12
Lecture: REVIEW
OH: Dumitriu
POSTED: Final Project
Feb 14
In-lecture Midterm
Feb 15
OH: Wang
OH: Dumitriu
Feb 16
Lecture 15
Lab/Discussion
DUE: HW 3
POSTED: Lab 4
7
Feb 19
Presidents' Day
no instruction
Feb 21
Lecture 16
Feb 22
OH: Wang
OH: Dumitriu
Feb 23
Lecture 17
Lab/Discussion
DUE: Lab 4
POSTED: HW 4
8
Feb 26
Lecture 18
OH: Dumitriu
Feb 28
Lecture 19
Feb 29
OH: Wang
OH: Dumitriu
Mar 1
Lecture 20
Lab/Discussion
DUE: HW 4
POSTED: Lab 5
9
Mar 4
Lecture 21
OH: Dumitriu
Mar 6
Lecture 22
Mar 7
OH: Wang
OH: Dumitriu
Mar 8
Lecture 23
Lab/Discussion
DUE: Lab 5
POSTED: HW 5
10
Mar 11
Lecture 24
OH: Dumitriu
Mar 13
Lecture 25
Mar 14
OH: Wang
OH: Dumitriu
Mar 15
Lecture: REVIEW
DUE: Final Project
DUE: HW 5
11
Mar 20
FINAL EXAM


Top


Lecture Notes


All lecture notes, both "before" and "after" the lecture, will be provided on Canvas. You will be able to see the recorded videos on Canvas, too, both the Zoom ones and the ones that will result from podcasting once we are back in-person.

In addition, Professor Todd Kemp has kindly agreed to allow us to use his notes from the previous iteration of the course. You may find a PDF with his notes on the Canvas "Home" website for the course. While these notes are not required, you might find them helpful.

Top


Syllabus


Please make sure you read the entire website.

DSC 155 & MATH 182 is a one quarter topics course on the theory of Principal Component Analysis (PCA), a tool from statistics and data analysis that is extremely wide-spread and effective for understanding large, high-dimensional data sets. PCA is a method of projecting a high-dimensional data set into a lower-dimensional affine subspace (of chosen dimension) that best fits the data (in least squares sense), or equivalently maximizes the preserved variance of the original data. It is a computationally efficient algorithm (utilizing effective numerical methods for the singular value decomposition of matrices), and so is often a first-stop for advanced data analytics of big data sets. The algorithm produces a canonical basis for the projected subspace; these vectors are called the principal components. The question of choosing the dimension for the projection is subtle. In virtually all real-world applications, a "cut-off phenomenon" occurs: a small number (usually 2 or 3) of the singular values of the sample covariance matrix for the data account for a large majority of the variance in the data set.

A full (and honestly still developing) theoretical understanding of this "cut-off phenomenon" has only arisen in the last 15 years, using the tools of random matrix theory. The result, known as the BBP Transition (named after Jinho Baik, Gerard Ben Arous, and Sandrine Peche, who discovered it in 2005), explains the phenomenon in terms of analysis of outlier singular values in low-rank perturbations of random covariance matrices. This uses a simple model (standard in signal processing and information theory) of a signal in a noisy channel to tease out exactly when an outlier will appear in PCA analysis of noisy data, and further predicts more subtle effects that current statistical methods are being developed to correct to produce even more accurate data analysis.

The goal of this course is to present and understand the PCA algorithm, and then analyze it to understand how (and when) it works. Time permitting, we will then use these ideas to apply to some current interesting problems in data science and computer science.

There is no textbook that covers this material. It is an experimental course designed by Professor Todd Kemp, whose ideas and materials we will rely on (I have posted a PDF with his Lecture notes for this course on Canvas). Here is a tentative list of topics we intend to cover, in the order they will be covered. (Depending on time, we may have to skip some of the more advanced topics at the end).
  • Introductory material:
    • Review of relevant linear algebra concepts (linear equations; rank and nullity; orthogonal rotations and projections; eignevalues and eigenvectors), and introduction to singular value decomposition.
    • Review of relevant topics from probability (distributions and densities; random vectors, covariance, and correlation).
    • Introduction to basic statistical concepts: unbiased estimators, maximal likelihood estimation, sample covariance, (linear) least squares and regression.
  • Principal Component Analysis (PCA):
    • as best approximating d-dimensional affine subspace projection
    • as highest variance d-dimensional affine subspace projection
    • equivalence of these two characterizations
    • finding the principal components: singular value decomposition
    • complexity and correctness
  • Noise without signal: spectral statistics of totally random data sets:
    • Method of Moments
    • Wigner's semicircle law
    • Marchenko-Pastur laws for Gaussian rectangular matrices
    • Universality of eigenvalue statistics
  • Outlier singular values:
    • Robustness of the bulk singular values
    • Spiked covariance models
    • The BBP transition
  • Application of the BBP transition to understand the generic behavior of PCA
  • Introduction to community detection: the stochastic block model.

Prerequisites:  The prerequisites are Linear Algebra (MATH 18) and Probability Theory (MATH 180A). MATH 109 (Mathematical Reasoning) is also strongly recommended as a prerequisite or co-requisite. Also: MATH 102 (Applied Linear Algebra) would be beneficial, but is not required. For the lab component of the course, some familiarity with Python and MATLAB is helpful, but not required.

Lecture:  Attending the lecture is a fundamental part of the course; you are responsible for material presented in the lecture whether or not it is discussed in the notes. You should expect questions on the homework and exams that will test your understanding of concepts discussed in the lecture.

Homework:  Homework assignments will be posted on Canvas on the dates indicated in the calendar and will be due at 11:59pm on the indicated due date.  You must turn in your homework through Gradescope; if you have produced it on paper, you can scan it or simply take clear photos of it to upload. It is allowed (and even encouraged!) to discuss homework problems with your classmates and your instructor and TA, but your final write up of your homework solutions must be your own work. If you collaborate on the homework, you must list the people you collaborated with on the write up.

Labs:  The data science labs are accessible through DataHub. The turn-in components should be exported as pdf files and turned in through Gradescope; they are due at 11:59pm on the dates indicated on the labs. MAKE SURE YOU ATTEND THE FIRST LAB/DISCUSSION SESSION.

Lab Project:  You will choose a real-world high-dimensional data set, and implement the PCA algorithm to analyze it. You will use the tools explored in this class to give a careful analysis of how the PCA algorithm performed, what it discovered about the data, and what structural shortcomings were evidence in the analysis. Topics and data-sets to be approved by the instructor.The Project will be due in Gradescope at 11:59pm, March 15.

Midterm Exam:  There will be a single midterm exam, administered in class on Wednesday, February 14 (contrary to what WebReg might say).

Final Exam:  The final exam will be held on Wednesday, March 20, from 3:00-5:59pm, location TBA. It is your responsibility to ensure that you do not have a schedule conflict involving the final examination; you should not enroll in this class if you cannot take the final examination at its scheduled time.

Administrative Links:    Here are two links regarding UC San Diego policies on exams:

Regrade Policy:  

Grading: Your cumulative average will be the best of the following three weighted averages:

Your course grade will be determined by your cumulative average at the end of the quarter. You will need roughly 90% to get A- or above, roughly 80% to get a B- or above, and roughly 60% to get a C- or above. This is guaranteed, meaning that you will not get a worse grade than specified above, and we might pull down the thresholds a bit depending on the difficulty of the exams. However, to get a C- or a P, you should really aim for more than 60%.

Etiquette

In addition, here are a few of my expectations for etiquette (updated for the era of remote teaching). Naturally, if and when we go back to in-person instruction, these expectations will be modified accordingly.

Accommodations:

Students requesting accommodations for this course due to a disability must provide a current Authorization for Accommodation (AFA) letter issued by the Office for Students with Disabilities (OSD) which is located in University Center 202 behind Center Hall. The AFA letter may be issued by the OSD electronically or in hard-copy; in either case, please make arrangements to discuss your accommodations with me in advance (by the end of Week 2). We will make every effort to arrange for whatever accommodations are stipulated by the OSD. For more information, see here.

Top


Academic Integrity Policies


UC San Diego's code of academic integrity outlines the expected academic honesty of all students and faculty, and details the consequences for academic dishonesty. The main issues are cheating and plagiarism, of course, for which we have a zero-tolerance policy. (Penalties for these offenses always include assignment of a failing grade in the course, and usually involve an administrative penalty, such as suspension or expulsion, as well.) However, academic integrity also includes things like giving credit where credit is due (listing your collaborators on homework assignments, noting books or papers containing information you used in solutions, etc.), and treating your peers respectfully in class.

To ensure that Academic Integrity is upheld during any and all exams (quizzes, midterm, and final), at the beginning of each exam you will be signing a pledge that you will submit alongside your solutions, whereby you will agree to obey all the rules that have been outlined. In addition, you will agree to the following: In case the instructor suspects academic misconduct in completing the exam, you will be invited to defend your solutions during a short Zoom session with the instructor and/or the TA. If you decline, or if you accept but fail to defend your solution(s) to the instructor’s or the TA’s satisfaction, the instructor will refer your case to the Academic Integrity Office for an investigation. If you accept and defend your solution(s) satisfactorily, the case will be closed and you will receive a grade for the exam and for the class.

Top


About Gradescope


We will be using Gradescope for the grading of both homework and exams.
  • Your login is your university email, and your password can be changed here. The same link can be used if you need to originally set your password.
  • Please make sure your files are legible before submitting.
  • Most word processors can save files as a pdf.
  • There are many tools to combine pdfs, such as here, and others for turning jpgs into pdfs, such as here.
  • If you have not yet been added to the course, the Gradescope entry code is GPJ5XZ . Use your UCSD email!