DSC 155 & MATH 182

"Winter" 2020, Lecture A00 (Kemp) MWF 1:00-1:50pm

Hidden Data in Random Matrices

Announcements

Course Information

Instructional Staff

NameRoleOfficeE-mail
Todd Kemp Instructor APM 5202 tkemp@ucsd.edu
Denise Rava Teaching Assistant APM 2220 drava@ucsd.edu

Our office hours can be found in the following calendar.

Calendar



Class Meetings

DateTimeLocation
Lecture A00 (Kemp) Monday, Wednesday, Friday1:00pm - 1:50pmAPM B402A
Lab A01 (Rava) Wednesday5:00pm - 5:50pmAPM B432
Lab A02 (Rava) Wednesday6:00pm - 6:50pmAPM B432
Take-Home Midterm Exam Monday, Feb 102:00pm(Home)
Take-Home Final Exam Friday, Mar 2011:30am - 2:29pm(Home)

Top


Lectures Notes


Here are lecture notes on topics outside any of the recommended preparatory textbooks.

182.Notes.pdf last updated March 5, 2020.

Here are my lectures notes on Random Matrix Theory. They are intended for a reader who has taken graduate courses in (measure theoretic) probability theory, complex analysis, real analysis, and have some familiarity with (enumerative) combinatorics. Section 4.1 contains material related to the discussion in Section 2.4 of our current course notes.

RMT.Notes.pdf

Lectures Slides


The lectures are typically given via tablet, on notes/slides with some information prepared before lecture, and some filled-in during the lecture. Below, you will find the before and after slides for each lecture (as they are produced).

LectureBeforeAfter
1 182-Lec1-Before.pdf 182-Lec1-After.pdf
2 182-Lec2-Before.pdf 182-Lec2-After.pdf
3 182-Lec3-Before.pdf 182-Lec3-After.pdf
4 182-Lec4-Before.pdf 182-Lec4-After.pdf
5 182-Lec5-Before.pdf 182-Lec5-After.pdf
6 182-Lec6-Before.pdf 182-Lec6-After.pdf
7 182-Lec7-Before.pdf 182-Lec7-After.pdf
8 182-Lec8-Before.pdf 182-Lec8-After.pdf
9 182-Lec9-Before.pdf 182-Lec9-After.pdf
10 182-Lec10-Before.pdf 182-Lec10-After.pdf
11 182-Lec11-Before.pdf 182-Lec11-After.pdf
12 182-Lec12-Before.pdf 182-Lec12-After.pdf
13 Not available Not available
14 182-Lec14-Before.pdf 182-Lec14-After.pdf
15 182-Lec15-Before.pdf 182-Lec15-After.pdf
16 182-Lec16-Before.pdf 182-Lec16-After.pdf
17 182-Lec17-Before.pdf 182-Lec17-After.pdf
18 182-Lec18-Before.pdf 182-Lec18-After.pdf
19 Not available Not available
20 Not available Not available
21 Not available 182-Lec21-After.pdf
22 Not available 182-Lec22-After.pdf
23 Not available 182-Lec23-After.pdf
24 Not available 182-Lec24-After.pdf
25 Not available 182-Lec25-After.pdf
26 182-Lec26-Before.pdf 182-Lec26-After.pdf
27 Not available Not available
28 Not available Interactive EVT Video here.

Syllabus


DSC 155 & MATH 182 is a one quarter topics course on the theory of Principal Component Analysis (PCA), a tool from statistics and data analysis that is extremely wide-spread and effective for understanding large, high-dimensional data sets. PCA is a method of projecting a high-dimensional data set into a lower-dimensional affine subspace (of chosen dimension) that best fits the data (in least squares sense), or equivalently maximizes the preserved variance of the original data. It is a computationally efficient algorithm (utilizing effective numerical methods for the singular value decomposition of matrices), and so is often a first-stop for advanced data analytics of big data sets. The algorithm produces a canonical basis for the projected subspace; these vectors are called the principal components. The question of choosing the dimension for the projection is subtle. In virtually all real-world applications, a "cut-off phenomenon" occurs: a small number (usually 2 or 3) of the singular values of the sample covariance matrix for the data account for a large majority of the variance in the data set.

A full (and honestly still developing) theoretical understanding of this "cut-off phenomenon" has only arisen in the last 15 years, using the tools of random matrix theory. The result, known as the BBP Transition (named after Jinho Baik, Gerard Ben Arous, and Sandrine Peche, who discovered it in 2005), explains the phenomenon in terms of analysis of outlier singular values in low-rank perturbations of random covariance matrices. This uses a simple model (standard in signal processing and information theory) of a signal in a noisy channel to tease out exactly when an outlier will appear in PCA analysis of noisy data, and further predicts more subtle effects that current statistical methods are being developed to correct to produce even more accurate data analysis.

The goal of this course is to present and understand the PCA algorithm, and then analyze it to understand how (and when) it works. Time permitting, we will then use these ideas to apply to some current interesting problems in data science and computer science, such as community detection in large (random) networks.

There is no textbook that covers this material, so we will sample from different sources (and prepare instructor lecture notes along the way). Because this is a brand new experimental course, which has never been taught at UCSD (or anywhere else), it is hard to say in advance what the schedule of topics will be. Instead, here is a point-form list of topics we intend to cover, in the order they will be covered.


Prerequisite:  The prerequisites are Linear Algebra (MATH 18) and Probability Theory (MATH 180A). MATH 109 (Mathematical Reasoning) is also strongly recommended as a prerequisite or corequisite. Also: MATH 102 (Applied Linear Algebra) would be beneficial, but is not required. For the lab component of the course, some familiarity with Python and MATLAB is helpful, but not required.

Lecture:  Attending the lecture is a fundamental part of the course; you are responsible for material presented in the lecture whether or not it is discussed in the notes. You should expect questions on the homework and exams that will test your understanding of concepts discussed in the lecture.

Homework:  Homework assignments are posted below, and will be due at 11:59pm on the indicated due date.  You must turn in your homework through Gradescope; if you have produced it on paper, you can scan it or simply take clear photos of it to upload. It is allowed and even encouraged to discuss homework problems with your classmates and your instructor and TA, but your final write up of your homework solutions must be your own work.

Labs:  The data science labs are accessible through DataHub. The turn-in components should be exported as pdf files and turned in through Gradescope; they are due at 11:59pm on the dates indicated on the labs.

Lab Project:  You will choose a real-world high-dimensional data set, and implement the PCA algorithm to analyze it. You will use the tools explored in this class to give a careful analysis of how the PCA algorithm performed, what it discovered about the data, and what structural shortcomings were evidence in the analysis. Topics and data-sets to be approved by the intructor.

Take-Home Midterm Exam:  There will be a single take-home midterm exam, available immediately after the lecture on Monday, February 10, due the following day before midnight. You are free to use any paper / online resources you like during the exam, but collaboration with other people is not allowed. This will be enforced by the honor-system; be warned, we will grade carefully looking for evidence of collaboration on the exam, and any suspicious cases will be reported as academic integrity violions (with likely severe penalties).

Final Exam:  The final examination will be held at the date and time stated above.

Administrative Links:    Here are two links regarding UC San Diego policies on exams:

Regrade Policy:  

Grading: Your cumulative average will be determined by whichever of the following four weighted averages is higher (for you):

Your course grade will be determined by your cumulative average at the end of the quarter, and will be based on the following scale:

A+ A A- B+ B B- C+ C C-
97 93 90 87 83 80 77 73 70

The above scale is guaranteed: for example, if your cumulative average is 80, your final grade will be at least B-. However, your instructor may adjust the above scale to be more generous.

Academic Integrity:  UC San Diego's code of academic integrity outlines the expected academic honesty of all studentd and faculty, and details the consequences for academic dishonesty. The main issues are cheating and plagiarism, of course, for which we have a zero-tolerance policy. (Penalties for these offenses always include assignment of a failing grade in the course, and usually involve an administrative penalty, such as suspension or expulsion, as well.) However, academic integrity also includes things like giving credit where credit is due (listing your collaborators on homework assignments, noting books or papers containing information you used in solutions, etc.), and treating your peers respectfully in class. In addition, here are a few of our expectations for etiquette in and out of class.

Accommodations:

Students requesting accommodations for this course due to a disability must provide a current Authorization for Accommodation (AFA) letter issued by the Office for Students with Disabilities (OSD) which is located in University Center 202 behind Center Hall. The AFA letter may be issued by the OSD electronically or in hard-copy; in either case, please make arrangements to discuss your accommodations with me in advance (by the end of Week 2. We will make every effort to arrange for whatever accommodations are stipulated by the OSD. For more information, see here.

Top


Homework


Weekly homework assignments are posted here. Homework is due by 11:59pm on the posted date, through Gradescope. Late homework will not be accepted.

Due DateHomeworkSolutions
01/17/20 182.HW1.pdf 182.HW1.Solutions.pdf
01/31/20 182.HW2.pdf 182.HW2.Solutions.pdf
03/06/20 182.HW3.pdf 182.HW3.Solutions.pdf
03/16/20 182.HW4.pdf 182.HW4.Solutions.pdf