CSEE4121 - Computer Systems for Data Science
Spring ‘20, Columbia University
Data scientists and engineers increasingly have access to a powerful and broad range of systems they use to conduct big data analysis and machine learning at scale: from databases, large-scale analytics to distributed machine learning frameworks. The goal of this class is to provide data scientists and engineers that work with big data a better understanding of the foundations of how the systems they will be using are built. It will also give them a better understanding of the real-world performance, availability and scalability challenges when using and deploying these systems at scale. In the course we will cover foundational ideas in designing these systems, while focusing on specific popular systems that students are likely to encounter at work or when doing research. The class will include some written homework and programming assignments. Some of the assignments will be done in groups. In this course we will answer the following questions:
- How are popular big data systems designed and architected?
- How to think about performance, scale and reliability of big data systems?
- How do they remain available and not lose data despite frequent server and hardware failures?
OH: Mondays 2:30-3:30 PM (By appointment only)
Location and Time
501 Northwest Corner Building.
Mondays 4:10pm - 6:40pm
(All Office Hours Held in the CS TA Room, Mudd 1st Floor)
Hongyi Wang – Mon 10am - 12pm
Qianrui Zhang – Tue 9am - 11am
Yujian Wu – Wed 2pm - 4pm
Junlin Song – Thu 12pm - 2pm
Ke Li – Thu 3pm - 5pm
Mingen Pan – Fri 4pm - 6pm
Students are expected to have solid programming experience in Python or with an equivalent programming language. This class is intended to be accessible for data scientists who do not necessarily have a background in databases, operating systems or distributed systems.
Schedule (this is a work in progress, and is likely to change)
|Jan 27||Introduction [pptx][pdf]|
|Feb 3||Infrastructure for Big Data [pptx] [pdf] Relational Model Part I [pptx] [pdf]||Programming Homework 1 [TA-Solution]</th>|
|Rescheduled: Feb 14 (Friday), 1:30-3:40 PM, 501 SCH (Schermerhorn)||SQL and Relational Model [pptx] [pdf]|
|Rescheduled: Feb 17, 8:00 - 10:10 AM, 501 NWC||Transactions [pptx] [pdf]|
|Feb 24||Transactions [pptx] [pdf] Storage and Memory Hierarchy [pptx] [pdf]||Written Homework [TA-Solution]|
|Mar 2||DB Architecture [pptx] [pdf]|
|Mar 16||Spring Break|
|Mar 30||Key-value Stores and Single DB architecture [pptx] [pdf] Partitioning [pptx] [pdf]||Programming Homework 2|
|Apr 6||Distributed File Systems and Transactions [pptx] [pdf]|
|Apr 13||MapReduce and Spark [pptx] [pdf]|
|Apr 20||Spark (continued) and Streaming [pptx] [pdf] Caching [pptx] [pdf]|
|Apr 27||Systems for Machine Learning [pptx] [pdf]||Programming Homework 3|
|May 4||Security and Privacy [pptx] [pdf]|
20% Programming Homework 1
10% Written Homework
20% Programming Homework 2
10% Programming Homework 3
Programming assignment 1 and the written assignment will be done alone. Programming assignments 2-3 will be done in pairs . You may not copy answers and code. We will enforce this policy when checking the assignments (we use a code similarity system).