Data Engineering Syllabus Spring 2024 (3)
Data Engineering Syllabus Spring 2024 (3)
Spring 2024
Class Schedule
● Classroom: Mudd 303
● Meeting time: Mondays and Wednesdays 2:40-3:55 PM
Course Staff
Instructor
● Yi Zhang
○ Email: yz3558@columbia.edu
○ Office: Mudd 340
○ Office Hours: Fridays 2:45- 4:15 pm
Course Assistant
● Sahil Bhave
○ Email: sb4865@columbia.edu
○ Office Hours
■ Time: 11 AM to 12:30 AM
■ Location: Table 3 Mudd 301
● Chris Lee
○ Email: csl2183@columbia.edu
○ Office Hours
■ Time: 11 AM to 12:30 AM
■ Location: Table 3 Mudd 301
Course Description
This comprehensive course is designed to equip students with essential knowledge and skills in
effectively working with data. Students will learn how data is organized, stored, and managed in
both programming applications and large-scale databases. This course serves as a crucial
stepping stone for students interested in technology-driven fields that deal with information
handling and data analytics.
Learning Objectives
● Understand fundamental data building blocks, such as list, tuple, arrays, linked lists,
stacks, queues, deque, priority queues, dictionary, set, trees, graphs, and their
applications.
● Implement and analyze algorithms using data structures for efficient data manipulation
and problem-solving for Operations Research problems
● Comprehend the principles of database design and develop SQL/NoSQL skills to
retrieve, update, and manage data in databases.
● Perform basic data manipulation and wrangling using Pandas
Course Materials
Course website
We will be using EdStem to post lecture materials and host the discussion board. Please check
the updates on the course website periodically.
We will be using the EdStem Discussion Board for Q&A. All questions should be posted on the
discussion board on the EdStem platform. The goal is to make a collaborative space for learning
and communicating. Please do not use public posts to share solutions. For personal matters or
content that might contain the solution, please set your post "private" so that your peers will not
see the content.
Textbook
There is no required textbook. All materials will be posted on EdStem. The following two
textbooks might be useful if you are interested in doing some in-depth reading for certain topics.
● Goodrich, Michael T., Roberto Tamassia, and Michael H. Goldwasser. Data structures and
algorithms in Python. John Wiley & Sons Ltd, 2013.
● Silberschatz, Abraham, Henry F. Korth, and Shashank Sudarshan. Database system concepts.
McGraw Hill, 2019.
Software
We will use Python for this course. We expect you have finished ENGI E1006 or have
proficiency in Python, such as knowing how to
Assessments
Assessment of learning objectives will consist of six regular problem sets, a midterm exam, and
a final exam. All assessments will contain both theoretical problems as well as application
problems in Python. The problems aim to help students practice and assess their skills in
applying methods learned in class to work with various data components and solve real-world
problems.
Grading Policy
Homework (35%), Midterm (20%), Final (40%), Class Participation (5%)
Homework
We will have six homework assignments in total. The lowest assignment grade will be dropped.
The questions appearing on the homework focus on the application component of the course.
You can collaborate on the homework assignments. However, you MUST finish the write-up
independently. You cannot share questions with or solicit help from people not attending this
course.
Exams
The exams will be computer-based. It will be open-book and open-notes. You are required to
finish the questions independently. AI tools are forbidden for the exam.
Class Participation
Students are expected to attend the lectures and contribute to an active learning environment.
Ways to increase your class participation include but are not limited to:
Letter Grade
The letter grade will be assigned based on the curve. When assigning the letter grade, we will
consider your standing among your peers and the class performance.
Assignment Policies
Late Policy
The deadline for all the homework assignments is at midnight EST. In addition, I will give each
person a leeway of 1 hour for each assignment. After the leeway, the submissions will receive a
0. In addition, you can submit up to 2 homework assignments up to 24 hours late. No
permission is needed. No questions asked.
Re-grading Policy
For the re-grading of homework, please leave a private post on the discussion board on EdStem
within seven days of receiving your grades. Since I will post the solution to each homework
assignment, you are expected to compare the solution with your own write-up before sending
the request. In your request, you should explain the reasoning for any suspected mistakes in
grading.
Tentative Deadlines
Assignments are released 7 days before the deadline.
For the exams, you need to finish the questions by yourself. No discussion or collaboration is
allowed. Your work cannot be copied from another person or any other source. Submissions
where these details are identical or nearly identical, either among peers or with another source,
will be regarded as cheating. The sanctions may range up to the termination of your enrollment
at Columbia University. All suspected incidents will be recorded with SEAS administration at the
same time the student is notified.
Potential Topics
● Data Building Blocks
○ Numeric, Boolean, and String
○ List, Tuple, and Arrays
○ Stacks, Queues, Priority Queue, Deque, and Linked List
○ Map, Hash Tables, Dictionary, and Set
○ Trees
○ Graphs
○ Search and Sort
● Data Repository
○ Relational Database Management System
○ SQL
○ NoSQL
● Data Collection
○ Web Scraping using XML, HTML, and Json
● Data Manipulation with Pandas
Additional Support
Studying at Columbia University can be competitive and stressful. We are here to make sure
everyone stays healthy physically and mentally. If you have any help with your work or life,
please do not hesitate to approach us. We are always here to help. In addition, it is a good
option to use Columbia Counseling and Psychological Services for anonymous consultation.