CX4240: Introduction to Computational Data Analysis

Table of Contents

Logistics

  • Lecture time: Mons and Weds, 4:30pm-5:45pm
  • Location: Klaus 2456
  • Instructors: Chao Zhang
  • Teaching Assistant: Yue Yu (yyu414@gatech.edu) and Rui Feng (rfeng44@gatech.edu)
  • Office Hours:
    • Instructor: Weds 3:30-4:20pm @ CODA 1309
    • TA Office Hour: Mons 3:30-4:20pm, Open space outside Klaus 1325
  • Piazza: https://piazza.com/class/k51ginmjc0446
    • Piazza will be the main place for course discussions and announcements. If you have questions, please ask it on Piazza first because 1) other students may have the same question; 2) you will get help faster than emails.
    • If it's something you do not like to discuss publicly on Piazza, send an email with CX4240 in the subject.

Course Content

Q: What will be covered in this course? A: This course will introduce techniques for computational data analysis, with an emphasis on machine learning algorithms and their applications to real-world data. On the technique side, we will cover key supervised methods (linear regression, logistic regression, neural networks, tree-based models) and unsupervised methods (k-means, Gaussian mixture models, expectation-maximization, dimension reduction). On the application side, we will introduce various applications of these techniques, particularly on text data analysis and natural language processing. We will demonstrate how to formulate real-world tasks as data analysis problems, key methods for solving these problems, and their advantages and disadvantages.

Q: Who will benefit from this course? A: The learning objective is that by the end of this course, the students are able to formulate their data analysis problems at hand, choose appropriate computational models to acquire insights from data automatically, and even come up with innovative solutions for solving open problems in this field. The course will be helpful for students who want to solve practical problems using machine learning and data science techniques. The course will also be helpful for students who want to do edge-cutting research in data mining, machine learning, natural language processing, and others.

Q: What are the prerequisites? A: This course is math and programming demanding. As prerequisites, you are expected to have 1) solid knowledge of probability, statistics, and linear algebra; 2) basic knowledge of machine learning; 3) solid programming skills, preferably in Python.

Schedule

Date Topic  
1/6/20 Course Overview Piazza Signup
1/8/20 Math Basics I  
1/13/20 Math Basics II  
1/15/20 Data Analysis Toolbox  
1/20/20 No Class (Martin Luther King Day)  
1/22/20 Linear Regression HW1 Out
1/27/20 Linear Regression  
1/29/20 Example Projects Start Project Team Formation
2/3/20 Naïve Bayes HW1 Due
2/5/20 Logistic Regression HW2 Out
2/10/20 Support Vector Machine  
2/12/20 Neural Networks  
2/17/20 Neural Networks  
2/19/20 Tree-Based Models HW2 Due
2/24/20 Tree-Based Models HW3 Out
2/26/20 Midterm Review Project Proposal Due
3/2/20 Midterm Exam  
3/4/20 Clustering Analysis and K-Means  
03/09/20 Hierarchical Clustering HW3 Due
3/11/20 Gaussian Mixture Model  
3/16/20 No Class (Spring Break)  
3/18/20 No Class (Spring Break)  
3/23/20 Dimension Reduction HW4 Out
3/25/20 Application: Text Embedding  
3/30/20 Application: Text Classification  
4/1/20 Project Presentation  
4/6/20 Project Presentation  
4/8/20 Project Presentation Project Video Due
4/13/20 Project Presentation HW4 Due
4/15/20 Review Class Peer Grading Due
4/20/20 No Class Final Exam Out
4/22/20 No Class, Reading Day  
4/24/20   Final Exam Due
4/26/20   Project Report Due

Grading

Homework (40%)

  • There will be four assignments. Each one is designed for testing your understanding of the taught algorithms. It could be either programming or written analysis.
  • All students are expected to follow the Georgia Tech Academic Honor Code.
  • All assignments follow the "no-late" policy. Assignments received after the due time will receive zero credit.

Project (20%)

  • You are expected to complete a project on computational data analysis. You can choose from one of the two options below for your project:
    • Tackle a real-life data analysis task using computational data analysis. Your need to be clear about the data you are using, the problem you are attempting to solve, the method you are using, and the results and conclusion you attain.
    • Conduct a survey on a specific research topic. You need to select a research topic highly related to this course and read state-of-the-art research papers comprehensively. You need to organize existing techniques into different categories, introduce their key ideas, and critically comment on their advantages and disadvantages.
  • You will need to turn in a project report and also give an in-class presentation for your project. The project report and the presentation will each count for 10% of your final grade.
  • Each project needs to be completed in a team of 2-4 people. Team members need to clearly claim their contributions in the project report.

Class participation (15%)

  • This class will have many in-class activities, your class participation score will be mainly graded based on attendance and performance of those in-class activities.
  • Participation in class discussions (including asking relevant questions in class, volunteering to answer questions on Piazza) can give you bonus for your final grade. It will be especially useful when you are right on the edge of two letter grades.

Midterm Exam (10%)

  • The midterm exam will take place on March 02 in lieu of the regular class.
  • The midterm exam will be a written and open-book exam, but no computer or Internet usage will not be allowed.
  • There will be no make-up exams. You will get zero credit for your missed midterm exam.

Final Exam (15%)

  • The final exam will be on April 20 in lieu of the regular class.
  • The final exam will be a written and open-book exam, but no Internet usage will be allowed.
  • Again, there will be no make-up exams. You will get zero credit for your missed final exam.

Resources

Other resources, such as machine learning toolboxes and datasets, will be provided throughout the course.