CX4240: Introduction to Computational Data Analysis (2022 Spring)

Table of Contents

Logistics

  • Lecture time: Mons and Weds, 3:30pm-4:45pm
  • Location: J. Erskine Love Manufacturing 185 (also streamed at bluejeans.com/5341114422)
  • Instructors: Chao Zhang
  • Teaching Assistant: Brad Baker <bbradt@gatech.edu> and Binghong Chen <binghong@gatech.edu>
  • Office Hours:
    • Instructor: Mons 2-3 @ bluejeans.com/5341114422
    • TA Office Hour: Weds 2:30-3:30pm @ bluejeans.com/547687219/6944
  • Piazza: https://piazza.com/gatech/spring2022/cx4240

Course Content

Q: What will be covered in this course? A: This course introduces techniques for computational data analysis, with an emphasis on machine learning algorithms and their applications to real-world data. On the technique side, we will cover key supervised machine learning methods (linear regression, logistic regression, neural networks, tree-based models) and unsupervised method (k-means, Gaussian mixture models, expectation-maximization, dimension reduction). On the application side, it will introduce various applications of these techniques, particularly on text data analysis and natural language processing. It will introduce how to formulate real-world tasks as data analysis problems, key methods for solving these problems, and their advantages and disadvantages.

Q: Who will benefit from this course? A: The learning objective is that by the end of this course, the students are able to formulate their data analysis problems at hand, choose appropriate computational models to acquire insights from data automatically, and even come up with innovative solutions for solving open problems in this field. The course will be helpful for students who want to solve practical problems using machine learning and data science techniques. The course will provide useful techniques for students who want to do edge-cutting research in data mining, machine learning, natural language processing, and others.

Q: What are the prerequisites? A: Prerequisites for this course include 1) solid knowledge of probability, statistics, calculus, and linear algebra; 2) basic knowledge of machine learning; 3) solid programming skills, preferably in Python.

Schedule

Date Topic Due
01/10/2022 Course Overview  
01/12/2022 Probability and MLE Piazza Signup
01/17/2022 No Class (Martin Luther King Day)  
01/19/2022 Data Analysis Toolbox  
01/24/2022 Linear Regression  
01/26/2022 Linear Regression HW1 Out
01/31/2022 Example Projects  
02/02/2022 Naïve Bayes Classifier  
02/07/2022 Logistic Regression  
02/09/2022 Neural Networks HW1 Due
02/14/2022 Neural Networks Project checkpoint 1 signup
02/16/2022 Project checkpoint 1 & discussion HW2 Out
02/21/2022 CNNs and RNNs  
02/23/2022 Decision Trees  
02/28/2022 Random Forest  
03/02/2022 Clustering Analysis and K-Means HW2 Due
03/07/2022 Hierarchical Clustering Project checkpoint 2 signup
03/09/2022 Gaussian Mixture Model HW3 Out
03/14/2022 Dimension Reduction  
03/16/2022 Project checkpoint 2 & discussion  
03/21/2022 No Class (spring break)  
03/23/2022 No Class (spring break)  
03/28/2022 Application: Text Representation  
03/30/2022 Application: Text Embedding & Clustering HW3 Due
04/04/2022 Application: Text Classification project presentation signup
04/06/2022 Review Class  
04/11/2022 Exam  
04/13/2022 No Class (Project Preparation)  
04/18/2022    
04/20/2022 Project Presentation project presentation due
04/25/2022   presentation peer grading due

Grading

Homework (30%)

There will be three assignments, each account for 10% towards your final score. Each assignment includes written analysis and/or programming for testing your understanding of the taught content.

  • Late policy: Assignments are due at 11:59PM of the due date. You will be allowed 2 total late days (48 hours) without penalty for the entire semester (for homework only, not applicable to exams or projects). Once those days are used, you will be penalized according to the following policy:
    • Homework is worth full credit before the due time.
    • It is worth 75% credit for the next 24 hours.
    • It is worth 50% credit for the second next 24 hours.
    • It is worth zero credit after that.
  • Follow the Georgia Tech Academic Honor Code.

Project (30%)

You need to complete a project on using computational data analysis techniques to tackle a real-life data analysis problem. Each project needs to be completed in a team of 2-4 people. Here are some guidelines and resources for doing your project smoothly.

Exam (40%)

One exam will be held on April 11 in lieu of the regular class:

  • The exam will be open-book. However, no peer communication is allowed—you may not message or collaborate with others.
  • There will be no make-up exams. You will get zero credit for your missed exam.

Resources

Other resources, such as machine learning toolboxes and datasets, will be provided throughout the course.