CX4240: Introduction to Computational Data Analysis

Logistics
Course Content
Schedule
Grading
Resources

Logistics

Lecture time: Mons and Weds, 4:30pm-5:45pm
Location: Klaus 2456
Instructors: Chao Zhang
Teaching Assistant: Yue Yu (yyu414@gatech.edu) and Rui Feng (rfeng44@gatech.edu)
Office Hours:
- Instructor: Weds 3:30-4:20pm @ CODA 1309
- TA Office Hour: Mons 3:30-4:20pm, Open space outside Klaus 1325
Piazza: https://piazza.com/class/k51ginmjc0446
- Piazza will be the main place for course discussions and announcements. If you have questions, please ask it on Piazza first because 1) other students may have the same question; 2) you will get help faster than emails.
- If it's something you do not like to discuss publicly on Piazza, send an email with CX4240 in the subject.

Course Content

Q: What will be covered in this course? A: This course will introduce techniques for computational data analysis, with an emphasis on machine learning algorithms and their applications to real-world data. On the technique side, we will cover key supervised methods (linear regression, logistic regression, neural networks, tree-based models) and unsupervised methods (k-means, Gaussian mixture models, expectation-maximization, dimension reduction). On the application side, we will introduce various applications of these techniques, particularly on text data analysis and natural language processing. We will demonstrate how to formulate real-world tasks as data analysis problems, key methods for solving these problems, and their advantages and disadvantages.

Q: Who will benefit from this course? A: The learning objective is that by the end of this course, the students are able to formulate their data analysis problems at hand, choose appropriate computational models to acquire insights from data automatically, and even come up with innovative solutions for solving open problems in this field. The course will be helpful for students who want to solve practical problems using machine learning and data science techniques. The course will also be helpful for students who want to do edge-cutting research in data mining, machine learning, natural language processing, and others.

Q: What are the prerequisites? A: This course is math and programming demanding. As prerequisites, you are expected to have 1) solid knowledge of probability, statistics, and linear algebra; 2) basic knowledge of machine learning; 3) solid programming skills, preferably in Python.

Schedule

Date	Topic
1/6/20	Course Overview	Piazza Signup
1/8/20	Math Basics I
1/13/20	Math Basics II
1/15/20	Data Analysis Toolbox
1/20/20	No Class (Martin Luther King Day)
1/22/20	Linear Regression	HW1 Out
1/27/20	Linear Regression
1/29/20	Example Projects	Start Project Team Formation
2/3/20	Naïve Bayes	HW1 Due
2/5/20	Logistic Regression	HW2 Out
2/10/20	Support Vector Machine
2/12/20	Neural Networks
2/17/20	Neural Networks
2/19/20	Tree-Based Models	HW2 Due
2/24/20	Tree-Based Models	HW3 Out
2/26/20	Midterm Review	Project Proposal Due
3/2/20	Midterm Exam
3/4/20	Clustering Analysis and K-Means
03/09/20	Hierarchical Clustering	HW3 Due
3/11/20	Gaussian Mixture Model
3/16/20	No Class (Spring Break)
3/18/20	No Class (Spring Break)
3/23/20	Dimension Reduction	HW4 Out
3/25/20	Application: Text Embedding
3/30/20	Application: Text Classification
4/1/20	Project Presentation
4/6/20	Project Presentation
4/8/20	Project Presentation	Project Video Due
4/13/20	Project Presentation	HW4 Due
4/15/20	Review Class	Peer Grading Due
4/20/20	No Class	Final Exam Out
4/22/20	No Class, Reading Day
4/24/20		Final Exam Due
4/26/20		Project Report Due

Grading

Homework (40%)

There will be four assignments. Each one is designed for testing your understanding of the taught algorithms. It could be either programming or written analysis.
All students are expected to follow the Georgia Tech Academic Honor Code.
All assignments follow the "no-late" policy. Assignments received after the due time will receive zero credit.

Project (20%)

You are expected to complete a project on computational data analysis. You can choose from one of the two options below for your project:
- Tackle a real-life data analysis task using computational data analysis. Your need to be clear about the data you are using, the problem you are attempting to solve, the method you are using, and the results and conclusion you attain.
- Conduct a survey on a specific research topic. You need to select a research topic highly related to this course and read state-of-the-art research papers comprehensively. You need to organize existing techniques into different categories, introduce their key ideas, and critically comment on their advantages and disadvantages.
You will need to turn in a project report and also give an in-class presentation for your project. The project report and the presentation will each count for 10% of your final grade.
Each project needs to be completed in a team of 2-4 people. Team members need to clearly claim their contributions in the project report.

Class participation (15%)

This class will have many in-class activities, your class participation score will be mainly graded based on attendance and performance of those in-class activities.
Participation in class discussions (including asking relevant questions in class, volunteering to answer questions on Piazza) can give you bonus for your final grade. It will be especially useful when you are right on the edge of two letter grades.

Midterm Exam (10%)

The midterm exam will take place on March 02 in lieu of the regular class.
The midterm exam will be a written and open-book exam, but no computer or Internet usage will not be allowed.
There will be no make-up exams. You will get zero credit for your missed midterm exam.

Final Exam (15%)

The final exam will be on April 20 in lieu of the regular class.
The final exam will be a written and open-book exam, but no Internet usage will be allowed.
Again, there will be no make-up exams. You will get zero credit for your missed final exam.

Resources

Machine learning, by Tom Mitchell
Pattern recognition and machine learning, by Christopher Bishop
Data Mining: Concepts and Techniques, by Jiawei Han, Micheline Kamber, and Jian Pei
The Elements of Statistical Learning, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
Deep Learning, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
Dive into Deep Learning, by Aston Zhang, Zack C. Lipton, Mu Li, and Alex Smola

Other resources, such as machine learning toolboxes and datasets, will be provided throughout the course.