CX4240: Introduction to Computational Data Analysis
Table of Contents
Logistics
- Lecture time: Mons and Weds, 4:30pm-5:45pm
- Location: Klaus 2456
- Instructors: Chao Zhang
- Teaching Assistant: Yue Yu (yyu414@gatech.edu) and Rui Feng (rfeng44@gatech.edu)
- Office Hours:
- Instructor: Weds 3:30-4:20pm @ CODA 1309
- TA Office Hour: Mons 3:30-4:20pm, TBD
- Piazza: https://piazza.com/class/k51ginmjc0446
- Piazza will be the main place for course discussions and announcements. If you have questions, please ask it on Piazza first because 1) other students may have the same question; 2) you will get help faster than emails.
- If it's something you do not like to discuss publicly on Piazza, send an email with CX4240 in the subject.
Course Content
Q: What will be covered in this course? A: This course will introduce techniques for computational data analysis, with an emphasis on machine learning algorithms and their applications to real-world data. On the technique side, we will cover key supervised methods (linear regression, logistic regression, neural networks, tree-based models) and unsupervised methods (k-means, Gaussian mixture models, expectation-maximization, dimension reduction). On the application side, we will introduce various applications of these techniques, particularly on text data analysis and natural language processing. We will demonstrate how to formulate real-world tasks as data analysis problems, key methods for solving these problems, and their advantages and disadvantages.
Q: Who will benefit from this course? A: The learning objective is that by the end of this course, the students are able to formulate their data analysis problems at hand, choose appropriate computational models to acquire insights from data automatically, and even come up with innovative solutions for solving open problems in this field. The course will be helpful for students who want to solve practical problems using machine learning and data science techniques. The course will also be helpful for students who want to do edge-cutting research in data mining, machine learning, natural language processing, and others.
Q: What are the prerequisites? A: This course is math and programming demanding. As prerequisites, you are expected to have 1) solid knowledge of probability, statistics, and linear algebra; 2) basic knowledge of machine learning; 3) solid programming skills, preferably in Python.
Schedule
Date | Topic | |
---|---|---|
1/6/20 | Course Overview | Piazza Signup |
1/8/20 | Math Basics I | |
1/13/20 | Math Basics II | |
1/15/20 | Data Analysis Toolbox | |
1/20/20 | No Class (Martin Luther King Day) | |
1/22/20 | Linear Regression | HW1 Out |
1/27/20 | Linear Regression | |
1/29/20 | Example Projects | Start Project Team Formation |
2/3/20 | Naïve Bayes and Logistic Regression | HW1 Due |
2/5/20 | Support Vector Machine | HW2 Out |
2/10/20 | Neural Networks | |
2/12/20 | Neural Networks | |
2/17/20 | CNNs and RNNs | HW2 Due |
2/19/20 | Decision Trees | |
2/24/20 | Ensemble Methods and Random Forest | HW3 Out |
2/26/20 | Midterm Review | Project Proposal Due |
3/2/20 | Midterm Exam | |
3/4/20 | Clustering Analysis and K-Means | |
03/09/20 | Hierarchical Clustering | |
3/11/20 | Gaussian Mixture Model | HW4 Out |
3/16/20 | No Class (Spring Break) | |
3/18/20 | No Class (Spring Break) | |
3/23/20 | Dimension Reduction | |
3/25/20 | Application: Text Embedding | |
3/30/20 | Application: Text Classification | HW4 Due |
4/1/20 | Project Presentation | |
4/6/20 | Project Presentation | |
4/8/20 | Project Presentation | |
4/13/20 | Project Presentation | |
4/15/20 | Review Class | |
4/20/20 | Final Exam | |
4/22/20 | No Class, Reading Day | Project Report Due |
Grading
Homework (40%)
- There will be four assignments. Each one is designed for testing your understanding of the taught algorithms. It could be either programming or written analysis.
- All students are expected to follow the Georgia Tech Academic Honor Code.
- All assignments follow the "no-late" policy. Assignments received after the due time will receive zero credit.
Project (20%)
- You are expected to complete a project on computational data analysis. You can choose from one of the two options below for your project:
- Tackle a real-life data analysis task using computational data analysis. Your need to be clear about the data you are using, the problem you are attempting to solve, the method you are using, and the results and conclusion you attain.
- Conduct a survey on a specific research topic. You need to select a research topic highly related to this course and read state-of-the-art research papers comprehensively. You need to organize existing techniques into different categories, introduce their key ideas, and critically comment on their advantages and disadvantages.
- You will need to turn in a project report and also give an in-class presentation for your project. The project report and the presentation will each count for 10% of your final grade.
- Each project needs to be completed in a team of 2-4 people. Team members need to clearly claim their contributions in the project report.
Class participation (15%)
- This class will have many in-class activities, your class participation score will be mainly graded based on attendance and performance of those in-class activities.
- Participation in class discussions (including asking relevant questions in class, volunteering to answer questions on Piazza) can give you bonus for your final grade. It will be especially useful when you are right on the edge of two letter grades.
Midterm Exam (10%)
- The midterm exam will take place on March 02 in lieu of the regular class.
- The midterm exam will be a written and open-book exam, but no computer or Internet usage will not be allowed.
- There will be no make-up exams. You will get zero credit for your missed midterm exam.
Final Exam (15%)
- The final exam will be on April 20 in lieu of the regular class.
- The final exam will be a written and open-book exam, but no Internet usage will be allowed.
- Again, there will be no make-up exams. You will get zero credit for your missed final exam.
Resources
- Machine learning, by Tom Mitchell
- Pattern recognition and machine learning, by Christopher Bishop
- Data Mining: Concepts and Techniques, by Jiawei Han, Micheline Kamber, and Jian Pei
- The Elements of Statistical Learning, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
- Deep Learning, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
- Dive into Deep Learning, by Aston Zhang, Zack C. Lipton, Mu Li, and Alex Smola
Other resources, such as machine learning toolboxes and datasets, will be provided throughout the course.