CS 6665: Data Mining

Spring 2021, 10:30 am to 11:45 am on TR via web broadcast

The information described here has not been finalized yet. This page will be updated frequently.

Course Descriptions

Data mining aims at finding useful patterns in large data sets. This course will discuss data mining algorithms for analyzing large amounts of data, including association rules mining, finding similar items, clustering, data stream mining, recommender systems, how search engines rank pages, and recent techniques for large scale machine learning. The goal of this class is for students to understand basic and scale data mining algorithms.

Prerequisites

A solid programming skill (Python is preferred)
Basic probability and statistics
Basic linear algebra

Course Material

[MMDS] Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge university press. Available Online.
[MML] Deisenroth, Marc Peter, A. Aldo Faisal, and Cheng Soon Ong. (2020). Mathematics for Machine Learning. Available Online.

Class Schedule

Date	Topic	Reading
Jan 19	Introduction to Data Mining	MMDS CH.1
Jan 21	Canceled	MMDS CH.2
Jan 26	Map-Reduce	MMDS CH.2
Jan 28	Matrix Multiplication by MapReduce (Optional)	MMDS CH.2
Feb 2	Spark	MMDS CH.2
Feb 4	Frequent Itemset Mining	MMDS CH.6.1-6.3; CH.6.4 (Optional)
Feb 9	Locality-Sensitive Hashing	MMDS CH.3.1-3.4
Feb 11	Locality-Sensitive Hashing	MMDS CH.3.5-3.6 (Optional)
Feb 16	Clustering	MMDS CH.7.1-7.3
Feb 18	Hierarchical clustering and K-means	MMDS CH.7.1-7.3
Feb 23	BRF and CURE	MMDS CH.7.3-7.4
Feb 25	EM algorithm (Optional)	MML CH.11.1-11.3
Mar 2	Gaussian Mixture Models (Optional)	MML CH.11.1-11.3
Mar 4	k-nn and Naive Bayes	MMDS CH.12.1,12.4
Mar 9	k-nn and Naive Bayes	MMDS CH.12.1,12.4
Mar 11	Decision Tree	MMDS CH.12.5
Mar 16	SVM	MML CH.12.1,12.2; MMDS CH.12.3
Mar 18	Course Project Proposal Presentation
Mar 23	SVM + Midterm Review
Mar 25	Midterm
Mar 30	Mining Data Streams	MMDS CH4.1-4.3
April 1	Mining Data Streams	MMDS CH4.4-4.7
April 6	Mining Data Streams	MMDS CH4.4-4.7
April 8	Canceled
April 13	PageRank	MMDS CH5.1-5.2
April 20	Course Project Presentation
April 22	Course Project Presentation

Grading

Homework (30%)
- Five programming assignments
Course Project (40%)
- Students are required to participate in one Kaggle competition.
- The project will be evaluated based on the technical soundness, presentation, and final report.
Midterm Exam (30%)
Class Attendance
- Class attendance is not mandatory but recommended.

Course Topics

Data mining overview
MapReduce and Spark
Frequent itemset mining
Finding similar items
Clustering
Classification
Mining data stream
Dimensionality reduction
Recommender systems
Computational advertising
Pagerank
Machine learning
Anomaly detection