CS 6665: Data Mining

Spring 2021, 10:30 am to 11:45 am on TR via web broadcast

The information described here has not been finalized yet. This page will be updated frequently.

Course Descriptions

Data mining aims at finding useful patterns in large data sets. This course will discuss data mining algorithms for analyzing large amounts of data, including association rules mining, finding similar items, clustering, data stream mining, recommender systems, how search engines rank pages, and recent techniques for large scale machine learning. The goal of this class is for students to understand basic and scale data mining algorithms.

Prerequisites

  • A solid programming skill (Python is preferred)
  • Basic probability and statistics
  • Basic linear algebra

Course Material

  • [MMDS] Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge university press. Available Online.
  • [MML] Deisenroth, Marc Peter, A. Aldo Faisal, and Cheng Soon Ong. (2020). Mathematics for Machine Learning. Available Online.

Class Schedule

Date Topic Reading
Jan 19 Introduction to Data Mining MMDS CH.1
Jan 21 Canceled MMDS CH.2
Jan 26 Map-Reduce MMDS CH.2
Jan 28 Matrix Multiplication by MapReduce (Optional) MMDS CH.2
Feb 2 Spark MMDS CH.2
Feb 4 Frequent Itemset Mining MMDS CH.6.1-6.3; CH.6.4 (Optional)
Feb 9 Locality-Sensitive Hashing MMDS CH.3.1-3.4
Feb 11 Locality-Sensitive Hashing MMDS CH.3.5-3.6 (Optional)
Feb 16 Clustering MMDS CH.7.1-7.3
Feb 18 Hierarchical clustering and K-means MMDS CH.7.1-7.3
Feb 23 BRF and CURE MMDS CH.7.3-7.4
Feb 25 EM algorithm (Optional) MML CH.11.1-11.3
Mar 2 Gaussian Mixture Models (Optional) MML CH.11.1-11.3
Mar 4 k-nn and Naive Bayes MMDS CH.12.1,12.4
Mar 9 k-nn and Naive Bayes MMDS CH.12.1,12.4
Mar 11 Decision Tree MMDS CH.12.5
Mar 16 SVM MML CH.12.1,12.2; MMDS CH.12.3
Mar 18 Course Project Proposal Presentation  
Mar 23 SVM + Midterm Review  
Mar 25 Midterm  
Mar 30 Mining Data Streams MMDS CH4.1-4.3
April 1 Mining Data Streams MMDS CH4.4-4.7
April 6 Mining Data Streams MMDS CH4.4-4.7
April 8 Canceled  
April 13 PageRank MMDS CH5.1-5.2
April 20 Course Project Presentation  
April 22 Course Project Presentation  

Grading

  • Homework (30%)
    • Five programming assignments
  • Course Project (40%)
    • Students are required to participate in one Kaggle competition.
    • The project will be evaluated based on the technical soundness, presentation, and final report.
  • Midterm Exam (30%)

  • Class Attendance
    • Class attendance is not mandatory but recommended.

Course Topics

  1. Data mining overview
  2. MapReduce and Spark
  3. Frequent itemset mining
  4. Finding similar items
  5. Clustering
  6. Classification
  7. Mining data stream
  8. Dimensionality reduction
  9. Recommender systems
  10. Computational advertising
  11. Pagerank
  12. Machine learning
  13. Anomaly detection