CS 6665: Data Mining

Spring 2022, 10:30 am to 11:45 am on TR, Huntsman Hall 160

The information described here has not been finalized yet. This page will be updated frequently.

Course Descriptions

Data mining aims at finding useful patterns in large data sets. This course will discuss data mining algorithms for analyzing large amounts of data, including association rules mining, finding similar items, clustering, data stream mining, recommender systems, how search engines rank pages, and recent techniques for large scale machine learning. The goal of this class is for students to understand basic and scale data mining algorithms.

Prerequisites

  • A solid programming skill (Python is preferred)
  • Basic probability and statistics
  • Basic linear algebra

Course Material

  • [MMDS] Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge university press. Available Online.
  • [MML] Deisenroth, Marc Peter, A. Aldo Faisal, and Cheng Soon Ong. (2020). Mathematics for Machine Learning. Available Online.

Class Schedule

DateTopicReading
Jan 11Introduction to Data MiningMMDS CH.1
Jan 13Map-ReduceMMDS CH.2
Jan 18Matrix Multiplication by MapReduce (Optional)MMDS CH.2
Jan 20SparkMMDS CH.2
Jan 25Frequent Itemset MiningMMDS CH.6.1-6.3; CH.6.4 (Optional)
Jan 27Locality-Sensitive HashingMMDS CH.3.1-3.4
Feb 1Locality-Sensitive HashingMMDS CH.3.5-3.6 (Optional)
Feb 3ClusteringMMDS CH.7.1-7.3
Feb 8Hierarchical clustering and K-meansMMDS CH.7.1-7.3
Feb 10BRF and CUREMMDS CH.7.3-7.4
Feb 15EM algorithm (Optional)MML CH.11.1-11.3
Feb 17Gaussian Mixture Models (Optional)MML CH.11.1-11.3
Feb 22k-nn and Naive BayesMMDS CH.12.1,12.4
Feb 24k-nn and Naive BayesMMDS CH.12.1,12.4
Mar 1SVMMML CH.12.1,12.2 MMDS CH.12.3
Mar 3SVMMML CH.12.1,12.2 MMDS CH.12.3
Mar 8Spring Break 
Mar 10Spring Break 
Mar 15Decision TreeMMDS CH.12.5
Mar 17PageRankMMDS CH5.1-5.2
Mar 22Midterm 
Mar 24Dimensionality ReductionMMDS CH.11.3
Mar 29Recommender SystemsMMDS CH.9.1-9.3
Mar 31Mining Data StreamsMMDS CH4.1-4.3
April 5Mining Data StreamsMMDS CH4.4-4.7
April 7Mining Data StreamsMMDS CH4.4-4.7
April 12Trustworthy AI 
April 14Course Project Presentation 
April 19Course Project Presentation 
April 21Course Project Presentation 
April 26Canceled 

Grading

  • Homework (35%)
    • Six programming assignments
  • Course Project (35%)
    • Students are required to participate in one Kaggle competition.
    • The project will be evaluated based on the technical soundness, presentation, and final report.
  • Midterm Exam (30%)

  • Class Attendance
    • Class attendance is not mandatory but recommended.

Course Topics

  1. Data mining overview
  2. MapReduce and Spark
  3. Frequent itemset mining
  4. Finding similar items
  5. Clustering
  6. Classification
  7. Mining data stream
  8. Dimensionality reduction
  9. Recommender systems
  10. Computational advertising
  11. Pagerank
  12. Anomaly detection