CS 6665: Data Mining

Spring 2021, 10:30 am to 11:45 am on TR via web broadcast

The information described here has not been finalized yet. This page will be updated frequently.

Course Descriptions

Data mining aims at finding useful patterns in large data sets. This course will discuss data mining algorithms for analyzing large amounts of data, including association rules mining, finding similar items, clustering, data stream mining, recommender systems, how search engines rank pages, and recent techniques for large scale machine learning. The goal of this class is for students to understand basic and scale data mining algorithms.

Prerequisites

  • A solid programming skill (Python is preferred)
  • Basic probability and statistics
  • Basic linear algebra

Course Material

  • [MMDS] Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge university press. Available Online.
  • [MML] Deisenroth, Marc Peter, A. Aldo Faisal, and Cheng Soon Ong. (2020). Mathematics for Machine Learning. Available Online.

Class Schedule

DateTopicReading
Jan 19Introduction to Data MiningMMDS CH.1
Jan 21CanceledMMDS CH.2
Jan 26Map-ReduceMMDS CH.2
Jan 28Matrix Multiplication by MapReduce (Optional)MMDS CH.2
Feb 2SparkMMDS CH.2
Feb 4Frequent Itemset MiningMMDS CH.6.1-6.3; CH.6.4 (Optional)
Feb 9Locality-Sensitive HashingMMDS CH.3.1-3.4
Feb 11Locality-Sensitive HashingMMDS CH.3.5-3.6 (Optional)
Feb 16ClusteringMMDS CH.7.1-7.3
Feb 18Hierarchical clustering and K-meansMMDS CH.7.1-7.3
Feb 23BRF and CUREMMDS CH.7.3-7.4
Feb 25EM algorithm (Optional)MML CH.11.1-11.3
Mar 2Gaussian Mixture Models (Optional)MML CH.11.1-11.3
Mar 4k-nn and Naive BayesMMDS CH.12.1,12.4
Mar 9k-nn and Naive BayesMMDS CH.12.1,12.4
Mar 11Decision TreeMMDS CH.12.5
Mar 16SVMMML CH.12.1,12.2; MMDS CH.12.3
Mar 18Course Project Proposal Presentation 
Mar 23SVM + Midterm Review 
Mar 25Midterm 
Mar 30Mining Data StreamsMMDS CH4.1-4.3
April 1Mining Data StreamsMMDS CH4.4-4.7
April 6Mining Data StreamsMMDS CH4.4-4.7
April 8Canceled 
April 13PageRankMMDS CH5.1-5.2
April 20Course Project Presentation 
April 22Course Project Presentation 

Grading

  • Homework (30%)
    • Five programming assignments
  • Course Project (40%)
    • Students are required to participate in one Kaggle competition.
    • The project will be evaluated based on the technical soundness, presentation, and final report.
  • Midterm Exam (30%)

  • Class Attendance
    • Class attendance is not mandatory but recommended.

Course Topics

  1. Data mining overview
  2. MapReduce and Spark
  3. Frequent itemset mining
  4. Finding similar items
  5. Clustering
  6. Classification
  7. Mining data stream
  8. Dimensionality reduction
  9. Recommender systems
  10. Computational advertising
  11. Pagerank
  12. Machine learning
  13. Anomaly detection