01
Unit 1: Scalable Data Mining Frameworks
MapReduce programming model revisited, Beyond MapReduce (Spark, Dask), Data stream mining challenges (concept drift, memory constraints), Sliding window models, Reservoir sampling, Massive data partitioning strategies, Approximate query processing and sampling guarantees, Sketching algorithms fundamentals.
02
Unit 2: Frequent Itemset Mining and Association Rules
Apriori algorithm and candidate generation, FP-growth and FP-tree structure, Eclat and vertical data format, Sampling-based frequent itemset mining, Parallel/distributed FP-growth, Association rule interestingness measures (support, confidence, lift, conviction), Sequential pattern mining (GSP, SPADE, PrefixSpan).
03
Unit 3: Graph Mining and Network Analysis
Graph representation (adjacency lists, CSR format), PageRank algorithm and variants (weighted, personalized), HITS algorithm, Triangle counting and clustering coefficients, Community detection (Louvain, spectral clustering, label propagation), Graph sampling and streaming algorithms, Subgraph isomorphism and motif discovery.
04
Unit 4: Dimensionality Reduction and Clustering at Scale
Mini-batch k-means and scalable EM, Canopy clustering, BIRCH hierarchical clustering, Spectral clustering approximation, t-SNE/UMAP for large datasets, Random projection trees, Streaming PCA and incremental SVD, CUR matrix decomposition for interpretability.
05
Unit 5: Near-Duplicates and Recommendation Systems
MinHash and Locality Sensitive Hashing (LSH) for Jaccard similarity, SimHash for text documents, Near-duplicate detection at web scale, Matrix factorization (SVD, ALS, NTF), Neighborhood-based collaborative filtering, Content-based recommendation, Hybrid recommenders, Scalable bandit algorithms (LinUCB, Thompson sampling).