MapReduce programming model revisited, Beyond MapReduce (Spark, Dask), Data stream mining challenges (concept drift, memory constraints), Sliding window models, Reservoir sampling, Massive data partitioning strategies...
MapReduce programming model revisited, Beyond MapReduce (Spark, Dask), Data stream mining challenges (concept drift, memory constraints), Sliding window models, Reservoir sampling, Massive data partitioning strategies, Approximate query processing and sampling guarantees, Sketching algorithms fundamentals.
Apriori algorithm and candidate generation, FP-growth and FP-tree structure, Eclat and vertical data format, Sampling-based frequent itemset mining, Parallel/distributed FP-growth, Association rule interestingness measures (support, confidence, lift, conviction), Sequential pattern mining (GSP, SPADE, PrefixSpan).
Graph representation (adjacency lists, CSR format), PageRank algorithm and variants (weighted, personalized), HITS algorithm, Triangle counting and clustering coefficients, Community detection (Louvain, spectral clustering, label propagation), Graph sampling and streaming algorithms, Subgraph isomorphism and motif discovery.
Mini-batch k-means and scalable EM, Canopy clustering, BIRCH hierarchical clustering, Spectral clustering approximation, t-SNE/UMAP for large datasets, Random projection trees, Streaming PCA and incremental SVD, CUR matrix decomposition for interpretability.
MinHash and Locality Sensitive Hashing (LSH) for Jaccard similarity, SimHash for text documents, Near-duplicate detection at web scale, Matrix factorization (SVD, ALS, NTF), Neighborhood-based collaborative filtering, Content-based recommendation, Hybrid recommenders, Scalable bandit algorithms (LinUCB, Thompson sampling).