Back to Full Curriculum
DS-EL1Semester 74 (3-1-0)Elective

High-Dimensional Data Analysis

Distance concentration and emptiness phenomenon, Concentration of measure inequality, Nearest neighbor distances in high dimensions, Sparsity in high-dimensional data, Double descent phenomenon, Blessing of dimensiona...

Syllabus

01

Unit 1: High-Dimensional Phenomena and Curse of Dimensionality

Distance concentration and emptiness phenomenon, Concentration of measure inequality, Nearest neighbor distances in high dimensions, Sparsity in high-dimensional data, Double descent phenomenon, Blessing of dimensionality vs. curse, Dimension reduction necessity, Johnson-Lindenstrauss lemma for random projections.

02

Unit 2: Principal Component Analysis and Classical Methods

PCA mathematical formulation (covariance matrix eigendecomposition, SVD), PCA variants (Kernel PCA, Sparse PCA, Robust PCA), Incremental/online PCA, Factor analysis and independent component analysis (ICA), Multidimensional scaling (MDS - classical, non-metric), Isomap and geodesic distances.

03

Unit 3: Nonlinear Dimensionality Reduction

t-SNE algorithm (student-t divergence, perplexity tuning), UMAP (uniform manifold approximation and projection), LargeVis and PHATE, Autoencoder architectures (vanilla, variational, denoising), Deep belief networks for representation learning, Self-supervised contrastive learning (SimCLR, MoCo), Manifold learning assumptions and topology preservation.

04

Unit 4: High-Dimensional Statistics and Regularization

Multiple testing problem and FDR control (Benjamini-Hochberg procedure), Sparsity-inducing regularization (Lasso, Elastic Net, Group Lasso), Stability selection and knockoffs framework, High-dimensional covariance estimation (graphical models, covariance shrinkage), Robust high-dimensional regression, Sure independence screening (SIS).

05

Unit 5: Scalable Algorithms and Embeddings

Random projection methods (CountSketch, Johnson-Lindenstrauss transforms), Locality-Sensitive Hashing (LSH) families (random hyperplanes, p-stable distributions), Approximate nearest neighbors (HNSW, FAISS), Word embeddings scaling (GloVe, fastText), Graph embeddings (Node2Vec, DeepWalk), Tensor decomposition for multi-modal data.