Focuses on statistics, databases, analytics engineering, big data systems, and predictive modeling.
These are the required courses that define the major specialization journey.
Review of Hypothesis Testing framework (p-value, Type I/II errors). Statistical Power and Sample Size calculation. A/B Testing: Design, Execution, and Interpretation. Non-parametric Tests for non-normal data: Mann-Whi...
Window functions fundamentals (ROW_NUMBER, RANK, DENSE_RANK, NTILE), Aggregate window functions (SUM, AVG, COUNT over partitions), Framing clauses and sliding windows, LAG/LEAD functions for time-series analysis, FIRS...
Data visualization principles and human perception (pre-attentive processing, Gestalt principles), Data types and appropriate encodings (position, length, angle, area, color, shape), Chart taxonomy (categorical, tempo...
3V's characteristics (Volume, Velocity, Variety), CAP theorem and BASE consistency, Hadoop Distributed File System (HDFS) architecture (NameNode, DataNode, federation, high availability), HDFS data replication and fau...
Supervised vs. unsupervised learning review, Regression vs. classification frameworks, Model evaluation metrics (MAE, RMSE, R² for regression; precision, recall, F1, AUC for classification), Cross-validation strategie...
Time series components (trend, seasonality, cycle, irregular), Stationarity concepts (weak vs. strict stationarity), Trend estimation (moving averages, polynomial fitting, LOESS), Seasonal decomposition (classical, ST...
These electives are available within the same major specialization pathway.
Distance concentration and emptiness phenomenon, Concentration of measure inequality, Nearest neighbor distances in high dimensions, Sparsity in high-dimensional data, Double descent phenomenon, Blessing of dimensiona...
Operational Data Store (ODS) vs. data warehouse, Bill Inmon vs. Ralph Kimball approaches (normalized vs. star schema), Data warehouse architecture (staging, ETL, presentation layer), conformed dimensions and fact gran...
MapReduce programming model revisited, Beyond MapReduce (Spark, Dask), Data stream mining challenges (concept drift, memory constraints), Sliding window models, Reservoir sampling, Massive data partitioning strategies...
Data governance definition and business value, DAMA-DMBOK framework domains, Data governance maturity models (Gartner, IBM, EDM Council), Roles and responsibilities (data stewards, custodians, owners), Data governance...
Data stream characteristics (infinite, unbounded, out-of-order), Lambda vs. Kappa architectures, Time concepts (event time, processing time, ingestion time, watermarks), Windowing strategies (tumbling, hopping, slidin...
Graph representations (adjacency matrix, edge list, CSR/CSC), Directed/undirected/multigraphs, Basic metrics (degree, density, diameter, average path length), Connected components and strongly connected components, Gr...