01
Unit 1: Big Data Fundamentals and Hadoop Ecosystem
3V's characteristics (Volume, Velocity, Variety), CAP theorem and BASE consistency, Hadoop Distributed File System (HDFS) architecture (NameNode, DataNode, federation, high availability), HDFS data replication and fault tolerance, MapReduce programming model (mapper, reducer, combiner, partitioner), YARN resource management and application lifecycle.
02
Unit 2: Advanced Hadoop Components
Hive data warehousing ( metastore, SerDe, partitioning, bucketing), HiveQL optimization and execution engine, Pig Latin scripting for ETL pipelines, HBase NoSQL database (column-family model, coprocessors), Sqoop data transfer between RDBMS and HDFS, Flume streaming data ingestion, Oozie workflow orchestration.
03
Unit 3: Apache Spark Core and RDDs
Spark architecture (driver, executors, cluster manager), Resilient Distributed Datasets (RDDs) lineage and fault tolerance, Spark transformations (map, flatMap, filter, join) and actions (collect, count, reduce), Spark SQL and DataFrames, Catalyst optimizer and Tungsten execution engine, Delta Lake for ACID transactions on data lakes.
04
Unit 4: Spark Streaming, MLlib, and GraphX
Structured Streaming (micro-batch vs. continuous processing), Kafka integration and exactly-once semantics, Spark MLlib pipelines (transformers, estimators, cross-validation), MLflow integration for experiment tracking, GraphX graph-parallel computation (Pregel API, PageRank, connected components), GraphFrames for DataFrame-based graph analytics.
05
Unit 5: Big Data Platforms and Cloud Integration
Cloud data platforms (AWS EMR, Azure HDInsight, Databricks), Managed Spark services and autoscaling, Data lake architectures (lakehouse pattern, medallion architecture), Apache Kafka fundamentals (topics, partitions, consumer groups, exactly-once), Apache Airflow DAG orchestration, Cost optimization strategies and spot instance utilization.