Back to Full Curriculum
DS401Semester 53 (2-0-2)Major

Big Data Technologies

3V's characteristics (Volume, Velocity, Variety), CAP theorem and BASE consistency, Hadoop Distributed File System (HDFS) architecture (NameNode, DataNode, federation, high availability), HDFS data replication and fau...

Syllabus

01

Unit 1: Big Data Fundamentals and Hadoop Ecosystem

3V's characteristics (Volume, Velocity, Variety), CAP theorem and BASE consistency, Hadoop Distributed File System (HDFS) architecture (NameNode, DataNode, federation, high availability), HDFS data replication and fault tolerance, MapReduce programming model (mapper, reducer, combiner, partitioner), YARN resource management and application lifecycle.

02

Unit 2: Advanced Hadoop Components

Hive data warehousing ( metastore, SerDe, partitioning, bucketing), HiveQL optimization and execution engine, Pig Latin scripting for ETL pipelines, HBase NoSQL database (column-family model, coprocessors), Sqoop data transfer between RDBMS and HDFS, Flume streaming data ingestion, Oozie workflow orchestration.

03

Unit 3: Apache Spark Core and RDDs

Spark architecture (driver, executors, cluster manager), Resilient Distributed Datasets (RDDs) lineage and fault tolerance, Spark transformations (map, flatMap, filter, join) and actions (collect, count, reduce), Spark SQL and DataFrames, Catalyst optimizer and Tungsten execution engine, Delta Lake for ACID transactions on data lakes.

04

Unit 4: Spark Streaming, MLlib, and GraphX

Structured Streaming (micro-batch vs. continuous processing), Kafka integration and exactly-once semantics, Spark MLlib pipelines (transformers, estimators, cross-validation), MLflow integration for experiment tracking, GraphX graph-parallel computation (Pregel API, PageRank, connected components), GraphFrames for DataFrame-based graph analytics.

05

Unit 5: Big Data Platforms and Cloud Integration

Cloud data platforms (AWS EMR, Azure HDInsight, Databricks), Managed Spark services and autoscaling, Data lake architectures (lakehouse pattern, medallion architecture), Apache Kafka fundamentals (topics, partitions, consumer groups, exactly-once), Apache Airflow DAG orchestration, Cost optimization strategies and spot instance utilization.