3V's characteristics (Volume, Velocity, Variety), CAP theorem and BASE consistency, Hadoop Distributed File System (HDFS) architecture (NameNode, DataNode, federation, high availability), HDFS data replication and fau...
3V's characteristics (Volume, Velocity, Variety), CAP theorem and BASE consistency, Hadoop Distributed File System (HDFS) architecture (NameNode, DataNode, federation, high availability), HDFS data replication and fault tolerance, MapReduce programming model (mapper, reducer, combiner, partitioner), YARN resource management and application lifecycle.
Hive data warehousing ( metastore, SerDe, partitioning, bucketing), HiveQL optimization and execution engine, Pig Latin scripting for ETL pipelines, HBase NoSQL database (column-family model, coprocessors), Sqoop data transfer between RDBMS and HDFS, Flume streaming data ingestion, Oozie workflow orchestration.
Spark architecture (driver, executors, cluster manager), Resilient Distributed Datasets (RDDs) lineage and fault tolerance, Spark transformations (map, flatMap, filter, join) and actions (collect, count, reduce), Spark SQL and DataFrames, Catalyst optimizer and Tungsten execution engine, Delta Lake for ACID transactions on data lakes.
Structured Streaming (micro-batch vs. continuous processing), Kafka integration and exactly-once semantics, Spark MLlib pipelines (transformers, estimators, cross-validation), MLflow integration for experiment tracking, GraphX graph-parallel computation (Pregel API, PageRank, connected components), GraphFrames for DataFrame-based graph analytics.
Cloud data platforms (AWS EMR, Azure HDInsight, Databricks), Managed Spark services and autoscaling, Data lake architectures (lakehouse pattern, medallion architecture), Apache Kafka fundamentals (topics, partitions, consumer groups, exactly-once), Apache Airflow DAG orchestration, Cost optimization strategies and spot instance utilization.