Back to Full Curriculum
DS-EL5Semester 74 (3-0-2)Elective

Real-Time Analytics & Stream Processing

Data stream characteristics (infinite, unbounded, out-of-order), Lambda vs. Kappa architectures, Time concepts (event time, processing time, ingestion time, watermarks), Windowing strategies (tumbling, hopping, slidin...

Syllabus

01

Unit 1: Stream Processing Fundamentals

Data stream characteristics (infinite, unbounded, out-of-order), Lambda vs. Kappa architectures, Time concepts (event time, processing time, ingestion time, watermarks), Windowing strategies (tumbling, hopping, sliding, session), Late data handling and allowed lateness, Exactly-once vs. at-least-once semantics, Fault tolerance in streaming systems.

02

Unit 2: Apache Kafka Ecosystem

Kafka architecture (topics, partitions, leaders/followers), Producer/consumer APIs and configurations, Kafka Streams DSL vs. Processor API, KSQL for stream processing SQL, Kafka Connect for data integration, MirrorMaker and cluster federation, Schema Registry and Avro/Protobuf serialization, Exactly-once guarantees with transactions.

03

Unit 3: Apache Flink Deep Dive

Flink execution model (DataStream/DataSet APIs), Stateful stream processing and keyed state, Checkpointing and savepoints, Event-time processing with watermarks, Complex Event Processing (CEP), Table/SQL API for unified batch/streaming, FlinkML and Gelly for streaming ML, Side outputs and dynamic scaling.

04

Unit 4: Stream Processing Patterns and Analytics

Pattern detection (match_recognize, CEP patterns), Sessionization and funnel analysis, Real-time aggregations and materialized views, Join strategies (stream-stream, stream-table), Time-windowed joins and interval joins, Anomaly detection (isolation forests, statistical methods), Real-time dashboards (Grafana + Prometheus).

05

Unit 5: Advanced Topics and Production Systems

Stream-table duality and changelog semantics, Upserts and primary key handling, Change Data Capture (CDC) with Debezium, Streaming ETL/ELT pipelines, Backpressure handling and resource management, Multi-tenancy and resource isolation, Monitoring and observability (metrics, tracing), Deployment strategies (Kubernetes operators).