Assembling the Puzzle

Presented at Beam College 2026

When source systems emit only partial updates to conserve network bandwidth, Data Engineers face the complex task of reconstructing complete entities in real-time. In this session, we will deep dive into a high-performance, SCD-like streaming pipeline that dynamically reconstructs full entities from partial data before sinking them to the data warehouse.

The core of our solution is a custom two-tiered state backend architecture. By intelligently combining Apache Beam’s native, low-latency state API (Tier 1) with an external third-party data store (Tier 2), we overcome standard memory and throughput limitations. Join us to explore how this two-tiered design, alongside Beam timers, drastically reduces external database lookups, minimizes network latency, and unlocks unparalleled performance for stateful streaming pipelines.

Instructor(s):