Apache Beam

Language: Java

Big Data / Streaming / Batch Processing

Apache Beam was created to provide a consistent model for building portable data processing pipelines. It decouples the pipeline logic from the execution engine, allowing developers to write code once and run it on various runners, making it ideal for hybrid cloud and on-premise processing.

Apache Beam is an open-source unified programming model for defining and executing data processing pipelines, both batch and streaming. It provides Java, Python, and Go SDKs and can run on multiple runners like Flink, Spark, and Dataflow.

Installation

maven:

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-core</artifactId>
  <version>2.49.0</version>
</dependency>

gradle: implementation 'org.apache.beam:beam-sdks-java-core:2.49.0'

Usage

Beam allows defining pipelines using PCollections, transforms, and sinks. It supports windowing, triggers, aggregations, and joins for streaming data, as well as standard batch transformations.

Simple word count

import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.transforms.*;
import org.apache.beam.sdk.values.*;

Pipeline pipeline = Pipeline.create();
pipeline.apply(TextIO.read().from("input.txt"))
        .apply(FlatMapElements.into(TypeDescriptors.strings()).via((String line) -> Arrays.asList(line.split(" "))))
        .apply(Count.perElement())
        .apply(MapElements.into(TypeDescriptors.strings()).via(kv -> kv.getKey() + ":" + kv.getValue()))
        .apply(TextIO.write().to("output.txt"));
pipeline.run().waitUntilFinish();

Builds a simple batch pipeline that reads text, splits words, counts occurrences, and writes results.

Windowed stream processing

PCollection<String> stream = pipeline.apply( ... );
stream.apply(Window.<String>into(FixedWindows.of(Duration.standardMinutes(1))))
      .apply(Count.perElement());

Counts elements in fixed one-minute windows for streaming data.

Using ParDo for custom transformations

stream.apply(ParDo.of(new DoFn<String, String>() {
  @ProcessElement
  public void processElement(ProcessContext c) {
    c.output(c.element().toUpperCase());
  }
}));

Applies a custom transformation to convert each element to uppercase.

Connecting to external sources

pipeline.apply(PubsubIO.readStrings().fromTopic("projects/my-project/topics/my-topic"));

Reads messages from a Google Cloud Pub/Sub topic for streaming processing.

Combining multiple PCollections

PCollectionList.of(pc1).and(pc2).apply(Flatten.pCollections());

Merges multiple PCollections into a single PCollection.

Error Handling

PipelineExecutionException: Occurs when a pipeline fails on execution. Check transforms, IO sources, and runner configuration.

IOException: Thrown when reading/writing from external sources. Verify file paths, network access, and permissions.

IllegalArgumentException: Check type specifications and transform parameters for correctness.

Best Practices

Choose the appropriate runner (Flink, Spark, Dataflow) based on use case.

Use windowing and triggers for accurate streaming computations.

Minimize state and side inputs to reduce resource consumption.

Monitor pipeline metrics and logs for optimization.

Write unit tests using Beam’s TestPipeline for correctness.

Official Docs Github