Language: Java
Big Data / Streaming / Batch Processing
Apache Beam was created to provide a consistent model for building portable data processing pipelines. It decouples the pipeline logic from the execution engine, allowing developers to write code once and run it on various runners, making it ideal for hybrid cloud and on-premise processing.
Apache Beam is an open-source unified programming model for defining and executing data processing pipelines, both batch and streaming. It provides Java, Python, and Go SDKs and can run on multiple runners like Flink, Spark, and Dataflow.
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-core</artifactId>
<version>2.49.0</version>
</dependency>implementation 'org.apache.beam:beam-sdks-java-core:2.49.0'Beam allows defining pipelines using PCollections, transforms, and sinks. It supports windowing, triggers, aggregations, and joins for streaming data, as well as standard batch transformations.
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.transforms.*;
import org.apache.beam.sdk.values.*;
Pipeline pipeline = Pipeline.create();
pipeline.apply(TextIO.read().from("input.txt"))
.apply(FlatMapElements.into(TypeDescriptors.strings()).via((String line) -> Arrays.asList(line.split(" "))))
.apply(Count.perElement())
.apply(MapElements.into(TypeDescriptors.strings()).via(kv -> kv.getKey() + ":" + kv.getValue()))
.apply(TextIO.write().to("output.txt"));
pipeline.run().waitUntilFinish();Builds a simple batch pipeline that reads text, splits words, counts occurrences, and writes results.
PCollection<String> stream = pipeline.apply( ... );
stream.apply(Window.<String>into(FixedWindows.of(Duration.standardMinutes(1))))
.apply(Count.perElement());Counts elements in fixed one-minute windows for streaming data.
stream.apply(ParDo.of(new DoFn<String, String>() {
@ProcessElement
public void processElement(ProcessContext c) {
c.output(c.element().toUpperCase());
}
}));Applies a custom transformation to convert each element to uppercase.
pipeline.apply(PubsubIO.readStrings().fromTopic("projects/my-project/topics/my-topic"));Reads messages from a Google Cloud Pub/Sub topic for streaming processing.
PCollectionList.of(pc1).and(pc2).apply(Flatten.pCollections());Merges multiple PCollections into a single PCollection.
Choose the appropriate runner (Flink, Spark, Dataflow) based on use case.
Use windowing and triggers for accurate streaming computations.
Minimize state and side inputs to reduce resource consumption.
Monitor pipeline metrics and logs for optimization.
Write unit tests using Beam’s TestPipeline for correctness.