Language: Java
Big Data / Streaming / ML
Apache Flink was developed to handle large-scale data processing in real-time. It supports event-time processing, stateful computations, and exactly-once semantics. Flink ML extends Flink with scalable machine learning algorithms that can process both batch and streaming data efficiently.
Apache Flink is a powerful open-source framework for distributed stream and batch data processing. Flink ML provides machine learning libraries for building scalable models on streaming and batch data.
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.12</artifactId>
<version>1.17.2</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-ml-lib_2.12</artifactId>
<version>2.2.0</version>
</dependency>implementation 'org.apache.flink:flink-streaming-java_2.12:1.17.2'
implementation 'org.apache.flink:flink-ml-lib_2.12:2.2.0'Flink allows processing high-throughput, low-latency data streams and performing batch analytics. Flink ML provides scalable implementations of algorithms like linear regression, k-means, logistic regression, and clustering on streaming or batch datasets.
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.datastream.DataStream;
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.socketTextStream("localhost", 9999);
text.flatMap(new Tokenizer())
.keyBy(value -> value.f0)
.sum(1)
.print();
env.execute("Streaming WordCount");Processes text data from a socket stream and counts the occurrences of each word in real-time.
import org.apache.flink.api.java.ExecutionEnvironment;
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> data = env.readTextFile("data.csv");
data.print();Reads and prints a batch dataset from a CSV file.
import org.apache.flink.ml.regression.LinearRegression;
LinearRegression lr = new LinearRegression().setStepsize(0.01).setIterations(100);
lr.fit(trainingData);
INDArray predictions = lr.predict(testData);Trains a linear regression model using Flink ML and generates predictions on test data.
text.flatMap(new Tokenizer())
.keyBy(value -> value.f0)
.timeWindow(Time.seconds(5))
.sum(1)
.print();Counts words in a sliding time window of 5 seconds on a data stream.
// Maintain running totals and custom state for complex streaming analyticsFlink allows storing and updating state for each key in the stream, enabling advanced real-time computations.
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
// Use Kafka consumer to read data streams into FlinkFlink can ingest streams from Kafka, RabbitMQ, or other messaging systems for large-scale processing.
Use event-time processing when timestamps matter for streaming analytics.
Leverage checkpointing and state backends for fault-tolerant streaming.
Keep state size manageable for high-throughput streaming pipelines.
Combine Flink ML with Flink streaming for real-time model predictions.
Monitor throughput, latency, and backpressure for optimal performance.