Language: Java
Big Data / Distributed Computing / ML
Spark was originally developed at UC Berkeley’s AMPLab to simplify large-scale data processing. Spark MLlib extends Spark for machine learning tasks, enabling high-performance distributed computations on massive datasets. It is widely used in data engineering, analytics, and AI pipelines.
Apache Spark is an open-source distributed computing system for big data processing. Spark MLlib provides scalable machine learning algorithms for classification, regression, clustering, and collaborative filtering, usable in Java, Scala, and Python.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.5.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.12</artifactId>
<version>3.5.1</version>
</dependency>implementation 'org.apache.spark:spark-core_2.12:3.5.1'
implementation 'org.apache.spark:spark-mllib_2.12:3.5.1'Spark provides RDDs, DataFrames, and Datasets for distributed data processing. MLlib includes scalable implementations of machine learning algorithms, pipelines, feature transformers, and model evaluation methods.
import org.apache.spark.sql.SparkSession;
SparkSession spark = SparkSession.builder()
.appName("Spark Example")
.master("local[*]")
.getOrCreate();Initializes a Spark session for distributed computing tasks.
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
Dataset<Row> df = spark.read().format("csv").option("header", true).load("data.csv");
df.show();Loads a CSV dataset into a Spark DataFrame and prints the first few rows.
import org.apache.spark.ml.regression.LinearRegression;
LinearRegression lr = new LinearRegression()
.setLabelCol("label")
.setFeaturesCol("features");
LinearRegressionModel model = lr.fit(trainingData);
Dataset<Row> predictions = model.transform(testData);Trains a linear regression model and makes predictions on test data using Spark MLlib.
import org.apache.spark.ml.clustering.KMeans;
KMeans kmeans = new KMeans().setK(3).setSeed(1L);
KMeansModel kmModel = kmeans.fit(dataset);
Dataset<Row> predictions = kmModel.transform(dataset);Performs k-means clustering on a dataset with Spark MLlib.
import org.apache.spark.ml.Pipeline;
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[]{tokenizer, hashingTF, lr});
PipelineModel model = pipeline.fit(trainingData);Builds an ML pipeline combining preprocessing steps and a learning algorithm for streamlined training and evaluation.
import org.apache.spark.sql.streaming.StreamingQuery;
Dataset<Row> streamingDF = spark.readStream().format("socket").option("host", "localhost").option("port", 9999).load();
StreamingQuery query = streamingDF.writeStream().format("console").start();Processes real-time streaming data using Spark Structured Streaming.
Cache frequently used RDDs/DataFrames to improve performance.
Use pipelines for repeatable preprocessing and model training.
Partition large datasets for distributed computation.
Monitor job execution using the Spark UI.
Combine batch and streaming workflows as needed for real-time analytics.