Apache Spark

Language: Java

Big Data / Distributed Computing / ML

Spark was originally developed at UC Berkeley’s AMPLab to simplify large-scale data processing. Spark MLlib extends Spark for machine learning tasks, enabling high-performance distributed computations on massive datasets. It is widely used in data engineering, analytics, and AI pipelines.

Apache Spark is an open-source distributed computing system for big data processing. Spark MLlib provides scalable machine learning algorithms for classification, regression, clustering, and collaborative filtering, usable in Java, Scala, and Python.

Installation

maven:

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.12</artifactId>
  <version>3.5.1</version>
</dependency>
<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-mllib_2.12</artifactId>
  <version>3.5.1</version>
</dependency>

gradle:

implementation 'org.apache.spark:spark-core_2.12:3.5.1'
implementation 'org.apache.spark:spark-mllib_2.12:3.5.1'

Usage

Spark provides RDDs, DataFrames, and Datasets for distributed data processing. MLlib includes scalable implementations of machine learning algorithms, pipelines, feature transformers, and model evaluation methods.

Creating a Spark session

import org.apache.spark.sql.SparkSession;

SparkSession spark = SparkSession.builder()
    .appName("Spark Example")
    .master("local[*]")
    .getOrCreate();

Initializes a Spark session for distributed computing tasks.

Loading a dataset

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

Dataset<Row> df = spark.read().format("csv").option("header", true).load("data.csv");
df.show();

Loads a CSV dataset into a Spark DataFrame and prints the first few rows.

Linear regression with MLlib

import org.apache.spark.ml.regression.LinearRegression;
LinearRegression lr = new LinearRegression()
    .setLabelCol("label")
    .setFeaturesCol("features");
LinearRegressionModel model = lr.fit(trainingData);
Dataset<Row> predictions = model.transform(testData);

Trains a linear regression model and makes predictions on test data using Spark MLlib.

K-Means clustering

import org.apache.spark.ml.clustering.KMeans;
KMeans kmeans = new KMeans().setK(3).setSeed(1L);
KMeansModel kmModel = kmeans.fit(dataset);
Dataset<Row> predictions = kmModel.transform(dataset);

Performs k-means clustering on a dataset with Spark MLlib.

Pipeline with transformers and estimators

import org.apache.spark.ml.Pipeline;
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[]{tokenizer, hashingTF, lr});
PipelineModel model = pipeline.fit(trainingData);

Builds an ML pipeline combining preprocessing steps and a learning algorithm for streamlined training and evaluation.

Streaming data processing

import org.apache.spark.sql.streaming.StreamingQuery;
Dataset<Row> streamingDF = spark.readStream().format("socket").option("host", "localhost").option("port", 9999).load();
StreamingQuery query = streamingDF.writeStream().format("console").start();

Processes real-time streaming data using Spark Structured Streaming.

Error Handling

org.apache.spark.SparkException: Check Spark configurations, cluster setup, and data sources for errors.

OutOfMemoryError: Increase executor memory, reduce partitions, or optimize transformations.

IllegalArgumentException: Verify schema, feature column types, and algorithm requirements for Spark MLlib.

Best Practices

Cache frequently used RDDs/DataFrames to improve performance.

Use pipelines for repeatable preprocessing and model training.

Partition large datasets for distributed computation.

Monitor job execution using the Spark UI.

Combine batch and streaming workflows as needed for real-time analytics.

Official Docs Github