Apache NiFi

Language: Java

Data Flow / ETL / Integration

NiFi was created by the NSA and later contributed to the Apache Software Foundation. It provides a visual interface to build data pipelines with minimal coding. NiFi is widely used for streaming, ETL, and IoT data flows, enabling reliable and scalable data integration across heterogeneous systems.

Apache NiFi is an open-source data integration and automation tool for designing, managing, and monitoring data flows. It supports real-time data ingestion, routing, transformation, and delivery between systems.

Installation

maven: <dependency> <groupId>org.apache.nifi</groupId> <artifactId>nifi-api</artifactId> <version>1.25.0</version> </dependency>
gradle: implementation 'org.apache.nifi:nifi-api:1.25.0'

Usage

NiFi provides processors to ingest, transform, and route data. Developers can create flow-based pipelines using its web-based UI or programmatically via the NiFi API. It supports scheduling, provenance tracking, and backpressure for robust data management.

Reading from a file and writing to another directory

// Configure GetFile processor to read input files
// Configure PutFile processor to write files to output directory
// Connect processors using a connection in the NiFi UI

Moves files from input to output directory with monitoring and error handling.

HTTP ingestion

// Use ListenHTTP processor to accept incoming HTTP requests
// Route the data to processors for transformation or storage

Ingests data from HTTP endpoints into the NiFi pipeline.

Using ExecuteScript for custom processing

// Add ExecuteScript processor with Groovy, Python, or JavaScript code to transform flow files

Enables custom transformations within the data flow.

Routing based on content

// Use RouteOnAttribute processor to send flow files to different paths based on content attributes

Implements content-based routing for dynamic pipeline behavior.

Connecting to Kafka

// Use PublishKafkaRecord_2_0 or ConsumeKafkaRecord_2_0 processors to integrate with Kafka topics

Enables streaming integration between NiFi pipelines and Kafka messaging system.

Monitoring data provenance

// NiFi automatically tracks data provenance for each flow file, allowing traceability

Provides audit and debugging capabilities for complex pipelines.

Error Handling

Processor failed to execute: Check processor configuration, input data format, and dependencies.
FlowFile queue full: Adjust backpressure thresholds or optimize flow to prevent bottlenecks.
Kafka connection error: Ensure Kafka brokers are running, reachable, and credentials are correct.

Best Practices

Use backpressure to control flow rates for large pipelines.

Leverage NiFi provenance and monitoring features for auditing.

Design modular flows with reusable processors.

Validate input data and handle errors gracefully using relationships.

Use NiFi Parameter Contexts for environment-specific configurations.