Apache Avro

Language: Java

Serialization / Data

Avro was created as part of the Hadoop project to provide a language-agnostic, schema-based serialization system. It allows schemas to evolve over time, supports rich data structures, and integrates seamlessly with Hadoop, Kafka, and other big data tools. Its compact binary format and dynamic typing make it efficient for data storage and streaming.

Apache Avro is a data serialization system that provides compact, fast, and binary data serialization for Java and other languages. It is widely used in big data ecosystems for storing and transmitting structured data efficiently.

Installation

maven: Add dependency in pom.xml: <dependency> <groupId>org.apache.avro</groupId> <artifactId>avro</artifactId> <version>1.12.3</version> </dependency>
gradle: implementation 'org.apache.avro:avro:1.12.3'

Usage

Avro uses JSON-defined schemas to generate Java classes. It supports serialization, deserialization, and schema evolution. Avro can serialize data in a compact binary format or a readable JSON format.

Defining an Avro schema (user.avsc)

{
  "type": "record",
  "name": "User",
  "namespace": "com.example",
  "fields": [
    {"name": "name", "type": "string"},
    {"name": "age", "type": "int"}
  ]
}

Defines a simple Avro record `User` with `name` and `age` fields.

Generating Java classes

# Using avro-tools
java -jar avro-tools-1.12.3.jar compile schema user.avsc src/main/java

Generates Java classes from the Avro schema for use in serialization/deserialization.

Serializing a record

User user = User.newBuilder().setName("Alice").setAge(30).build();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DatumWriter<User> writer = new SpecificDatumWriter<>(User.class);
BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(out, null);
writer.write(user, encoder);
encoder.flush();
byte[] serializedData = out.toByteArray();

Serializes a `User` object to a compact binary format.

Deserializing a record

DatumReader<User> reader = new SpecificDatumReader<>(User.class);
BinaryDecoder decoder = DecoderFactory.get().binaryDecoder(serializedData, null);
User deserializedUser = reader.read(null, decoder);
System.out.println(deserializedUser.getName());

Deserializes binary data back into a `User` object.

Using GenericRecord without code generation

Schema schema = new Schema.Parser().parse(new File("user.avsc"));
GenericRecord record = new GenericData.Record(schema);
record.put("name", "Bob");
record.put("age", 25);

Demonstrates dynamic usage of Avro records without generating Java classes.

Schema evolution

// Add a new optional field in schema without breaking old data
{ "name": "email", "type": ["null", "string"], "default": null }

Supports adding optional fields while maintaining backward and forward compatibility.

Error Handling

AvroTypeException: Occurs when data does not match the schema type. Ensure the data conforms to the defined schema.
IOException during serialization/deserialization: Check streams, encoders, decoders, and ensure proper closing of resources.

Best Practices

Use specific Java classes for better type safety when possible.

Leverage default values in schemas for schema evolution.

Use Avro binary format for compact storage and JSON format for readability during debugging.

Integrate with Kafka or Hadoop for streaming and batch data processing.

Validate schemas before serialization to prevent runtime errors.