JSON is Fine — Until It Isn’t: A Practical Guide to Binary Serialization Formats

abhijeetpnwr

2 weeks ago

Most backend engineers use JSON for everything. Request bodies, Kafka events, database exports, internal service calls — JSON everywhere. And for most systems, that’s completely fine.

But at some point, either through scale or through the need to store data for years, JSON starts showing cracks. This post explains exactly where those cracks appear, what the alternatives are, and — most importantly — when to actually bother switching.

We’ll use one example record throughout to make the comparison concrete:

{
  "userName": "Abhijeet",
  "favoriteNumber": 2412,
  "interests": ["daydreaming", "programming"]
}

The Problem With JSON at Scale

JSON is a text format. Everything in it — field names, values, punctuation — is human-readable text. That is its greatest strength and its biggest weakness.

Take the record above. It’s approximately 87 bytes as encoded JSON. Inside those 87 bytes:

"userName" — 10 bytes (including quotes), repeated in every single record
"favoriteNumber" — 16 bytes, repeated in every single record
"interests" — 11 bytes, repeated in every single record
2412 — stored as 4 ASCII characters, not as a binary integer
"daydreaming" and "programming" — 13 and 13 bytes including quotes
Structural characters — braces, brackets, colons, commas, spaces

At 100 requests per second, none of this matters. At 1 million events per second through a Kafka pipeline, you’re pushing gigabytes of field names — metadata that describes the data rather than being the data — through your network and storage systems on every single message.

There’s also a parsing cost. JSON is text, so every consumer has to read it character by character, infer types, and allocate strings. A binary format skips all of that.

The alternatives — MessagePack, Protobuf, and Avro — each solve this problem differently. Understanding how they differ tells you which one to reach for.

MessagePack — Binary JSON

MessagePack is the most conservative option. It keeps exactly the same structure as JSON — objects, arrays, strings, numbers — but encodes everything in binary instead of text.

The key insight is how it encodes type and length information. Instead of writing { to start an object, it writes a single byte that encodes both “this is an object” and “it has N fields” simultaneously.

For our record with 3 fields, that byte is 0x83:

0x83  →  binary: 1000 0011
          top 4 bits (1000) = "this is a map/object"
          bottom 4 bits (0011) = 3 fields

One byte. Two pieces of information. No curly brace, no colon, no comma.

Strings work similarly. The field name "userName" (8 characters) becomes 0xa8 followed by the raw bytes of the string:

0xa8  →  binary: 1010 1000
          top 4 bits (1010) = "this is a string" 
// MessagePack has a predefined lookup table that says "these bits mean this type.
          bottom 4 bits (1000) = length 8

Numbers use a typed encoding. The integer 2412 fits in 16 bits, so MessagePack writes a type marker (0xcd, meaning “uint16 — value in next 2 bytes”) followed by the 2-byte value: 0xcd 0x09 0x6c. Three bytes total, compared to four ASCII characters in JSON.

What does the Abhijeet record look like after MessagePack encoding?

Format	Size
JSON	~87 bytes
MessagePack	~55 bytes

About 37% smaller. Meaningful, but not transformative. The reason: MessagePack still carries field names in every message. The string "userName" still gets encoded in every record — it’s just a more compact string. The fundamental inefficiency remains.

MessagePack is useful when you want “JSON but faster” without the operational overhead of a schema. It fits niche use cases like internal caches or gaming backends. But it has no major ecosystem, no schema evolution story, and doesn’t solve the real problem at scale. Most teams that reach for it end up moving to Protobuf anyway.

Protobuf — Fields as Numbers

Protocol Buffers, developed at Google, takes a fundamentally different approach. It eliminates field names from the wire format entirely and replaces them with integers.

You define a schema — a .proto file — that assigns a numeric tag to every field:

message Person {
  string userName           = 1;
  int32  favoriteNumber     = 2;
  repeated string interests = 3;
}

The numbers 1, 2, 3 are field tags. When Protobuf encodes a record, the string "userName" never appears on the wire. Instead, it writes the number 1. The decoder, which has the same schema compiled into it, reads 1 and knows that means userName. The schema is the shared key that unlocks the meaning.

Each field on the wire is encoded as:

[tag number + wire type]  [value]

For our record:

tag=1, type=string, length=8   →  A b h i j e e t
tag=2, type=varint              →  2412
tag=3, type=string, length=11  →  d a y d r e a m i n g
tag=3, type=string, length=11  →  p r o g r a m m i n g

Notice interests is encoded as two separate entries with the same tag (3), because it’s a repeated field.

Integers use varint encoding — a different scheme from MessagePack’s fixed-width types. Protobuf varints encode each 7 bits of the number into one byte, using the 8th bit to signal “more bytes follow.” The number 2412 fits in 14 bits and is encoded in 2 bytes — no separate type marker needed, unlike MessagePack’s 3-byte uint16 representation.

Format	Size
JSON	~87 bytes
MessagePack	~55 bytes
Protobuf	~33 bytes

About 62% smaller than JSON. And significantly faster to parse — no character scanning, no type inference, direct binary reads.

Schema Evolution

The tag number is the identity of a field — not the name. You can rename userName to displayName and nothing breaks, because the tag 1 still identifies it on the wire. But if you change a tag number, you silently corrupt every existing encoded message. The rules are absolute:

Never reuse a tag number, even after deleting a field
Never change a field’s type
New fields must be optional — old messages won’t have them, new code uses defaults

With these rules, adding a new field is safe. You give it a new tag, deploy the new producer, and old consumers simply skip the unknown tag. New consumers reading old messages see the field is missing and use the default value.

One important gotcha worth understanding: in systems where old and new code run simultaneously — during a rolling deployment, for example — old code may read a record that was written by new code (which includes a new field), and then write that record back to the database. If the old code silently drops fields it doesn’t recognise, that new field is permanently lost. This isn’t a Protobuf-specific problem; it applies to any schema evolution strategy. The safe approach is to ensure that code always preserves unknown fields when reading and rewriting records.

When to Use Protobuf

Protobuf is designed for communication between services you control. Both ends compile the same .proto file and evolve together. It is the wire format for gRPC — a modern RPC framework that replaced older approaches like CORBA and SOAP-based RPC, offering efficient binary transport, type safety across languages, and native support for bidirectional streaming. If you are building synchronous service-to-service communication at scale, gRPC and Protobuf are the standard answer.

The limitation: Protobuf is not designed for data that outlives the code that wrote it. If you write Protobuf events to S3 today and need to read them in two years without the original .proto file, you have a problem.

Avro — Schema Travels With Data

Apache Avro strips out even more than Protobuf. There are no field tags at all. Fields are encoded purely by position — just the values, one after another, in the order they appear in the schema.

An Avro schema for our record:

{
  "type": "record",
  "name": "Person",
  "fields": [
    {"name": "userName",       "type": "string"},
    {"name": "favoriteNumber", "type": "int"},
    {"name": "interests",      "type": {"type": "array", "items": "string"}}
  ]
}

The encoded data:

[length=8]   A b h i j e e t
[varint]     2412
[count=2]    [length=11]  d a y d r e a m i n g
             [length=11]  p r o g r a m m i n g

[block end: 0]

No field names. No tag numbers. Just values in schema order. Arrays use block encoding — a count of items in the block, followed by the items, followed by a zero byte to signal the end.

Format	Size
JSON	~87 bytes
MessagePack	~55 bytes
Protobuf	~33 bytes
Avro	~32 bytes

Marginally smaller than Protobuf because there are no tag bytes at all.

The Catch — and the Solution

To decode those raw bytes, you need the exact schema that wrote them. The bytes A b h i j e e t have no meaning without context — is this userName? Could be any string field.

Avro solves this with a Schema Registry — a central service that stores every version of every schema, identified by a short integer ID. When a producer encodes a message, it prepends just 5 bytes: a magic byte and the 4-byte schema ID. The binary data follows. When a consumer reads the message, it extracts the schema ID, fetches the writer’s schema from the registry, and resolves it against its own reader’s schema.

Schema Evolution Without Tag Numbers

Avro’s evolution works differently from Protobuf. Instead of tag numbers, it matches fields by name between the writer’s schema and the reader’s schema at decode time:

Field in writer, not in reader → skipped
Field in reader, not in writer → uses the declared default value
Field in both → decoded normally

Adding a new field with a default value is safe. Old messages simply don’t have it — the reader fills in the default. No tag numbers needed, no coordination required between producers and consumers beyond the Schema Registry.

Avro supports both generated code (like Protobuf) and schema-based dynamic decoding without any code generation. The dynamic option is why Avro is popular in data pipeline tooling like Spark and Kafka Connect — those systems can inspect and process schemas at runtime without knowing them at compile time.

When to Use Avro

Avro is designed for decoupled producers and consumers — especially when data has a long life. It is the standard format in the Kafka and Hadoop ecosystems precisely because:

A Kafka message might be consumed months after it was produced
Multiple independent teams might consume the same topic with different schema versions
Historical data in S3 or HDFS needs to be readable without the original codebase

When the producer and consumer are different teams with different deployment schedules, Avro’s name-based resolution is safer than Protobuf’s tag-based approach. The Schema Registry ensures compatibility without tight coupling between services.

The Full Comparison

	JSON	MessagePack	Protobuf	Avro
Size (our example)	~87 bytes	~55 bytes	~33 bytes	~32 bytes
Schema needed to decode?	No	No	Yes (compiled)	Yes (registry)
Field identity	Name string	Name string	Tag integer	Name + position
Self-describing?	Yes	Yes	No	No
Schema evolution	Manual versioning	None	Tag numbers stable	Name matching + defaults
Code generation	No	No	Required	Optional*
Primary ecosystem	REST APIs	Niche	gRPC	Kafka, Hadoop, Spark

*Avro supports both generated classes and dynamic schema-based decoding, which is why it integrates naturally with data pipeline tooling that processes schemas at runtime.

When to Use Which

The choice is not about performance benchmarks. It comes down to three questions: who holds the schema, how long does the data live, and who are the consumers.

Use JSON when:

The API is public-facing or consumed by external developers
Debuggability matters more than performance (you want to curl it, log it, read it)
Volume is moderate — serialization cost doesn’t show up in your latency metrics
You want the simplest possible operational setup

Use Protobuf when:

Services are internal and you control both producer and consumer
You need RPC with streaming support (gRPC)
Volume is high enough that serialization cost is measurable
Both ends can share and compile the same schema file

Use Avro when:

Producer and consumer are owned by different teams or have independent release cycles
Data flows through Kafka and multiple downstream consumers exist
Data is stored long-term in S3, HDFS, or a data lake
You need the schema to travel with the data so it remains readable without the original codebase

The honest default: most companies, even large ones, use JSON for internal services and are completely fine. The operational simplicity of JSON — loggable, curl-able, no schema files to manage — outweighs the performance gain at most scales. Switch to binary formats when serialization cost actually shows up as a problem, not in anticipation of a problem that may never arrive.

The One-Line Summary

MessagePack — JSON compressed into binary. Still carries field names. No major ecosystem. Use when you want a quick win without schema management overhead.

Protobuf — field names replaced by tag numbers in a shared schema. Fast, compact, type-safe across languages. Designed for services that evolve together. The standard for gRPC.

Avro — no tags, no names in the binary data. Schema stored in a registry and resolved at decode time. Designed for data that outlives the code that wrote it. The standard for Kafka and data pipelines.

The format should follow from your architecture, not drive it. If your producer and consumer are the same team deploying together, Protobuf. If they are different teams or the data lives for years, Avro. If neither applies, JSON is probably fine.