JSON Streaming for Large Files: Process Without Loading All

Standard JSON parsing loads the entire document into memory, builds a complete data structure, then gives you access. For a 10 MB file, that works fine. For a 10 GB file, your process runs out of memory and crashes. Streaming parsers solve this by processing JSON incrementally — reading and handling data as it arrives, without ever holding the entire document in memory.

The Problem with Standard Parsing

import json

# This loads the ENTIRE file into memory
with open('huge.json') as f:
    data = json.load(f)  # 10 GB file = 10+ GB of RAM

# Processing happens after full load
for item in data['records']:
    process(item)

For a file with 1 million records at 10 KB each, standard parsing needs:

File size: ~10 GB
Memory for parsing: ~10 GB (the raw string)
Memory for data structure: ~15-20 GB (Python objects are larger than raw JSON)
Total: ~25-30 GB of RAM

Streaming reduces this to megabytes.

Streaming Approaches

SAX-Style (Event-Based)

The parser emits events as it encounters JSON tokens:

import ijson

# Process items one at a time - constant memory usage
with open('huge.json', 'rb') as f:
    for record in ijson.items(f, 'records.item'):
        process(record)  # Each record is parsed individually
        # Previous records are garbage collected

Events include: start_map, map_key, end_map, start_array, end_array, string, number, boolean, null.

JSON Lines (JSONL / NDJSON)

A simpler approach: one JSON object per line. Each line is a complete, valid JSON document:

{"id": 1, "name": "Alice", "email": "alice@example.com"}
{"id": 2, "name": "Bob", "email": "bob@example.com"}
{"id": 3, "name": "Charlie", "email": "charlie@example.com"}

Processing is trivial — read line by line:

with open('data.jsonl') as f:
    for line in f:
        record = json.loads(line)
        process(record)

Advantages of JSON Lines:

Each line is independently parseable (parallel processing)
Append-friendly (just add a new line)
Works with standard Unix tools (grep, wc, head, tail)
Natural for log files and streaming data

Chunked Processing

For standard JSON arrays, split processing into chunks:

import ijson

def process_in_chunks(filename, chunk_size=1000):
    chunk = []
    with open(filename, 'rb') as f:
        for record in ijson.items(f, 'item'):
            chunk.append(record)
            if len(chunk) >= chunk_size:
                process_batch(chunk)
                chunk = []
    if chunk:
        process_batch(chunk)

Language Implementations

Python (ijson)

import ijson

# Stream from file
with open('large.json', 'rb') as f:
    parser = ijson.parse(f)
    for prefix, event, value in parser:
        if prefix == 'records.item.name':
            print(value)

# Stream from HTTP response
import urllib.request
response = urllib.request.urlopen('https://api.example.com/data')
for record in ijson.items(response, 'records.item'):
    process(record)

JavaScript (Node.js)

const { createReadStream } = require('fs');
const { parser } = require('stream-json');
const { streamArray } = require('stream-json/streamers/StreamArray');

const pipeline = createReadStream('large.json')
  .pipe(parser())
  .pipe(streamArray());

pipeline.on('data', ({ value }) => {
  process(value);
});

pipeline.on('end', () => {
  console.log('Done processing');
});

Command Line (jq)

# Stream mode - process objects individually
jq --stream 'select(length == 2) | .[1]' large.json

# Process JSON Lines
cat data.jsonl | jq -c 'select(.age > 30)'

# Convert array to JSON Lines
jq -c '.[]' large_array.json > data.jsonl

When to Use Streaming

Scenario	Standard	Streaming
File under 100 MB	Preferred	Overkill
File 100 MB to 1 GB	Depends on RAM	Recommended
File over 1 GB	Not feasible	Required
HTTP response (large)	Risk timeout	Stream as received
Real-time data feed	Not applicable	Required
Simple one-time read	Preferred	Unnecessary

Performance Comparison

Processing 1 million records (1 GB file):

Approach	Memory Usage	Processing Time	Complexity
json.load()	3-5 GB	15 sec	Simple
ijson streaming	50 MB	45 sec	Moderate
JSON Lines	10 MB	12 sec	Simple
Chunked (1000)	100 MB	20 sec	Moderate

Streaming uses far less memory but is slower for SAX-style parsing due to the event-driven overhead. JSON Lines is both fast and memory-efficient because each line is an independent parse operation.

Best Practices

Use JSON Lines when possible: It is the simplest streaming format and works with standard tools.
Set buffer sizes: Configure read buffers for optimal throughput (64 KB to 1 MB typically).
Process in batches: Batch database inserts and API calls rather than processing one record at a time.
Handle errors gracefully: In streaming, one malformed record should not crash the entire pipeline.
Monitor memory: Use profiling to verify that streaming is actually keeping memory bounded.

For formatting and validating smaller JSON files, our JSON Formatter handles documents up to 100 MB with real-time formatting.

FAQ

Can I use JSONPath with streaming parsers?

Some streaming libraries support path-based filtering. Python ijson.items supports filtering by path during streaming. However, complex JSONPath queries (filters, wildcards across levels) typically require the full document in memory. For path-based queries, see our JSONPath guide.

How do I convert a large JSON array to JSON Lines?

Use jq -c '.[]' input.json > output.jsonl for files that fit in memory. For truly large files, use a streaming converter: read the array with a streaming parser and write each element as a line.

Related Resources

JSON Formatter — Format and validate JSON files
JSONPath Query Guide — Extract data from JSON efficiently
JSON Editor Tips — Work with large JSON documents