JSON Streaming for Large Files: Process Without Loading All
Standard JSON parsing loads the entire document into memory, builds a complete data structure, then gives you access. For a 10 MB file, that works fine. For a 10 GB file, your process runs out of memory and crashes. Streaming parsers solve this by processing JSON incrementally β reading and handling data as it arrives, without ever holding the entire document in memory.
The Problem with Standard Parsing
import json
# This loads the ENTIRE file into memory
with open('huge.json') as f:
data = json.load(f) # 10 GB file = 10+ GB of RAM
# Processing happens after full load
for item in data['records']:
process(item)
For a file with 1 million records at 10 KB each, standard parsing needs:
- File size: ~10 GB
- Memory for parsing: ~10 GB (the raw string)
- Memory for data structure: ~15-20 GB (Python objects are larger than raw JSON)
- Total: ~25-30 GB of RAM
Streaming reduces this to megabytes.
Streaming Approaches
SAX-Style (Event-Based)
The parser emits events as it encounters JSON tokens:
import ijson
# Process items one at a time - constant memory usage
with open('huge.json', 'rb') as f:
for record in ijson.items(f, 'records.item'):
process(record) # Each record is parsed individually
# Previous records are garbage collected
Events include: start_map, map_key, end_map, start_array, end_array, string, number, boolean, null.
JSON Lines (JSONL / NDJSON)
A simpler approach: one JSON object per line. Each line is a complete, valid JSON document:
{"id": 1, "name": "Alice", "email": "alice@example.com"}
{"id": 2, "name": "Bob", "email": "bob@example.com"}
{"id": 3, "name": "Charlie", "email": "charlie@example.com"}
Processing is trivial β read line by line:
with open('data.jsonl') as f:
for line in f:
record = json.loads(line)
process(record)
Advantages of JSON Lines:
- Each line is independently parseable (parallel processing)
- Append-friendly (just add a new line)
- Works with standard Unix tools (
grep,wc,head,tail) - Natural for log files and streaming data
Chunked Processing
For standard JSON arrays, split processing into chunks:
import ijson
def process_in_chunks(filename, chunk_size=1000):
chunk = []
with open(filename, 'rb') as f:
for record in ijson.items(f, 'item'):
chunk.append(record)
if len(chunk) >= chunk_size:
process_batch(chunk)
chunk = []
if chunk:
process_batch(chunk)
Language Implementations
Python (ijson)
import ijson
# Stream from file
with open('large.json', 'rb') as f:
parser = ijson.parse(f)
for prefix, event, value in parser:
if prefix == 'records.item.name':
print(value)
# Stream from HTTP response
import urllib.request
response = urllib.request.urlopen('https://api.example.com/data')
for record in ijson.items(response, 'records.item'):
process(record)
JavaScript (Node.js)
const { createReadStream } = require('fs');
const { parser } = require('stream-json');
const { streamArray } = require('stream-json/streamers/StreamArray');
const pipeline = createReadStream('large.json')
.pipe(parser())
.pipe(streamArray());
pipeline.on('data', ({ value }) => {
process(value);
});
pipeline.on('end', () => {
console.log('Done processing');
});
Command Line (jq)
# Stream mode - process objects individually
jq --stream 'select(length == 2) | .[1]' large.json
# Process JSON Lines
cat data.jsonl | jq -c 'select(.age > 30)'
# Convert array to JSON Lines
jq -c '.[]' large_array.json > data.jsonl
When to Use Streaming
| Scenario | Standard | Streaming |
|---|---|---|
| File under 100 MB | Preferred | Overkill |
| File 100 MB to 1 GB | Depends on RAM | Recommended |
| File over 1 GB | Not feasible | Required |
| HTTP response (large) | Risk timeout | Stream as received |
| Real-time data feed | Not applicable | Required |
| Simple one-time read | Preferred | Unnecessary |
Performance Comparison
Processing 1 million records (1 GB file):
| Approach | Memory Usage | Processing Time | Complexity |
|---|---|---|---|
| json.load() | 3-5 GB | 15 sec | Simple |
| ijson streaming | 50 MB | 45 sec | Moderate |
| JSON Lines | 10 MB | 12 sec | Simple |
| Chunked (1000) | 100 MB | 20 sec | Moderate |
Streaming uses far less memory but is slower for SAX-style parsing due to the event-driven overhead. JSON Lines is both fast and memory-efficient because each line is an independent parse operation.
Best Practices
- Use JSON Lines when possible: It is the simplest streaming format and works with standard tools.
- Set buffer sizes: Configure read buffers for optimal throughput (64 KB to 1 MB typically).
- Process in batches: Batch database inserts and API calls rather than processing one record at a time.
- Handle errors gracefully: In streaming, one malformed record should not crash the entire pipeline.
- Monitor memory: Use profiling to verify that streaming is actually keeping memory bounded.
For formatting and validating smaller JSON files, our JSON Formatter handles documents up to 100 MB with real-time formatting.
FAQ
Can I use JSONPath with streaming parsers?
Some streaming libraries support path-based filtering. Python ijson.items supports filtering by path during streaming. However, complex JSONPath queries (filters, wildcards across levels) typically require the full document in memory. For path-based queries, see our JSONPath guide.
How do I convert a large JSON array to JSON Lines?
Use jq -c '.[]' input.json > output.jsonl for files that fit in memory. For truly large files, use a streaming converter: read the array with a streaming parser and write each element as a line.
Related Resources
- JSON Formatter β Format and validate JSON files
- JSONPath Query Guide β Extract data from JSON efficiently
- JSON Editor Tips β Work with large JSON documents