alltools.one
JSONβ€’
2025-06-02
β€’
8 min
β€’
alltools.one Team
JSONStreamingPerformanceLarge FilesMemory

JSON Streaming for Large Files: Process Without Loading All

Standard JSON parsing loads the entire document into memory, builds a complete data structure, then gives you access. For a 10 MB file, that works fine. For a 10 GB file, your process runs out of memory and crashes. Streaming parsers solve this by processing JSON incrementally β€” reading and handling data as it arrives, without ever holding the entire document in memory.

The Problem with Standard Parsing

import json

# This loads the ENTIRE file into memory
with open('huge.json') as f:
    data = json.load(f)  # 10 GB file = 10+ GB of RAM

# Processing happens after full load
for item in data['records']:
    process(item)

For a file with 1 million records at 10 KB each, standard parsing needs:

  • File size: ~10 GB
  • Memory for parsing: ~10 GB (the raw string)
  • Memory for data structure: ~15-20 GB (Python objects are larger than raw JSON)
  • Total: ~25-30 GB of RAM

Streaming reduces this to megabytes.

Streaming Approaches

SAX-Style (Event-Based)

The parser emits events as it encounters JSON tokens:

import ijson

# Process items one at a time - constant memory usage
with open('huge.json', 'rb') as f:
    for record in ijson.items(f, 'records.item'):
        process(record)  # Each record is parsed individually
        # Previous records are garbage collected

Events include: start_map, map_key, end_map, start_array, end_array, string, number, boolean, null.

JSON Lines (JSONL / NDJSON)

A simpler approach: one JSON object per line. Each line is a complete, valid JSON document:

{"id": 1, "name": "Alice", "email": "alice@example.com"}
{"id": 2, "name": "Bob", "email": "bob@example.com"}
{"id": 3, "name": "Charlie", "email": "charlie@example.com"}

Processing is trivial β€” read line by line:

with open('data.jsonl') as f:
    for line in f:
        record = json.loads(line)
        process(record)

Advantages of JSON Lines:

  • Each line is independently parseable (parallel processing)
  • Append-friendly (just add a new line)
  • Works with standard Unix tools (grep, wc, head, tail)
  • Natural for log files and streaming data

Chunked Processing

For standard JSON arrays, split processing into chunks:

import ijson

def process_in_chunks(filename, chunk_size=1000):
    chunk = []
    with open(filename, 'rb') as f:
        for record in ijson.items(f, 'item'):
            chunk.append(record)
            if len(chunk) >= chunk_size:
                process_batch(chunk)
                chunk = []
    if chunk:
        process_batch(chunk)

Language Implementations

Python (ijson)

import ijson

# Stream from file
with open('large.json', 'rb') as f:
    parser = ijson.parse(f)
    for prefix, event, value in parser:
        if prefix == 'records.item.name':
            print(value)

# Stream from HTTP response
import urllib.request
response = urllib.request.urlopen('https://api.example.com/data')
for record in ijson.items(response, 'records.item'):
    process(record)

JavaScript (Node.js)

const { createReadStream } = require('fs');
const { parser } = require('stream-json');
const { streamArray } = require('stream-json/streamers/StreamArray');

const pipeline = createReadStream('large.json')
  .pipe(parser())
  .pipe(streamArray());

pipeline.on('data', ({ value }) => {
  process(value);
});

pipeline.on('end', () => {
  console.log('Done processing');
});

Command Line (jq)

# Stream mode - process objects individually
jq --stream 'select(length == 2) | .[1]' large.json

# Process JSON Lines
cat data.jsonl | jq -c 'select(.age > 30)'

# Convert array to JSON Lines
jq -c '.[]' large_array.json > data.jsonl

When to Use Streaming

ScenarioStandardStreaming
File under 100 MBPreferredOverkill
File 100 MB to 1 GBDepends on RAMRecommended
File over 1 GBNot feasibleRequired
HTTP response (large)Risk timeoutStream as received
Real-time data feedNot applicableRequired
Simple one-time readPreferredUnnecessary

Performance Comparison

Processing 1 million records (1 GB file):

ApproachMemory UsageProcessing TimeComplexity
json.load()3-5 GB15 secSimple
ijson streaming50 MB45 secModerate
JSON Lines10 MB12 secSimple
Chunked (1000)100 MB20 secModerate

Streaming uses far less memory but is slower for SAX-style parsing due to the event-driven overhead. JSON Lines is both fast and memory-efficient because each line is an independent parse operation.

Best Practices

  1. Use JSON Lines when possible: It is the simplest streaming format and works with standard tools.
  2. Set buffer sizes: Configure read buffers for optimal throughput (64 KB to 1 MB typically).
  3. Process in batches: Batch database inserts and API calls rather than processing one record at a time.
  4. Handle errors gracefully: In streaming, one malformed record should not crash the entire pipeline.
  5. Monitor memory: Use profiling to verify that streaming is actually keeping memory bounded.

For formatting and validating smaller JSON files, our JSON Formatter handles documents up to 100 MB with real-time formatting.

FAQ

Can I use JSONPath with streaming parsers?

Some streaming libraries support path-based filtering. Python ijson.items supports filtering by path during streaming. However, complex JSONPath queries (filters, wildcards across levels) typically require the full document in memory. For path-based queries, see our JSONPath guide.

How do I convert a large JSON array to JSON Lines?

Use jq -c '.[]' input.json > output.jsonl for files that fit in memory. For truly large files, use a streaming converter: read the array with a streaming parser and write each element as a line.

Related Resources

Published on 2025-06-02
JSON Streaming for Large Files: Process Without Loading All | alltools.one