Storage Engines: How PostgreSQL and MongoDB Actually Write to Disk

MM
Moinuddin M Masud
20 min read

There is a class of production incident that every senior backend engineer eventually lives through.

The system is under load. Writes are piling up. Latency is spiking. You open your database dashboard and everything looks fine — connections are healthy, no errors, queries are completing. But something is clearly wrong. You start digging: replication lag, WAL archiving, checkpoint frequency, buffer pool hit rate. You are reaching for knobs you barely understand, hoping one of them explains what you are seeing.

The engineer who gets out of that incident quickly is not the one who has read the most runbooks. It is the one who has a clear mental model of how the database actually stores and retrieves data at the mechanical level. Once you understand what happens between the moment your application calls db.query() and the moment those bytes are durable on disk, a whole class of mysterious production behaviour stops being mysterious.

This article builds that mental model for both PostgreSQL and MongoDB. Not from a DBA perspective — from an application engineer's perspective. We will look at how each engine is designed, why those design choices create the performance characteristics you observe, and where the two systems make fundamentally different trade-offs.

By the end, you will understand why a write-heavy workload behaves differently in each system, what "durability" actually means at the byte level, and how storage engine internals should influence your architecture decisions.


Why Storage Engine Design Matters to Application Engineers

Most engineers think of the database as a black box. You write SQL or a query document, you get rows or documents back, the database handles the rest. This abstraction holds right up until it doesn't — and it tends to break at the worst possible moment, under load, in production.

Storage engine design determines:

  • How fast writes are under burst load — can the engine absorb a Kafka consumer spike without degrading reads?
  • How reads behave under concurrent writes — do readers block writers? Do writers block readers?
  • What happens during a crash — how much data can you lose, and how long does recovery take?
  • Why your disk usage grows faster than your actual data — dead tuples, fragmentation, write amplification
  • Why that query was fast yesterday and slow today — the planner changed its plan because storage layout changed

These are not DBA concerns. These are the concerns of any engineer who owns a service that talks to a database.

PostgreSQL and MongoDB are not just different query interfaces. They are fundamentally different answers to the question of how to store and retrieve data reliably. Understanding those answers makes you a better engineer with either system.


PostgreSQL: The Heap, WAL, and MVCC

PostgreSQL's storage model is built around three interconnected concepts: the heap (where your actual data lives), the Write-Ahead Log (how changes are made durable), and MVCC (how concurrent access is managed without locking readers out). These three things are inseparable. You cannot understand one without understanding how it relates to the other two.

The Heap: Where Your Data Actually Lives

In PostgreSQL, every table is stored as a heap file. This is not a data structure heap — the name comes from the fact that rows are stored in no particular order. The heap is a sequence of 8KB pages (configurable, but 8KB is the default and almost universally used). Each page holds multiple row versions, a page header, and a line pointer array that maps row offsets within the page.

When you insert a row, PostgreSQL writes it to the first page that has enough free space. When you update a row, PostgreSQL does not overwrite the existing row — it writes a new version of the row elsewhere in the heap and marks the old version as no longer visible to new transactions. When you delete a row, the row is not immediately removed — it is marked as deleted but the bytes remain on disk.

This is the foundation of PostgreSQL's MVCC model, and it is also the source of one of PostgreSQL's most notorious operational challenges: dead tuples.

Page layout (8KB)
┌─────────────────────────────────────────────────────┐
Page Header (24 bytes)│ ── lsn, checksum, flags, lower, upper, special      │
├─────────────────────────────────────────────────────┤
Line Pointer Array (4 bytes per row)│ ── [ptr1][ptr2][ptr3]...[ptrN]├─────────────────────────────────────────────────────┤
Free Space├─────────────────────────────────────────────────────┤
Row Data (grows downward from top of page)│ ── [row_N]...[row_3][row_2][row_1]└─────────────────────────────────────────────────────┘

Each row (PostgreSQL calls them tuples) carries system columns that most engineers never look at directly:

  • xmin — the transaction ID that created this tuple version
  • xmax — the transaction ID that deleted or superseded this tuple (zero if still live)
  • ctid — the physical location of the tuple: (page_number, slot_number)

When a transaction reads a row, PostgreSQL compares these values against its own transaction ID and a snapshot of which transactions were active when the read began. This is how two transactions can see different versions of the same row simultaneously, without either one blocking the other.

The practical implication for application engineers: UPDATE is more expensive in PostgreSQL than it looks. Every update writes a new tuple and leaves the old one behind. On a table that receives heavy updates, you can accumulate substantial dead tuple overhead that affects both storage space and query performance.

The Write-Ahead Log: Durability Before Durability

Before PostgreSQL writes anything to the heap, it writes a record to the Write-Ahead Log (WAL). The WAL is a sequential append-only log stored separately from the heap files. It records every change — INSERT, UPDATE, DELETE, even changes to internal structures like index pages — as a logical description of what was done.

The fundamental guarantee is: a transaction is only considered committed once its WAL record is flushed to disk. The actual heap pages can be in memory (in PostgreSQL's shared buffer pool) and not yet written to disk. If the system crashes, PostgreSQL can replay the WAL to reconstruct any changes that had been committed but not yet reflected in the heap files on disk.

Transaction lifecycle:

1. Application sends INSERT
2. PostgreSQL writes WAL record (sequential write to WAL file)
3. WAL record flushed to disk (fsync at commit)
4. Transaction acknowledged as committed to client
5. Heap page updated in shared buffers (may happen before or after step 3)
6. Background checkpoint writer eventually flushes dirty heap pages to disk

The sequential nature of WAL writes is important. Sequential I/O is dramatically faster than random I/O on both spinning disks and SSDs. PostgreSQL is designed so that the critical path — making a write durable — involves sequential I/O (WAL), while the random I/O (updating the heap) is deferred and batched through checkpoints.

In practice, WAL gives you fine-grained control over the durability vs performance trade-off:

  • synchronous_commit = on (default): WAL flushed to disk before acknowledging commit. Maximum durability.
  • synchronous_commit = off: WAL buffered; acknowledgment returns before flush. Risk of losing last ~200ms of commits on crash, but significantly higher write throughput.
  • synchronous_commit = remote_write: For replication — WAL sent to replica but not necessarily flushed there before commit.

This is not an exotic configuration. In systems where losing a handful of commits is acceptable — analytics event logging, metrics collection, certain audit trails — turning off synchronous commit can meaningfully increase write throughput without touching application code.

MVCC: Reading Without Blocking

PostgreSQL's Multi-Version Concurrency Control is the mechanism that allows multiple transactions to read and write simultaneously without blocking each other. Understanding it properly changes how you think about isolation levels and concurrent access.

When a transaction starts, PostgreSQL gives it a transaction ID (xid) and takes a snapshot — a record of which transactions were in progress at that moment. Every row read during the transaction is evaluated against this snapshot. A row version is visible to a transaction if:

  • Its xmin (the transaction that created it) was committed before the snapshot was taken
  • Its xmax (the transaction that deleted it) was either not yet committed when the snapshot was taken, or does not exist

This means readers never block writers and writers never block readers. Two transactions can query the same table simultaneously and each sees a consistent view of the data as of when their transaction started.

The consequence engineers need to understand: long-running transactions hold back VACUUM. VACUUM is the process that reclaims dead tuples. It cannot remove a dead tuple if any active transaction might still need to see it (because that transaction's snapshot predates the deletion). A long-running OLAP query running alongside OLTP workload can cause dead tuple accumulation that starts degrading write performance — because new writes have to skip over dead tuples to find free space, and eventually page bloat forces more random I/O for sequential scans.

This is a real production scenario. A reporting query running for 40 minutes against a busy e-commerce database can cause measurable write amplification on high-update tables.

Checkpoints: The Cost of Not Writing to Disk

PostgreSQL defers heap writes through its shared buffer pool and flushes dirty pages to disk at checkpoints. A checkpoint is a moment where PostgreSQL guarantees that all changes up to a certain WAL position are reflected in the heap files on disk. After a checkpoint, WAL records before that point are no longer needed for crash recovery.

Checkpoints have a cost: they involve writing potentially many dirty pages to disk, which creates I/O pressure. PostgreSQL spreads checkpoint writes over time (checkpoint_completion_target, typically 0.9 = spread over 90% of the interval between checkpoints) to avoid a spike.

The tuning parameters that matter:

  • checkpoint_timeout: maximum time between checkpoints (default 5 minutes)
  • max_wal_size: maximum WAL size before forcing a checkpoint (default 1GB)
  • checkpoint_completion_target: fraction of interval over which to spread writes

In write-heavy workloads, checkpoints can become a significant source of I/O latency. If you see periodic latency spikes in your metrics every few minutes, look at checkpoint frequency before you look at query plans.


MongoDB: WiredTiger, the Journal, and Document Storage

MongoDB's default storage engine since version 3.2 is WiredTiger. The shift from the original MMAPv1 engine to WiredTiger was one of the most consequential architectural changes in MongoDB's history, bringing document-level concurrency control, compression, and a fundamentally different approach to durability.

WiredTiger: B-Trees, Not Heaps

WiredTiger stores data in B-tree structures. Each collection has a B-tree, and each index is a separate B-tree. This is structurally different from PostgreSQL's heap model.

The B-tree organises documents by their _id field (or the indexed key for secondary indexes). Documents are stored in the leaf nodes of the tree. When you insert a document, it goes into the appropriate leaf node based on its _id. When you update a document, WiredTiger can update it in place — if the new version fits in the same space — or relocate it if the document has grown.

The key implication: in-place updates are possible in WiredTiger. Unlike PostgreSQL's heap, where an update always writes a new tuple, WiredTiger can modify an existing document in the B-tree node directly, provided the new version fits within the allocated space for that slot. This can mean lower write amplification for small field updates on documents that don't grow significantly.

WiredTiger uses MVCC at the storage engine level as well, but its implementation is different. WiredTiger maintains a transaction list and resolves visibility at read time — conceptually similar to PostgreSQL, but integrated directly into the B-tree structure rather than managed through heap tuple headers.

The WiredTiger Cache and Eviction

WiredTiger maintains its own in-memory cache, separate from the operating system's page cache. By default, this is configured to use 50% of available RAM minus 1GB (with a floor of 256MB). Data is served from this cache when available; when it is not, WiredTiger reads from disk into the cache, potentially evicting other pages to make room.

The eviction mechanism is a source of latency that surprises many engineers. When the cache is under pressure, WiredTiger performs eviction — writing dirty pages back to disk and freeing cache space. Eviction happens in background threads, but when the cache is critically full (above the configured eviction trigger), application threads themselves may be forced to participate in eviction, creating sudden latency spikes.

The practical signals:

  • wiredTiger.cache.pages evicted by application threads in db.serverStatus() — if this is non-zero and growing, your cache is too small or your working set exceeds memory
  • wiredTiger.cache.bytes dirty in cache estimate as a percentage of total cache — sustained values above 80% are a warning sign

In production, I have seen MongoDB deployments where the Atlas instance tier was chosen based on vCPU requirements, with memory as an afterthought. The working set exceeded the WiredTiger cache, eviction was constant, and read latency was 10× what it should have been for the query patterns involved. Upsizing the instance to double the RAM resolved the issue more effectively than any index change.

The Journal: MongoDB's WAL Equivalent

MongoDB uses a journal for durability — conceptually equivalent to PostgreSQL's WAL, but with some important differences in how it interacts with the storage engine.

WiredTiger writes changes to an in-memory write buffer. Every 100ms (configurable via storage.journal.commitIntervalMs), the journal is synced to disk. In the event of a crash, MongoDB replays the journal to restore any writes that were committed but not yet reflected in the collection data files.

The key difference from PostgreSQL: MongoDB's default journal commit interval means up to 100ms of committed writes could be lost on a hard crash, unless write concern j: true is specified, which forces journal sync before acknowledging the write.

// Write acknowledged only after journal sync — maximum durability
db.collection.insertOne(
  { field: "value" },
  { writeConcern: { w: 1, j: true } }
)

// Write acknowledged after in-memory write — up to 100ms loss possible
db.collection.insertOne(
  { field: "value" },
  { writeConcern: { w: 1, j: false } }
)

This is not a MongoDB weakness — it is a tunable trade-off. For high-throughput write paths where losing the last 100ms of data is acceptable (metrics ingestion, event logging), the default is fine. For financial or order data, j: true should be non-negotiable.

Document Storage: BSON and the Padding Factor

MongoDB stores documents as BSON (Binary JSON) — a binary encoding of JSON-like documents that adds support for types like dates, binary data, and ObjectIds that JSON does not natively support.

When a document is stored in a WiredTiger B-tree leaf node, it is stored as its BSON representation. WiredTiger historically used a padding factor to allocate slightly more space than the current document size, anticipating that documents would grow through updates. Modern WiredTiger uses a more sophisticated allocation strategy, but the fundamental challenge remains: documents that grow significantly on update create fragmentation.

When an updated document no longer fits in its allocated space in the B-tree, WiredTiger must relocate it. This creates a hole in the B-tree structure that may not be immediately reusable. Over time, on collections with heavy mixed updates and inserts, this fragmentation can cause collection size on disk to exceed the actual data size significantly.

The compact command reclaims this space, but it requires taking a collection offline (or accepting significant I/O load) and blocks reads and writes during execution in most configurations. This is analogous to PostgreSQL's VACUUM FULL — a necessary but expensive operation that cannot be run transparently. Unlike PostgreSQL's regular VACUUM, which runs continuously in the background and handles routine dead tuple cleanup, MongoDB has no fully equivalent background compaction process for collection-level fragmentation.


Head-to-Head: The Same Problems, Different Trade-offs

Understanding each system in isolation is useful. Understanding how they make different trade-offs on the same fundamental problems is what shapes good architecture decisions.

Write Durability

Both systems make writes durable through a sequential write-ahead mechanism (WAL in PostgreSQL, journal in WiredTiger). The difference is in the default guarantees:

PostgreSQLMongoDB
Default commit durabilityWAL flushed to disk (synchronous_commit=on)Journal synced every 100ms
Maximum potential data loss on crash (default)None (last committed transaction preserved)Up to 100ms of committed writes
Per-operation durability overridesynchronous_commit = off at session levelj: true or j: false in write concern
Durability confirmation to clientAfter WAL flush (default)After in-memory write (default)

For a payment processing system, PostgreSQL's default is safer out of the box. For a high-volume event pipeline, MongoDB's default may be more appropriate, with j: true applied selectively to critical writes.

Read Consistency Under Concurrent Writes

Both systems use MVCC to allow readers and writers to operate without blocking each other. The experience is similar from an application perspective — reads do not block writes, writes do not block reads.

The difference emerges under heavy write load:

  • PostgreSQL: Dead tuples accumulate. VACUUM runs continuously in the background but can fall behind. If VACUUM falls behind, bloat degrades query performance and storage efficiency. Long-running transactions make the problem worse.
  • MongoDB: WiredTiger's in-place updates (for documents that don't grow) create less dead data. But cache pressure during high write load can trigger eviction, which impacts read latency in ways that are harder to observe and reason about than PostgreSQL's dead tuple metrics.

Update Behaviour

This is the most significant architectural difference for write-heavy workloads:

PostgreSQL always writes a new tuple version on UPDATE. The old version remains until VACUUM reclaims it. This means:

  • Every update touches at least one index (ctid must be updated in all indexes that cover the row)
  • Exception: Heap Only Tuple (HOT) updates — if the update doesn't change any indexed column and there's space on the same page, PostgreSQL can update in place and avoid touching indexes. HOT updates are one of the least-known PostgreSQL optimisations.

WiredTiger can update in place. If a document update doesn't change indexed fields and the document doesn't grow beyond its allocated space, the update can be done with minimal write amplification. But this optimisation is document-size-sensitive — a document that consistently grows on update (appending to an array, incrementing a counter, adding fields) will be relocated frequently.

In practice: for workloads with heavy, small-field updates on fixed-schema data, MongoDB with a well-designed document model can have lower write amplification than PostgreSQL. For workloads with large batch updates across many rows, PostgreSQL's HOT update behaviour combined with careful schema design often outperforms.

Crash Recovery

Both systems guarantee crash recovery through their sequential logs.

PostgreSQL's recovery is WAL replay. At startup after a crash, PostgreSQL replays WAL records from the last checkpoint forward, restoring the heap to a consistent state. Recovery time is proportional to the amount of WAL accumulated since the last checkpoint — a reason to tune checkpoint frequency on write-heavy systems.

MongoDB's recovery is journal replay. WiredTiger reads the journal on startup and applies any committed but not yet persisted changes. Recovery is typically fast — the journal covers only the last commit interval (100ms by default).

Both have replica-set and streaming replication capabilities that make crash recovery on a secondary much faster in practice — the primary's standby is already near-current and can be promoted immediately.


What This Means for Your Architecture Decisions

Understanding storage internals at this level is not academic. It directly influences decisions you make every week.

Choosing Between PostgreSQL and MongoDB for a New Service

The storage model should inform the decision, not be an afterthought. Ask:

  • Will documents grow on update? Frequently growing documents mean WiredTiger relocation overhead. If your documents are mostly fixed-size with small field changes, MongoDB's in-place update model is efficient. If you are appending to arrays or adding fields frequently, consider whether the data is better modelled as rows in PostgreSQL.
  • What are your durability requirements? If every write must be acknowledged only after it is truly durable, PostgreSQL's default behaviour is simpler to reason about. MongoDB requires explicit j: true configuration to match this.
  • What is your read/write ratio and pattern? MongoDB's WiredTiger cache model performs well when the working set fits in memory. If your read pattern is unpredictable (full collection scans, ad hoc analytics), PostgreSQL's buffer pool combined with OS page cache often handles mixed workloads more gracefully.

Designing for Write-Heavy Workloads

In PostgreSQL: avoid indexes on columns that are updated frequently. Every index covering an updated column requires a write to that index on every update (unless HOT applies). A table with seven indexes on a high-update workload is paying 8× the write cost (heap + 7 indexes) per update.

In MongoDB: avoid embedding large arrays that grow unboundedly. A document that grows from 1KB to 50KB through array appends will be relocated in the B-tree repeatedly. Consider a separate collection for the growing sub-data, linked by reference, to keep documents stable in size.

Operating Under Load

When PostgreSQL write latency suddenly increases under sustained load, the first things to check are:

  1. pg_stat_bgwriter — are checkpoints happening too frequently?
  2. pg_stat_user_tables — dead tuple ratio on high-update tables
  3. WAL generation rate — are you generating more WAL than your disk can absorb sequentially?

When MongoDB write latency spikes under load:

  1. wiredTiger.cache.pages evicted by application threads — application threads evicting cache
  2. Journal sync latency — wiredTiger.log.log sync time duration
  3. Oplog window — on replica sets, is replication lag growing, indicating write saturation?

These are not obscure metrics. They are the first things you look at when storage engine behaviour is manifesting as application latency.


Conclusion

The storage engine is not invisible plumbing. It is the load-bearing structure of every database-backed system you build or operate. The engineers who treat it as a black box are the ones who spend hours in production incidents reaching for configuration knobs they do not understand.

PostgreSQL's heap + WAL + MVCC model is designed around the assumption that reads and writes are frequent and concurrent, that durability must be guaranteed by default, and that the operational cost of dead tuple cleanup is worth paying for the predictability of the model. WiredTiger's B-tree + journal model optimises for in-place updates and flexible document growth, with durability that is tunable per-operation.

Neither model is universally superior. Both reward engineers who understand them.

The next article in this series covers indexing internals — how B-trees, hash indexes, and MongoDB's compound indexes actually work at the page level, and why the same query can be fast or catastrophically slow depending on which index the engine chooses.


This is Article 1 of Series 1: How Databases Actually Work. The series covers storage engines, indexes, query lifecycle, transactions, and concurrency — from an application engineer's perspective, grounded in production experience with both PostgreSQL and MongoDB.