Building pg2iceberg

For the past month, I’ve been working on an open source (MIT license) data mirroring tool that is designed specifically for mirroring data from Postgres to Iceberg, which is called pg2iceberg (yes, I suck at naming things, I might rename it later).

So far it has been a fun journey, and I even decided to launch a cloud version of it called Polynya (it’s an iceberg-adjacent noun I randomly picked). It’s currently in early access, so please try it out and let me know how it goes.

Why make another data replication tool?

If you look around, there are already numerous toolings in this space. Some that I can think of off the top of my head:

So why create another one?

The reason I’m creating another data replication tool is because the exact tool that I’m looking for doesn’t exist yet:

  1. First class support for Postgres
  2. First class support for streaming writes to Iceberg
  3. Zero infra dependency, only Postgres and Iceberg (which includes catalog and cloud storage)
  4. Open source (doesn’t have to be OSI)
  5. Designed specifically for mirroring data
  6. Not abandoned

Moonlink was probably the closest to what I want. But alas, since Databricks acquired Mooncake Labs, it’s pretty much dead.

Mirroring tool vs CDC tool

Requirement #5 is important. I specifically chose the word “mirror” here because I think it best captures the intent: I want a tool that is designed from the ground up to mirror data, i.e. given source A and target B, if you take a point in time snapshot of A, and the equivalent point in time snapshot of B, both must have the same data.

I haven’t checked thoroughly, but I believe most of the tools I listed above are not true “mirrors”, because they allow initial snapshot to be skipped. I want a tool that allows me to specify a Postgres table, the target Iceberg table, and it takes care of the entire lifecycle of mirroring, and at the same guarantees that no data is lost.

The architecture of pg2iceberg

Single binary with no dependencies

pg2iceberg is designed to be easy to operate. It’s a single binary, and nothing else is needed besides the source Postgres and the target Iceberg + S3. It uses the source database to store the checkpoint state.

Staging -> Materializer

Because of how Postgres logical replication works, it is super important to advance confirmed_flush_lsn as soon as possible, otherwise we risk blowing up the source database due to WAL retention. In order to minimize this risk, pg2iceberg uses a 2-tiered pipeline: it first stages the change events into a leaderless durable log (using the source database itself as coordinator) on top of S3 before merging it into the Iceberg table downstream.

While technically an additional infra dependency, Iceberg already comes with S3, so pg2iceberg can just piggyback on this S3 for staging the change events. The alternative is to stage the events in Iceberg (pg2iceberg did exactly this prior to v0.6.0), but it’s a lot slower due to metadata writes and catalog commits.

Single writer assumption and compaction

pg2iceberg operates under the assumption that it is the only writer to the Iceberg tables (this includes compaction, but not table maintenance). By making this assumption, pg2iceberg is able to eliminate most catalog reads: since pg2iceberg is the only one writing to the Iceberg table, it can simply cache the table metadata from its last write in memory.

To maintain this assumption, pg2iceberg performs compaction inline with the materializer thread. Compaction is important to keep small files problem under control. Once a certain threshold is met, a compaction cycle will be executed in between materializer cycles.

Standalone vs distributed mode

Thanks to te leaderless durable log and the 2-tiered pipeline, pg2iceberg can be operated in a distributed mode, where you run 1 instance of pg2iceberg for the staging pipeline, and N instances of pg2iceberg for the table materialization pipeline. pg2iceberg will automatically distribute the materializer tasks across the pg2iceberg workers.

Honestly this is probably overkill for most use cases out there. It’s provided out of the box because it comes for free with the leaderless log protocol. It might be useful for the 1% case where you have a huge table with high write throughput. The standalone mode of running 1 instance of pg2iceberg per Postgres instance/cluster and vertical scaling is good enough for the remaining 99% use cases.

Fail fast on potential data loss

Because pg2iceberg is designed to be a mirror, not a general purpose CDC tool, it favors correctness over performance. pg2iceberg will fail fast when it gets into a state where there can be a potential data loss. Stale data is better than an incomplete data.

Example situations:

  • The checkpoint state is corrupted or tampered with
  • The checkpoint state does not match the external state (e.g. this can happen when you create a logical replica from a snapshot)
  • The replication slot is in invalid state (e.g. if max_slot_wal_keep_size is set and the replication slot is too far behind)

What’s next?

The short-term goal is to get pg2iceberg to v1.0.0 release. A v1.0.0 release means:

  1. Strong guarantee of no data loss
  2. Stable and bckward compatible interface, which includes the CLI API, _pg2iceberg state, and meta tables schema
  3. Benchmarks to establish baseline performance
  4. Catalog integration tests against all supported catalogs to ensure no regression

Honestly, requirement #1 is difficult to achieve at its current state. I might consider rewriting pg2iceberg in Rust purely because Rust has well-known support for deterministic simulation testing (it’s impossible with Golang), i.e. with proper tooling, it’s possible to make the entire program 100% deterministic on a given seed.

Stay tuned.