Data Gateways: Turning Third-Party APIs into a System of Record

We talk a lot in software about owning your data. The harder problem, in practice, is owning your integration with someone else's data. Every third-party API comes with its own authentication scheme, rate limits, pagination quirks, and an undocumented deletion model that you will discover at the worst possible time. Multiply that by five or ten integrations and you have not just complexity — you have fragility distributed across every service that happens to need external data.

Over the past year we have converged on a pattern we call a Data Gateway. It is not a grand architecture. It is a small, focused service with one job: poll a third-party API on a schedule, and publish everything it gets as messages to Apache Pulsar. That is the whole pattern. What follows from it, though, has meaningfully changed how we build systems.

The Core Insight

REST APIs are pull-based and stateless by design. Your application has to ask for the data, every time, and the API decides what you get and when. This works fine for a single consumer making occasional requests. It falls apart when you have multiple services that need the same data, or when you need to answer questions about the past, or when an upstream API changes and you need to shield your consumers from that change.

The insight behind Data Gateways is straightforward: if you treat the third-party API as an upstream event source and publish its output to a durable log, you can give every downstream consumer the properties they would get from a proper event-sourced system — replayability, fan-out, decoupling from upstream auth and rate limits — even when the upstream has none of those properties itself.

Pulsar is our log of choice. Its topic model, durable storage, and consumer-group semantics make it well suited to this role. The gateway publishes once; consumers subscribe independently and at their own pace.

The Pattern in Practice

Our first Data Gateway targeting health data, whoop-sync, polls the WHOOP v2 fitness API every 60 minutes. It collects sleep sessions, recovery scores, workouts, physiological cycles, and body measurements, and publishes each to a corresponding Pulsar topic under persistent://chrisdobson/whoop/*. Every message uses a consistent envelope:

{ "type": "sleep", "synced_at": "2026-03-06T07:00:00Z", "data": { ... } }

When WHOOP notifies us via webhook that a record has been updated or deleted, the gateway publishes a message with "type": "sleep.deleted" (or .updated) to the same topic. This matters because REST APIs almost universally under-represent deletions. A polling-only integration will silently accumulate stale records. By making deletions first-class messages, every consumer can maintain a consistent view of the world.

A companion CLI tool, whoop-today, reads back the day's messages directly from Pulsar. The same source, the same topic, the same envelope — no separate database query needed.

Our older MLS gateway follows the same structure. It polls a real estate Multiple Listing Service API and publishes property listing events to Pulsar. Downstream applications that need live listing feeds subscribe to the topic without knowing anything about the MLS authentication model or its pagination behaviour. We wrote that integration once, in one place, and it has stayed there.

Why This Holds Up

A few of the benefits are obvious upfront. The rest reveal themselves over time.

Decoupling. The gateway absorbs the upstream API's auth tokens, rate-limit backoff, schema quirks, and error handling. Downstream consumers see a clean, normalised message. When WHOOP changes an endpoint, we update one service, not five.

Single source of truth. Without a gateway, every service that needs WHOOP data would poll WHOOP directly, each burning rate-limit quota, each implementing its own auth refresh, each with subtly different error handling. A single gateway owns the integration and the rest of the system inherits that correctness.

Fan-out. Pulsar's subscription model lets multiple consumers read the same topic independently. whoop-sync publishes a recovery score once. A dashboard consumer reads it. A future ML pipeline reads it. Neither knows the other exists, and neither slows the other down.

Replayability. Pulsar retains messages durably. If a consumer has a bug and processes records incorrectly, or simply goes offline for a week, it replays from the earliest retained message and catches up. No data is lost because no consumer was watching. This is the property that changes how confident you feel deploying consumers.

Event sourcing semantics. This is the benefit that compounds the most over time, so it is worth dwelling on.

In a traditional integration, your database is the system of record. The data in it reflects the last known state of the upstream API, and that is all you have. If you need to answer a question about the past — what did this user's recovery score look like three weeks ago, before we changed the sync logic? — you either have it in the database or you do not. If a consumer had a bug and wrote incorrect data for two weeks, you have a problem that requires either a backfill from the upstream API (if it supports that) or manual correction.

With a Data Gateway, the Pulsar topic is the system of record. The database is just a materialised view of it. This reframes the problem entirely.

Want to bootstrap a new consumer with six months of historical WHOOP data? Do not write a backfill script. Subscribe at position earliest and replay. Want to recover a corrupted database? Drop it, replay, and you are back. Want to add a new derived metric — say, a rolling 7-day HRV average — retroactively applied to all historical data? Write the consumer logic, replay from the beginning, and the full history is computed correctly. None of these operations require touching the upstream API again.

This also changes your relationship with schema changes. When you decide to restructure how you store sleep data, you are not migrating a database — you are rewriting a consumer and replaying. The source of truth did not change; only the projection did.

The other underappreciated property is auditability. Every message in the topic is a timestamped, immutable record of what the gateway observed at that moment. You can answer questions like: what did Plaid report for this transaction when it was first synced, before it was later amended? That kind of audit trail is nearly impossible to reconstruct from a mutable database. In Pulsar, it is just a topic seek.

Schema evolution. The gateway normalises the upstream data shape and can version the envelope. Consumers do not need to track upstream API changelogs. That is the gateway's problem.

Trade-offs Worth Naming

This pattern adds infrastructure. You are running Pulsar, you are running gateway services, and you have an additional hop between the upstream API and any consumer. For a simple use case with one consumer and no history requirements, that overhead is not justified. Data Gateways earn their keep when you have multiple consumers, when replayability matters, or when upstream API fragility has already caused you pain.

The polling interval is also a deliberate choice. Sixty minutes works for health data that accumulates across a day. For low-latency use cases we supplement polling with webhooks, as whoop-sync does for deletes.

Where This Is Going

We are building a Plaid gateway next. Plaid provides access to bank transaction data, account balances, and statements. The downstream use case is direct: transaction events from Pulsar will drive automatic creation of Payment Entries in ERPNext, closing the loop between the bank and the books without manual data entry. The gateway pattern is unchanged. Only the upstream API and the consumer logic differ.

As we add more gateways, the pattern becomes infrastructure. The WHOOP integration, the MLS integration, the Plaid integration — each is a small, replaceable service with a clearly bounded responsibility. The rest of the system sits downstream, consuming a clean log, insulated from whatever is happening on the other side.

We think that is the right shape for integrating with a world that does not owe you stability.

United Algorithmics builds software at the intersection of data infrastructure, automation, and applied AI. If you are working through similar integration problems, we would like to hear from you.