Debugging large datasets with DuckDb
At Orb, we’re building real-time infrastructure to power billing and monetization for modern companies. The process of integrating with Orb starts with sending in data in the form of usage events, which represent billable product usage in your application. Orb allows you to build metrics (aggregation queries over your events) and attach those metrics to pricing models.
Although Orb provides a direct batch ingestion API, there’s an ingestion solution that’s become increasingly popular amongst our customers, especially for high volume use cases: ingestion using S3 as a message bus. On Orb’s end, we set up cross-account S3 event notifications on a bucket that you provide us read permissions to, and immediately ingest any files that land in the bucket. It’s also convenient for our customers, since there’s usually an existing S3 sink output (whether that’s Kafka, Kinesis, or a periodic dump from a data warehouse). This latter method has scaled to hundreds of thousands of events a second — not surprising, since S3 is the holy grail of reliability and provides plenty of read and write throughput.
Between the input source (in this case, S3) and our events datastore, a couple things still need to happen:
- Validation: Orb ensures that the payload is valid by ensuring that the event JSON fits the spec, and that any Orb IDs are present already. Orb also ensures that the timestamp of an event is within the ingestion grace period (typically a handful of hours), which prevents our customers from ingesting old data that can no longer be reflected for billing purposes.
- Deduplication: Orb deduplicates based on an idempotency key in the event body.
Only when an event passes validation and we’ve ensured it’s not a duplicate, we ingest it into our events datastore.
What needs debugging?
Although Orb has detailed and flexible in-product tooling on top of ingested data, when customers are first integrating our platform, there’s sometimes an upstream problem: not everything in the bucket is ingested, and instead ends up in a dead letter queue because it fails one or both of validation or deduplication. Although inspecting the dead letter queue is a start for basic integrations, actually inspecting the shape of the problem requires more work.
When we started looking for a solution that could help our engineering support team help customers ensure that their data was production-ready, we had a few criteria to make it use-case appropriate:
- On demand: We needed to query data ad-hoc in S3 in order to debug its contents, but this wasn’t a persistent production workload. We didn’t explore solutions that required keeping around running infrastructure.
- Performant: The tool we picked had to be performant at “medium” scale (tens or hundreds of gigs of data). Although the total ingestion volume into Orb is often much higher, we’re performing analysis over smaller time ranges, and that task needed to take seconds so we could flexibly understand the source of problems.
- Expressive: Orb allows you to define arbitrary SQL queries as your metric, not just a limited number of aggregation functions. We wanted the option to run those queries directly on your raw data, without hand building more translation.
DuckDB meets all three of these things.
What DuckDB enables
When customers are seeing a mismatch between their metrics and the data they believe they’re sending into Orb, we can now help them debug straight at the source; they don’t need to pull in any data science help on their side to debug the datasets coming into Orb.
We’ve installed a DuckDB client on a multipurpose production EC2 instance, but other than that, there’s no running infrastructure at all.
It’s easy — instant, in fact — to attach a DuckDb instance to query over a remote file in S3, using a custom role we assume temporarily. We can start by doing simple counts over the timeframe, finding duplicate events that would’ve been rejected by Orb’s ingestion API. This helps us understand the pattern of duplicates, and when they happen:
Note how interacting with the S3 file like a relational table is seamless here! DuckDB also lets us directly replicate the metrics we support in Orb over the events that land in S3. Suppose, for example, that we discover that the customer’s client is incorrectly assigning idempotency keys, and so we’re deduplicating more data than intended. Ignoring the duplicate constraint and running the metric over the raw events in the bucket (thanks to support for structs) lets us confirm that the rest of the data is correct, avoiding a re-ingestion loop.
We’ve found DuckDB to be very snappy for this sort of debugging over millions of rows of data – and we expect even better performance for Parquet files, where the httpfs protocol and the parquet metadata format allows DuckDB to only selectively download portions of the file.
What’s next for DuckDB at Orb
We’re considering expanding the use cases in the following ways:
- Self-serve debugging: We want to bring the debugging experience above to our customers directly, so they can leverage our data infrastructure and their Orb-defined metrics on their raw data.
- Better data exploration: Products like Rill leverage DuckDB to build instant query dashboards, with no run button at all.
- Combining different data sources: A current limitation of this architecture (and other use cases we have in mind for DuckDB) is the requirement to have a fairly large single machine hosting DuckDB. We’re excited about exploring solutions like MotherDuck that provide hosted data persistence, and the ability to query across local and remote data.
- Production use cases: DuckDB can help us speed up production use cases like real-time alerting where we can leverage a significantly smaller in-memory working set for our queries.
Using DuckDB as a debugging tool for support may seem like a minor use case, but it’s helped us to understand its performance characteristics and build confidence in expanding its footprint in Orb’s technical stack.