A guide to evaluating a billing system, part 2
This piece is a follow-up to Part 1 of our guide to evaluating a billing system, where we covered how important product flexibility and configurability are for modeling pricing complexity.
Designing a system to recover from failure
As an engineer designing an internal system, you’re used to thinking about your system’s failure modes and ensuring that any contortion of downtime won’t lead to correctness or consistency issues in your application. When you’re evaluating a third-party vendor, however, most organizations will do their best to understand how the vendor thinks about and communicates system reliability and uptime — whether that is through a status page, historical incident review, an empirical stress test, or simply an eye to eye conversation with a technical leader.
Mission-critical vendors that sit at the heart of a business process — billing systems included — fail most critically not when there’s ‘simple’ downtime but rather when the key product integration points break and lead to correctness problems.
In turn, it’s important to carefully analyze the connection between your product and the external system to assess the mechanisms the vendor gives you to facilitate recovery.
Let’s take a look at two basic functions that a modern billing system provides, and how Orb is designed for robust recovery in each area:
- Data ingestion and aggregation
- Subscription lifecycle management
Data ingestion and aggregation
A modern usage-based billing provider should be able to accept hundreds of thousands, if not millions, of events per second in real-time. Each event is labeled with some timestamp in the past, which represents conceptually when the event happened in your application since that might differ from the time that it’s being received by the provider. In the normal course of business, as long as the billing system provides an idempotency guarantee, reliably stays up, and accepts events with some buffer in the past, requirements are met.
Things rarely fall into this happy path for very long. Here are a few failure scenarios you should be sure to consider carefully.
Brief reporter downtime: Suppose each part of your application logs events to an internal message stream, and an event sink is responsible for collecting these events to report them to billing. If your event reporting service is down, then recovering from this state may be quite a challenge at high volume. You have to think about locally queuing up events in the reporter or on disk and have to deal with the consequences of temporarily sending 5-10x the typical event volume. Many providers don’t have great tolerance for this and will start strictly enforcing API limits. Although progressive catch-up is often possible in these scenarios, you’re left to build it yourself with a good understanding of API semantics.
When you’re using Orb’s cloud storage integrations (for example with S3), Orb hosts a sync client that automatically ingests data from your bucket. With this ingestion mechanism, you never have to deal with ingestion API error codes or worry about peaks and troughs in ingestion. Orb automatically handles changes in your input ingestion rate. If you have blips of downtime, you don’t need to worry about controlled recovery or the associated API semantics.
Missing or incorrect data: Suppose that you realize that there was a period many days ago when your reporters weren’t sending data or through an audit that you’ve sent simply incorrect event data. With all other billing systems, you’re left to either solve the problem in extremely hacky ways (sending in the equivalent of ‘reversal’ events) or at the wrong layer of abstraction (having to translate missing events to dollar impact and fixing the invoices). Not only are these methods cumbersome, but they also create large correctness risks and shift the critical burden of billing calculations back to your team.
Orb’s approach is fundamentally different: when events are wrong or missing, you can create a native backfill operation, optionally choosing to completely replace existing events. These backfills follow the same multi-request ingestion format (allowing you to backfill very large volumes) and importantly are atomic and auditable. Because backfills are implemented as append-only actions, you can even revert a backfill or layer one on top of the other — after all, mistakes can happen when correcting mistakes!
The resulting workflow difference here is stark: Orb is responsible for recalculating invoices based on new events, flowing through details of pricing that could differ across hundreds or thousands of customers.
Subscription lifecycle management
Managing your customers’ subscriptions typically involves the processes of creating them when service starts, and executing changes via add-ons or upgrades/downgrades. Depending on the shape of your business, these actions may either happen manually in a billing admin view (e.g. if you’re serving a small number of small enterprise deals) or through your product dashboard, triggered by the end-user.
Most billing systems are designed with the happy path in mind: changes are either scheduled for the future, or they happen on time as desired. In reality, this is rarely the case because changes need to be undone or need to be made effective as in the past.
With that in mind, your billing system must follow two basic principles for subscription management.
It should always be possible to execute actions at an explicit time that you determine, not only the “current time”. It’s never safe to assume that the time that you ‘issue’ an action matches the time it’s processed — this is of course true in any distributed system, but very important when dealing with customers’ money. This is particularly important around billing period boundaries, where it could make a big difference if the action happens before or after a midnight boundary.
For example, imagine the case where you cancel a subscription in the current month to avoid incurring an upfront charge for the next period. If the request implicitly uses the “current time” and the month changes as soon as you send the request, you’re liable to overcharge your customer and have to issue corrective credits.
With Orb, actions are attached to explicit timestamps: whether it’s usage events (which happen at a labeled timestamp, not now ) or subscription actions such as edits/cancellations, this ensures that the outcome will always match what you expect.
The second core principle of subscription management safety is that all actions are reversible whenever possible. You should be able to entirely undo a scheduled cancellation, extend a programmed trial period, or change the contract ramp that your subscription is scheduled to undergo at any time. The only exception is the existence of end-user-facing side effects. After all, if you’ve issued an invoice and sent an email, the system must now accommodate this by correctly issuing credit notes rather than modifying the draft invoice.
In order to adhere to these two principles, Orb is architected from the ground up to support ‘backdated’ actions completely: the system will “rewind time” as if the action happened as of the previous date, and automatically play forward the consequences to catch up to current time.
This is critical for a few different workflows:
- It’s incredibly easy to migrate subscriptions to Orb accurately. Subscriptions can start at the correct historical date, and Orb will instantly create the invoices and fill them with the usage information required.
- When you sign an enterprise contract, you’ll often backdate a discounted pricing structure or prepaid commitment to the beginning of the current period. Orb doesn’t get in the way, and supports this natively, automatically re-pricing existing usage on elapsed invoices.
- If you make a mistake with a customer’s pricing and no invoices have been issued yet, Orb gives you the control to correct this without any incident whatsoever. Otherwise, Orb will accept the backdated action and automatically void and re-issue all relevant invoices.
Conclusion
Recovery in the midst of failure should be a key lens of a billing system evaluation. We've explored two fundamental functions of modern billing systems: data ingestion and subscription lifecycle management. Orb's ability to handle high-volume data ingestion, its robust approach to dealing with failures, and its innovative backfill operations for missing or incorrect data set it apart. In terms of subscription management, Orb's adherence to executing actions at explicit timestamps and its flexibility in reversing actions ensure accuracy and adaptability. These features highlight the necessity for businesses to select a billing system that not only meets their immediate needs but also provides comprehensive, reliable solutions for unforeseen challenges and errors.