With my team, I’ve been exploring a new strategy for detecting bugs in production. We’ve found this to be very effective in flagging issues that slip through more traditional test suites and observability efforts. I wanted to quickly document this approach and write up what we’ve discovered after using this in production with hundreds of thousands of users for the better part of a year.

In a nutshell, we use DBT to write SQL queries that test for incorrect state, and every night the test suite runs and posts to Slack with each test that crosses specific warn/error thresholds.

The SQL we write here is extremely simple. Here’s an example:

{{ config(
    severity = "error",
    error_if = "<5",
    warn_if = "<20",
) }}
select distinct id
from
  {{ ref("stg_koho_db_benefits__subscription_discounts") }}
where
  accepted_date > current_date - interval '2 day'

We have a system for offering users discounts based on tiers. If we see zero discounts created, then we have a problem and need to investigate. This test alerted us about a configuration bug that prevented discounts from being created, a bug that did not emit any errors or break any unit/integration tests. This is a major churn-reduction strategy, so if this bug had gone undetected, it would have been an immediate, expensive miss. Additionally, since this test lives in its own repository, we can write and merge these tests without the overhead of a traditional deployment. In the age of AI, this means we can generate these tests very fast and avoid blocking the release pipeline in the main codebase.

The fundamental insight here is that if all data lives in the warehouse, tests on this data become extremely powerful since they have complete visibility into the real-world behaviour of your system. Since we store both data and eventing in the warehouse, SQL transformations can leverage both in highly flexible ways. This can effectively form a full observability layer, without the overhead of adding new instrumentation into your codebase.

Here’s an example combining both data and eventing tables in a single, simple test:

-- Active tier with payment in the past
{{ config(
    severity = "error",
    error_if = ">350",
    warn_if = ">20",
) }}
select
  distinct account_group_id
  , user_reference
  , next_payment_date
from
  {{ ref("int_paid_tier_users") }}
where
  next_payment_date < current_date
  and status = 'active'
order by next_payment_date desc

In a larger distributed system, the flexibility to test on all eventing and DB data across all services becomes a bit of a superpower. My team manages tiers and subscriptions, and we need to verify that all product features are correctly configured for each subscription tier. With this process, we have access to all DB & eventing data, so testing across multiple systems becomes possible. Traditional E2E testing approaches often require significant coordination across teams and can miss the core truth about what data is stored and what events are published.

Here’s an example of testing across multiple databases:

-- test across multiple DBs
{{ config(
    severity = "error",
    error_if = ">100",
    warn_if = ">10",
) }}

with expected_benefits as (
  select
    account_group_id
    , expected_hisa
    , tier_name
    , experiment_variation
    , user_ref

  from {{ ref('int_expected_benefits') }}
)

, downstream_data as (
  select
    account_group_id
    , interest_rate

  from {{ ref('int_downstream_benefit_data') }}
)

select distinct
  expected_benefits.account_group_id
  , downstream_data.interest_rate
  , expected_benefits.expected_hisa
  , expected_benefits.tier_name
  , expected_benefits.experiment_variation
  , expected_benefits.user_ref

from expected_benefits
inner join downstream_data on expected_benefits.account_group_id = downstream_data.account_group_id

where
  downstream_data.interest_rate != expected_benefits.expected_hisa
  and expected_benefits.expected_hisa is not null

There’s a whole class of errors that occur when two dependent systems owned by separate teams need to coordinate their states so they’re synced. This approach provides a clean, flexible process for detecting these sorts of errors.

Another issue we face when working on tiers and subscriptions is the presence of nuanced bugs in the state associated with long-lived state machines. Because of the long billing cycle, this bugged state can surface in user reports weeks or months later. Testing with dbt has been a huge asset in ensuring that all our active user state remains correct.

Here’s an example of a test for a bugged subscription state that would only appear weeks later:

-- test for a missed or late monthly payment
{{ config(
    severity = "error",
    error_if = ">300",
    warn_if = ">20",
) }}

with account_group_ids_with_payments as (
  select distinct account_group_id

  from {{ ref("int_tier_users_with_payments") }}

  where created_at >= current_date - interval '46 day'
)

, paid_tier_users as (
  select
    account_group_id
    , next_payment_date
    , created_at
    , frequency
    , status

  from {{ ref("int_paid_tier_users") }}

  where
    frequency = 'monthly'
    and status = 'active'
)

select distinct
  paid_tier_users.account_group_id
  , paid_tier_users.next_payment_date
  , paid_tier_users.created_at

from paid_tier_users

full join account_group_ids_with_payments on paid_tier_users.account_group_id =
account_group_ids_with_payments.account_group_id

where
  account_group_ids_with_payments.account_group_id is null

To make it easy for teams to consume these test results, I built a Slackbot that publishes them daily to an alerting channel. The bot posts a summary of all failing tests with their error counts, along with sample rows and a query to dig into the failures directly.

Huge thanks to my coworkers on the Data team at Koho for supporting these explorations and to DBT Labs for building DBT. We’re excited to continue exploring how to use tests to detect bugged production state and improve the quality & reliability of our products.