Skip to content

suvie-eng/data-engineering-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Subscription Analytics: Data Engineering Take-Home

Context

You are working with a weekly meal-kit subscription service. Every week, a delivery window opens and each active subscriber has an order record created for that week. Subscribers can choose to skip a delivery, in which case their order record is marked with skip = True and no line items are generated. If they do not skip, skip = False and the order includes one or more line items representing the meals they selected.

The business does not have a formal "cancel" flow. Instead, subscribers who want to leave simply skip every week indefinitely. The company tracks these as Shadow Cancellations internally, and is also very interested in Re-engagements -- subscribers who come back after shadow cancelling.

Your job is to build a dbt pipeline that identifies these two events from the raw order data.


Key Definitions

Shadow Cancellation

A subscriber enters a shadow cancelled state when they have skipped 3 or more consecutive delivery weeks. The shadow cancellation event occurs on the 3rd consecutive skipped week. This is a one-time state transition per skip streak -- a subscriber who skips weeks 10 through 30 has exactly one shadow cancellation event (at week 12, their 3rd consecutive skip), not seven.

For example, if a subscriber skips weeks 10, 11, and 12, they become shadow cancelled as of week 12.

Re-engagement

A re-engagement occurs when a subscriber who is currently in a shadow cancelled state places a successful order (skip = False). The re-engagement event occurs on the week of that first non-skipped order.

For example, if a subscriber is shadow cancelled (skipped weeks 10-12) and then orders in week 13, week 13 is a re-engagement.

Important Edge Cases to Consider

  • A subscriber who skips exactly 2 weeks and then orders again has not shadow cancelled.
  • A subscriber who shadow cancels and never returns is still counted as a single shadow cancellation (on their 3rd consecutive skip week).
  • A subscriber who re-engages and then shadow cancels again should be counted as a new shadow cancellation.
  • Missing order records: Not every account has an order record for every delivery week. Your model must handle this and document its assumption -- are missing weeks skips? Data gaps? Something else? Hint: at least one account in the dataset has weeks with no order record at all.

Data Model

The seed data consists of 5 tables:

accounts        deliveries       variants
+-----------+   +-------------+  +-----------+
| account_id|   | delivery_id |  | variant_id|
| first_name|   | week_number |  | name      |
| last_name |   | delivery_date|  | price_cents|
| email     |   +-------------+  +-----------+
| created_at|
+-----------+
        \             |
         \            |
          v           v
         orders
         +------------+
         | order_id   |
         | account_id |  --> accounts
         | delivery_id|  --> deliveries
         | skip       |  (boolean)
         | created_at |
         +------------+
              |
              v
         line_items (only when skip = False)
         +--------------+
         | line_item_id |
         | order_id     |  --> orders
         | variant_id   |  --> variants
         | quantity     |
         +--------------+

There are 50 accounts and 52 delivery weeks. The first 5 accounts have been hardcoded with specific patterns to test edge cases. Accounts 6-50 have randomized skip behavior (~20% skip rate).


Your Goal

Build a dbt pipeline organized into three layers:

  1. Staging (models/staging/) -- Already provided. Clean, typed versions of the raw seed tables.

  2. Intermediate (models/intermediate/) -- This is where the core logic lives. Build models that:

    • Join orders with deliveries to get a timeline per account.
    • Identify consecutive skip streaks (the "gaps and islands" problem).
    • Flag when a shadow cancellation occurs (3rd consecutive skip).
    • Flag when a re-engagement occurs (first non-skip after shadow cancellation).
  3. Marts (models/marts/) -- Final output table(s). At minimum, produce a model that shows:

    • Per week: the count of new shadow cancellations and the count of re-engagements.
    • This is the table you will visualize in Rill.

Prerequisites

  • Python 3.10+
  • Install the Python dependencies:
pip install dbt-duckdb pandas faker
  • Rill Developer (for visualization):
curl -s https://cdn.rilldata.com/install.sh | bash

Windows users: check the Rill installation docs for alternative instructions.


Step-by-Step Instructions

Step 1: Generate the Seed Data

python generate_data.py

This creates 5 CSV files in the seeds/ directory.

Step 2: Load Seeds into DuckDB

dbt seed --profiles-dir .

This loads the CSVs into a local data.duckdb file. The --profiles-dir . flag tells dbt to use the profiles.yml in the project root rather than ~/.dbt/.

Step 3: Write Your Models

Build out the intermediate and marts models in:

  • models/intermediate/ -- your transformation logic
  • models/marts/ -- your final output tables

The staging layer is already provided as a reference for the pattern. Use {{ ref('stg_orders') }} etc. to reference upstream models.

Step 4: Build the Pipeline

dbt build --profiles-dir .

This runs seeds, models, and tests. Fix any failures before proceeding.

Step 5 (Bonus): Visualize with Rill

rill start

Rill will start a local server. Connect it to the data.duckdb file and build a line chart showing:

  • New shadow cancellations per week
  • Re-engagements per week

If you are short on time, this step is optional. A screenshot, a brief write-up describing what you would build, or even just the final SQL query backing the chart is acceptable.


Project Structure

.
├── README.md
├── generate_data.py
├── dbt_project.yml
├── profiles.yml
├── seeds/
│   ├── accounts.csv
│   ├── deliveries.csv
│   ├── variants.csv
│   ├── orders.csv
│   └── line_items.csv
└── models/
    ├── staging/
    │   ├── schema.yml
    │   ├── stg_accounts.sql
    │   ├── stg_deliveries.sql
    │   ├── stg_variants.sql
    │   ├── stg_orders.sql
    │   └── stg_line_items.sql
    ├── intermediate/
    │   └── (your models here)
    └── marts/
        └── (your models here)

Evaluation Criteria

Criteria What We're Looking For
Correctness Does your logic correctly handle all edge cases, including the hardcoded test accounts?
SQL Proficiency Clean, readable SQL. Effective use of window functions and CTEs.
dbt Best Practices Proper use of ref(), layered model organization, schema documentation.
Data Modeling Thoughtful intermediate models that break the problem into logical steps.
Visualization (Bonus) A clear Rill dashboard showing the two metrics over time.
Documentation Brief comments or a short write-up explaining your approach and any assumptions.

Good luck! We are excited to see your approach.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages