Data In Production

Data In Production

Pipeline Orchestration Patterns: Beyond Simple DAGs

The patterns that make data pipelines actually work in production

Yusuf Ganiyu's avatar
Yusuf Ganiyu
Jan 23, 2026
∙ Paid

Welcome back to Data in Production. Previously, in Data Observability, we covered how to know when your data breaks. This week, we’re going deeper into orchestration itself, the patterns that separate pipelines that work from pipelines that work at scale.

I used to think orchestration was the easy part.

Define some tasks. Draw some arrows. Hit run. Done, right?

Then I inherited a production Airflow instance with 200 DAGs. Half of them had hardcoded dates. A third couldn’t handle retries without duplicating data. And one particularly creative engineer had built a “dynamic” pipeline by generating Python files on the fly and importing them at runtime.

It worked. Until it didn’t.

The thing about orchestration patterns is that the simple approach works fine until you need to backfill three months of data, or process files that arrive unpredictably, or spin up expensive compute only when you actually need it. That’s when you realize the difference between a DAG that runs and a DAG that’s production-ready.

This week, we’re covering the patterns I wish I’d known earlier.

What We’ll Cover

  1. Dynamic Task Mapping — Processing unknown quantities at runtime

  2. Setup and Teardown — Managing ephemeral resources properly

  3. Asset-Driven Scheduling — Moving beyond time-based triggers

  4. Backfill Patterns — Reprocessing without the chaos

  5. Recovery Strategies — When things go wrong (and they will)

1. Dynamic Task Mapping

Here’s a pattern you’ll recognize: you need to process a list of files, but you don’t know how many until runtime. Maybe it’s customer exports that arrive overnight. Maybe it’s partitions that vary by day.

The old way was to either hardcode a maximum number of tasks or generate DAG code dynamically. Both approaches are fragile.

Airflow 3.0’s dynamic task mapping solves this elegantly. You define the task once, and Airflow expands it at runtime based on upstream output.

from datetime import datetime
from airflow.sdk import dag, task

@dag(schedule='@daily', start_date=datetime(2025, 1, 1), catchup=False)
def process_daily_files():
    
    @task
    def list_files():
        """Discover files to process — could be 5 or 500."""
        # In reality, this would list from S3, GCS, etc.
        return ['file_001.csv', 'file_002.csv', 'file_003.csv']
    
    @task
    def process_file(filename: str):
        """Process a single file. Airflow creates one task per file."""
        print(f"Processing {filename}")
        return f"Processed: {filename}"
    
    @task
    def summarize(results: list[str]):
        """Collect all results after parallel processing."""
        print(f"Completed {len(results)} files")
    
    files = list_files()
    processed = process_file.expand(filename=files)  # Dynamic expansion
    summarize(processed)

process_daily_files()
User's avatar

Continue reading this post for free, courtesy of Yusuf Ganiyu.

Or purchase a paid subscription.
© 2026 Yusuf Ganiyu · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture