Portable Data Stack Quickstart

This quickstart guide will get you up and running with a complete Portable Data Stack in minutes.

quickstart data-engineering portable-stack

Prerequisites

  • Python 3.11+ installed
  • Node.js 18+ installed (for Evidence dashboards)
  • Git for version control
  • Basic familiarity with command line interfaces

Setup Options

You have two options for setting up the Portable Data Stack:

  1. Automated Setup: Using the provided setup script
  2. Manual Setup: Following the step-by-step instructions

Option 1: Automated Setup

This option uses our setup script to automate the entire installation process.

Step 1: Download the Setup Script

# Download the setup script
curl -o portable_stack_setup.sh https://example.com/portable_stack_setup.sh
chmod +x portable_stack_setup.sh

Step 2: Run the Setup Script

# Run the script
./portable_stack_setup.sh

The script will:

  1. Install the UV package manager and Dagster DG CLI tool
  2. Create a new Dagster project with the appropriate structure
  3. Set up DuckDB integration
  4. Create data generator assets
  5. Configure dbt for transformations
  6. Set up Evidence for visualization
  7. Provide a helper script for common tasks

Step A: Start the Services

Once the script completes, you can start the services:

# Navigate to your project directory
cd ~/portable-data-stack
 
# Start Dagster server
./run.sh start
 
# In a new terminal, materialize assets
./run.sh materialize
 
# In another terminal, start Evidence dashboard
./run.sh evidence

Skip to Verification Checklist to confirm everything is working correctly.

Option 2: Manual Setup

This option walks you through the setup process step by step.

Step 1: Install Dependencies

# Install UV (recommended for faster package installation)
curl -sSf https://astral.sh/uv/install.sh | sh
 
# Install Dagster DG command-line tool globally
uv tool install dagster-dg

Step 2: Initialize Project with Dagster DG

# Create project directory
mkdir -p portable-data-stack
cd portable-data-stack
 
# Initialize a new Dagster project using DG
dg init portable_stack
 
# Navigate to the project directory
cd portable_stack

The dg init command creates a standard project structure with:

  • pyproject.toml - Project configuration
  • src/portable_stack/ - Main source code
  • tests/ - Test directory
  • .venv/ - Virtual environment (managed by UV)

Step 3: Add Required Dependencies

# Install project dependencies
uv add duckdb dbt-duckdb pandas pyarrow docling faker

Step 4: Create DuckDB Integration

Create a custom I/O manager for DuckDB:

# Use DG to scaffold an I/O manager
dg scaffold asset_io_manager --name duckdb_io_manager

Edit the generated file at src/portable_stack/defs/asset_io_managers/duckdb_io_manager.py:

import os
from typing import Any, Dict, Optional, Sequence
 
import duckdb
import pandas as pd
from dagster import AssetKey, ConfigurableIOManager, InputContext, OutputContext
 
 
class DuckDBIOManager(ConfigurableIOManager):
    """I/O manager for storing dataframes in DuckDB."""
 
    database_path: str
    schema_name: str = "analytics"
 
    def __init__(self, database_path: str, schema_name: str = "analytics"):
        self.database_path = database_path
        self.schema_name = schema_name
        # Create directory if it doesn't exist
        os.makedirs(os.path.dirname(database_path), exist_ok=True)
 
    def handle_output(self, context: OutputContext, obj: Any) -> None:
        """Store a pandas DataFrame in a DuckDB table."""
        if not isinstance(obj, pd.DataFrame):
            return  # Only handle pandas DataFrames
 
        table_name = context.asset_key.path[-1]
 
        # Ensure schema exists
        with duckdb.connect(self.database_path) as conn:
            conn.execute(f"CREATE SCHEMA IF NOT EXISTS {self.schema_name}")
            
            # Write the dataframe to DuckDB
            conn.execute(
                f"CREATE OR REPLACE TABLE {self.schema_name}.{table_name} AS SELECT * FROM obj"
            )
            context.log.info(f"Stored dataframe as table {self.schema_name}.{table_name}")
 
    def load_input(self, context: InputContext) -> pd.DataFrame:
        """Load a pandas DataFrame from a DuckDB table."""
        table_name = context.asset_key.path[-1]
        
        with duckdb.connect(self.database_path) as conn:
            # Check if table exists
            result = conn.execute(
                f"""
                SELECT count(*) 
                FROM information_schema.tables 
                WHERE table_schema = '{self.schema_name}' 
                AND table_name = '{table_name}'
                """
            ).fetchone()
            
            if result[0] == 0:
                context.log.warning(f"Table {self.schema_name}.{table_name} does not exist")
                return pd.DataFrame()
            
            # Load the dataframe from DuckDB
            df = conn.execute(f"SELECT * FROM {self.schema_name}.{table_name}").fetchdf()
            context.log.info(f"Loaded dataframe from table {self.schema_name}.{table_name}")
            return df
 
 
def build_duckdb_io_manager(
    database_path: str,
    schema_name: str = "analytics",
) -> DuckDBIOManager:
    """Build a DuckDB I/O manager with the given configuration."""
    return DuckDBIOManager(
        database_path=database_path,
        schema_name=schema_name,
    )

Step 5: Create Data Generator Assets

Use dg to scaffold assets for data generation:

# Create assets for data generation
dg scaffold asset --name customers
dg scaffold asset --name products
dg scaffold asset --name sales

Edit the generated files:

For src/portable_stack/defs/assets/customers.py:

import pandas as pd
from faker import Faker
from dagster import asset
 
fake = Faker()
 
@asset
def customers():
    """Generate sample customer data."""
    customers = []
    for i in range(100):
        customers.append({
            'customer_id': i + 1,
            'name': fake.name(),
            'email': fake.email(),
            'city': fake.city(),
            'state': fake.state(),
            'country': fake.country(),
            'registration_date': fake.date_between(
                start_date='-2y', end_date='today')
        })
    
    return pd.DataFrame(customers)

For src/portable_stack/defs/assets/products.py:

import pandas as pd
import random
from faker import Faker
from dagster import asset
 
fake = Faker()
 
@asset
def products():
    """Generate sample product data."""
    categories = ['Electronics', 'Clothing', 'Home', 'Books', 'Sports']
    products = []
    
    for i in range(20):
        products.append({
            'product_id': i + 1,
            'name': fake.catch_phrase(),
            'category': random.choice(categories),
            'price': round(random.uniform(10, 1000), 2),
            'created_at': fake.date_between(
                start_date='-1y', end_date='today')
        })
    
    return pd.DataFrame(products)

For src/portable_stack/defs/assets/sales.py:

import pandas as pd
import random
from datetime import datetime, timedelta
from faker import Faker
from dagster import asset, AssetIn
 
fake = Faker()
 
@asset(
    ins={
        "customers": AssetIn(key_prefix=[]),
        "products": AssetIn(key_prefix=[]),
    }
)
def sales(customers: pd.DataFrame, products: pd.DataFrame):
    """Generate sample sales data."""
    sales = []
    end_date = datetime.now()
    start_date = end_date - timedelta(days=30)
    
    customer_ids = customers['customer_id'].tolist()
    product_ids = products['product_id'].tolist()
    product_prices = dict(zip(
        products['product_id'], products['price']
    ))
    
    for _ in range(300):
        product_id = random.choice(product_ids)
        quantity = random.randint(1, 5)
        sales.append({
            'order_id': fake.uuid4(),
            'customer_id': random.choice(customer_ids),
            'product_id': product_id,
            'quantity': quantity,
            'unit_price': product_prices[product_id],
            'total_price': round(quantity * product_prices[product_id], 2),
            'order_date': fake.date_time_between(
                start_date=start_date, end_date=end_date)
        })
    
    return pd.DataFrame(sales)

Step 6: Create Analysis Assets (dbt)

Set up a dbt project for transformations:

# Create dbt project structure
mkdir -p dbt_project/models/core
mkdir -p dbt_project/profiles
 
# Create dbt project file
cat > dbt_project/dbt_project.yml << EOF
name: 'portable_stack'
version: '1.0.0'
config-version: 2
 
profile: 'portable_stack'
 
model-paths: ["models"]
test-paths: ["tests"]
analysis-paths: ["analyses"]
macro-paths: ["macros"]
 
target-path: "target"
clean-targets:
  - "target"
  - "dbt_packages"
 
models:
  portable_stack:
    core:
      +materialized: table
EOF
 
# Create profiles file
cat > dbt_project/profiles/profiles.yml << EOF
portable_stack:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: ../../db/datamart.duckdb
      schema: analytics
EOF

Create dbt models:

# Create dim_customers model
cat > dbt_project/models/core/dim_customers.sql << EOF
{{ config(materialized='table') }}
 
SELECT
  customer_id,
  name,
  email,
  city,
  state,
  country,
  registration_date
FROM
  analytics.customers
EOF
 
# Create dim_products model
cat > dbt_project/models/core/dim_products.sql << EOF
{{ config(materialized='table') }}
 
SELECT
  product_id,
  name,
  category,
  price,
  created_at
FROM
  analytics.products
EOF
 
# Create fact_sales model
cat > dbt_project/models/core/fact_sales.sql << EOF
{{ config(materialized='table') }}
 
SELECT
  s.order_id,
  s.customer_id,
  s.product_id,
  s.quantity,
  s.unit_price,
  s.total_price,
  s.order_date,
  p.category,
  c.city,
  c.state,
  c.country
FROM
  analytics.sales s
JOIN
  {{ ref('dim_products') }} p ON s.product_id = p.product_id
JOIN
  {{ ref('dim_customers') }} c ON s.customer_id = c.customer_id
EOF

Step 7: Integrate dbt with Dagster

Create dbt assets for Dagster:

# Scaffold dbt assets
dg scaffold asset --name dbt_models

Edit the file at src/portable_stack/defs/assets/dbt_models.py:

from dagster import asset
from dagster_dbt import dbt_assets, DbtCliResource
 
dbt_resource = DbtCliResource(
    project_dir="../../dbt_project",
    profiles_dir="../../dbt_project/profiles"
)
 
dbt_assets = dbt_assets(
    dbt_resource,
    name="dbt_transformations",
    key_prefix=["analytics"],
    deps=["customers", "products", "sales"]
)

Step 8: Update Dagster Definitions

Update the definitions file at src/portable_stack/definitions.py:

import os
from dagster import Definitions, load_assets_from_modules, define_asset_job, ScheduleDefinition
from . import defs
from .defs.assets import dbt_models
 
# Import the DuckDB I/O manager
from .defs.asset_io_managers.duckdb_io_manager import build_duckdb_io_manager
 
# Create database directory if it doesn't exist
os.makedirs("../../db", exist_ok=True)
 
# Load assets
all_assets = load_assets_from_modules([defs.assets])
dbt_transformed_assets = dbt_models.dbt_assets
 
# Define a job to materialize all assets
materialize_all_job = define_asset_job(
    name="materialize_all", 
    selection="*"
)
 
# Create a schedule to materialize all assets daily
daily_schedule = ScheduleDefinition(
    job=materialize_all_job,
    cron_schedule="0 0 * * *",
)
 
# Define all objects
defs = Definitions(
    assets=[*all_assets, *dbt_transformed_assets],
    schedules=[daily_schedule],
    resources={
        "io_manager": build_duckdb_io_manager(database_path="../../db/datamart.duckdb")
    },
)

Step 9: Set Up Evidence for Dashboards

# Create directory for Evidence
cd .. # Go back to the main project directory
mkdir -p evidence_project
cd evidence_project
 
# Initialize Evidence project
npm create evidence@latest . -- --yes
 
# Create a source configuration
mkdir -p sources
cat > sources/duckdb.yml << EOF
name: 'duckdb'
type: 'duckdb'
path: '../db/datamart.duckdb'
EOF
 
# Create a simple dashboard
mkdir -p pages
cat > pages/index.md << EOF
# Sales Dashboard
 
\`\`\`sql sales_by_category
select 
  category,
  sum(total_price) as revenue
from 
  analytics.fact_sales
group by 
  category
order by 
  revenue desc
\`\`\`
 
## Category Performance
 
<BarChart 
  data={sales_by_category}
  x=category
  y=revenue
  title="Revenue by Category"
/>
 
## Sales by Location
 
\`\`\`sql sales_by_country
select 
  country,
  sum(total_price) as revenue
from 
  analytics.fact_sales
group by 
  country
order by 
  revenue desc
\`\`\`
 
<PieChart
  data={sales_by_country}
  value=revenue
  category=country
  title="Revenue by Country"
/>
 
## Daily Trend
 
\`\`\`sql daily_sales
select 
  date_trunc('day', order_date) as date,
  sum(total_price) as revenue
from 
  analytics.fact_sales
group by 
  date
order by 
  date
\`\`\`
 
<LineChart
  data={daily_sales}
  x=date
  y=revenue
  title="Daily Sales Revenue"
/>
EOF
 
# Return to project root
cd ..

Step 10: Start the Services

# Start Dagster in one terminal
cd portable_stack
dg dev
 
# In a new terminal, materialize assets
cd portable_stack
dg materialize
 
# In another terminal, start Evidence dashboard
cd evidence_project
npm run dev

Verification Checklist

Before proceeding further, use this checklist to verify that all components are working correctly:

ComponentVerification StepExpected Result
DuckDBRun duckdb db/datamart.duckdb "SELECT count(*) FROM analytics.dim_customers;"Returns a number greater than 0
dbtCheck dbt_project/target/manifest.jsonFile exists and contains models
DagsterOpen http://localhost:3000 and check assetsAssets show as materialized
EvidenceOpen http://localhost:4000Dashboard appears with charts
IntegrationClick a chart in EvidenceShows detailed data view

Validation Query Run this query to validate the full pipeline from raw data to transformed analytics:

SELECT 
  p.category,
  COUNT(DISTINCT f.customer_id) AS unique_customers,
  SUM(f.total_price) AS total_revenue,
  AVG(f.total_price) AS avg_order_value
FROM 
  analytics.fact_sales f
JOIN 
  analytics.dim_products p ON f.product_id = p.product_id
GROUP BY 
  p.category
ORDER BY 
  total_revenue DESC;

This query should return results with multiple categories and meaningful metrics.

Access Your Services

Common Commands

# Start Dagster development server
dg dev
 
# Materialize all assets
dg materialize
 
# Start Evidence server
cd evidence_project && npm run dev
 
# Query DuckDB
duckdb db/datamart.duckdb
 
# Reset database
rm db/datamart.duckdb

Next Steps

  • Customize the dbt models to build more complex transformations
  • Add your own data sources to the generator script
  • Create more advanced Evidence dashboards with interactive filtering
  • Schedule recurring pipelines in Dagster
  • Connect to real data sources such as APIs, databases, or files

Troubleshooting

Here are solutions to common issues you might encounter:

IssueSolution
Module not found errorsVerify you’ve installed all dependencies with uv add <package>
DuckDB file permission errorsCheck that the directory permissions allow writing to the database file
Dagster can’t find dbt modelsEnsure paths in dbt_models.py are correct and profiles.yml has the right paths
Evidence dashboards show no dataVerify dbt models ran successfully and materialized the tables
”No such file or directory” errorsEnsure directory structure matches code paths

For more detailed diagnostics:

# Check if the database exists and has tables
duckdb db/datamart.duckdb -c ".tables"
 
# Verify dbt project configuration
cd dbt_project && dbt debug
 
# Check Evidence configuration
cat evidence_project/sources/duckdb.yml

For more detailed setup instructions and advanced configurations, refer to the complete Portable Data Stack Guide.