Portable Data Stack Quickstart

This quickstart guide will get you up and running with a complete Portable Data Stack in minutes.

quickstart data-engineering portable-stack

Prerequisites

Python 3.11+ installed
Node.js 18+ installed (for Evidence dashboards)
Git for version control
Basic familiarity with command line interfaces

Setup Options

You have two options for setting up the Portable Data Stack:

Automated Setup: Using the provided setup script
Manual Setup: Following the step-by-step instructions

Option 1: Automated Setup

This option uses our setup script to automate the entire installation process.

Step 1: Download the Setup Script

# Download the setup script
curl -o portable_stack_setup.sh https://example.com/portable_stack_setup.sh
chmod +x portable_stack_setup.sh

Step 2: Run the Setup Script

# Run the script
./portable_stack_setup.sh

The script will:

Install the UV package manager and Dagster DG CLI tool
Create a new Dagster project with the appropriate structure
Set up DuckDB integration
Create data generator assets
Configure dbt for transformations
Set up Evidence for visualization
Provide a helper script for common tasks

Step A: Start the Services

Once the script completes, you can start the services:

# Navigate to your project directory
cd ~/portable-data-stack
 
# Start Dagster server
./run.sh start
 
# In a new terminal, materialize assets
./run.sh materialize
 
# In another terminal, start Evidence dashboard
./run.sh evidence

Skip to Verification Checklist to confirm everything is working correctly.

Option 2: Manual Setup

This option walks you through the setup process step by step.

Step 1: Install Dependencies

# Install UV (recommended for faster package installation)
curl -sSf https://astral.sh/uv/install.sh | sh
 
# Install Dagster DG command-line tool globally
uv tool install dagster-dg

Step 2: Initialize Project with Dagster DG

# Create project directory
mkdir -p portable-data-stack
cd portable-data-stack
 
# Initialize a new Dagster project using DG
dg init portable_stack
 
# Navigate to the project directory
cd portable_stack

The dg init command creates a standard project structure with:

pyproject.toml - Project configuration
src/portable_stack/ - Main source code
tests/ - Test directory
.venv/ - Virtual environment (managed by UV)

Step 3: Add Required Dependencies

# Install project dependencies
uv add duckdb dbt-duckdb pandas pyarrow docling faker

Step 4: Create DuckDB Integration

Create a custom I/O manager for DuckDB:

# Use DG to scaffold an I/O manager
dg scaffold asset_io_manager --name duckdb_io_manager

Edit the generated file at src/portable_stack/defs/asset_io_managers/duckdb_io_manager.py:

import os
from typing import Any, Dict, Optional, Sequence
 
import duckdb
import pandas as pd
from dagster import AssetKey, ConfigurableIOManager, InputContext, OutputContext
 
 
class DuckDBIOManager(ConfigurableIOManager):
    """I/O manager for storing dataframes in DuckDB."""
 
    database_path: str
    schema_name: str = "analytics"
 
    def __init__(self, database_path: str, schema_name: str = "analytics"):
        self.database_path = database_path
        self.schema_name = schema_name
        # Create directory if it doesn't exist
        os.makedirs(os.path.dirname(database_path), exist_ok=True)
 
    def handle_output(self, context: OutputContext, obj: Any) -> None:
        """Store a pandas DataFrame in a DuckDB table."""
        if not isinstance(obj, pd.DataFrame):
            return  # Only handle pandas DataFrames
 
        table_name = context.asset_key.path[-1]
 
        # Ensure schema exists
        with duckdb.connect(self.database_path) as conn:
            conn.execute(f"CREATE SCHEMA IF NOT EXISTS {self.schema_name}")
            
            # Write the dataframe to DuckDB
            conn.execute(
                f"CREATE OR REPLACE TABLE {self.schema_name}.{table_name} AS SELECT * FROM obj"
            )
            context.log.info(f"Stored dataframe as table {self.schema_name}.{table_name}")
 
    def load_input(self, context: InputContext) -> pd.DataFrame:
        """Load a pandas DataFrame from a DuckDB table."""
        table_name = context.asset_key.path[-1]
        
        with duckdb.connect(self.database_path) as conn:
            # Check if table exists
            result = conn.execute(
                f"""
                SELECT count(*) 
                FROM information_schema.tables 
                WHERE table_schema = '{self.schema_name}' 
                AND table_name = '{table_name}'
                """
            ).fetchone()
            
            if result[0] == 0:
                context.log.warning(f"Table {self.schema_name}.{table_name} does not exist")
                return pd.DataFrame()
            
            # Load the dataframe from DuckDB
            df = conn.execute(f"SELECT * FROM {self.schema_name}.{table_name}").fetchdf()
            context.log.info(f"Loaded dataframe from table {self.schema_name}.{table_name}")
            return df
 
 
def build_duckdb_io_manager(
    database_path: str,
    schema_name: str = "analytics",
) -> DuckDBIOManager:
    """Build a DuckDB I/O manager with the given configuration."""
    return DuckDBIOManager(
        database_path=database_path,
        schema_name=schema_name,
    )

Step 5: Create Data Generator Assets

Use dg to scaffold assets for data generation:

# Create assets for data generation
dg scaffold asset --name customers
dg scaffold asset --name products
dg scaffold asset --name sales

Edit the generated files:

For src/portable_stack/defs/assets/customers.py:

import pandas as pd
from faker import Faker
from dagster import asset
 
fake = Faker()
 
@asset
def customers():
    """Generate sample customer data."""
    customers = []
    for i in range(100):
        customers.append({
            'customer_id': i + 1,
            'name': fake.name(),
            'email': fake.email(),
            'city': fake.city(),
            'state': fake.state(),
            'country': fake.country(),
            'registration_date': fake.date_between(
                start_date='-2y', end_date='today')
        })
    
    return pd.DataFrame(customers)

For src/portable_stack/defs/assets/products.py:

import pandas as pd
import random
from faker import Faker
from dagster import asset
 
fake = Faker()
 
@asset
def products():
    """Generate sample product data."""
    categories = ['Electronics', 'Clothing', 'Home', 'Books', 'Sports']
    products = []
    
    for i in range(20):
        products.append({
            'product_id': i + 1,
            'name': fake.catch_phrase(),
            'category': random.choice(categories),
            'price': round(random.uniform(10, 1000), 2),
            'created_at': fake.date_between(
                start_date='-1y', end_date='today')
        })
    
    return pd.DataFrame(products)

For src/portable_stack/defs/assets/sales.py:

import pandas as pd
import random
from datetime import datetime, timedelta
from faker import Faker
from dagster import asset, AssetIn
 
fake = Faker()
 
@asset(
    ins={
        "customers": AssetIn(key_prefix=[]),
        "products": AssetIn(key_prefix=[]),
    }
)
def sales(customers: pd.DataFrame, products: pd.DataFrame):
    """Generate sample sales data."""
    sales = []
    end_date = datetime.now()
    start_date = end_date - timedelta(days=30)
    
    customer_ids = customers['customer_id'].tolist()
    product_ids = products['product_id'].tolist()
    product_prices = dict(zip(
        products['product_id'], products['price']
    ))
    
    for _ in range(300):
        product_id = random.choice(product_ids)
        quantity = random.randint(1, 5)
        sales.append({
            'order_id': fake.uuid4(),
            'customer_id': random.choice(customer_ids),
            'product_id': product_id,
            'quantity': quantity,
            'unit_price': product_prices[product_id],
            'total_price': round(quantity * product_prices[product_id], 2),
            'order_date': fake.date_time_between(
                start_date=start_date, end_date=end_date)
        })
    
    return pd.DataFrame(sales)

Step 6: Create Analysis Assets (dbt)

Set up a dbt project for transformations:

# Create dbt project structure
mkdir -p dbt_project/models/core
mkdir -p dbt_project/profiles
 
# Create dbt project file
cat > dbt_project/dbt_project.yml << EOF
name: 'portable_stack'
version: '1.0.0'
config-version: 2
 
profile: 'portable_stack'
 
model-paths: ["models"]
test-paths: ["tests"]
analysis-paths: ["analyses"]
macro-paths: ["macros"]
 
target-path: "target"
clean-targets:
  - "target"
  - "dbt_packages"
 
models:
  portable_stack:
    core:
      +materialized: table
EOF
 
# Create profiles file
cat > dbt_project/profiles/profiles.yml << EOF
portable_stack:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: ../../db/datamart.duckdb
      schema: analytics
EOF

Create dbt models:

# Create dim_customers model
cat > dbt_project/models/core/dim_customers.sql << EOF
{{ config(materialized='table') }}
 
SELECT
  customer_id,
  name,
  email,
  city,
  state,
  country,
  registration_date
FROM
  analytics.customers
EOF
 
# Create dim_products model
cat > dbt_project/models/core/dim_products.sql << EOF
{{ config(materialized='table') }}
 
SELECT
  product_id,
  name,
  category,
  price,
  created_at
FROM
  analytics.products
EOF
 
# Create fact_sales model
cat > dbt_project/models/core/fact_sales.sql << EOF
{{ config(materialized='table') }}
 
SELECT
  s.order_id,
  s.customer_id,
  s.product_id,
  s.quantity,
  s.unit_price,
  s.total_price,
  s.order_date,
  p.category,
  c.city,
  c.state,
  c.country
FROM
  analytics.sales s
JOIN
  {{ ref('dim_products') }} p ON s.product_id = p.product_id
JOIN
  {{ ref('dim_customers') }} c ON s.customer_id = c.customer_id
EOF

Step 7: Integrate dbt with Dagster

Create dbt assets for Dagster:

# Scaffold dbt assets
dg scaffold asset --name dbt_models

Edit the file at src/portable_stack/defs/assets/dbt_models.py:

from dagster import asset
from dagster_dbt import dbt_assets, DbtCliResource
 
dbt_resource = DbtCliResource(
    project_dir="../../dbt_project",
    profiles_dir="../../dbt_project/profiles"
)
 
dbt_assets = dbt_assets(
    dbt_resource,
    name="dbt_transformations",
    key_prefix=["analytics"],
    deps=["customers", "products", "sales"]
)

Step 8: Update Dagster Definitions

Update the definitions file at src/portable_stack/definitions.py:

import os
from dagster import Definitions, load_assets_from_modules, define_asset_job, ScheduleDefinition
from . import defs
from .defs.assets import dbt_models
 
# Import the DuckDB I/O manager
from .defs.asset_io_managers.duckdb_io_manager import build_duckdb_io_manager
 
# Create database directory if it doesn't exist
os.makedirs("../../db", exist_ok=True)
 
# Load assets
all_assets = load_assets_from_modules([defs.assets])
dbt_transformed_assets = dbt_models.dbt_assets
 
# Define a job to materialize all assets
materialize_all_job = define_asset_job(
    name="materialize_all", 
    selection="*"
)
 
# Create a schedule to materialize all assets daily
daily_schedule = ScheduleDefinition(
    job=materialize_all_job,
    cron_schedule="0 0 * * *",
)
 
# Define all objects
defs = Definitions(
    assets=[*all_assets, *dbt_transformed_assets],
    schedules=[daily_schedule],
    resources={
        "io_manager": build_duckdb_io_manager(database_path="../../db/datamart.duckdb")
    },
)

Step 9: Set Up Evidence for Dashboards

# Create directory for Evidence
cd .. # Go back to the main project directory
mkdir -p evidence_project
cd evidence_project
 
# Initialize Evidence project
npm create evidence@latest . -- --yes
 
# Create a source configuration
mkdir -p sources
cat > sources/duckdb.yml << EOF
name: 'duckdb'
type: 'duckdb'
path: '../db/datamart.duckdb'
EOF
 
# Create a simple dashboard
mkdir -p pages
cat > pages/index.md << EOF
# Sales Dashboard
 
\`\`\`sql sales_by_category
select 
  category,
  sum(total_price) as revenue
from 
  analytics.fact_sales
group by 
  category
order by 
  revenue desc
\`\`\`
 
## Category Performance
 
<BarChart 
  data={sales_by_category}
  x=category
  y=revenue
  title="Revenue by Category"
/>
 
## Sales by Location
 
\`\`\`sql sales_by_country
select 
  country,
  sum(total_price) as revenue
from 
  analytics.fact_sales
group by 
  country
order by 
  revenue desc
\`\`\`
 
<PieChart
  data={sales_by_country}
  value=revenue
  category=country
  title="Revenue by Country"
/>
 
## Daily Trend
 
\`\`\`sql daily_sales
select 
  date_trunc('day', order_date) as date,
  sum(total_price) as revenue
from 
  analytics.fact_sales
group by 
  date
order by 
  date
\`\`\`
 
<LineChart
  data={daily_sales}
  x=date
  y=revenue
  title="Daily Sales Revenue"
/>
EOF
 
# Return to project root
cd ..

Step 10: Start the Services

# Start Dagster in one terminal
cd portable_stack
dg dev
 
# In a new terminal, materialize assets
cd portable_stack
dg materialize
 
# In another terminal, start Evidence dashboard
cd evidence_project
npm run dev

Verification Checklist

Before proceeding further, use this checklist to verify that all components are working correctly:

Component	Verification Step	Expected Result	✓
DuckDB	Run `duckdb db/datamart.duckdb "SELECT count(*) FROM analytics.dim_customers;"`	Returns a number greater than 0	□
dbt	Check `dbt_project/target/manifest.json`	File exists and contains models	□
Dagster	Open http://localhost:3000 and check assets	Assets show as materialized	□
Evidence	Open http://localhost:4000	Dashboard appears with charts	□
Integration	Click a chart in Evidence	Shows detailed data view	□

Validation Query Run this query to validate the full pipeline from raw data to transformed analytics:
SELECT 
  p.category,
  COUNT(DISTINCT f.customer_id) AS unique_customers,
  SUM(f.total_price) AS total_revenue,
  AVG(f.total_price) AS avg_order_value
FROM 
  analytics.fact_sales f
JOIN 
  analytics.dim_products p ON f.product_id = p.product_id
GROUP BY 
  p.category
ORDER BY 
  total_revenue DESC;
This query should return results with multiple categories and meaningful metrics.

Access Your Services

Dagster UI: http://localhost:3000
Evidence dashboards: http://localhost:4000
DuckDB: Access directly via the db/datamart.duckdb file

Common Commands

# Start Dagster development server
dg dev
 
# Materialize all assets
dg materialize
 
# Start Evidence server
cd evidence_project && npm run dev
 
# Query DuckDB
duckdb db/datamart.duckdb
 
# Reset database
rm db/datamart.duckdb

Next Steps

Customize the dbt models to build more complex transformations
Add your own data sources to the generator script
Create more advanced Evidence dashboards with interactive filtering
Schedule recurring pipelines in Dagster
Connect to real data sources such as APIs, databases, or files

Troubleshooting

Here are solutions to common issues you might encounter:

Issue	Solution
Module not found errors	Verify you’ve installed all dependencies with `uv add <package>`
DuckDB file permission errors	Check that the directory permissions allow writing to the database file
Dagster can’t find dbt models	Ensure paths in dbt_models.py are correct and profiles.yml has the right paths
Evidence dashboards show no data	Verify dbt models ran successfully and materialized the tables
”No such file or directory” errors	Ensure directory structure matches code paths

For more detailed diagnostics:

# Check if the database exists and has tables
duckdb db/datamart.duckdb -c ".tables"
 
# Verify dbt project configuration
cd dbt_project && dbt debug
 
# Check Evidence configuration
cat evidence_project/sources/duckdb.yml

For more detailed setup instructions and advanced configurations, refer to the complete Portable Data Stack Guide.

WikiWe

Explorer

Portable Data Stack Quickstart

Portable Data Stack Quickstart

Prerequisites

Setup Options

Option 1: Automated Setup

Step 1: Download the Setup Script

Step 2: Run the Setup Script

Step A: Start the Services

Option 2: Manual Setup

Step 1: Install Dependencies

Step 2: Initialize Project with Dagster DG

Step 3: Add Required Dependencies

Step 4: Create DuckDB Integration

Step 5: Create Data Generator Assets

Step 6: Create Analysis Assets (dbt)

Step 7: Integrate dbt with Dagster

Step 8: Update Dagster Definitions

Step 9: Set Up Evidence for Dashboards

Step 10: Start the Services

Verification Checklist

Access Your Services

Common Commands

Next Steps

Troubleshooting

Graph View

Table of Contents

Backlinks