# Operational Intelligence Playbook

Techniques for operational intelligence, process mining, and high-cardinality data analysis. The SQL examples use ClickHouse (our implementation), but the **patterns are universal** -- they apply to any columnar store, streaming engine, or event-driven analytics platform. Sections 1-5 cover ClickHouse-specific primitives and portal recipes. Sections 6+ are implementation-agnostic techniques that work anywhere you have timestamped event data.

---

## 1. Process Mining Primitives

### windowFunnel

Counts how far users progress through an ordered sequence of events within a time window.

**Template:**
```sql
SELECT level, count() AS users
FROM (
  SELECT
    {identity_col} AS uid,
    windowFunnel({window_seconds})(
      {timestamp_col},
      {cond_1},
      {cond_2},
      {cond_3}
    ) AS level
  FROM {table}
  WHERE {timestamp_col} >= now() - INTERVAL {lookback}
  GROUP BY uid
)
GROUP BY level
ORDER BY level
```

**Real example (otel_traces):**
```sql
SELECT level, count() AS users
FROM (
  SELECT
    SpanAttributes['rum.session_id'] AS uid,
    windowFunnel(3600)(
      Timestamp,
      SpanName = 'portal.login',
      SpanName = 'order.status_checked',
      SpanName = 'photo.uploaded'
    ) AS level
  FROM otel_traces
  WHERE Timestamp >= now() - INTERVAL 30 DAY
  GROUP BY uid
)
GROUP BY level
ORDER BY level
```

**Portal usage:** Funnels tab, Process Mining > Funnel sub-tab
**Difficulty:** Easy

---

### sequenceMatch

Returns 1 if a row's event sequence matches a regex-like pattern. Use for conformance checking and out-of-order detection.

**Template:**
```sql
SELECT
  {identity_col},
  sequenceMatch('{pattern}')(
    {timestamp_col},
    {cond_1},
    {cond_2},
    {cond_3}
  ) AS matched
FROM {table}
GROUP BY {identity_col}
```

Pattern syntax: `(?1)` = cond_1 is true, `(?2)` = cond_2 is true. `(?1)(?2)` = cond_1 then cond_2 in order.

**Real example -- detect out-of-order steps:**
```sql
SELECT
  SpanAttributes['order.id'] AS order_id,
  sequenceMatch('(?1)(?2)')(
    Timestamp,
    SpanName = 'report.submitted',
    SpanName = 'appraisal.assigned'
  ) AS submitted_before_assigned
FROM otel_traces
WHERE Timestamp >= now() - INTERVAL 30 DAY
  AND SpanAttributes['order.id'] != ''
GROUP BY order_id
HAVING submitted_before_assigned = 1
```

**Portal usage:** Process Mining > Sequences tab (violation detection), Ask AI RCA prompts
**Difficulty:** Medium

---

### sequenceCount

Counts how many times a pattern occurs per group. Use for rework/bounce detection.

**Template:**
```sql
SELECT
  {identity_col},
  sequenceCount('{pattern}')(
    {timestamp_col},
    {cond_1},
    {cond_2}
  ) AS bounce_count
FROM {table}
GROUP BY {identity_col}
```

**Real example -- rework loops between review and revision:**
```sql
SELECT
  SpanAttributes['order.id'] AS order_id,
  sequenceCount('(?1)(?2)')(
    Timestamp,
    SpanName = 'review.started',
    SpanName = 'revision.requested'
  ) AS review_bounces
FROM otel_traces
WHERE Timestamp >= now() - INTERVAL 30 DAY
  AND SpanAttributes['order.id'] != ''
GROUP BY order_id
ORDER BY review_bounces DESC
LIMIT 20
```

**Portal usage:** Process Mining > Bottlenecks tab, automation candidate scoring
**Difficulty:** Medium

---

### lagInFrame / leadInFrame (Bottleneck Detection)

Compute step-to-step transition durations using window functions.

**Template:**
```sql
SELECT
  concat(prev_step, ' -> ', current_step) AS transition,
  count() AS volume,
  avg(step_duration_min) AS avg_min,
  quantile(0.95)(step_duration_min) AS p95_min
FROM (
  SELECT
    {identity_col},
    SpanName AS current_step,
    lagInFrame(SpanName) OVER (
      PARTITION BY {identity_col} ORDER BY {timestamp_col}
    ) AS prev_step,
    dateDiff('minute',
      lagInFrame({timestamp_col}) OVER (
        PARTITION BY {identity_col} ORDER BY {timestamp_col}
      ),
      {timestamp_col}
    ) AS step_duration_min
  FROM {table}
  WHERE {timestamp_col} >= now() - INTERVAL {lookback}
    AND {identity_col} != ''
)
WHERE prev_step != ''
GROUP BY transition
ORDER BY avg_min DESC
```

**Real example:**
```sql
SELECT
  concat(prev_step, ' -> ', current_step) AS transition,
  count() AS volume,
  round(avg(step_duration_min), 1) AS avg_min,
  round(quantile(0.95)(step_duration_min), 1) AS p95_min
FROM (
  SELECT
    SpanAttributes['order.id'] AS oid,
    SpanName AS current_step,
    lagInFrame(SpanName) OVER (PARTITION BY oid ORDER BY Timestamp) AS prev_step,
    dateDiff('minute',
      lagInFrame(Timestamp) OVER (PARTITION BY oid ORDER BY Timestamp),
      Timestamp
    ) AS step_duration_min
  FROM otel_traces
  WHERE Timestamp >= now() - INTERVAL 30 DAY
    AND SpanAttributes['order.id'] != ''
)
WHERE prev_step != ''
GROUP BY transition
ORDER BY avg_min DESC
LIMIT 20
```

**Portal usage:** Process Mining > Bottlenecks tab, step transition heatmap
**Difficulty:** Medium

---

### ASOF JOIN (Nearest-Event Correlation)

Join two event streams by closest timestamp. Use for correlating frontend sessions with backend traces.

**Template:**
```sql
SELECT
  a.{col_a},
  b.{col_b},
  dateDiff('second', a.{ts_a}, b.{ts_b}) AS lag_seconds
FROM {table_a} a
ASOF JOIN {table_b} b
  ON a.{join_key} = b.{join_key}
  AND a.{ts_a} >= b.{ts_b}
WHERE a.{ts_a} >= now() - INTERVAL {lookback}
```

**Real example -- link session replay to backend trace:**
```sql
SELECT
  s.session_id,
  t.TraceId,
  t.SpanName,
  dateDiff('second', s.event_time, t.Timestamp) AS lag_seconds
FROM (
  SELECT
    ResourceAttributes['rum.sessionId'] AS session_id,
    Timestamp AS event_time
  FROM hyperdx_sessions
  WHERE Timestamp >= now() - INTERVAL 7 DAY
) s
ASOF JOIN (
  SELECT TraceId, SpanName, Timestamp,
    SpanAttributes['rum.session_id'] AS session_id
  FROM otel_traces
  WHERE Timestamp >= now() - INTERVAL 7 DAY
) t
  ON s.session_id = t.session_id
  AND s.event_time >= t.Timestamp
```

**Portal usage:** Sessions page (link replay to backend spans), error-to-session correlation
**Difficulty:** Hard

---

### retention

Built-in aggregate for cohort retention matrices.

**Template:**
```sql
SELECT
  cohort_day,
  retention(
    {first_action_cond},
    {return_action_cond_d1},
    {return_action_cond_d7},
    {return_action_cond_d14},
    {return_action_cond_d30}
  ) AS ret
FROM (
  SELECT
    {identity_col},
    toDate(min({timestamp_col})) AS cohort_day,
    -- boolean conditions per retention window
  FROM {table}
  GROUP BY {identity_col}
)
GROUP BY cohort_day
ORDER BY cohort_day
```

**Real example:**
```sql
SELECT
  toDate(first_seen) AS cohort_day,
  count() AS cohort_size,
  countIf(d1_active) AS d1,
  countIf(d7_active) AS d7,
  countIf(d30_active) AS d30
FROM (
  SELECT
    SpanAttributes['rum.session_id'] AS uid,
    min(Timestamp) AS first_seen,
    max(Timestamp) AS last_seen,
    dateDiff('day', min(Timestamp), max(Timestamp)) >= 1 AS d1_active,
    dateDiff('day', min(Timestamp), max(Timestamp)) >= 7 AS d7_active,
    dateDiff('day', min(Timestamp), max(Timestamp)) >= 30 AS d30_active
  FROM otel_traces
  WHERE Timestamp >= now() - INTERVAL 90 DAY
    AND SpanAttributes['rum.session_id'] != ''
  GROUP BY uid
)
GROUP BY cohort_day
ORDER BY cohort_day
```

**Portal usage:** Retention tab
**Difficulty:** Easy

---

### groupArray + arrayStringConcat (Variant Paths)

Build human-readable process variant strings from ordered events.

**Template:**
```sql
SELECT
  arrayStringConcat(
    groupArray({step_col} ORDER BY {timestamp_col} ASC),
    ' -> '
  ) AS variant_path,
  count() AS case_count,
  round(avg({duration_col}), 1) AS avg_hours
FROM (
  SELECT {identity_col}, {step_col}, {timestamp_col}, {duration_col}
  FROM {table}
  WHERE {filters}
)
GROUP BY {identity_col}
-- then wrap to aggregate by variant_path:
-- GROUP BY variant_path ORDER BY case_count DESC LIMIT 20
```

**Real example:**
```sql
SELECT variant_path, count() AS cases, round(avg(total_hours), 1) AS avg_hours
FROM (
  SELECT
    SpanAttributes['order.id'] AS oid,
    arrayStringConcat(
      groupArray(SpanName ORDER BY Timestamp ASC),
      ' -> '
    ) AS variant_path,
    dateDiff('hour', min(Timestamp), max(Timestamp)) AS total_hours
  FROM otel_traces
  WHERE Timestamp >= now() - INTERVAL 30 DAY
    AND SpanAttributes['order.id'] != ''
  GROUP BY oid
)
GROUP BY variant_path
ORDER BY cases DESC
LIMIT 20
```

**Portal usage:** Process Mining > Sequences tab (top paths), Variants panel
**Difficulty:** Easy

---

### Rolling Z-Score (Anomaly Detection)

Flag statistical outliers using window-function z-scores over a rolling baseline.

**Template:**
```sql
SELECT
  {time_bucket},
  {metric},
  avg({metric}) OVER (ORDER BY {time_bucket} ROWS BETWEEN {window} PRECEDING AND 1 PRECEDING) AS rolling_avg,
  stddevPop({metric}) OVER (ORDER BY {time_bucket} ROWS BETWEEN {window} PRECEDING AND 1 PRECEDING) AS rolling_std,
  ({metric} - rolling_avg) / nullIf(rolling_std, 0) AS z_score
FROM (
  SELECT
    toStartOfHour({timestamp_col}) AS {time_bucket},
    {agg_func} AS {metric}
  FROM {table}
  WHERE {timestamp_col} >= now() - INTERVAL {lookback}
  GROUP BY {time_bucket}
)
ORDER BY {time_bucket}
```

**Real example -- hourly error rate anomalies:**
```sql
SELECT
  hour,
  error_rate,
  round(rolling_avg, 4) AS baseline,
  round(z_score, 2) AS z_score,
  abs(z_score) > 3 AS is_anomaly
FROM (
  SELECT
    hour,
    error_rate,
    avg(error_rate) OVER (ORDER BY hour ROWS BETWEEN 168 PRECEDING AND 1 PRECEDING) AS rolling_avg,
    stddevPop(error_rate) OVER (ORDER BY hour ROWS BETWEEN 168 PRECEDING AND 1 PRECEDING) AS rolling_std,
    (error_rate - rolling_avg) / nullIf(rolling_std, 0) AS z_score
  FROM (
    SELECT
      toStartOfHour(Timestamp) AS hour,
      countIf(StatusCode = 'ERROR') / count() AS error_rate
    FROM otel_traces
    WHERE Timestamp >= now() - INTERVAL 30 DAY
    GROUP BY hour
  )
)
ORDER BY hour
```

**Portal usage:** Ask AI anomaly explanation, error tracking sparklines, alerting candidate
**Difficulty:** Hard

---

## 2. Visualization Recipes

### Sankey Diagram (Process Flow)

**Data shape:**
```
| source          | target           | value |
|-----------------|------------------|-------|
| portal.login    | order.created    | 8230  |
| order.created   | appraisal.assign | 7100  |
| order.created   | report.submitted | 1130  |
```

Generate from the `lagInFrame` bottleneck query above -- each transition row becomes a Sankey link.

**Library:** d3-sankey
```html
<script src="https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/d3-sankey@0.12/dist/d3-sankey.min.js"></script>
```

**Implementation (10 lines):**
```javascript
const { sankey, sankeyLinkHorizontal } = d3;
const layout = sankey().nodeWidth(15).nodePadding(10)
  .extent([[0, 0], [width, height]]);
const { nodes, links } = layout({
  nodes: nodeData.map(d => ({ ...d })),
  links: linkData.map(d => ({ ...d }))
});
svg.selectAll('.link').data(links).join('path')
  .attr('d', sankeyLinkHorizontal()).attr('stroke-width', d => d.width)
  .attr('fill', 'none').attr('stroke', '#93c5fd').attr('opacity', 0.5);
svg.selectAll('.node').data(nodes).join('rect')
  .attr('x', d => d.x0).attr('y', d => d.y0)
  .attr('width', d => d.x1 - d.x0).attr('height', d => d.y1 - d.y0)
  .attr('fill', '#2563eb');
```

**Portal usage:** Process Mining > Sequences tab (process flow map)

---

### Timeline / Swimlane (Operator Activity)

**Data shape:**
```
| group       | label            | start               | end                 |
|-------------|------------------|---------------------|---------------------|
| Operator A  | appraisal.review | 2026-03-15T09:00:00 | 2026-03-15T09:45:00 |
| Operator A  | report.write     | 2026-03-15T10:00:00 | 2026-03-15T11:30:00 |
| Operator B  | photo.upload     | 2026-03-15T09:15:00 | 2026-03-15T09:20:00 |
```

**Library:** timelines-chart
```html
<script src="https://cdn.jsdelivr.net/npm/timelines-chart/dist/timelines-chart.min.js"></script>
```

**Implementation:**
```javascript
const chart = TimelinesChart()(document.getElementById('timeline'));
chart.data([{
  group: 'Operator A',
  data: [{
    label: 'appraisal.review',
    data: [{ timeRange: [start, end], val: 'active' }]
  }]
}]).zColorScale(d3.scaleOrdinal(['#3b82f6', '#ef4444', '#10b981']));
```

**Portal usage:** Sessions page (per-session trace timeline), By LOB tab (parallel operator activity)

---

### Anomaly Sparkline

**Data shape:**
```
| hour                | value  | is_anomaly |
|---------------------|--------|------------|
| 2026-03-15T08:00:00 | 0.012  | false      |
| 2026-03-15T09:00:00 | 0.089  | true       |
```

**Library:** Chart.js (already loaded in portal)
```html
<script src="https://cdn.jsdelivr.net/npm/chart.js@4/dist/chart.umd.min.js"></script>
```

**Implementation:**
```javascript
new Chart(ctx, {
  type: 'line',
  data: {
    labels: data.map(d => d.hour),
    datasets: [{
      data: data.map(d => d.value),
      pointBackgroundColor: data.map(d => d.is_anomaly ? '#ef4444' : '#3b82f6'),
      pointRadius: data.map(d => d.is_anomaly ? 5 : 1),
      borderColor: '#93c5fd', borderWidth: 1, fill: false, tension: 0.3
    }]
  },
  options: { plugins: { legend: { display: false } }, scales: { x: { display: false }, y: { display: false } } }
});
```

**Portal usage:** Error tracking sparklines, landing page real-time widget, Ask AI anomaly cards

---

### Process Map (Mermaid.js)

**Data shape:** Generate Mermaid syntax from transition query output.

**Library:** Mermaid.js
```html
<script src="https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.min.js"></script>
```

**Implementation:**
```javascript
// Build from transition query results
const lines = ['graph LR'];
transitions.forEach(t => {
  lines.push(`  ${t.source}["${t.source}<br/>${t.volume}"] -->|${t.avg_min}m| ${t.target}["${t.target}"]`);
});
document.getElementById('process-map').textContent = lines.join('\n');
mermaid.init(undefined, '#process-map');
```

**Portal usage:** Ask AI response rendering (inline process diagrams), documentation generation

---

### Streaming Pivot Table (Perspective)

**Data shape:** Any tabular SQL result set. Perspective handles pivoting, filtering, and charting client-side.

**Library:** Perspective (FINOS)
```html
<script src="https://cdn.jsdelivr.net/npm/@finos/perspective/dist/umd/perspective.js"></script>
<script src="https://cdn.jsdelivr.net/npm/@finos/perspective-viewer/dist/umd/perspective-viewer.js"></script>
<script src="https://cdn.jsdelivr.net/npm/@finos/perspective-viewer-datagrid/dist/umd/perspective-viewer-datagrid.js"></script>
<script src="https://cdn.jsdelivr.net/npm/@finos/perspective-viewer-d3fc/dist/umd/perspective-viewer-d3fc.js"></script>
```

**Implementation:**
```javascript
const viewer = document.createElement('perspective-viewer');
document.getElementById('pivot-container').appendChild(viewer);
const worker = perspective.worker();
const table = await worker.table(queryResultJSON);
await viewer.load(table);
viewer.restore({ group_by: ['ServiceName'], columns: ['count', 'avg_duration_ms'] });
```

**Portal usage:** Query Studio (alternative result rendering), By LOB deep-dive

---

## 3. AI Prompt Templates

Inject these into `/api/query` calls when Ask AI is triggered from a specific page context. Replace `{placeholders}` with actual query results.

### Process Mining Analysis

```
You are analyzing process mining data from a ClickHouse-backed analytics portal.

## Data
- Total sessions analyzed: {total_sessions}
- Unique process variants: {unique_paths}
- Top 5 paths (by frequency):
{top_paths_formatted}
- Top 3 bottleneck transitions (by avg duration):
{bottlenecks_formatted}

## Task
Summarize the process in 2-3 sentences for a VP who has never seen this data. Answer:
1. What is the "happy path" and what percentage follows it?
2. Where do deviations occur and what do they suggest?
3. What is the estimated time recoverable through automation?

Be specific with numbers. No hedging. No "it appears that."
```

### Root Cause Analysis

```
You are performing root cause analysis on slow traces.

## Slow population (p95+ duration)
{slow_traces_sample}

## Fast population (p50 duration)
{fast_traces_sample}

## Attribute comparison
{attribute_delta_table}

## Task
Identify the 1-3 most likely root causes for the slow traces. For each:
1. State the cause in one sentence
2. Cite the specific attribute values that differ between slow and fast populations
3. Suggest one concrete remediation action

Do not speculate beyond what the data shows.
```

### Anomaly Explanation

```
You are explaining anomalies detected in operational metrics.

## Anomalies detected
{anomaly_rows}
(Each row: timestamp, metric_value, rolling_baseline, z_score)

## Context
- Metric: {metric_name}
- Normal range: {baseline_avg} +/- {baseline_std}
- Detection threshold: |z_score| > 3

## Task
For each anomaly, explain in one sentence what happened in plain English.
Example: "At 2pm on March 15, the error rate spiked to 8.9% (normally 1.2%), suggesting a deployment or upstream dependency failure."

Then rank them by business impact (highest first).
```

### Automation Candidate Scoring

```
You are scoring process steps for automation ROI.

## Step frequency and duration data
{steps_table}
(Columns: step_name, daily_volume, avg_duration_minutes, p95_duration_minutes, rework_rate_pct)

## Task
Score each step on automation ROI using this formula:
  ROI_score = daily_volume * avg_duration_minutes * (1 + rework_rate_pct/100)

Return a ranked table:
| Rank | Step | ROI Score | Est. Hours Saved/Week | Automation Approach |

For "Automation Approach," pick one: rule-based workflow, API integration, ML classification, or human-in-the-loop review.

Be concrete. "Automate the email-to-order handoff via API integration" not "consider automation."
```

---

## 4. Cost Framing

**Celonis:** $100K+/year starting price for process mining. Our portal runs the same `windowFunnel`, `sequenceMatch`, and variant analysis on a $170/month ClickHouse Cloud instance. Same SQL, same insights, no vendor lock-in.

**Datadog:** Stores OTEL data at roughly 200x the cost of ClickHouse for equivalent retention. Our OTEL collector writes directly to ClickHouse -- no intermediate aggregation tier, no per-host pricing, no trace sampling required.

**GA360 + Pendo:** $162K/year combined for session analytics and product analytics. We deliver session replay (rrweb), funnels (`windowFunnel`), retention matrices, process mining, and an AI query assistant for under $2K/year all-in -- ClickHouse Cloud + AKS compute + Cloudflare Workers.

**Amplitude Enterprise:** $30K-$200K/year with per-seat fees. Our portal has zero per-seat cost. At 10K daily visitors: ~$260/month. Full SQL access, full data ownership, no export restrictions.

---

## 5. Schema Quick Reference

### otel_traces (primary table)

| Column | Type | Description |
|--------|------|-------------|
| `Timestamp` | `DateTime64(9)` | Span start time (nanosecond precision) |
| `TraceId` | `String` | Distributed trace ID |
| `SpanId` | `String` | Unique span ID |
| `ParentSpanId` | `String` | Parent span (empty for root spans) |
| `ServiceName` | `LowCardinality(String)` | Emitting service name |
| `SpanName` | `LowCardinality(String)` | Operation name (the "event type" for process mining) |
| `SpanKind` | `LowCardinality(String)` | INTERNAL / SERVER / CLIENT |
| `StatusCode` | `LowCardinality(String)` | OK / ERROR / UNSET |
| `StatusMessage` | `String` | Error message (when StatusCode = ERROR) |
| `Duration` | `Int64` | Span duration in nanoseconds |
| `SpanAttributes` | `Map(String, String)` | Key-value span attributes |
| `ResourceAttributes` | `Map(String, String)` | Key-value resource attributes |

### Key SpanAttributes Map Keys

| Key | Example Value | Used For |
|-----|---------------|----------|
| `order.id` | `ORD-2026-001` | Process mining case ID (PARTITION BY) |
| `operator.id` | `op-437` | Operator-level analysis, workload distribution |
| `lob` | `Appraisal` | Line-of-business segmentation |
| `core.task` | `valuation_review` | Task-level process mining |
| `automation.candidate` | `true` | Pre-tagged automation candidates |
| `rum.session_id` | `abc123-def456` | Browser session correlation |
| `http.url` | `https://exos.../api/orders` | URL-based filtering |
| `http.method` | `POST` | HTTP method |
| `http.status_code` | `200` | HTTP status |
| `http.target` | `/api/orders` | URL path (no host) |
| `user.id` | `jlanzone@svclnk.com` | User identity (when set) |
| `product.event` | `true` | Filter to product-level events only |

### Table Relationships

```
otel_traces <-- TraceId --> otel_logs
     |
     +-- SpanAttributes['rum.session_id']
     |           |
     |           v
     +-- hyperdx_sessions (ResourceAttributes['rum.sessionId'])
                |
                +-- rrweb DOM replay events (LogAttributes['rr-web.event'])
```

### Common WHERE Clauses

```sql
-- Time window (last N days)
WHERE Timestamp >= now() - INTERVAL 30 DAY

-- Specific service
AND ServiceName = 'exos-orders'

-- Only errors
AND StatusCode = 'ERROR'

-- Only product events (not infra spans)
AND SpanAttributes['product.event'] = 'true'

-- Specific LOB
AND SpanAttributes['lob'] = 'Appraisal'

-- Only spans with an order ID (process mining scope)
AND SpanAttributes['order.id'] != ''

-- Duration threshold (slow spans, > 5 seconds)
AND Duration > 5000000000

-- Exclude health checks
AND SpanName NOT IN ('healthz', 'readyz', 'HEAD /')
```

### Views (defined in SCHEMA_VIEWS.sql)

| View | Base Table | Purpose |
|------|-----------|---------|
| `sessions` | `hyperdx_sessions` | One row per browser session with duration and event count |
| `traces` | `otel_traces` | Human-readable column aliases, extracted HTTP context |
| `errors` | `otel_traces` | Pre-filtered to `StatusCode = 'ERROR'` |
| `logs` | `otel_logs` | Human-readable log view with severity |
| `page_views` | `otel_traces` | Pre-filtered to `documentLoad` / `routeChange` spans |
| `user_actions` | `otel_traces` | Pre-filtered to known product events with labels |
| `services` | `otel_traces` | Per-service aggregates: trace count, error rate, latency percentiles |

---

## 6. Automated Dimensional RCA

When a metric spikes, automatically slice by every dimension and rank by contribution. This is the Datadog Watchdog / Meta DrP pattern -- we do it in pure SQL. Meta's internal DrP platform runs 50,000 analyses daily across 300+ teams using this exact approach (Fast Dimensional Analysis, or FDA). The core insight: don't ask humans "what changed?" -- exhaustively test every dimension and let the math surface the winner.

### The Pattern (Implementation-Agnostic)

1. **Compute baseline**: Rolling 4-week average for the metric, broken down by each dimension
2. **Compute current**: Same metric over the anomaly window
3. **For each dimension** (operator, LOB, order_type, region, mailbox, day_of_week): compute current vs baseline
4. **Rank by variance contribution**: Which dimension value accounts for the largest share of the delta?
5. **Surface the winner**: "Orders from the northeast region account for 73% of the cycle time increase this week"

The key metric is **excess count** -- not just "this dimension has a high rate" but "this dimension contributes N excess errors/minutes/failures above what the baseline would predict given its volume." This prevents large-volume dimensions from always dominating.

### When to use

- Any anomaly alert fires (z-score > 3)
- Weekly metric reviews ("why did cycle time increase?")
- SLA breaches ("which dimension drove the miss?")
- Capacity planning ("where is demand growing fastest?")

### SQL Template (single dimension)

```sql
WITH
    baseline AS (
        SELECT
            {dimension_expr} AS dim_value,
            count() AS total,
            {metric_agg} AS metric_value
        FROM {table}
        WHERE {timestamp_col} BETWEEN now() - INTERVAL 28 DAY AND now() - INTERVAL 1 DAY
        GROUP BY dim_value
    ),
    current_window AS (
        SELECT
            {dimension_expr} AS dim_value,
            count() AS total,
            {metric_agg} AS metric_value
        FROM {table}
        WHERE {timestamp_col} BETWEEN {anomaly_start} AND {anomaly_end}
        GROUP BY dim_value
    ),
    global_baseline AS (
        SELECT {metric_agg} AS global_rate FROM {table}
        WHERE {timestamp_col} BETWEEN now() - INTERVAL 28 DAY AND now() - INTERVAL 1 DAY
    ),
    global_current AS (
        SELECT {metric_agg} AS global_rate FROM {table}
        WHERE {timestamp_col} BETWEEN {anomaly_start} AND {anomaly_end}
    )
SELECT
    c.dim_value,
    round(b.metric_value, 4) AS baseline_rate,
    round(c.metric_value, 4) AS current_rate,
    round(c.metric_value - coalesce(b.metric_value, 0), 4) AS delta,
    -- Contribution: what fraction of the global delta does this dimension explain?
    round(
        (c.metric_value - coalesce(b.metric_value, 0)) * c.total
        / nullIf((SELECT global_rate FROM global_current) - (SELECT global_rate FROM global_baseline), 0)
        / nullIf((SELECT sum(total) FROM current_window), 0)
    , 4) AS contribution_pct,
    c.total AS current_volume,
    '{dimension_name}' AS dimension
FROM current_window c
LEFT JOIN baseline b ON c.dim_value = b.dim_value
ORDER BY abs(delta) DESC
LIMIT 10
```

### Real example -- error rate spike by LOB

```sql
WITH
    baseline AS (
        SELECT
            SpanAttributes['lob'] AS dim_value,
            count() AS total,
            countIf(StatusCode = 'ERROR') / count() AS error_rate
        FROM otel_traces
        WHERE Timestamp BETWEEN now() - INTERVAL 28 DAY AND now() - INTERVAL 1 DAY
        GROUP BY dim_value
    ),
    anomaly AS (
        SELECT
            SpanAttributes['lob'] AS dim_value,
            count() AS total,
            countIf(StatusCode = 'ERROR') / count() AS error_rate
        FROM otel_traces
        WHERE Timestamp BETWEEN '2026-03-15 14:00:00' AND '2026-03-15 15:00:00'
        GROUP BY dim_value
    )
SELECT
    a.dim_value,
    round(b.error_rate * 100, 2) AS baseline_pct,
    round(a.error_rate * 100, 2) AS anomaly_pct,
    round((a.error_rate - b.error_rate) * 100, 2) AS delta_pct,
    round(
        (a.error_rate - coalesce(b.error_rate, 0)) * a.total
        / nullIf(sum(a.error_rate - coalesce(b.error_rate, 0)) OVER () * sum(a.total) OVER (), 0)
    , 2) AS contribution_pct,
    a.total AS volume_in_window,
    'lob' AS dimension
FROM anomaly a
LEFT JOIN baseline b ON a.dim_value = b.dim_value
ORDER BY abs(delta_pct) DESC
LIMIT 10
```

### Multi-dimension sweep (UNION ALL pattern)

Run the single-dimension query once per dimension, then UNION ALL and re-rank:

```sql
-- Sweep all dimensions in one query
SELECT * FROM (
    -- dimension: lob
    SELECT dim_value, baseline_pct, anomaly_pct, delta_pct, 'lob' AS dimension
    FROM ( /* baseline + anomaly CTEs for SpanAttributes['lob'] */ )
    ORDER BY abs(delta_pct) DESC LIMIT 5
)
UNION ALL (
    -- dimension: operator
    SELECT dim_value, baseline_pct, anomaly_pct, delta_pct, 'operator' AS dimension
    FROM ( /* baseline + anomaly CTEs for SpanAttributes['operator.id'] */ )
    ORDER BY abs(delta_pct) DESC LIMIT 5
)
UNION ALL (
    -- dimension: service
    SELECT dim_value, baseline_pct, anomaly_pct, delta_pct, 'service' AS dimension
    FROM ( /* baseline + anomaly CTEs for ServiceName */ )
    ORDER BY abs(delta_pct) DESC LIMIT 5
)
UNION ALL (
    -- dimension: day_of_week
    SELECT dim_value, baseline_pct, anomaly_pct, delta_pct, 'day_of_week' AS dimension
    FROM ( /* baseline + anomaly CTEs for toDayOfWeek(Timestamp) */ )
    ORDER BY abs(delta_pct) DESC LIMIT 5
)
ORDER BY abs(delta_pct) DESC
LIMIT 10
```

The top row across all dimensions tells you the single most explanatory slice. Example output: "operator=op-437 explains 68% of the error rate increase (baseline 1.2%, anomaly window 14.7%)."

### Advanced: Lift-based scoring (FDA / Apriori)

For multi-dimensional combinations (e.g., "operator X in LOB Y on day Z"), use lift scoring borrowed from Meta's FDA paper:

```
lift(X, failure) = P(failure | X) / P(failure)
```

A lift > 1 means dimension value X is over-represented in failures. A lift of 3.2 means "this combination is 3.2x more likely to fail than average." Combine with support (volume) to filter noise -- require minimum 20 observations.

**Portal usage:** Ask AI RCA drill-down (auto-generate on anomaly detection), Bottlenecks tab dimensional breakdown
**Difficulty:** Medium

---

## 7. Probabilistic Attribution

Not all event chains are deterministic. When System A produces an event and System B produces a downstream action, but no foreign key connects them, you need probabilistic attribution: "this email **probably** triggered this action" with a confidence score.

### The Pattern (Implementation-Agnostic)

Three scoring components, weighted and summed:

| Component | What it measures | Score range | Weight |
|-----------|-----------------|-------------|--------|
| **Temporal proximity** | How close in time are the two events? | 0.0 - 1.0 | 0.4 |
| **Attribute overlap** | How many shared dimensions (operator, mailbox, client)? | 0.0 - 1.0 | 0.4 |
| **Frequency baseline** | How often does this combination occur normally? (rarer = higher signal) | 0.0 - 1.0 | 0.2 |

**Combined confidence:**
```
confidence = proximity_weight * temporal_score
           + attribute_weight * overlap_score
           + frequency_weight * rarity_score
```

### Temporal proximity scoring

| Time gap | Score | Rationale |
|----------|-------|-----------|
| < 5 min | 1.0 | Almost certainly causally linked |
| 5 - 30 min | 0.7 | Likely linked (operator context-switched briefly) |
| 30 min - 2 hr | 0.4 | Plausible (batched work) |
| 2 - 8 hr | 0.15 | Weak (same shift, possibly related) |
| > 8 hr | 0.0 | Noise -- different shift or day |

### Attribute overlap scoring

Each shared attribute adds to the score:

| Shared attribute | Score contribution |
|-----------------|-------------------|
| Same operator | +0.35 |
| Same mailbox/queue | +0.25 |
| Same client/order keyword | +0.25 |
| Same LOB | +0.15 |

Cap at 1.0. Require at least one shared attribute to even score.

### Frequency baseline (rarity bonus)

If an operator handles 200 emails/day and creates 50 orders, any email-to-order pair has a 25% base rate. But if this specific mailbox only produces 3 orders/day, a match is much more informative. Score = `1 - base_rate`.

### SQL Template (ASOF JOIN + composite scoring)

```sql
SELECT
    source.{source_id} AS source_event,
    source.{timestamp_col} AS source_time,
    target.{target_id} AS target_event,
    target.{timestamp_col} AS target_time,
    dateDiff('minute', source.{timestamp_col}, target.{timestamp_col}) AS minutes_gap,

    -- Temporal proximity score
    0.4 * multiIf(
        minutes_gap <= 5, 1.0,
        minutes_gap <= 30, 0.7,
        minutes_gap <= 120, 0.4,
        minutes_gap <= 480, 0.15,
        0.0
    )
    -- Attribute overlap score
    + 0.4 * (
        if(source.{attr_1} = target.{attr_1} AND source.{attr_1} != '', 0.35, 0.0)
      + if(source.{attr_2} = target.{attr_2} AND source.{attr_2} != '', 0.25, 0.0)
      + if(source.{attr_3} = target.{attr_3} AND source.{attr_3} != '', 0.25, 0.0)
      + if(source.{attr_4} = target.{attr_4} AND source.{attr_4} != '', 0.15, 0.0)
    )
    -- Frequency baseline (rarity bonus) -- higher when combination is rare
    + 0.2 * (1.0 - {base_rate_for_combination})
    AS attribution_confidence

FROM {source_table} source
ASOF JOIN {target_table} target
    ON source.{join_key} = target.{join_key}
    AND target.{timestamp_col} >= source.{timestamp_col}
WHERE source.{source_filter}
    AND target.{target_filter}
    AND source.{timestamp_col} >= now() - INTERVAL {lookback}
HAVING attribution_confidence >= 0.5
ORDER BY source_time DESC
```

### Real example -- email to order action attribution

```sql
SELECT
    e.SpanAttributes['operator.id'] AS operator,
    e.SpanAttributes['mailbox.id'] AS mailbox,
    e.SpanName AS email_event,
    e.Timestamp AS email_time,
    a.SpanName AS action_event,
    a.Timestamp AS action_time,
    dateDiff('minute', e.Timestamp, a.Timestamp) AS minutes_gap,
    -- Temporal proximity (weight: 0.4)
    0.4 * multiIf(
        dateDiff('minute', e.Timestamp, a.Timestamp) <= 5, 1.0,
        dateDiff('minute', e.Timestamp, a.Timestamp) <= 30, 0.7,
        dateDiff('minute', e.Timestamp, a.Timestamp) <= 120, 0.4,
        dateDiff('minute', e.Timestamp, a.Timestamp) <= 480, 0.15,
        0.0
    )
    -- Attribute overlap (weight: 0.4)
    + 0.4 * (
        if(e.SpanAttributes['operator.id'] = a.SpanAttributes['operator.id']
           AND e.SpanAttributes['operator.id'] != '', 0.35, 0.0)
      + if(e.SpanAttributes['mailbox.id'] = a.SpanAttributes['mailbox.id']
           AND e.SpanAttributes['mailbox.id'] != '', 0.25, 0.0)
      + if(e.SpanAttributes['client.name'] = a.SpanAttributes['client.name']
           AND e.SpanAttributes['client.name'] != '', 0.25, 0.0)
      + if(e.SpanAttributes['lob'] = a.SpanAttributes['lob']
           AND e.SpanAttributes['lob'] != '', 0.15, 0.0)
    )
    AS attribution_confidence,
    multiIf(
        attribution_confidence >= 0.7, 'HIGH',
        attribution_confidence >= 0.5, 'MEDIUM',
        'LOW'
    ) AS confidence_level
FROM otel_traces e
ASOF JOIN otel_traces a
    ON e.SpanAttributes['operator.id'] = a.SpanAttributes['operator.id']
    AND a.Timestamp >= e.Timestamp
WHERE e.SpanName = 'email.opened'
    AND a.SpanName IN ('order.created', 'order.lookup', 'status.updated')
    AND e.Timestamp >= now() - INTERVAL 7 DAY
HAVING attribution_confidence >= 0.5
ORDER BY email_time DESC
LIMIT 100
```

**Example output:** "Email from cd_otc mailbox at 10:14 was actioned by R. Martinez at 1:32 on order ORD-2026-00128 -- 87% confidence (HIGH) based on operator + mailbox + 3.3hr proximity."

### Calibration

Track attribution accuracy over time. When you eventually add explicit foreign keys (e.g., order ID in the email metadata), compare against your probabilistic assignments to tune weights. Target: 85%+ precision at the HIGH confidence threshold.

**Portal usage:** Ask AI attribution cards, email-to-action reports on the Sessions page
**Difficulty:** Hard

---

## 8. Event Clustering (Unsupervised)

Find patterns without predefined categories. When you don't know what workflows exist, let the data tell you.

### The Pattern (Implementation-Agnostic)

Three complementary clustering approaches:

| Method | What it finds | Best for |
|--------|--------------|----------|
| **Template mining (Drain3)** | Structural patterns in event sequences | Discovering undocumented workflows |
| **Feature-vector clustering (K-means)** | Behavioral groups among operators | Operator segmentation, coaching |
| **Temporal clustering (burst detection)** | When events cluster in time | Load spikes, batch processing windows |

### 8a. Drain3 Template Mining

The Drain algorithm (Depth-based Clustering with Automatic Format-detection) uses a fixed-depth prefix tree to group event sequences by structural similarity. It is streaming-capable (processes events one at a time), requires no labeled data, and reduces millions of raw sequences into a manageable set of templates.

**How it works:**
1. Each event sequence is tokenized (e.g., `["email.opened", "order.created", "appraisal.assigned"]`)
2. Drain routes the sequence through a parse tree based on length and leading tokens
3. Sequences matching an existing template (above similarity threshold) are merged
4. New patterns create new template nodes
5. Output: a ranked list of templates with occurrence counts

**Why Drain3 over K-means for sequences:** Drain3 preserves the ordering and structure of events. K-means on bag-of-words features loses sequence information. Use Drain3 when the order matters (process mining), K-means when you care about proportions (behavior profiling).

**SQL -- extract action sequences for clustering:**
```sql
SELECT
    SpanAttributes['operator.id'] AS operator,
    toDate(Timestamp) AS work_date,
    groupArray(SpanName ORDER BY Timestamp ASC) AS action_sequence,
    length(action_sequence) AS action_count
FROM otel_traces
WHERE Timestamp >= now() - INTERVAL 30 DAY
    AND SpanAttributes['operator.id'] != ''
    AND SpanAttributes['product.event'] = 'true'
GROUP BY operator, work_date
ORDER BY operator, work_date
```

**Client-side clustering (Pyodide / WASM):**

ClickStack runs Drain3 in-browser via Pyodide. This keeps data local (no server round-trip for clustering) and works offline.

```javascript
// Load in ClickStack's Pyodide runtime
const pyResult = await pyodide.runPythonAsync(`
from drain3 import TemplateMiner
from drain3.template_miner_config import TemplateMinerConfig

config = TemplateMinerConfig()
config.drain_depth = 4          # Parse tree depth (higher = more specific templates)
config.drain_sim_th = 0.4       # Similarity threshold (lower = more templates)
config.drain_max_children = 100  # Max children per tree node
miner = TemplateMiner(config=config)

for seq in action_sequences:
    miner.add_log_message(' '.join(seq))

clusters = []
for c in miner.drain.clusters:
    clusters.append({
        'template': str(c.get_template()),
        'size': c.size,
        'cluster_id': c.cluster_id
    })
sorted(clusters, key=lambda x: -x['size'])[:20]
`);
```

**Server-side alternative (Python):**
```python
from drain3 import TemplateMiner
from drain3.template_miner_config import TemplateMinerConfig
import json

config = TemplateMinerConfig()
config.drain_depth = 4
config.drain_sim_th = 0.4
miner = TemplateMiner(config=config)

# sequences is a list of lists from the SQL query above
for seq in sequences:
    line = ' '.join(seq['action_sequence'])
    result = miner.add_log_message(line)

# Output: top templates
templates = sorted(miner.drain.clusters, key=lambda c: -c.size)
for t in templates[:20]:
    print(f"[{t.size:5d}x] {t.get_template()}")
    # Example: [ 1247x] email.opened order.created appraisal.assigned <*> report.submitted review.started order.closed
    # The <*> wildcard shows where sequences diverge
```

### 8b. K-means on Operator Behavior Vectors

When you want to group operators (not sequences) by behavioral similarity:

```sql
-- Build feature vectors: one row per operator with behavioral metrics
SELECT
    SpanAttributes['operator.id'] AS operator,
    count() AS total_actions,
    countIf(SpanName = 'email.opened') / count() AS email_pct,
    countIf(SpanName = 'order.created') / count() AS create_pct,
    countIf(SpanName = 'review.started') / count() AS review_pct,
    countIf(StatusCode = 'ERROR') / count() AS error_rate,
    avg(Duration / 1e9) AS avg_duration_sec,
    uniqExact(SpanAttributes['order.id']) AS unique_orders,
    uniqExact(toDate(Timestamp)) AS active_days,
    total_actions / active_days AS actions_per_day
FROM otel_traces
WHERE Timestamp >= now() - INTERVAL 30 DAY
    AND SpanAttributes['operator.id'] != ''
GROUP BY operator
HAVING total_actions >= 50
```

Feed the result into K-means (scikit-learn, Pyodide, or any ML framework). Typical clusters:
- **High-volume processors**: Many orders, fast durations, low error rates
- **Careful reviewers**: Fewer orders, longer durations, lowest error rates
- **Struggling operators**: High rework rates, high error rates, long durations

### 8c. Temporal Burst Detection

Find when events cluster in time -- revealing batch processing, system outages, or coordinated activity.

```sql
-- Detect hourly event bursts using z-scores
SELECT
    hour,
    event_count,
    round(rolling_avg, 1) AS baseline,
    round(z_score, 2) AS z_score,
    if(z_score > 3, 'BURST', if(z_score < -2, 'DROUGHT', 'NORMAL')) AS classification
FROM (
    SELECT
        toStartOfHour(Timestamp) AS hour,
        count() AS event_count,
        avg(event_count) OVER (ORDER BY hour ROWS BETWEEN 168 PRECEDING AND 1 PRECEDING) AS rolling_avg,
        stddevPop(event_count) OVER (ORDER BY hour ROWS BETWEEN 168 PRECEDING AND 1 PRECEDING) AS rolling_std,
        (event_count - rolling_avg) / nullIf(rolling_std, 0) AS z_score
    FROM otel_traces
    WHERE Timestamp >= now() - INTERVAL 30 DAY
        AND SpanAttributes['product.event'] = 'true'
    GROUP BY hour
)
WHERE abs(z_score) > 2
ORDER BY hour
```

**Portal usage:** Process Mining > Variants tab (Drain3 auto-discovered patterns), operator behavior analysis, burst detection alerts
**Difficulty:** Hard (Drain3 requires Pyodide or server-side Python; burst detection is SQL-only)

---

## 9. Conformance Checking

Compare actual operator workflow to the documented SOP. This is a core process mining technique from the van der Aalst school -- the goal is quantifying "how much do we deviate from the plan?"

### The Pattern (Implementation-Agnostic)

Conformance checking has two inputs:
1. **Reference model**: The expected sequence of steps (SOP, happy path, regulatory requirement)
2. **Event log**: What actually happened

And three types of deviations to detect:
- **Skipped steps**: Expected step never occurred
- **Extra steps**: Steps occurred that aren't in the SOP
- **Out-of-order steps**: Steps occurred but in the wrong sequence

The output is a **conformance rate** (what % of cases follow the SOP) and a **deviation breakdown** (which deviations are most common).

### Why it matters

- **Compliance**: Regulated industries require documented proof that SOPs are followed
- **Training**: Identify which operators deviate most and on which steps
- **Process improvement**: If 40% of cases skip step 4, maybe step 4 should be automated or removed
- **Risk**: Out-of-order steps may indicate system workarounds that mask deeper issues

### SQL -- detect all three deviation types

```sql
WITH order_paths AS (
    SELECT
        SpanAttributes['order.id'] AS oid,
        SpanAttributes['operator.id'] AS operator,
        groupArray(SpanName ORDER BY Timestamp ASC) AS actual_steps,
        length(actual_steps) AS actual_count,
        arrayStringConcat(actual_steps, ' -> ') AS actual_path,
        -- Define expected SOP
        7 AS expected_count,
        'email.opened -> order.created -> appraisal.assigned -> inspection.scheduled -> report.submitted -> review.started -> order.closed' AS expected_path,
        -- Check which expected steps are present
        hasAll(actual_steps, [
            'email.opened', 'order.created', 'appraisal.assigned',
            'inspection.scheduled', 'report.submitted', 'review.started', 'order.closed'
        ]) AS has_all_steps
    FROM otel_traces
    WHERE Timestamp >= now() - INTERVAL 30 DAY
        AND SpanAttributes['order.id'] != ''
        AND SpanName IN ('email.opened', 'order.created', 'appraisal.assigned',
                         'inspection.scheduled', 'report.submitted', 'review.started', 'order.closed')
    GROUP BY oid, operator
)
SELECT
    oid,
    operator,
    actual_path,
    multiIf(
        NOT has_all_steps AND actual_count < expected_count, 'skipped_steps',
        actual_count > expected_count, 'extra_steps',
        has_all_steps AND actual_path != expected_path, 'out_of_order',
        has_all_steps AND actual_path = expected_path, 'conformant',
        'mixed_deviation'
    ) AS deviation_type,
    actual_count,
    expected_count
FROM order_paths
ORDER BY
    multiIf(deviation_type = 'skipped_steps', 1, deviation_type = 'out_of_order', 2,
            deviation_type = 'extra_steps', 3, deviation_type = 'mixed_deviation', 4, 5),
    oid
LIMIT 100
```

### SQL -- skipped step analysis (which steps get skipped most?)

```sql
SELECT
    step_name,
    count() AS times_expected,
    countIf(was_present) AS times_present,
    count() - countIf(was_present) AS times_skipped,
    round((count() - countIf(was_present)) * 100.0 / count(), 1) AS skip_rate_pct
FROM (
    SELECT
        SpanAttributes['order.id'] AS oid,
        arrayJoin(['email.opened', 'order.created', 'appraisal.assigned',
                   'inspection.scheduled', 'report.submitted', 'review.started',
                   'order.closed']) AS step_name,
        has(groupArray(SpanName), step_name) AS was_present
    FROM otel_traces
    WHERE Timestamp >= now() - INTERVAL 30 DAY
        AND SpanAttributes['order.id'] != ''
    GROUP BY oid, step_name
)
GROUP BY step_name
ORDER BY skip_rate_pct DESC
```

### SQL -- out-of-order detection using sequenceMatch

```sql
-- Check if report was submitted BEFORE appraisal was assigned (violation)
SELECT
    SpanAttributes['order.id'] AS order_id,
    SpanAttributes['operator.id'] AS operator,
    sequenceMatch('(?1)(?2)')(
        Timestamp,
        SpanName = 'report.submitted',
        SpanName = 'appraisal.assigned'
    ) AS report_before_assignment
FROM otel_traces
WHERE Timestamp >= now() - INTERVAL 30 DAY
    AND SpanAttributes['order.id'] != ''
GROUP BY order_id, operator
HAVING report_before_assignment = 1
```

### SQL -- conformance summary dashboard

```sql
SELECT
    deviation_type,
    count() AS cases,
    round(count() * 100.0 / sum(count()) OVER (), 1) AS pct_of_total,
    -- Break down by operator to find who deviates most
    topK(3)(operator) AS top_deviating_operators
FROM (
    -- ... use the order_paths CTE from the main query above ...
    SELECT oid, operator, deviation_type FROM order_paths
)
GROUP BY deviation_type
ORDER BY cases DESC
```

**Expected output:** "38% of orders deviate from the 7-step SOP. Skipped steps account for 22%, out-of-order for 11%, extra steps for 5%. The `inspection.scheduled` step is skipped 31% of the time."

**Portal usage:** Process Mining > Sequences tab (conformance overlay), compliance dashboards, operator coaching
**Difficulty:** Medium

---

## 10. Predictive Signals

Leading indicators that predict outcomes before they happen. The core insight: early behavior in a process contains signal about the final result.

### The Pattern (Implementation-Agnostic)

For each candidate leading indicator:
1. **Measure the early signal** (e.g., time-to-first-action after email arrival)
2. **Measure the outcome** (e.g., total order cycle time)
3. **Compute correlation** between signal and outcome
4. **Set thresholds** based on historical percentiles
5. **Alert in real-time** when an in-progress case crosses the threshold

### 10a. Time-to-first-action predicts order cycle time

If an email sits untouched for 4+ hours, the resulting order takes 2.3x longer to complete on average. This is the single strongest predictor in most workflow systems.

```sql
-- Compute time-to-first-action and correlate with total cycle time
WITH order_lifecycle AS (
    SELECT
        SpanAttributes['order.id'] AS order_id,
        minIf(Timestamp, SpanName = 'email.opened') AS email_time,
        minIf(Timestamp, SpanName IN ('order.created', 'order.lookup')) AS first_action_time,
        maxIf(Timestamp, SpanName = 'order.closed') AS close_time,
        dateDiff('minute', email_time, first_action_time) AS minutes_to_first_action,
        dateDiff('hour', email_time, close_time) AS total_cycle_hours
    FROM otel_traces
    WHERE Timestamp >= now() - INTERVAL 90 DAY
        AND SpanAttributes['order.id'] != ''
    GROUP BY order_id
    HAVING email_time IS NOT NULL AND first_action_time IS NOT NULL AND close_time IS NOT NULL
)
SELECT
    multiIf(
        minutes_to_first_action <= 15, '0-15 min',
        minutes_to_first_action <= 60, '15-60 min',
        minutes_to_first_action <= 240, '1-4 hr',
        '4+ hr'
    ) AS response_bucket,
    count() AS orders,
    round(avg(total_cycle_hours), 1) AS avg_cycle_hours,
    round(quantile(0.50)(total_cycle_hours), 1) AS median_cycle_hours,
    round(quantile(0.95)(total_cycle_hours), 1) AS p95_cycle_hours
FROM order_lifecycle
GROUP BY response_bucket
ORDER BY
    multiIf(response_bucket = '0-15 min', 1, response_bucket = '15-60 min', 2,
            response_bucket = '1-4 hr', 3, 4)
```

### 10b. Operator error rate in first hour predicts daily error rate

Early-shift errors compound. If an operator's first-hour error rate exceeds 5%, their full-day error rate averages 3.1x the baseline.

```sql
-- First-hour error rate vs full-day error rate by operator by day
WITH daily_metrics AS (
    SELECT
        SpanAttributes['operator.id'] AS operator,
        toDate(Timestamp) AS work_date,
        -- First hour metrics
        countIf(
            toHour(Timestamp) = (SELECT min(toHour(t2.Timestamp))
                                  FROM otel_traces t2
                                  WHERE t2.SpanAttributes['operator.id'] = operator
                                    AND toDate(t2.Timestamp) = work_date)
            AND StatusCode = 'ERROR'
        ) AS first_hour_errors,
        countIf(
            toHour(Timestamp) = (SELECT min(toHour(t2.Timestamp))
                                  FROM otel_traces t2
                                  WHERE t2.SpanAttributes['operator.id'] = operator
                                    AND toDate(t2.Timestamp) = work_date)
        ) AS first_hour_total,
        -- Full day metrics
        countIf(StatusCode = 'ERROR') AS daily_errors,
        count() AS daily_total
    FROM otel_traces
    WHERE Timestamp >= now() - INTERVAL 30 DAY
        AND SpanAttributes['operator.id'] != ''
    GROUP BY operator, work_date
    HAVING first_hour_total >= 5 AND daily_total >= 20
)
SELECT
    multiIf(
        first_hour_errors / first_hour_total <= 0.02, 'first_hr_error_0-2%',
        first_hour_errors / first_hour_total <= 0.05, 'first_hr_error_2-5%',
        'first_hr_error_5%+'
    ) AS first_hour_bucket,
    count() AS operator_days,
    round(avg(daily_errors / daily_total) * 100, 2) AS avg_daily_error_rate_pct,
    round(avg(first_hour_errors / first_hour_total) * 100, 2) AS avg_first_hour_error_pct
FROM daily_metrics
GROUP BY first_hour_bucket
ORDER BY first_hour_bucket
```

### 10c. Rework loop count predicts customer escalation

Orders with 2+ rework loops (review -> revision -> review) are 4.7x more likely to trigger a customer escalation than orders with zero rework.

```sql
-- Rework loops vs escalation correlation
SELECT
    rework_bucket,
    count() AS orders,
    countIf(had_escalation) AS escalations,
    round(countIf(had_escalation) * 100.0 / count(), 1) AS escalation_rate_pct
FROM (
    SELECT
        SpanAttributes['order.id'] AS order_id,
        sequenceCount('(?1)(?2)(?1)')(
            Timestamp,
            SpanName = 'review.started',
            SpanName = 'revision.requested'
        ) AS rework_loops,
        max(if(SpanName = 'escalation.created', 1, 0)) AS had_escalation,
        multiIf(
            rework_loops = 0, '0 loops',
            rework_loops = 1, '1 loop',
            '2+ loops'
        ) AS rework_bucket
    FROM otel_traces
    WHERE Timestamp >= now() - INTERVAL 90 DAY
        AND SpanAttributes['order.id'] != ''
    GROUP BY order_id
)
GROUP BY rework_bucket
ORDER BY multiIf(rework_bucket = '0 loops', 1, rework_bucket = '1 loop', 2, 3)
```

### 10d. Predictive duration modeling (at-risk orders)

Flag in-progress orders whose current step is taking longer than the historical p75 for that step + LOB combination.

```sql
WITH baselines AS (
    SELECT
        SpanAttributes['lob'] AS lob,
        SpanAttributes['core.task'] AS task,
        avg(Duration / 1e9) AS avg_seconds,
        stddevPop(Duration / 1e9) AS std_seconds,
        quantile(0.75)(Duration / 1e9) AS p75_seconds
    FROM otel_traces
    WHERE Timestamp >= now() - INTERVAL 90 DAY
        AND Duration > 0
    GROUP BY lob, task
    HAVING count() >= 20
)
SELECT
    t.SpanAttributes['order.id'] AS order_id,
    t.SpanAttributes['lob'] AS lob,
    t.SpanAttributes['core.task'] AS current_task,
    t.SpanAttributes['operator.id'] AS operator,
    round(t.Duration / 1e9, 1) AS elapsed_seconds,
    round(b.avg_seconds, 1) AS expected_avg,
    round(b.p75_seconds, 1) AS p75_threshold,
    round((t.Duration / 1e9 - b.avg_seconds) / nullIf(b.std_seconds, 0), 2) AS z_score,
    multiIf(z_score > 3, 'CRITICAL', z_score > 2, 'AT_RISK', 'ON_TRACK') AS risk_status
FROM otel_traces t
JOIN baselines b
    ON t.SpanAttributes['lob'] = b.lob
    AND t.SpanAttributes['core.task'] = b.task
WHERE t.Timestamp >= now() - INTERVAL 1 DAY
    AND t.SpanAttributes['order.id'] != ''
    AND t.Duration > 0
HAVING z_score > 2
ORDER BY z_score DESC
LIMIT 25
```

**Portal usage:** Landing page "at risk orders" widget, Ask AI duration predictions, operator early warning alerts
**Difficulty:** Medium

---

## 11. Cross-System Temporal Joins

The "impossible join" pattern -- connecting events across systems that share no foreign key. Email system events, EXOS actions, order outcomes, phone calls, and operator sessions all live in different event streams. Joining them requires temporal reasoning, not just key matching.

### The Pattern (Implementation-Agnostic)

Three approaches of increasing complexity:

| Approach | When to use | Precision | Complexity |
|----------|------------|-----------|------------|
| **ASOF JOIN** | One-to-one nearest-event matching | High (when events are sparse) | Low |
| **Windowed correlation** | Many-to-many within a time window | Medium (captures batched work) | Medium |
| **Probabilistic linking** | Events don't align cleanly, multiple candidates | Variable (confidence-scored) | High |

### Challenge: Why this is hard

Real-world operators don't process events sequentially:
- **Batched work**: An operator reads 15 emails, then creates 8 orders. Which email triggered which order?
- **Out-of-order processing**: The order was created before the email was formally "opened" in the tracking system
- **Concurrent cases**: An operator works 3 orders simultaneously across 2 monitors
- **Clock skew**: Email system timestamps are UTC, EXOS uses EST, phone system uses epoch milliseconds

### Example 1: Simple ASOF JOIN (email to next action)

For sparse event streams where events align reasonably well one-to-one.

```sql
-- For each email.opened event, find the NEXT action by the same operator
SELECT
    e.SpanAttributes['operator.id'] AS operator,
    e.SpanName AS source_event,
    e.Timestamp AS source_time,
    a.SpanName AS target_event,
    a.Timestamp AS target_time,
    dateDiff('minute', e.Timestamp, a.Timestamp) AS minutes_gap
FROM otel_traces e
ASOF JOIN otel_traces a
    ON e.SpanAttributes['operator.id'] = a.SpanAttributes['operator.id']
    AND a.Timestamp >= e.Timestamp
WHERE e.SpanName = 'email.opened'
    AND a.SpanName IN ('order.created', 'order.lookup', 'status.updated')
    AND e.Timestamp >= now() - INTERVAL 7 DAY
    AND dateDiff('minute', e.Timestamp, a.Timestamp) <= 480  -- 8hr max window
ORDER BY e.Timestamp DESC
LIMIT 100
```

**Limitation:** ASOF JOIN returns exactly one match per source row (the nearest). If 3 orders were created from 1 email, you only see the first.

### Example 2: Windowed correlation (email burst to order burst)

For batched work patterns. Instead of matching individual events, correlate windows of activity.

```sql
-- Correlate email volume and order creation volume in 30-minute windows
WITH email_windows AS (
    SELECT
        SpanAttributes['operator.id'] AS operator,
        toStartOfInterval(Timestamp, INTERVAL 30 MINUTE) AS window_start,
        count() AS email_count,
        groupArray(SpanAttributes['mailbox.id']) AS mailboxes
    FROM otel_traces
    WHERE SpanName = 'email.opened'
        AND Timestamp >= now() - INTERVAL 7 DAY
        AND SpanAttributes['operator.id'] != ''
    GROUP BY operator, window_start
),
action_windows AS (
    SELECT
        SpanAttributes['operator.id'] AS operator,
        toStartOfInterval(Timestamp, INTERVAL 30 MINUTE) AS window_start,
        count() AS action_count,
        groupArray(SpanName) AS actions,
        groupArrayDistinct(SpanAttributes['order.id']) AS order_ids
    FROM otel_traces
    WHERE SpanName IN ('order.created', 'order.lookup', 'status.updated')
        AND Timestamp >= now() - INTERVAL 7 DAY
        AND SpanAttributes['operator.id'] != ''
    GROUP BY operator, window_start
)
SELECT
    e.operator,
    e.window_start AS email_window,
    a.window_start AS action_window,
    e.email_count,
    a.action_count,
    dateDiff('minute', e.window_start, a.window_start) AS window_lag_minutes,
    round(a.action_count / e.email_count, 2) AS action_to_email_ratio,
    a.order_ids
FROM email_windows e
JOIN action_windows a
    ON e.operator = a.operator
    -- Actions happen in the same window or the next window (0-60 min lag)
    AND a.window_start >= e.window_start
    AND a.window_start <= e.window_start + INTERVAL 60 MINUTE
WHERE e.email_count >= 3  -- Only look at batched email processing
ORDER BY e.operator, e.window_start
```

**Interpretation:** If an operator read 12 emails between 10:00-10:30 and created 8 orders between 10:30-11:00, the action_to_email_ratio of 0.67 suggests a 67% conversion rate for that batch. Some emails were informational or duplicates.

### Example 3: Probabilistic multi-system linking (phone + email + order)

The hardest case: three systems, no shared keys, overlapping time windows. A phone call comes in, then an email arrives, then an order is created. Which events are part of the same case?

```sql
-- Three-way temporal join with confidence scoring
WITH
    phone_events AS (
        SELECT
            SpanAttributes['operator.id'] AS operator,
            SpanAttributes['caller.phone'] AS phone_number,
            Timestamp AS call_time,
            Duration / 1e9 AS call_duration_sec,
            SpanAttributes['call.disposition'] AS disposition
        FROM otel_traces
        WHERE SpanName = 'phone.call_completed'
            AND Timestamp >= now() - INTERVAL 7 DAY
    ),
    email_events AS (
        SELECT
            SpanAttributes['operator.id'] AS operator,
            SpanAttributes['mailbox.id'] AS mailbox,
            SpanAttributes['email.subject'] AS subject,
            Timestamp AS email_time
        FROM otel_traces
        WHERE SpanName = 'email.opened'
            AND Timestamp >= now() - INTERVAL 7 DAY
    ),
    order_events AS (
        SELECT
            SpanAttributes['operator.id'] AS operator,
            SpanAttributes['order.id'] AS order_id,
            SpanAttributes['lob'] AS lob,
            Timestamp AS order_time
        FROM otel_traces
        WHERE SpanName = 'order.created'
            AND Timestamp >= now() - INTERVAL 7 DAY
    )
SELECT
    p.operator,
    p.call_time,
    p.phone_number,
    e.email_time,
    e.subject,
    o.order_time,
    o.order_id,
    -- Composite confidence score
    (
        -- Phone-to-email proximity (same operator, email within 30 min of call)
        0.3 * if(dateDiff('minute', p.call_time, e.email_time) BETWEEN 0 AND 30, 1.0,
                 if(dateDiff('minute', p.call_time, e.email_time) BETWEEN 30 AND 120, 0.5, 0.1))
        -- Email-to-order proximity
      + 0.3 * if(dateDiff('minute', e.email_time, o.order_time) BETWEEN 0 AND 60, 1.0,
                 if(dateDiff('minute', e.email_time, o.order_time) BETWEEN 60 AND 240, 0.5, 0.1))
        -- All three same operator
      + 0.2 * if(p.operator = e.operator AND e.operator = o.operator, 1.0, 0.0)
        -- Temporal ordering is correct (phone -> email -> order)
      + 0.2 * if(p.call_time <= e.email_time AND e.email_time <= o.order_time, 1.0, 0.0)
    ) AS link_confidence
FROM phone_events p
ASOF JOIN email_events e
    ON p.operator = e.operator AND e.email_time >= p.call_time
ASOF JOIN order_events o
    ON e.operator = o.operator AND o.order_time >= e.email_time
WHERE link_confidence >= 0.6
ORDER BY p.call_time DESC
LIMIT 50
```

### Handling edge cases

| Edge case | Solution |
|-----------|----------|
| **Clock skew** between systems | Normalize all timestamps to UTC before joining. Allow a +/- 5 minute tolerance in ASOF JOIN conditions. |
| **Batched work** (1 email -> N orders) | Use windowed correlation (Example 2) instead of ASOF JOIN. |
| **Out-of-order events** | Allow bidirectional time windows: `abs(dateDiff(...)) <= threshold` instead of requiring A before B. |
| **No shared operator ID** | Fall back to session-level matching (same `rum.session_id`) or IP-based correlation as a proxy. |
| **High fan-out** (1 phone call -> many follow-ups) | Score all candidates, take top-K by confidence, flag as "multi-match." |

**Portal usage:** Sessions page (cross-system timeline), Ask AI causal chain analysis, SLA end-to-end measurement
**Difficulty:** Hard

---

## 12. Organizational Mining (Handover of Work)

Discover the social network of work handoffs between operators. Who passes work to whom? Where do handoffs create delays? This is the organizational mining perspective from process mining, pioneered by van der Aalst's Social Network Miner.

### The Pattern (Implementation-Agnostic)

A **handover of work** occurs when operator A completes a step on a case and operator B performs the next step. Analyzing the frequency, duration, and direction of handoffs reveals:
- **Bottleneck people**: Operators who receive disproportionate handoff volume
- **Team boundaries**: Clusters of operators who frequently pass work to each other
- **Handoff friction**: Transitions between specific operator pairs that take unusually long

### SQL -- build handoff matrix

```sql
SELECT
    prev_operator AS from_operator,
    current_operator AS to_operator,
    count() AS handoff_count,
    round(avg(handoff_minutes), 1) AS avg_handoff_minutes,
    round(quantile(0.95)(handoff_minutes), 1) AS p95_handoff_minutes
FROM (
    SELECT
        SpanAttributes['order.id'] AS order_id,
        SpanAttributes['operator.id'] AS current_operator,
        lagInFrame(SpanAttributes['operator.id']) OVER (
            PARTITION BY SpanAttributes['order.id'] ORDER BY Timestamp
        ) AS prev_operator,
        dateDiff('minute',
            lagInFrame(Timestamp) OVER (
                PARTITION BY SpanAttributes['order.id'] ORDER BY Timestamp
            ),
            Timestamp
        ) AS handoff_minutes
    FROM otel_traces
    WHERE Timestamp >= now() - INTERVAL 30 DAY
        AND SpanAttributes['order.id'] != ''
        AND SpanAttributes['operator.id'] != ''
)
WHERE prev_operator != ''
    AND prev_operator != current_operator  -- Exclude self-handoffs
GROUP BY from_operator, to_operator
HAVING handoff_count >= 5
ORDER BY handoff_count DESC
LIMIT 50
```

### SQL -- identify bottleneck receivers

```sql
-- Who receives the most handoffs? These are potential bottleneck people.
SELECT
    to_operator,
    count() AS inbound_handoffs,
    uniqExact(from_operator) AS unique_senders,
    round(avg(handoff_minutes), 1) AS avg_wait_minutes,
    round(quantile(0.95)(handoff_minutes), 1) AS p95_wait_minutes
FROM (
    -- ... same inner query as above ...
)
WHERE prev_operator != current_operator
GROUP BY to_operator
ORDER BY inbound_handoffs DESC
LIMIT 20
```

### Visualization

The handoff matrix naturally maps to a directed graph (network diagram) or a chord diagram. Each operator is a node, each handoff pattern is a weighted edge. Use d3-force or d3-chord in the portal.

**Portal usage:** Process Mining > Organizational view, team structure analysis, workload rebalancing
**Difficulty:** Medium

---

## 13. Rework Detection and Loop Analysis

Identify processes where operators repeat steps -- indicating confusion, errors, or system issues forcing manual retries. Rework is one of the highest-ROI targets for automation because it represents pure waste.

### The Pattern (Implementation-Agnostic)

Three levels of rework analysis:
1. **Step repetition**: Same step occurs multiple times on the same case
2. **Loop detection**: A->B->A pattern (ping-pong between steps)
3. **Root cause correlation**: Cross-reference rework with operator, time-of-day, and system errors

### SQL -- count rework per step per order

```sql
SELECT
    SpanAttributes['order.id'] AS order_id,
    SpanName AS step,
    count() AS occurrences,
    min(Timestamp) AS first_occurrence,
    max(Timestamp) AS last_occurrence,
    dateDiff('minute', min(Timestamp), max(Timestamp)) AS rework_span_minutes
FROM otel_traces
WHERE Timestamp >= now() - INTERVAL 30 DAY
    AND SpanAttributes['order.id'] != ''
GROUP BY order_id, step
HAVING occurrences > 1
ORDER BY occurrences DESC
LIMIT 50
```

### SQL -- rework heatmap (operator x step)

```sql
SELECT
    operator,
    step,
    sum(total_executions) AS total_executions,
    count() AS unique_orders,
    round(sum(total_executions) / count(), 2) AS avg_executions_per_order
FROM (
    SELECT
        SpanAttributes['operator.id'] AS operator,
        SpanAttributes['order.id'] AS order_id,
        SpanName AS step,
        count() AS total_executions
    FROM otel_traces
    WHERE Timestamp >= now() - INTERVAL 30 DAY
        AND SpanAttributes['operator.id'] != ''
        AND SpanAttributes['order.id'] != ''
    GROUP BY operator, order_id, step
)
GROUP BY operator, step
HAVING avg_executions_per_order > 1.2
ORDER BY avg_executions_per_order DESC
LIMIT 30
```

### SQL -- formal loop detection with sequenceCount

```sql
SELECT
    SpanAttributes['order.id'] AS order_id,
    sequenceCount('(?1)(?2)(?1)')(
        Timestamp,
        SpanName = 'review.started',
        SpanName = 'revision.requested'
    ) AS review_revision_loops,
    sequenceCount('(?1)(?2)(?1)')(
        Timestamp,
        SpanName = 'report.submitted',
        SpanName = 'report.rejected'
    ) AS submit_reject_loops
FROM otel_traces
WHERE Timestamp >= now() - INTERVAL 30 DAY
    AND SpanAttributes['order.id'] != ''
GROUP BY order_id
HAVING review_revision_loops > 0 OR submit_reject_loops > 0
ORDER BY review_revision_loops + submit_reject_loops DESC
LIMIT 25
```

Cross-reference rework hotspots with session replay (via `rum.session_id`) to see **why** operators rework -- system error, unclear UI, missing data, or process confusion.

**Portal usage:** Process Mining > Bottlenecks tab (rework overlay), operator coaching reports, automation candidate scoring
**Difficulty:** Medium

---

## 14. Prompt Engineering for Operational AI

### Context-Aware Analysis Prompt

When Ask AI analyzes data, inject full operational context. This is the master prompt template -- customize per page.

```
You are analyzing operational workflow data for ServiceLink, a financial services
company that processes mortgage-related orders across three lines of business
(valuations, title, closing).

Key context:
- 22 operators across 6 roles process orders through a lifecycle:
  email intake -> order creation -> assignment -> inspection -> report -> review -> closing
- Operators often batch work, act out of order, and handle multiple cases concurrently
- The goal is identifying automation opportunities and process inefficiencies
- Duration values are in nanoseconds. Divide by 1e9 for seconds.
  Values > 86400 seconds represent calendar days between lifecycle stages, not active work time.
- "automation.candidate" = true means the task is repetitive and could be automated
- "automation.est_minutes_saved" represents estimated time saved per occurrence if automated

Given the following data:
{DATA}

Provide:
1. A 2-3 sentence executive summary of the key finding
2. One specific automation recommendation with estimated weekly hours saved
3. One process risk or inefficiency that warrants investigation

Be specific with numbers. No hedging. No "it appears that."
```

**Portal usage:** Ask AI system prompt for all operational analysis queries
**Difficulty:** Easy (prompt engineering, no SQL)

---

### Anomaly Drill-Down Prompt

When an anomaly is detected, auto-generate the dimensional breakdown and feed it to the AI with this prompt.

```
You are performing root cause analysis on an anomaly in operational metrics.

## Anomaly
- Metric: {metric_name}
- Normal range: {baseline_avg} +/- {baseline_std}
- Anomaly value: {anomaly_value} at {anomaly_timestamp}
- Z-score: {z_score}

## Dimensional breakdown (sorted by contribution)
{dimensional_breakdown_table}
(Columns: dimension, value, baseline_rate, anomaly_rate, delta, excess_count)

## Recent changes
{recent_deployments_or_config_changes}

## Task
1. Identify the single most likely root cause in one sentence
2. Cite the specific dimension values that explain the spike
3. Classify as: deployment issue / upstream dependency / operator error / data quality / seasonal pattern
4. Recommend one immediate action and one preventive measure
```

**Portal usage:** Ask AI auto-RCA (triggered by z-score > 3), alerting system narrative generation
**Difficulty:** Easy (prompt engineering, no SQL)

---

### Automation Business Case Prompt

Generate executive-ready automation recommendations from step frequency data.

```
You are building a business case for process automation at ServiceLink.

## Step frequency and duration data
{steps_table}
(Columns: step_name, daily_volume, avg_duration_minutes, p95_duration_minutes,
 rework_rate_pct, operator_count)

## Cost assumptions
- Fully-loaded operator cost: $45/hour
- Automation development cost: $15,000 per workflow (one-time)
- Maintenance cost: 10% of development cost per year

## Task
For the top 3 automation candidates:
1. Calculate annual cost savings: daily_volume * avg_duration_minutes * 260 work days * $0.75/min
2. Calculate payback period: development_cost / monthly_savings
3. Recommend implementation approach: rule-based workflow, API integration, ML classification, or human-in-the-loop
4. Flag any risks (error rate sensitivity, edge cases, compliance requirements)

Format as a table:
| Rank | Step | Annual Savings | Payback (months) | Approach | Risk |

Follow with a 2-sentence executive summary.
```

**Portal usage:** Ask AI automation scoring, executive report generation, quarterly review deck data
**Difficulty:** Easy (prompt engineering, no SQL)

---

## 15. Techniques Quick Reference

A single table mapping every technique in this playbook to the question it answers, the key SQL function or approach, implementation difficulty, and where it surfaces in the portal.

| # | Technique | Question it answers | Key SQL / Approach | Difficulty | Portal Page |
|---|-----------|--------------------|--------------------|------------|-------------|
| 1a | windowFunnel | How far do cases progress through a sequence? | `windowFunnel()` | Easy | Funnels |
| 1b | sequenceMatch | Does this case follow the expected pattern? | `sequenceMatch()` | Medium | Process Mining > Sequences |
| 1c | sequenceCount | How many times does a pattern repeat? | `sequenceCount()` | Medium | Process Mining > Bottlenecks |
| 1d | lagInFrame bottleneck | Which step transitions are slowest? | `lagInFrame()` window function | Medium | Process Mining > Bottlenecks |
| 1e | ASOF JOIN | What happened nearest in time to this event? | `ASOF JOIN` | Hard | Sessions |
| 1f | retention | Do users come back after N days? | `retention()` or manual cohort | Easy | Retention |
| 1g | Variant paths | What are the most common process paths? | `groupArray() + arrayStringConcat()` | Easy | Process Mining > Sequences |
| 1h | Rolling z-score | Is this metric value anomalous? | Window functions + `stddevPop()` | Hard | Error tracking, Ask AI |
| 6 | Automated Dimensional RCA | What dimension explains this metric spike? | CTE baseline/anomaly + UNION ALL sweep | Medium | Ask AI RCA, Bottlenecks |
| 7 | Probabilistic Attribution | Which upstream event triggered this action? | `ASOF JOIN` + composite confidence scoring | Hard | Ask AI attribution, Sessions |
| 8a | Drain3 Template Mining | What undocumented workflows exist? | Drain3 (Pyodide/Python) | Hard | Process Mining > Variants |
| 8b | K-means Behavior Clustering | How do operators group by behavior? | SQL feature vectors + K-means | Hard | Operator analysis |
| 8c | Temporal Burst Detection | When do events cluster in time? | Window function z-scores on counts | Medium | Alerts, Ask AI |
| 9 | Conformance Checking | How often do we deviate from the SOP? | `sequenceMatch()` + path comparison | Medium | Process Mining > Sequences |
| 10a | Time-to-first-action | Does response speed predict cycle time? | Window functions + bucketed aggregation | Medium | Landing page, Ask AI |
| 10b | First-hour error rate | Do early errors predict daily quality? | Time-bucketed error rates | Medium | Operator coaching |
| 10c | Rework loop predictor | Do rework loops predict escalations? | `sequenceCount()` + outcome correlation | Medium | Process Mining, Ask AI |
| 10d | Predictive duration | Is this order at risk of missing SLA? | z-score vs historical baseline | Medium | Landing page "at risk" widget |
| 11 | Cross-System Temporal Joins | How do events connect across systems? | `ASOF JOIN`, windowed correlation, probabilistic linking | Hard | Sessions, Ask AI |
| 12 | Organizational Mining | Who passes work to whom? | `lagInFrame()` + operator transition matrix | Medium | Process Mining > Org view |
| 13 | Rework Detection | Where do operators repeat work? | `sequenceCount()`, step repetition counts | Medium | Bottlenecks, coaching |

---

## References and Further Reading

- **Meta DrP (Fast Dimensional Analysis)**: [engineering.fb.com/2025/12/19/data-infrastructure/drp-metas-root-cause-analysis-platform-at-scale](https://engineering.fb.com/2025/12/19/data-infrastructure/drp-metas-root-cause-analysis-platform-at-scale/) -- 50K analyses/day, 20-80% MTTR reduction
- **FDA Paper (Apriori/FP-Growth for RCA)**: [arxiv.org/abs/1911.01225](https://arxiv.org/abs/1911.01225) -- The original Facebook paper on automated dimensional analysis
- **Drain3 (Log Template Mining)**: [github.com/logpai/Drain3](https://github.com/logpai/Drain3) -- Streaming log template miner, fixed-depth parse tree
- **van der Aalst (Process Mining)**: [processmining.org](https://processmining.org/) -- Social Network Miner, conformance checking, organizational mining
- **Conformance Checking**: [en.wikipedia.org/wiki/Conformance_checking](https://en.wikipedia.org/wiki/Conformance_checking) -- Model-based vs rule-based approaches
