# Building PostHog/Amplitude-Class Product Analytics on ClickHouse

**Last updated**: 2026-03-22
**Author**: Joe Lanzone
**Context**: EXOS Analytics implements these patterns. This document explains the architecture for anyone extending or auditing the system.

---

## Thesis

Commercial product analytics tools (PostHog, Amplitude, Mixpanel, Heap) are wrappers around columnar databases with pre-built UIs. EXOS Analytics proves you can build the same capabilities on raw ClickHouse with:

- 10-50x lower cost (no per-seat, no per-event pricing)
- Full data ownership (your ClickHouse Cloud instance, your SQL)
- Process mining capabilities that no commercial analytics tool offers
- Session replay integrated with trace data (click any metric, watch the session)

The tradeoff: you build the UI yourself. EXOS Analytics has done this with vanilla HTML + Tailwind + Chart.js -- no framework, no build pipeline, no npm dependency hell.

---

## 1. Event Tracking Schema Design

### The Single-Table Pattern

Commercial analytics tools use a denormalized single-table event model. EXOS Analytics achieves the same via OpenTelemetry's `otel_traces` table:

```
Commercial tool schema:          OTel equivalent in EXOS:
─────────────────────             ──────────────────────
event_id                    →     SpanId
timestamp                   →     Timestamp (DateTime64(9))
user_id                     →     SpanAttributes['user.id'] or ResourceAttributes['rum.sessionId']
event_name                  →     SpanName
event_properties (JSON)     →     SpanAttributes (Map(String, String))
user_properties (JSON)      →     ResourceAttributes (Map(String, String))
session_id                  →     SpanAttributes['rum.session_id']
device_type                 →     ResourceAttributes['visitor.device_type']
country                     →     ResourceAttributes['visitor.country']
```

The key insight: OTel's `Map(String, String)` columns are ClickHouse's equivalent of JSON event properties, but with better compression because ClickHouse stores maps as parallel arrays of keys and values.

### Why Not a Separate Events Table?

Some architectures create a dedicated `product_events` table. EXOS avoids this because:

1. **Materialized Views are free**: The `user_actions` view already filters `otel_traces` to product events. No storage cost, no sync lag.
2. **Cross-correlation**: Keeping product events in `otel_traces` lets you join frontend events with backend spans in a single query (same TraceId).
3. **Less operational surface**: One table to back up, one TTL policy, one schema to evolve.

If you need a physical table for performance (queries scanning 100M+ rows), create a Materialized View that writes to a dedicated MergeTree:

```sql
CREATE MATERIALIZED VIEW product_events_mv
ENGINE = MergeTree()
ORDER BY (SpanName, toDate(Timestamp), SpanAttributes['user.id'])
AS
SELECT *
FROM otel_traces
WHERE SpanAttributes['product.event'] = 'true';
```

---

## 2. Funnel Analysis with windowFunnel()

ClickHouse's `windowFunnel(window_seconds)` aggregate function is purpose-built for conversion funnel analysis. It counts how far each user progresses through an ordered sequence of events within a time window.

### How It Works

```sql
SELECT level, count() AS users
FROM (
  SELECT
    user_id,
    windowFunnel(3600)(           -- 3600-second (1-hour) conversion window
      Timestamp,                   -- event timestamp
      SpanName = 'portal.login',   -- step 1: must happen first
      SpanName = 'order.created',  -- step 2: must follow step 1 within window
      SpanName = 'photo.uploaded'  -- step 3: must follow step 2 within window
    ) AS level
  FROM otel_traces
  WHERE Timestamp >= now() - INTERVAL 30 DAY
  GROUP BY user_id
)
GROUP BY level
ORDER BY level
```

**Output**: `level=0` (never started), `level=1` (completed step 1 only), `level=2` (completed steps 1+2), `level=3` (completed all three).

### EXOS Implementation

The funnel builder UI (`funnels.html`) provides:
- Dropdown-based step selection (2-5 steps from available SpanNames)
- Configurable time window
- Horizontal bar chart with step-by-step conversion rates
- Step detail table showing: % of first step, % of previous step, drop-off count

**API**: `GET /api/funnels?steps=portal.login,order.created,photo.uploaded&window=3600&hours=720`

### Advanced: Breakdowns and Comparisons

To break down a funnel by a dimension (device type, LOB, etc.), wrap the query with an additional GROUP BY:

```sql
SELECT
  ResourceAttributes['visitor.device_type'] AS device,
  level,
  count() AS users
FROM (
  SELECT
    SpanAttributes['rum.session_id'] AS user_id,
    any(ResourceAttributes['visitor.device_type']) AS device_type,
    windowFunnel(3600)(
      Timestamp,
      SpanName = 'portal.login',
      SpanName = 'order.created',
      SpanName = 'photo.uploaded'
    ) AS level
  FROM otel_traces
  WHERE Timestamp >= now() - INTERVAL 30 DAY
  GROUP BY user_id
)
GROUP BY device, level
ORDER BY device, level
```

This answers "do mobile users convert at the same rate as desktop users?" -- a question that Amplitude charges $30K+/yr to answer.

---

## 3. Cohort and Retention Analysis

### retention() Function

ClickHouse has a built-in `retention()` aggregate function, but EXOS uses a more flexible manual approach with `dateDiff()` that supports arbitrary retention windows:

```sql
SELECT
  toDate(first_seen) AS cohort_day,
  count() AS cohort_size,
  countIf(d1_active) AS d1,
  countIf(d7_active) AS d7,
  countIf(d14_active) AS d14,
  countIf(d30_active) AS d30,
  round(countIf(d1_active) * 100.0 / count(), 1) AS d1_pct,
  round(countIf(d7_active) * 100.0 / count(), 1) AS d7_pct,
  round(countIf(d14_active) * 100.0 / count(), 1) AS d14_pct,
  round(countIf(d30_active) * 100.0 / count(), 1) AS d30_pct
FROM (
  SELECT
    SpanAttributes['rum.session_id'] AS uid,
    min(Timestamp) AS first_seen,
    max(Timestamp) AS last_seen,
    dateDiff('day', min(Timestamp), max(Timestamp)) >= 1 AS d1_active,
    dateDiff('day', min(Timestamp), max(Timestamp)) >= 7 AS d7_active,
    dateDiff('day', min(Timestamp), max(Timestamp)) >= 14 AS d14_active,
    dateDiff('day', min(Timestamp), max(Timestamp)) >= 30 AS d30_active
  FROM otel_traces
  WHERE Timestamp >= now() - INTERVAL 90 DAY
    AND SpanAttributes['rum.session_id'] != ''
  GROUP BY uid
)
GROUP BY cohort_day
ORDER BY cohort_day
```

### EXOS Implementation

The retention page (`retention.html`) renders:
- Color-coded retention matrix (green = high retention, red = low)
- Hover tooltips with exact counts and percentages
- Daily/weekly toggle
- Configurable first-action and return-action
- Summary cards: Avg D0, D1, D7, D30 retention

**API**: `GET /api/retention?first_action=portal.login&return_action=order.created&mode=daily&hours=720`

### Weekly Cohort Alternative

For B2B products where daily retention is noisy, use weekly cohorts:

```sql
SELECT
  toMonday(first_seen) AS cohort_week,
  count() AS cohort_size,
  countIf(dateDiff('week', first_seen, last_seen) >= 1) AS w1,
  countIf(dateDiff('week', first_seen, last_seen) >= 2) AS w2,
  countIf(dateDiff('week', first_seen, last_seen) >= 4) AS w4
FROM (
  SELECT
    coalesce(nullIf(SpanAttributes['user.id'], ''), SpanAttributes['rum.session_id']) AS uid,
    min(Timestamp) AS first_seen,
    max(Timestamp) AS last_seen
  FROM otel_traces
  WHERE Timestamp >= now() - INTERVAL 90 DAY
  GROUP BY uid
)
GROUP BY cohort_week
ORDER BY cohort_week
```

---

## 4. Materialized Views for Pre-Aggregation

For dashboards that need sub-second response times on billions of rows, pre-aggregate with Materialized Views. These are ClickHouse's killer feature -- they compute incrementally on INSERT, not on query.

### Daily Event Counts (for sparklines and trend charts)

```sql
CREATE MATERIALIZED VIEW daily_event_counts_mv
ENGINE = SummingMergeTree()
ORDER BY (event_date, service, event_name)
AS
SELECT
  toDate(Timestamp) AS event_date,
  ServiceName AS service,
  SpanName AS event_name,
  count() AS event_count,
  countIf(StatusCode = 'ERROR') AS error_count,
  avg(Duration / 1e6) AS avg_duration_ms,
  quantileState(0.95)(Duration / 1e6) AS p95_duration_state
FROM otel_traces
GROUP BY event_date, service, event_name;
```

### Hourly Session Metrics (for real-time dashboards)

```sql
CREATE MATERIALIZED VIEW hourly_sessions_mv
ENGINE = SummingMergeTree()
ORDER BY (hour, service)
AS
SELECT
  toStartOfHour(Timestamp) AS hour,
  ServiceName AS service,
  uniqState(SpanAttributes['rum.session_id']) AS unique_sessions_state,
  count() AS total_events,
  countIf(StatusCode = 'ERROR') AS error_count
FROM otel_traces
GROUP BY hour, service;

-- Query the MV (merges states at read time):
SELECT
  hour,
  service,
  uniqMerge(unique_sessions_state) AS unique_sessions,
  sum(total_events) AS events,
  sum(error_count) AS errors,
  round(sum(error_count) * 100.0 / sum(total_events), 2) AS error_rate_pct
FROM hourly_sessions_mv
WHERE hour >= now() - INTERVAL 7 DAY
GROUP BY hour, service
ORDER BY hour;
```

### Cost of MVs

Materialized Views consume additional storage proportional to the aggregation factor. For EXOS at typical scale:

| MV | Storage Overhead | Query Speedup |
|----|-----------------|---------------|
| Daily event counts | ~0.1% of base table | 100-1000x for daily trend queries |
| Hourly session metrics | ~0.5% of base table | 50-500x for dashboard queries |
| Product events (filtered copy) | ~10-20% of base table | 5-10x (smaller scan) |

---

## 5. GUI Query Builder Architecture

End users should not need to write SQL. EXOS Analytics provides two paths to insights without SQL:

### Path A: Pre-Built Analysis Pages

Each analysis type has a dedicated UI with dropdowns, toggles, and date pickers:

| Page | No-SQL Controls | Backend Translation |
|------|----------------|---------------------|
| Funnels | Step dropdowns, time window selector | `windowFunnel()` query |
| Retention | First/return action dropdowns, daily/weekly toggle | `dateDiff()` cohort query |
| Paths | Depth slider, time range | `groupArray()` + `lagInFrame()` |
| Process Mining | Tab selector (Sequences, Bottlenecks, By LOB) | Multiple specialized queries |
| Errors | Service filter, time range | Grouped error aggregation |
| Events | Service, status, time range filters | Paginated trace scan |

### Path B: Natural Language Query (Ask AI)

The Query Studio (`query.html`) accepts natural language questions and generates ClickHouse SQL via Azure AI Foundry:

```
User: "How many orders were created this week by LOB?"

→ AI generates:
SELECT
  SpanAttributes['lob'] AS lob,
  count() AS order_count
FROM otel_traces
WHERE SpanName = 'order.created'
  AND Timestamp >= now() - INTERVAL 7 DAY
  AND SpanAttributes['lob'] != ''
GROUP BY lob
ORDER BY order_count DESC

→ Executed against ClickHouse, results rendered as chart
```

The AI assistant has full schema context injected into its system prompt, including table structures, column types, available SpanAttributes keys, and example queries from the Operational Intelligence Playbook.

### Path C: SQL Power Users

The Query Studio also provides a full SQL editor with:
- Schema browser with type-colored badges
- SQL autocomplete
- Line numbers
- Chart builder (bar/line/pie/area) with axis selectors
- CSV/JSON export
- Query history and saved queries
- Shareable query URLs (SQL encoded in URL parameters)

---

## 6. Session Replay Integration

The differentiator: click any metric, watch the session. This is the bridge between quantitative analytics ("conversion dropped 12%") and qualitative understanding ("users are confused by the submit button placement").

### Architecture

```
Metric Dashboard                    Session Replay
┌──────────────────┐               ┌──────────────────────┐
│ Funnel shows 40%  │  click on    │ rrweb DOM replay at   │
│ drop-off at step  │ ─────────> │ exact moment of drop  │
│ "order.created"   │  deep link   │ off, with console     │
│                   │  with        │ logs, network reqs,   │
│                   │  session_id  │ and error overlay     │
└──────────────────┘               └──────────────────────┘
```

### How It Works

1. Every metric query includes `SpanAttributes['rum.session_id']` in its GROUP BY or SELECT
2. Clicking a data point in a chart extracts the session ID(s) associated with that data point
3. A deep link to `replay.html#session=<id>` opens the session replay at the relevant timestamp
4. The replay player loads rrweb events from `hyperdx_sessions` and renders the DOM reconstruction

### rrweb Event Types Used

| Type | Name | Purpose |
|------|------|---------|
| 2 | FullSnapshot | Complete DOM tree snapshot |
| 3 | IncrementalSnapshot | Mutations, mouse moves, clicks, scrolls, input |
| 4 | Meta | Page navigation (URL change) |
| 5 | Custom | HyperDX-injected events: console logs, network requests, dead/rage clicks |

### Activity Heatbar

The session replay page features a YouTube-style activity heatbar (`assets/js/activity-heatbar.js`) that visualizes engagement intensity over time using weighted event buckets and Catmull-Rom spline interpolation. Event weights:

| Signal | Weight | Rationale |
|--------|--------|-----------|
| Rage click | 5.0 | Strongest frustration signal |
| Error | 3.0 | Something went wrong |
| Navigation | 2.0 | Intentional page change |
| Click | 1.0 | Normal interaction |
| Input | 0.8 | Form interaction |
| Scroll | 0.3 | Passive consumption |

---

## 7. Anomaly Detection

EXOS uses rolling z-score anomaly detection -- the same statistical approach used by Datadog Watchdog and Meta's DrP platform, implemented in pure ClickHouse SQL.

### Rolling Z-Score

```sql
SELECT
  hour,
  error_rate,
  round(rolling_avg, 4) AS baseline,
  round(z_score, 2) AS z_score,
  abs(z_score) > 3 AS is_anomaly
FROM (
  SELECT
    hour,
    error_rate,
    avg(error_rate) OVER w AS rolling_avg,
    stddevPop(error_rate) OVER w AS rolling_std,
    (error_rate - rolling_avg) / nullIf(rolling_std, 0) AS z_score
  FROM (
    SELECT
      toStartOfHour(Timestamp) AS hour,
      countIf(StatusCode = 'ERROR') / count() AS error_rate
    FROM otel_traces
    WHERE Timestamp >= now() - INTERVAL 30 DAY
    GROUP BY hour
  )
  WINDOW w AS (ORDER BY hour ROWS BETWEEN 168 PRECEDING AND 1 PRECEDING)
)
ORDER BY hour
```

The 168-row window (7 days of hourly data) provides a weekly-seasonal baseline. A z-score > 3 (or < -3) flags the hour as anomalous. This catches:

- Error rate spikes (deployment failures, upstream outages)
- Traffic drops (infrastructure issues, DNS problems)
- Latency increases (database contention, network degradation)

### Automated Dimensional RCA

When an anomaly fires, EXOS automatically slices by every available dimension (LOB, operator, service, day of week) to identify the root cause. This is the Meta DrP / Datadog Watchdog pattern -- implemented in `OPERATIONAL_INTELLIGENCE_PLAYBOOK.md` Section 6.

---

## 8. What Commercial Tools Have That EXOS Does Not (Yet)

Honest gaps, with effort estimates:

| Capability | Amplitude/PostHog | EXOS Status | Effort to Close |
|-----------|-------------------|-------------|-----------------|
| User identity stitching | Built-in | Session ID only; `user.id` attribute available but not yet wired into all queries | 1 day (coalesce change) |
| Behavioral cohort builder | Visual UI | None | 1-2 weeks |
| Saved dashboards | Drag-and-drop | None (each page is standalone) | 1-2 weeks |
| Alerting | Threshold + anomaly | None (anomaly detection exists in SQL but no notification pipeline) | 3-5 days |
| A/B experiment analysis | Built-in | None | 2-4 weeks |
| Data governance/taxonomy | Event blocking, transforming | None (raw OTel data) | 1-2 weeks |
| Self-serve onboarding | Guided setup wizard | One-tag copy-paste (fast but no guided tour) | 1 week |

The critical insight: these are UI/UX gaps, not architectural gaps. The ClickHouse backend can support all of these capabilities. The work is in building the frontend abstractions.

---

## 9. Performance Characteristics

Benchmarks on ClickHouse Cloud (Azure East US 2, dev tier):

| Query Type | Row Count | Latency | Notes |
|-----------|-----------|---------|-------|
| Simple aggregation (count, avg) | 1M traces | 50-200ms | Hot cache |
| windowFunnel() with 5 steps | 1M traces | 200-500ms | Depends on cardinality of user IDs |
| Retention matrix (90-day) | 5M traces | 500ms-1.5s | Heavy: full scan + GROUP BY |
| Path analysis (depth 5) | 1M traces | 300-800ms | groupArray + lagInFrame |
| Full-text search on SpanAttributes | 10M traces | 1-3s | Consider bloom filter index |
| Session replay event load | 1 session (~2K events) | 100-300ms | Filtered by session ID |

ClickHouse Cloud's automatic SSD caching means frequently-queried recent data (last 7 days) returns at the low end of these ranges. Cold data (90+ days) hits the high end on first query, then caches.

---

## 10. Extending EXOS Analytics

### Adding a New Event Type

1. Emit the event as an OTel span with `SpanAttributes['product.event'] = 'true'`:
   ```javascript
   window.HyperDX.addAction('payment.completed', {
     orderId: order.id,
     amount: order.total,
     method: 'credit_card'
   });
   ```

2. Add to the `user_actions` view in `SCHEMA_VIEWS.sql`:
   ```sql
   WHEN 'payment.completed' THEN 'Payment'
   ```

3. The event is immediately available in funnels, retention, paths, and Query Studio.

### Adding a New Analysis Page

1. Copy any existing HTML page as a template
2. Add a `GET /api/<name>` handler in `functions/api/<name>.js`
3. Write the ClickHouse SQL query in the handler
4. Render results with Chart.js on the frontend
5. Add to the nav in all HTML pages

No build step. No compilation. Edit HTML, deploy to Cloudflare Pages or Vercel, done.

### Adding a New Materialized View

1. Write the MV definition in `docs/SCHEMA_VIEWS.sql`
2. Execute against ClickHouse Cloud
3. The MV backfills automatically on next INSERT (not retroactive -- backfill manually if needed)
4. Query the MV from API handlers instead of the base table

---

## Related Documentation

| Document | Covers |
|----------|--------|
| `OPERATIONAL_INTELLIGENCE_PLAYBOOK.md` | SQL templates for all 10 technique categories |
| `ARCHITECTURE.md` | System architecture, component details, deployment models |
| `STORAGE_ARCHITECTURE.md` | Tiered storage, TTL policies, cost projections |
| `vs-commercial-analytics.md` | Feature-by-feature comparison with 9 commercial tools |
| `vs-amplitude.md` | Deep-dive PM evaluation against Amplitude |
| `vs-fullstory.md` | Deep-dive engineering evaluation against FullStory |
