# EXOS Analytics -- System Architecture

**Last updated**: 2026-03-22
**Author**: Joe Lanzone
**Status**: Living document

---

## Overview

EXOS Analytics is an operational intelligence platform built for ServiceLink that combines process mining, product analytics, session replay, and AI-assisted querying into a single ClickHouse-native stack. It replaces a $100K+/yr commercial toolchain (Celonis + FullStory + Amplitude + Pendo) at ~$2K/yr by leveraging:

- **ClickHouse Cloud** (Azure East US 2) as the sole analytical database
- **HyperDX** browser SDK + OTel Collector for telemetry ingestion and session replay
- **Cloudflare Pages / Vercel** for static frontend hosting + serverless API functions
- **Vanilla HTML + Tailwind CSS + Chart.js** for the frontend (zero framework dependencies)
- **Azure AI Foundry** (Claude on Azure) for natural-language query generation

---

## System Architecture Diagram

```
                                EXOS Analytics -- Data Flow

  Browser (Operator)
  +-------------------+
  | HyperDX SDK       |  OTel traces + rrweb replay events
  | (session replay,  | ───────────────────────────┐
  |  console capture, |                             |
  |  network capture, |                             v
  |  dead-click       |               +─────────────────────────+
  |  tracker)         |               | OTel Collector           |
  +-------------------+               | (HyperDX variant)        |
           |                          | Protocols: OTLP/gRPC     |
           |                          | :4317 + OTLP/HTTP :4318  |
           |                          +─────────────────────────+
           |                                       |
           |                    Batch insert via    |
           |                    ClickHouse native   |
           |                    protocol (TCP :9440)|
           |                                       v
           |                   +═══════════════════════════════+
           |                   ║  ClickHouse Cloud             ║
           |                   ║  (Azure East US 2, managed)   ║
           |                   ║                               ║
           |                   ║  Tables:                      ║
           |                   ║  ├── otel_traces              ║
           |                   ║  ├── otel_logs                ║
           |                   ║  ├── otel_metrics_*           ║
           |                   ║  ├── hyperdx_sessions         ║
           |                   ║  └── Views: sessions, traces, ║
           |                   ║      errors, logs, page_views,║
           |                   ║      user_actions, services   ║
           |                   ╚═══════════════════════════════╝
           |                                  ^   |
           |          ClickHouse HTTP API     |   |  Query results (JSON)
           |          (:8443, TLS)            |   |
           |                                  |   v
           |                    +─────────────────────────+
           |                    | API Functions Layer      |
           |                    | (Cloudflare Pages Fns    |
           |    Static HTML +   |  or Vercel Serverless)   |
           |    Tailwind +      |                          |
           |    Chart.js        | 18 handlers:             |
           |                    | stats, schema, sql,      |
           +<───────────────────| events, errors, sessions,|
             fetch(/api/*)      | funnels, paths, retention|
                                | filters, visitor,        |
                                | visualize, workspaces,   |
                                | assistant, query,        |
                                | operators, opportunities,|
                                | insights                 |
                                +─────────────────────────+
                                            |
                                            | Azure AI Foundry
                                            | (Claude on Azure)
                                            v
                                +─────────────────────────+
                                | NL-to-SQL generation     |
                                | Process analysis prompts |
                                | RCA / anomaly explanation|
                                +─────────────────────────+
```

---

## Component Details

### 1. Browser SDK (HyperDX)

Every page loads the HyperDX browser SDK (`@hyperdx/browser@0.22.0`) which instruments:

| Signal | Mechanism | Destination Table |
|--------|-----------|-------------------|
| DOM replay (rrweb) | Full DOM snapshot + incremental patches | `hyperdx_sessions` |
| Console logs | `consoleCapture: true` | `hyperdx_sessions` (type 5 custom events) |
| Network requests | `advancedNetworkCapture: true` | `hyperdx_sessions` (type 5 custom events) |
| Dead/rage clicks | `assets/dead-click-tracker.js` via `HyperDX.addAction()` | `hyperdx_sessions` (type 5 custom events) |
| Page navigations | Automatic `documentLoad` / `routeChange` spans | `otel_traces` |
| Visitor enrichment | `/api/visitor` call on page load, then `setGlobalAttributes()` | `ResourceAttributes` on all spans |

The SDK connects to the OTel Collector at `https://collector.exos-demo.com` using OTLP/HTTP.

### 2. OTel Collector

The HyperDX-flavored OpenTelemetry Collector receives browser telemetry and writes directly to ClickHouse Cloud:

- **Ingress**: OTLP/gRPC on `:4317`, OTLP/HTTP on `:4318`
- **Export**: ClickHouse native protocol on `:9440` (TLS)
- **Database**: `default` schema on ClickHouse Cloud
- **Deployment**: Docker container (`hyperdx/hyperdx-otel-collector:latest`) on a VM or AKS pod

No intermediate queue, no Kafka, no sampling. Raw events flow directly into ClickHouse within milliseconds of emission.

### 3. ClickHouse Cloud

ClickHouse Cloud (Azure Marketplace, East US 2) is the sole analytical database. All data lives here.

#### Core Tables

| Table | Engine | Primary Key | Purpose |
|-------|--------|-------------|---------|
| `otel_traces` | MergeTree | `(ServiceName, SpanName, toDate(Timestamp), TraceId)` | All backend + frontend spans |
| `otel_logs` | MergeTree | `(ServiceName, SeverityText, toDate(Timestamp))` | Server-side logs |
| `otel_metrics_gauge` | MergeTree | `(MetricName, toDate(TimeUnix))` | Gauge metrics |
| `otel_metrics_histogram` | MergeTree | `(MetricName, toDate(TimeUnix))` | Histogram metrics |
| `otel_metrics_sum` | MergeTree | `(MetricName, toDate(TimeUnix))` | Sum/counter metrics |
| `hyperdx_sessions` | MergeTree | `(ResourceAttributes, toDate(Timestamp))` | rrweb replay events + browser telemetry |

#### Views (zero-storage, computed at query time)

Defined in `docs/SCHEMA_VIEWS.sql`:

| View | Base Table | Purpose |
|------|-----------|---------|
| `sessions` | `hyperdx_sessions` | One row per browser session with duration and event count |
| `traces` | `otel_traces` | Human-readable column aliases, extracted HTTP context |
| `errors` | `otel_traces` | Pre-filtered to `StatusCode = 'ERROR'` |
| `logs` | `otel_logs` | Human-readable log view with severity |
| `page_views` | `otel_traces` | Pre-filtered to `documentLoad` / `routeChange` spans |
| `user_actions` | `otel_traces` | Pre-filtered to known product events with labels |
| `services` | `otel_traces` | Per-service aggregates: trace count, error rate, latency percentiles |

#### Key SpanAttributes

The `Map(String, String)` columns `SpanAttributes` and `ResourceAttributes` carry the domain-specific context that powers process mining:

| Key | Example | Used For |
|-----|---------|----------|
| `order.id` | `ORD-2026-001` | Process mining case ID |
| `operator.id` | `op-437` | Operator-level analysis |
| `lob` | `Appraisal` | Line-of-business segmentation |
| `rum.session_id` | `abc123-def456` | Browser session correlation |
| `user.id` | `jlanzone@svclnk.com` | User identity (when set) |
| `product.event` | `true` | Filter to product-level events |

#### Table Relationships

```
otel_traces <── TraceId ──> otel_logs
     |
     +── SpanAttributes['rum.session_id']
     |           |
     |           v
     +── hyperdx_sessions (ResourceAttributes['rum.sessionId'])
                |
                +── rrweb DOM replay events (LogAttributes['rr-web.event'])
```

### 4. Three-MV Pipeline (from RadLo Paper)

The RadLo paper (Lanzone, 2026b) defines a three-stage Materialized View pipeline that transforms raw OTel spans into behavioral intelligence. EXOS Analytics implements this pattern:

```
Stage 1: Raw Ingestion                Stage 2: Session Enrichment       Stage 3: Intent Classification
┌─────────────────────┐               ┌──────────────────────┐          ┌──────────────────────┐
│ otel_traces          │    MV 1       │ session_enrichment    │   MV 2   │ operator_intent       │
│                      │──────────────>│                       │─────────>│                       │
│ Raw OTel spans with  │  Aggregate    │ Per-session metrics:  │  Classify│ Intent labels:        │
│ SpanName, Duration,  │  by session   │ - page_count          │  by rule │ - browsing            │
│ SpanAttributes,      │  ID, compute  │ - action_count        │  engine  │ - searching           │
│ ResourceAttributes   │  features     │ - error_count         │  or ML   │ - transacting         │
│                      │               │ - session_duration    │          │ - investigating       │
│ ~millions of spans   │               │ - unique_pages        │          │ - struggling          │
│ per day              │               │ - has_rage_clicks     │          │                       │
└─────────────────────┘               │ - dominant_lob        │          │ One row per session   │
                                       │                       │          │ with intent + score   │
                                       │ One row per session   │          └──────────────────────┘
                                       └──────────────────────┘
                                                                          MV 3 (optional):
                                                                         ┌──────────────────────┐
                                                                         │ behavioral_segments   │
                                                                         │                       │
                                                                         │ Gamma-Poisson         │
                                                                         │ Bayesian indexing      │
                                                                         │ (from Taste Machine)  │
                                                                         │                       │
                                                                         │ Audience segments     │
                                                                         │ with lift scores      │
                                                                         └──────────────────────┘
```

**Stage 1** (otel_traces) exists today. **Stage 2** (session enrichment) is partially implemented via the `sessions` view. **Stage 3** (intent classification) is the research frontier from the RadLo paper, where browsing behavior signals are classified into intent categories for personalization.

The Free Signal paper (Lanzone, 2026c) proves that this intent classification strictly lifts ad revenue via the attention channel (alpha > 1) with a Blackwell RPM factor (rho >= 1).

### 5. API Functions Layer

Nineteen serverless function handlers translate HTTP requests into ClickHouse SQL queries and return JSON results. They run on either Cloudflare Pages Functions or Vercel Serverless Functions via a compatibility adapter.

| Handler | Method | Purpose |
|---------|--------|---------|
| `stats` | GET | Real-time dashboard metrics (active sessions, events/min, error rate) |
| `schema` | GET | ClickHouse table schema browser |
| `sql` | POST | Raw SQL execution (Query Studio) |
| `events` | GET | Event explorer with filters |
| `errors` | GET | Grouped error tracking |
| `sessions` | GET | Session list + individual session traces |
| `funnels` | GET | `windowFunnel()` conversion analysis |
| `paths` | GET | User path analysis with sliding-window approach |
| `retention` | GET | Cohort retention matrices |
| `filters` | GET | Dynamic filter values (services, span names) |
| `visitor` | GET | Visitor enrichment (geo, device, network from Cloudflare headers) |
| `visualize` | POST | Chart rendering from SQL results |
| `workspaces` | GET | Workspace (service) selector |
| `assistant` | POST | AI chat assistant (streaming, Azure AI Foundry) |
| `query` | POST | NL-to-SQL generation (Azure AI Foundry) |
| `operators` | GET | Operator performance metrics |
| `opportunities` | GET | Automation opportunity scoring |
| `insights` | GET | AI-generated operational insights |
| `ocpm` | GET | Object-Centric Process Mining (multi-object interaction analysis) |

**Authentication**: Cookie-based Basic Auth gate via `functions/_middleware.js`. First visit prompts HTTP Basic Auth, then sets a 30-day `csauth` cookie.

**Dual-runtime compatibility**: The `api/[action].js` Vercel adapter translates Node.js req/res into Cloudflare-compatible Request/Context objects, allowing the same handler code to run on both platforms.

### 6. Frontend

Static HTML pages served from Cloudflare Pages or Vercel's CDN. No React, no Vue, no build step beyond Tailwind CSS compilation.

| Page | File | Purpose |
|------|------|---------|
| Ops Overview | `index.html` | Landing page, automation opportunities, real-time stats |
| Process Mining | `compare.html` | Funnel, Sequences, By LOB, Bottlenecks, Order Traces tabs |
| Analytics | `analytics.html` | Product analytics dashboard |
| Sessions | `sessions.html` | Order lifecycle trace viewer |
| Replay | `replay.html` | Session replay (rrweb DOM replay + activity heatbar) |
| Funnels | `funnels.html` | Standalone funnel builder |
| Paths | `paths.html` | User path analysis |
| Retention | `retention.html` | Cohort retention matrices |
| Errors | `errors.html` | Grouped error tracking |
| Events | `events.html` | Event explorer |
| Engineering | `engineering.html` | Service health, latency percentiles |
| Operators | `operators.html` | Operator performance analysis |
| Query Studio | `query.html` | SQL editor with autocomplete, AI assistant, chart builder |
| Docs | `docs.html` | Architecture documentation and SQL playbook |
| Getting Started | `getting-started.html` | One-tag onboarding |
| Chat | `chat.html` | AI chat interface |
| Architecture | `architecture.html` | System architecture visualization |

**Design system**: Tailwind CSS with custom config (`tailwind.config.js`). Brand color: ServiceLink Blue (`#2B7FBE` / `sl-blue`). Font stack: Oswald (headings), Roboto (body), JetBrains Mono (code). Dark mode support via `html.dark` class toggle.

**Charting**: Chart.js for all visualizations (bar, line, pie, area). d3-sankey for process flow maps. Mermaid.js for AI-generated process diagrams. FINOS Perspective for pivot tables.

---

## Deployment Architectures

### Current: Cloudflare Pages + VM (v2)

```
Browser ──> Cloudflare Pages (static HTML + API Functions)
              |
              +──> ClickHouse Cloud (Azure East US 2)

Browser SDK ──> Cloudflare Tunnel ──> VM (Docker Compose)
                                       ├── otel-collector ──> ClickHouse Cloud
                                       ├── metabase:3001 ──> ClickHouse Cloud
                                       └── cloudflared (tunnel client)
```

### Alternative: Vercel (current dev/demo)

```
Browser ──> Vercel (static HTML + Serverless Functions)
              |
              +──> ClickHouse Cloud (Azure East US 2)
```

### Target: AKS (v3, production)

```
Browser ──> Cloudflare Tunnel ──> AKS Cluster
                                   ├── portal (nginx + Node.js)
                                   ├── otel-collector
                                   ├── metabase
                                   ├── mcp-server (ClickHouse MCP)
                                   └── cloudflared

All services ──> ClickHouse Cloud (Azure Private Link)
```

See `docs/AKS_MIGRATION.md` for detailed migration guide.

---

## Infrastructure Services

| Service | Role | Deployment |
|---------|------|-----------|
| ClickHouse Cloud | Analytical database | Azure Marketplace (managed) |
| HyperDX OTel Collector | Telemetry ingestion | Docker on VM / AKS pod |
| Metabase | BI dashboards (16 Grafana-style dashboards also available) | Docker on VM / AKS pod |
| MCP ClickHouse Server | AI-powered SQL via Model Context Protocol | Docker on VM / AKS pod |
| Cloudflared | Tunnel from Cloudflare edge to internal services | Docker on VM / AKS pod |
| PostgreSQL | Metabase metadata store | Docker on VM / AKS pod |
| Azure AI Foundry | Claude on Azure for NL-to-SQL and process analysis | Managed Azure service |

---

## Cost at Scale

| Scale | ClickHouse Cloud | Compute | Total |
|-------|-----------------|---------|-------|
| Demo (~20GB) | ~$150/mo | ~$50/mo (VM) | ~$200/mo |
| 10K visitors/day | ~$150/mo | ~$110/mo | ~$260/mo |
| 100K visitors/day | ~$500/mo | ~$330/mo | ~$830/mo |
| 1M visitors/day | ~$4K/mo | ~$2K/mo | ~$6K/mo |

No per-seat fees. No per-event pricing. Storage at $25.30/TiB with automatic SSD caching.

---

## Theoretical Foundation

Three academic papers by Joe Lanzone define the theoretical framework that EXOS Analytics implements:

1. **"The Taste Machine" (2026a)**: WidgetLang constraint grammar + Gamma-Poisson Bayesian audience indexing + UCB exploration for automated UX personalization. Provides the mathematical framework for behavioral segmentation (Stage 3 MV).

2. **"RadLo" (2026b)**: OTel spans to three-stage MV pipeline to real-time intent classification to behavioral enrichment. The architectural blueprint for the three-MV pipeline described above. Proof of concept for personalization using browsing behavior signals.

3. **"Free Signal" (2026c)**: Blackwell informativeness proof that personalization strictly lifts ad revenue via the attention channel (alpha > 1) + Blackwell RPM factor (rho >= 1). The economic justification for investing in behavioral intelligence infrastructure.

---

## Related Documentation

| Document | What it covers |
|----------|---------------|
| `STORAGE_ARCHITECTURE.md` | Tiered storage, TTL policies, cost projections, archive pipeline |
| `OPERATIONAL_INTELLIGENCE_PLAYBOOK.md` | SQL playbook: 10 technique categories with templates and real examples |
| `SCHEMA_VIEWS.sql` | View definitions for human-readable access to raw OTel tables |
| `AKS_MIGRATION.md` | Kubernetes migration guide with manifests and cost estimates |
| `PRODUCT_ANALYTICS.md` | PostHog/Amplitude-class analytics on ClickHouse |
| `vs-commercial-analytics.md` | Comparison against 9 commercial tools |
| `OCPM.md` | Object-Centric Process Mining: architecture, SQL, Celonis comparison |
| `CLICKSTACK_ALIGNMENT.md` | Alignment with ClickHouse's ClickStack initiative |
| `SEQUENCES_TAB_REDESIGN_SPEC.md` | Process Mining sequences tab redesign spec |
| `plans/session-replay-implementation.md` | Session replay implementation plan |
