Spaces:

kernels-community
/

kernel-ci-monitor

Running

App Files Files Community

adarshxs HF Staff commited on 13 days ago

Commit

1d3e970

verified ·

1 Parent(s): 200869a

Upload folder using huggingface_hub

Browse files

Files changed (43) hide show

.dockerignore +14 -0
.env.example +19 -0
.gitignore +11 -0
Dockerfile +17 -0
README.md +212 -12
app.py +31 -0
config/monitor.yaml +51 -0
monitoring/docker-compose.yml +45 -0
monitoring/github-actions-post-job.yml +28 -0
monitoring/grafana/dashboards/build-duration-trends.json +474 -0
monitoring/grafana/dashboards/build-failure-overview.json +500 -0
monitoring/grafana/dashboards/build-matrix-overview.json +593 -0
monitoring/grafana/provisioning/dashboards/dashboards.yml +11 -0
monitoring/grafana/provisioning/datasources/prometheus.yml +10 -0
monitoring/prometheus/prometheus.yml +18 -0
monitoring/prometheus/rules/build-alerts.yml +20 -0
requirements-dev.txt +3 -0
requirements.txt +9 -0
scripts/bootstrap_space.py +150 -0
scripts/push_build_metrics.py +48 -0
scripts/smoke_check.py +63 -0
src/kc_monitor/__init__.py +5 -0
src/kc_monitor/config.py +190 -0
src/kc_monitor/github_client.py +456 -0
src/kc_monitor/grafana.py +65 -0
src/kc_monitor/kernel_index.py +108 -0
src/kc_monitor/log_parser.py +216 -0
src/kc_monitor/metrics_push.py +190 -0
src/kc_monitor/models.py +342 -0
src/kc_monitor/service.py +572 -0
src/kc_monitor/stall_detector.py +48 -0
src/kc_monitor/ui.py +1110 -0
tests/conftest.py +10 -0
tests/fixtures/active_build_job.json +45 -0
tests/fixtures/build_release_run.json +19 -0
tests/fixtures/failed_build_job.json +45 -0
tests/fixtures/failed_build_run.json +19 -0
tests/fixtures/manual_build_run.json +19 -0
tests/fixtures/manual_upload_job.json +53 -0
tests/test_grafana.py +44 -0
tests/test_log_parser.py +52 -0
tests/test_metrics_push.py +96 -0
tests/test_service.py +152 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,14 @@

+.git
+.gitignore
+.env
+.venv
+venv
+__pycache__
+.pytest_cache
+.ruff_cache
+*.pyc
+*.pyo
+*.pyd
+.cursor
+tests
+requirements-dev.txt

.env.example ADDED Viewed

	@@ -0,0 +1,19 @@

+GITHUB_TOKEN=your_github_token_here
+HF_TOKEN=your_huggingface_token_here
+KCM_SPACE_ID=adarshxs/kernels-community-monitor
+KCM_GITHUB_OWNER=huggingface
+KCM_GITHUB_REPO=kernels-community
+KCM_GITHUB_BRANCH=main
+KCM_REFRESH_INTERVAL_SECONDS=300
+KCM_WORKFLOW_RUN_PAGE_SIZE=100
+KCM_WORKFLOW_RUN_PAGES=12
+KCM_CRITICAL_KERNELS=flash-attn3,sgl-flash-attn3,flash-attn4,vllm-flash-attn3,deep-gemm
+KCM_GRAFANA_BASE_URL=http://localhost:3000
+KCM_GRAFANA_ORG_ID=1
+KCM_GRAFANA_THEME=dark
+KCM_GRAFANA_OVERVIEW_UID=kernels-build-matrix
+KCM_GRAFANA_DURATION_UID=kernels-build-durations
+KCM_GRAFANA_FAILURE_UID=kernels-build-failures
+KCM_PROMETHEUS_BASE_URL=http://prometheus:9090
+KCM_PUSHGATEWAY_URL=http://pushgateway:9091
+KCM_PUSHGATEWAY_JOB_NAME=kernels-community-build-matrix

.gitignore ADDED Viewed

	@@ -0,0 +1,11 @@

+__pycache__/
+*.py[cod]
+.env
+.venv/
+venv/
+.pytest_cache/
+.ruff_cache/
+.mypy_cache/
+build/
+dist/
+*.log

Dockerfile ADDED Viewed

	@@ -0,0 +1,17 @@

+FROM python:3.11-slim
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    PORT=7860
+WORKDIR /app
+COPY requirements.txt /app/requirements.txt
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir -r /app/requirements.txt
+COPY . /app
+EXPOSE 7860
+CMD ["python", "app.py"]

README.md CHANGED Viewed

@@ -1,12 +1,212 @@
----
-title: Kernel Ci Monitor
-emoji: 🏃
-colorFrom: red
-colorTo: purple
-sdk: gradio
-sdk_version: 6.10.0
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+---
+title: Kernels Community Monitor
+sdk: gradio
+sdk_version: 6.10.0
+python_version: 3.11
+app_file: app.py
+fullWidth: true
+header: mini
+suggested_hardware: cpu-basic
+short_description: Live kernel build table plus optional Grafana metrics deck.
+tags:
+  - monitoring
+  - github-actions
+  - kernels
+  - gradio
+  - grafana
+---
+# Kernels Community Monitor
+`Kernels Community Monitor` now does two things:
+1. Enumerates every kernel source dir in `huggingface/kernels-community`, scans the latest GitHub Actions runs, and renders a live per-kernel / per-variant status table with Actions links.
+2. Optionally embeds Grafana dashboards for longer-term metrics once a public Grafana endpoint is configured.
+The app prefers the GitHub Actions REST API when it can, but it also has a public GitHub HTML fallback for workflow runs, job groups, and `build.toml` reads. That avoids the current `huggingface` org restriction that blocks classic PATs on some Actions endpoints.
+The metrics path is still in the repo:
+- GitHub Actions pushes per-matrix build metrics to Pushgateway.
+- Prometheus scrapes Pushgateway and evaluates alert rules.
+- Grafana owns the dashboards, filters, and time-series UI.
+- This Hugging Face Space just presents clean links and embeds for those dashboards.
+## What Changed
+The old zero-upstream-change monitor worked, but it had three hard limits:
+- it depended on GitHub API polling and log scraping
+- it could only infer matrix state indirectly
+- it could not give you clean duration trends or robust alerting without more brittle parsing
+This cutover replaces that with first-class metrics:
+- `scripts/push_build_metrics.py` pushes the latest status, duration, and timestamp for each matrix combo.
+- `monitoring/docker-compose.yml` provisions `prometheus`, `pushgateway`, and `grafana`.
+- `monitoring/prometheus/rules/build-alerts.yml` alerts on failing or stale combos.
+- `monitoring/grafana/dashboards/` provides ready dashboards with filters for kernel, backend, compute backend, CUDA, PyTorch, and Python.
+- `src/kc_monitor/ui.py` renders the live kernel matrix table first, then the Grafana deck if configured.
+## Metrics Model
+Each matrix combo is stored as a stable Pushgateway grouping key:
+`kernel + backend + compute_backend + cuda_version + pytorch_version + python_version`
+Each push updates these gauges:
+- `kc_build_last_run_result_code`
+- `kc_build_last_run_failed`
+- `kc_build_last_run_duration_seconds`
+- `kc_build_last_run_timestamp_seconds`
+- `kc_build_last_run_info`
+That gives you:
+- current per-combo health
+- duration history per combo
+- stale build telemetry detection
+- alert-friendly failure signals
+## Local Setup
+Install deps:
+```bash
+python -m venv .venv
+. .venv/bin/activate
+pip install -r requirements-dev.txt
+```
+Windows PowerShell activation:
+```powershell
+python -m venv .venv
+.\.venv\Scripts\Activate.ps1
+pip install -r requirements-dev.txt
+```
+Create `.env` from `.env.example` and set at least:
+```env
+HF_TOKEN=...
+```
+If you want local Grafana too, also set:
+```bash
+KCM_GRAFANA_BASE_URL=http://localhost:3000
+docker compose -f monitoring/docker-compose.yml up -d
+```
+Run the app locally:
+```bash
+python app.py
+```
+Run the smoke check:
+```bash
+python scripts/smoke_check.py
+```
+Run tests:
+```bash
+pytest
+```
+## GitHub Actions Step
+The actual workflow YAMLs live in the `huggingface/kernels-community` repo, not here.
+Use `monitoring/github-actions-post-job.yml` as the drop-in snippet. The important bit is:
+```yaml
+- name: Record matrix job start time
+  shell: bash
+  run: echo "KCM_JOB_STARTED_AT=$(date +%s)" >> "$GITHUB_ENV"
+- name: Push matrix build metrics
+  if: always()
+  shell: bash
+  env:
+    PUSHGATEWAY_URL: ${{ secrets.PUSHGATEWAY_URL }}
+    KCM_PUSHGATEWAY_JOB_NAME: kernels-community-build-matrix
+    KCM_JOB_STATUS: ${{ job.status }}
+    KCM_KERNEL: ${{ matrix.kernel }}
+    KCM_BACKEND: ${{ matrix.backend }}
+    KCM_COMPUTE_BACKEND: ${{ matrix.compute_backend }}
+    KCM_CUDA_VERSION: ${{ matrix.cuda }}
+    KCM_PYTORCH_VERSION: ${{ matrix.torch }}
+    KCM_PYTHON_VERSION: ${{ matrix.python }}
+  run: python scripts/push_build_metrics.py
+```
+The emitter is intentionally low-cardinality: it tracks the latest state for each stable combo, which is what you want for Grafana filters and Prometheus alerts without Pushgateway turning into a per-run junk drawer.
+## Dashboards
+Provisioned dashboards:
+- `kernels-build-matrix`
+- `kernels-build-durations`
+- `kernels-build-failures`
+All of them expose variables for:
+- kernel
+- backend
+- compute backend
+- CUDA version
+- PyTorch version
+- Python version
+## Alerting
+Prometheus rules ship in `monitoring/prometheus/rules/build-alerts.yml`.
+Current rules:
+- `KernelsBuildMatrixComboFailing`
+- `KernelsBuildMetricsStale`
+You can route those through Alertmanager later, but the expression layer is already there.
+## Runtime Configuration
+Main env/config knobs:
+- `KCM_GRAFANA_BASE_URL`
+- `KCM_GRAFANA_ORG_ID`
+- `KCM_GRAFANA_THEME`
+- `KCM_GRAFANA_OVERVIEW_UID`
+- `KCM_GRAFANA_DURATION_UID`
+- `KCM_GRAFANA_FAILURE_UID`
+- `KCM_PROMETHEUS_BASE_URL`
+- `KCM_PUSHGATEWAY_URL`
+- `KCM_PUSHGATEWAY_JOB_NAME`
+If `KCM_GRAFANA_BASE_URL` is not set, the Space still works: the live GitHub Actions table stays active and the Grafana section renders as a setup card instead of broken embeds.
+The base YAML config lives at `config/monitor.yaml`. Environment variables override it at runtime.
+## Deploy To Hugging Face Space
+This repo still includes a bootstrap script that creates or updates the Space and uploads the current folder.
+```bash
+python scripts/bootstrap_space.py --space-id adarshxs/kernels-community-monitor
+```
+What it does:
+- creates the Space repo if it does not exist
+- uploads this project as a Gradio Space
+- writes the Grafana, Prometheus, and Pushgateway settings into Space variables
+After upload, the expected Space URL is:
+- `https://huggingface.co/spaces/adarshxs/kernels-community-monitor`

app.py ADDED Viewed

	@@ -0,0 +1,31 @@

+# ruff: noqa: E402
+from __future__ import annotations
+import os
+import sys
+from pathlib import Path
+ROOT_DIR = Path(__file__).resolve().parent
+SRC_DIR = ROOT_DIR / "src"
+if str(SRC_DIR) not in sys.path:
+    sys.path.insert(0, str(SRC_DIR))
+from kc_monitor.config import load_config
+from kc_monitor.service import MonitorService
+from kc_monitor.ui import CSS, PAGE_JS, THEME, build_dashboard
+config = load_config(ROOT_DIR / "config" / "monitor.yaml")
+service = MonitorService(config)
+demo = build_dashboard(service, config)
+if __name__ == "__main__":
+    demo.launch(
+        server_name="0.0.0.0",
+        server_port=int(os.getenv("PORT", "7860")),
+        theme=THEME,
+        css=CSS,
+        js=PAGE_JS,
+    )

config/monitor.yaml ADDED Viewed

	@@ -0,0 +1,51 @@

+github:
+  owner: huggingface
+  repo: kernels-community
+  branch: main
+  per_page: 100
+  request_timeout_seconds: 25
+  workflows:
+    - path: .github/workflows/build-release.yaml
+      label: Build Release
+      enabled: true
+    - path: .github/workflows/manual-build-upload.yaml
+      label: Manual Kernel Build
+      enabled: true
+monitor:
+  refresh_interval_seconds: 300
+  snapshot_ttl_seconds: 240
+  workflow_run_page_size: 100
+  workflow_run_pages: 12
+  recent_completed_hours: 336
+  recent_limit: 30
+  completed_runs_per_workflow: 15
+  log_line_limit: 400
+  log_char_limit: 35000
+  detail_event_limit: 25
+  stall_without_log_minutes: 45
+  stall_active_phase_minutes: 180
+  critical_kernels:
+    - flash-attn3
+    - sgl-flash-attn3
+    - flash-attn4
+    - vllm-flash-attn3
+    - deep-gemm
+grafana:
+  base_url: null
+  org_id: 1
+  theme: dark
+  default_from: now-30d
+  default_to: now
+  default_refresh: 5m
+  overview_dashboard_uid: kernels-build-matrix
+  duration_dashboard_uid: kernels-build-durations
+  failure_dashboard_uid: kernels-build-failures
+prometheus:
+  base_url: null
+pushgateway:
+  url: null
+  job_name: kernels-community-build-matrix

monitoring/docker-compose.yml ADDED Viewed

	@@ -0,0 +1,45 @@

+services:
+  prometheus:
+    image: prom/prometheus
+    command:
+      - --config.file=/etc/prometheus/prometheus.yml
+      - --web.enable-lifecycle
+    ports:
+      - "9090:9090"
+    volumes:
+      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
+      - ./prometheus/rules:/etc/prometheus/rules:ro
+      - prometheus-data:/prometheus
+  pushgateway:
+    image: prom/pushgateway
+    command:
+      - --persistence.file=/data/pushgateway.db
+    ports:
+      - "9091:9091"
+    volumes:
+      - pushgateway-data:/data
+  grafana:
+    image: grafana/grafana-oss
+    depends_on:
+      - prometheus
+    environment:
+      GF_SECURITY_ADMIN_USER: ${GRAFANA_ADMIN_USER:-admin}
+      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD:-admin}
+      GF_AUTH_ANONYMOUS_ENABLED: ${GRAFANA_ANONYMOUS_ENABLED:-true}
+      GF_AUTH_ANONYMOUS_ORG_ROLE: Viewer
+      GF_SECURITY_ALLOW_EMBEDDING: "true"
+      GF_DASHBOARDS_MIN_REFRESH_INTERVAL: 10s
+      GF_SERVER_ROOT_URL: ${GRAFANA_ROOT_URL:-http://localhost:3000}
+    ports:
+      - "3000:3000"
+    volumes:
+      - ./grafana/provisioning:/etc/grafana/provisioning:ro
+      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
+      - grafana-data:/var/lib/grafana
+volumes:
+  prometheus-data:
+  pushgateway-data:
+  grafana-data:

monitoring/github-actions-post-job.yml ADDED Viewed

	@@ -0,0 +1,28 @@

+# Drop this into the kernels-community workflow repo.
+#
+# Example matrix fields expected below:
+#   matrix.kernel
+#   matrix.backend
+#   matrix.compute_backend
+#   matrix.cuda
+#   matrix.torch
+#   matrix.python
+- name: Record matrix job start time
+  shell: bash
+  run: echo "KCM_JOB_STARTED_AT=$(date +%s)" >> "$GITHUB_ENV"
+- name: Push matrix build metrics
+  if: always()
+  shell: bash
+  env:
+    PUSHGATEWAY_URL: ${{ secrets.PUSHGATEWAY_URL }}
+    KCM_PUSHGATEWAY_JOB_NAME: kernels-community-build-matrix
+    KCM_JOB_STATUS: ${{ job.status }}
+    KCM_KERNEL: ${{ matrix.kernel }}
+    KCM_BACKEND: ${{ matrix.backend }}
+    KCM_COMPUTE_BACKEND: ${{ matrix.compute_backend }}
+    KCM_CUDA_VERSION: ${{ matrix.cuda }}
+    KCM_PYTORCH_VERSION: ${{ matrix.torch }}
+    KCM_PYTHON_VERSION: ${{ matrix.python }}
+  run: python scripts/push_build_metrics.py

monitoring/grafana/dashboards/build-duration-trends.json ADDED Viewed

	@@ -0,0 +1,474 @@

+{
+  "annotations": {
+    "list": []
+  },
+  "editable": true,
+  "fiscalYearStartMonth": 0,
+  "graphTooltip": 1,
+  "links": [],
+  "panels": [
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "s"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 5,
+        "w": 8,
+        "x": 0,
+        "y": 0
+      },
+      "id": 1,
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "expr": "avg(kc_build_last_run_duration_seconds{kernel=~\"${kernel:regex}\",backend=~\"${backend:regex}\",compute_backend=~\"${compute_backend:regex}\",cuda_version=~\"${cuda_version:regex}\",pytorch_version=~\"${pytorch_version:regex}\",python_version=~\"${python_version:regex}\"})",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "title": "Average current duration",
+      "type": "stat"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "s"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 5,
+        "w": 8,
+        "x": 8,
+        "y": 0
+      },
+      "id": 2,
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "expr": "max(kc_build_last_run_duration_seconds{kernel=~\"${kernel:regex}\",backend=~\"${backend:regex}\",compute_backend=~\"${compute_backend:regex}\",cuda_version=~\"${cuda_version:regex}\",pytorch_version=~\"${pytorch_version:regex}\",python_version=~\"${python_version:regex}\"})",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "title": "Slowest current combo",
+      "type": "stat"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "orange",
+                "value": 6
+              },
+              {
+                "color": "red",
+                "value": 24
+              }
+            ]
+          },
+          "unit": "h"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 5,
+        "w": 8,
+        "x": 16,
+        "y": 0
+      },
+      "id": 3,
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "expr": "avg((time() - kc_build_last_run_timestamp_seconds{kernel=~\"${kernel:regex}\",backend=~\"${backend:regex}\",compute_backend=~\"${compute_backend:regex}\",cuda_version=~\"${cuda_version:regex}\",pytorch_version=~\"${pytorch_version:regex}\",python_version=~\"${python_version:regex}\"}) / 3600)",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "title": "Average age of last sample",
+      "type": "stat"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "continuous-BlPu"
+          },
+          "unit": "s"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 24,
+        "x": 0,
+        "y": 5
+      },
+      "id": 4,
+      "options": {
+        "legend": {
+          "displayMode": "table",
+          "placement": "bottom"
+        },
+        "tooltip": {
+          "mode": "multi"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "expr": "kc_build_last_run_duration_seconds{kernel=~\"${kernel:regex}\",backend=~\"${backend:regex}\",compute_backend=~\"${compute_backend:regex}\",cuda_version=~\"${cuda_version:regex}\",pytorch_version=~\"${pytorch_version:regex}\",python_version=~\"${python_version:regex}\"}",
+          "legendFormat": "{{kernel}} | {{backend}} | {{compute_backend}} | CUDA {{cuda_version}} | torch {{pytorch_version}} | py {{python_version}}",
+          "refId": "A"
+        }
+      ],
+      "title": "Duration trends by combo",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "continuous-GrYlRd"
+          },
+          "unit": "s"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 24,
+        "x": 0,
+        "y": 13
+      },
+      "id": 5,
+      "options": {
+        "displayMode": "gradient",
+        "orientation": "horizontal",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "showUnfilled": true
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "expr": "kc_build_last_run_duration_seconds{kernel=~\"${kernel:regex}\",backend=~\"${backend:regex}\",compute_backend=~\"${compute_backend:regex}\",cuda_version=~\"${cuda_version:regex}\",pytorch_version=~\"${pytorch_version:regex}\",python_version=~\"${python_version:regex}\"}",
+          "instant": true,
+          "legendFormat": "{{kernel}} | {{backend}} | {{compute_backend}} | CUDA {{cuda_version}} | torch {{pytorch_version}} | py {{python_version}}",
+          "refId": "A"
+        }
+      ],
+      "title": "Current duration distribution",
+      "type": "bargauge"
+    }
+  ],
+  "refresh": "5m",
+  "schemaVersion": 39,
+  "style": "dark",
+  "tags": [
+    "kernels-community",
+    "ci",
+    "durations"
+  ],
+  "templating": {
+    "list": [
+      {
+        "current": {
+          "selected": true,
+          "text": [
+            "All"
+          ],
+          "value": [
+            "$__all"
+          ]
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(kc_build_last_run_timestamp_seconds, kernel)",
+        "includeAll": true,
+        "label": "Kernel",
+        "multi": true,
+        "name": "kernel",
+        "options": [],
+        "query": {
+          "query": "label_values(kc_build_last_run_timestamp_seconds, kernel)",
+          "refId": "PrometheusVariableQueryEditor-kernel"
+        },
+        "refresh": 1,
+        "sort": 1,
+        "type": "query"
+      },
+      {
+        "current": {
+          "selected": true,
+          "text": [
+            "All"
+          ],
+          "value": [
+            "$__all"
+          ]
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(kc_build_last_run_timestamp_seconds, backend)",
+        "includeAll": true,
+        "label": "Backend",
+        "multi": true,
+        "name": "backend",
+        "options": [],
+        "query": {
+          "query": "label_values(kc_build_last_run_timestamp_seconds, backend)",
+          "refId": "PrometheusVariableQueryEditor-backend"
+        },
+        "refresh": 1,
+        "sort": 1,
+        "type": "query"
+      },
+      {
+        "current": {
+          "selected": true,
+          "text": [
+            "All"
+          ],
+          "value": [
+            "$__all"
+          ]
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(kc_build_last_run_timestamp_seconds, compute_backend)",
+        "includeAll": true,
+        "label": "Compute backend",
+        "multi": true,
+        "name": "compute_backend",
+        "options": [],
+        "query": {
+          "query": "label_values(kc_build_last_run_timestamp_seconds, compute_backend)",
+          "refId": "PrometheusVariableQueryEditor-compute_backend"
+        },
+        "refresh": 1,
+        "sort": 1,
+        "type": "query"
+      },
+      {
+        "current": {
+          "selected": true,
+          "text": [
+            "All"
+          ],
+          "value": [
+            "$__all"
+          ]
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(kc_build_last_run_timestamp_seconds, cuda_version)",
+        "includeAll": true,
+        "label": "CUDA",
+        "multi": true,
+        "name": "cuda_version",
+        "options": [],
+        "query": {
+          "query": "label_values(kc_build_last_run_timestamp_seconds, cuda_version)",
+          "refId": "PrometheusVariableQueryEditor-cuda_version"
+        },
+        "refresh": 1,
+        "sort": 1,
+        "type": "query"
+      },
+      {
+        "current": {
+          "selected": true,
+          "text": [
+            "All"
+          ],
+          "value": [
+            "$__all"
+          ]
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(kc_build_last_run_timestamp_seconds, pytorch_version)",
+        "includeAll": true,
+        "label": "PyTorch",
+        "multi": true,
+        "name": "pytorch_version",
+        "options": [],
+        "query": {
+          "query": "label_values(kc_build_last_run_timestamp_seconds, pytorch_version)",
+          "refId": "PrometheusVariableQueryEditor-pytorch_version"
+        },
+        "refresh": 1,
+        "sort": 1,
+        "type": "query"
+      },
+      {
+        "current": {
+          "selected": true,
+          "text": [
+            "All"
+          ],
+          "value": [
+            "$__all"
+          ]
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(kc_build_last_run_timestamp_seconds, python_version)",
+        "includeAll": true,
+        "label": "Python",
+        "multi": true,
+        "name": "python_version",
+        "options": [],
+        "query": {
+          "query": "label_values(kc_build_last_run_timestamp_seconds, python_version)",
+          "refId": "PrometheusVariableQueryEditor-python_version"
+        },
+        "refresh": 1,
+        "sort": 1,
+        "type": "query"
+      }
+    ]
+  },
+  "time": {
+    "from": "now-30d",
+    "to": "now"
+  },
+  "timezone": "browser",
+  "title": "Kernels Build Duration Trends",
+  "uid": "kernels-build-durations",
+  "version": 1,
+  "weekStart": ""
+}

monitoring/grafana/dashboards/build-failure-overview.json ADDED Viewed

	@@ -0,0 +1,500 @@

+{
+  "annotations": {
+    "list": []
+  },
+  "editable": true,
+  "fiscalYearStartMonth": 0,
+  "graphTooltip": 1,
+  "links": [],
+  "panels": [
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "red",
+                "value": 1
+              }
+            ]
+          },
+          "unit": "none"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 5,
+        "w": 8,
+        "x": 0,
+        "y": 0
+      },
+      "id": 1,
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "expr": "sum(kc_build_last_run_failed{kernel=~\"${kernel:regex}\",backend=~\"${backend:regex}\",compute_backend=~\"${compute_backend:regex}\",cuda_version=~\"${cuda_version:regex}\",pytorch_version=~\"${pytorch_version:regex}\",python_version=~\"${python_version:regex}\"})",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "title": "Failing combos",
+      "type": "stat"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "none"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 5,
+        "w": 8,
+        "x": 8,
+        "y": 0
+      },
+      "id": 2,
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "expr": "count(kc_build_last_run_failed{kernel=~\"${kernel:regex}\",backend=~\"${backend:regex}\",compute_backend=~\"${compute_backend:regex}\",cuda_version=~\"${cuda_version:regex}\",pytorch_version=~\"${pytorch_version:regex}\",python_version=~\"${python_version:regex}\"} == 1)",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "title": "Alerting series",
+      "type": "stat"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "orange",
+                "value": 6
+              },
+              {
+                "color": "red",
+                "value": 24
+              }
+            ]
+          },
+          "unit": "h"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 5,
+        "w": 8,
+        "x": 16,
+        "y": 0
+      },
+      "id": 3,
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "expr": "max((time() - kc_build_last_run_timestamp_seconds{kernel=~\"${kernel:regex}\",backend=~\"${backend:regex}\",compute_backend=~\"${compute_backend:regex}\",cuda_version=~\"${cuda_version:regex}\",pytorch_version=~\"${pytorch_version:regex}\",python_version=~\"${python_version:regex}\"}) / 3600)",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "title": "Oldest sample age",
+      "type": "stat"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisBorderShow": false,
+            "axisCenteredZero": false,
+            "drawStyle": "line",
+            "fillOpacity": 18,
+            "lineInterpolation": "stepAfter",
+            "lineWidth": 2,
+            "pointSize": 4,
+            "showPoints": "never",
+            "spanNulls": true
+          },
+          "max": 1,
+          "min": 0,
+          "unit": "none"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 24,
+        "x": 0,
+        "y": 5
+      },
+      "id": 4,
+      "options": {
+        "legend": {
+          "displayMode": "table",
+          "placement": "bottom"
+        },
+        "tooltip": {
+          "mode": "multi"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "expr": "kc_build_last_run_failed{kernel=~\"${kernel:regex}\",backend=~\"${backend:regex}\",compute_backend=~\"${compute_backend:regex}\",cuda_version=~\"${cuda_version:regex}\",pytorch_version=~\"${pytorch_version:regex}\",python_version=~\"${python_version:regex}\"}",
+          "legendFormat": "{{kernel}} | {{backend}} | {{compute_backend}} | CUDA {{cuda_version}} | torch {{pytorch_version}} | py {{python_version}}",
+          "refId": "A"
+        }
+      ],
+      "title": "Failure state by combo",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisBorderShow": false,
+            "axisCenteredZero": false,
+            "drawStyle": "line",
+            "fillOpacity": 20,
+            "lineInterpolation": "stepAfter",
+            "lineWidth": 2,
+            "pointSize": 4,
+            "showPoints": "never",
+            "spanNulls": true
+          },
+          "max": 3,
+          "min": 0,
+          "unit": "none"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 24,
+        "x": 0,
+        "y": 13
+      },
+      "id": 5,
+      "options": {
+        "legend": {
+          "displayMode": "table",
+          "placement": "bottom"
+        },
+        "tooltip": {
+          "mode": "multi"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "expr": "kc_build_last_run_result_code{kernel=~\"${kernel:regex}\",backend=~\"${backend:regex}\",compute_backend=~\"${compute_backend:regex}\",cuda_version=~\"${cuda_version:regex}\",pytorch_version=~\"${pytorch_version:regex}\",python_version=~\"${python_version:regex}\"}",
+          "legendFormat": "{{kernel}} | {{backend}} | {{compute_backend}} | CUDA {{cuda_version}} | torch {{pytorch_version}} | py {{python_version}}",
+          "refId": "A"
+        }
+      ],
+      "title": "Result code over time",
+      "type": "timeseries"
+    }
+  ],
+  "refresh": "5m",
+  "schemaVersion": 39,
+  "style": "dark",
+  "tags": [
+    "kernels-community",
+    "ci",
+    "failures"
+  ],
+  "templating": {
+    "list": [
+      {
+        "current": {
+          "selected": true,
+          "text": [
+            "All"
+          ],
+          "value": [
+            "$__all"
+          ]
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(kc_build_last_run_timestamp_seconds, kernel)",
+        "includeAll": true,
+        "label": "Kernel",
+        "multi": true,
+        "name": "kernel",
+        "options": [],
+        "query": {
+          "query": "label_values(kc_build_last_run_timestamp_seconds, kernel)",
+          "refId": "PrometheusVariableQueryEditor-kernel"
+        },
+        "refresh": 1,
+        "sort": 1,
+        "type": "query"
+      },
+      {
+        "current": {
+          "selected": true,
+          "text": [
+            "All"
+          ],
+          "value": [
+            "$__all"
+          ]
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(kc_build_last_run_timestamp_seconds, backend)",
+        "includeAll": true,
+        "label": "Backend",
+        "multi": true,
+        "name": "backend",
+        "options": [],
+        "query": {
+          "query": "label_values(kc_build_last_run_timestamp_seconds, backend)",
+          "refId": "PrometheusVariableQueryEditor-backend"
+        },
+        "refresh": 1,
+        "sort": 1,
+        "type": "query"
+      },
+      {
+        "current": {
+          "selected": true,
+          "text": [
+            "All"
+          ],
+          "value": [
+            "$__all"
+          ]
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(kc_build_last_run_timestamp_seconds, compute_backend)",
+        "includeAll": true,
+        "label": "Compute backend",
+        "multi": true,
+        "name": "compute_backend",
+        "options": [],
+        "query": {
+          "query": "label_values(kc_build_last_run_timestamp_seconds, compute_backend)",
+          "refId": "PrometheusVariableQueryEditor-compute_backend"
+        },
+        "refresh": 1,
+        "sort": 1,
+        "type": "query"
+      },
+      {
+        "current": {
+          "selected": true,
+          "text": [
+            "All"
+          ],
+          "value": [
+            "$__all"
+          ]
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(kc_build_last_run_timestamp_seconds, cuda_version)",
+        "includeAll": true,
+        "label": "CUDA",
+        "multi": true,
+        "name": "cuda_version",
+        "options": [],
+        "query": {
+          "query": "label_values(kc_build_last_run_timestamp_seconds, cuda_version)",
+          "refId": "PrometheusVariableQueryEditor-cuda_version"
+        },
+        "refresh": 1,
+        "sort": 1,
+        "type": "query"
+      },
+      {
+        "current": {
+          "selected": true,
+          "text": [
+            "All"
+          ],
+          "value": [
+            "$__all"
+          ]
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(kc_build_last_run_timestamp_seconds, pytorch_version)",
+        "includeAll": true,
+        "label": "PyTorch",
+        "multi": true,
+        "name": "pytorch_version",
+        "options": [],
+        "query": {
+          "query": "label_values(kc_build_last_run_timestamp_seconds, pytorch_version)",
+          "refId": "PrometheusVariableQueryEditor-pytorch_version"
+        },
+        "refresh": 1,
+        "sort": 1,
+        "type": "query"
+      },
+      {
+        "current": {
+          "selected": true,
+          "text": [
+            "All"
+          ],
+          "value": [
+            "$__all"
+          ]
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(kc_build_last_run_timestamp_seconds, python_version)",
+        "includeAll": true,
+        "label": "Python",
+        "multi": true,
+        "name": "python_version",
+        "options": [],
+        "query": {
+          "query": "label_values(kc_build_last_run_timestamp_seconds, python_version)",
+          "refId": "PrometheusVariableQueryEditor-python_version"
+        },
+        "refresh": 1,
+        "sort": 1,
+        "type": "query"
+      }
+    ]
+  },
+  "time": {
+    "from": "now-30d",
+    "to": "now"
+  },
+  "timezone": "browser",
+  "title": "Kernels Build Failure Overview",
+  "uid": "kernels-build-failures",
+  "version": 1,
+  "weekStart": ""
+}

monitoring/grafana/dashboards/build-matrix-overview.json ADDED Viewed

	@@ -0,0 +1,593 @@

+{
+  "annotations": {
+    "list": []
+  },
+  "editable": true,
+  "fiscalYearStartMonth": 0,
+  "graphTooltip": 1,
+  "links": [],
+  "panels": [
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "none"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 5,
+        "w": 6,
+        "x": 0,
+        "y": 0
+      },
+      "id": 1,
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "expr": "count(kc_build_last_run_timestamp_seconds{kernel=~\"${kernel:regex}\",backend=~\"${backend:regex}\",compute_backend=~\"${compute_backend:regex}\",cuda_version=~\"${cuda_version:regex}\",pytorch_version=~\"${pytorch_version:regex}\",python_version=~\"${python_version:regex}\"})",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "title": "Tracked combos",
+      "type": "stat"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "red",
+                "value": 1
+              }
+            ]
+          },
+          "unit": "none"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 5,
+        "w": 6,
+        "x": 6,
+        "y": 0
+      },
+      "id": 2,
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "expr": "sum(kc_build_last_run_failed{kernel=~\"${kernel:regex}\",backend=~\"${backend:regex}\",compute_backend=~\"${compute_backend:regex}\",cuda_version=~\"${cuda_version:regex}\",pytorch_version=~\"${pytorch_version:regex}\",python_version=~\"${python_version:regex}\"})",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "title": "Failing combos",
+      "type": "stat"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              }
+            ]
+          },
+          "unit": "none"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 5,
+        "w": 6,
+        "x": 12,
+        "y": 0
+      },
+      "id": 3,
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "expr": "count(kc_build_last_run_timestamp_seconds{kernel=~\"${kernel:regex}\",backend=~\"${backend:regex}\",compute_backend=~\"${compute_backend:regex}\",cuda_version=~\"${cuda_version:regex}\",pytorch_version=~\"${pytorch_version:regex}\",python_version=~\"${python_version:regex}\"}) - sum(kc_build_last_run_failed{kernel=~\"${kernel:regex}\",backend=~\"${backend:regex}\",compute_backend=~\"${compute_backend:regex}\",cuda_version=~\"${cuda_version:regex}\",pytorch_version=~\"${pytorch_version:regex}\",python_version=~\"${python_version:regex}\"})",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "title": "Healthy combos",
+      "type": "stat"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "thresholds"
+          },
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {
+                "color": "green",
+                "value": null
+              },
+              {
+                "color": "orange",
+                "value": 6
+              },
+              {
+                "color": "red",
+                "value": 24
+              }
+            ]
+          },
+          "unit": "h"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 5,
+        "w": 6,
+        "x": 18,
+        "y": 0
+      },
+      "id": 4,
+      "options": {
+        "colorMode": "value",
+        "graphMode": "none",
+        "justifyMode": "auto",
+        "orientation": "auto",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "textMode": "value"
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "expr": "max((time() - kc_build_last_run_timestamp_seconds{kernel=~\"${kernel:regex}\",backend=~\"${backend:regex}\",compute_backend=~\"${compute_backend:regex}\",cuda_version=~\"${cuda_version:regex}\",pytorch_version=~\"${pytorch_version:regex}\",python_version=~\"${python_version:regex}\"}) / 3600)",
+          "instant": true,
+          "refId": "A"
+        }
+      ],
+      "title": "Oldest metric age",
+      "type": "stat"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "continuous-GrYlRd"
+          },
+          "unit": "s"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 8,
+        "x": 0,
+        "y": 5
+      },
+      "id": 5,
+      "options": {
+        "displayMode": "gradient",
+        "orientation": "horizontal",
+        "reduceOptions": {
+          "calcs": [
+            "lastNotNull"
+          ],
+          "fields": "",
+          "values": false
+        },
+        "showUnfilled": true
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "expr": "kc_build_last_run_duration_seconds{kernel=~\"${kernel:regex}\",backend=~\"${backend:regex}\",compute_backend=~\"${compute_backend:regex}\",cuda_version=~\"${cuda_version:regex}\",pytorch_version=~\"${pytorch_version:regex}\",python_version=~\"${python_version:regex}\"}",
+          "instant": true,
+          "legendFormat": "{{kernel}} | {{backend}} | {{compute_backend}} | CUDA {{cuda_version}} | torch {{pytorch_version}} | py {{python_version}}",
+          "refId": "A"
+        }
+      ],
+      "title": "Current duration by combo",
+      "type": "bargauge"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic"
+          },
+          "custom": {
+            "axisBorderShow": false,
+            "axisCenteredZero": false,
+            "drawStyle": "line",
+            "fillOpacity": 20,
+            "lineInterpolation": "stepAfter",
+            "lineWidth": 2,
+            "pointSize": 4,
+            "showPoints": "never",
+            "spanNulls": true
+          },
+          "max": 3,
+          "min": 0,
+          "unit": "none"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 16,
+        "x": 8,
+        "y": 5
+      },
+      "id": 6,
+      "options": {
+        "legend": {
+          "displayMode": "list",
+          "placement": "bottom"
+        },
+        "tooltip": {
+          "mode": "multi"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "expr": "kc_build_last_run_result_code{kernel=~\"${kernel:regex}\",backend=~\"${backend:regex}\",compute_backend=~\"${compute_backend:regex}\",cuda_version=~\"${cuda_version:regex}\",pytorch_version=~\"${pytorch_version:regex}\",python_version=~\"${python_version:regex}\"}",
+          "legendFormat": "{{kernel}} | {{backend}} | {{compute_backend}} | CUDA {{cuda_version}} | torch {{pytorch_version}} | py {{python_version}}",
+          "refId": "A"
+        }
+      ],
+      "title": "Latest result code over time",
+      "type": "timeseries"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "continuous-BlYlRd"
+          },
+          "unit": "s"
+        },
+        "overrides": []
+      },
+      "gridPos": {
+        "h": 8,
+        "w": 24,
+        "x": 0,
+        "y": 13
+      },
+      "id": 7,
+      "options": {
+        "legend": {
+          "displayMode": "list",
+          "placement": "bottom"
+        },
+        "tooltip": {
+          "mode": "multi"
+        }
+      },
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "expr": "kc_build_last_run_duration_seconds{kernel=~\"${kernel:regex}\",backend=~\"${backend:regex}\",compute_backend=~\"${compute_backend:regex}\",cuda_version=~\"${cuda_version:regex}\",pytorch_version=~\"${pytorch_version:regex}\",python_version=~\"${python_version:regex}\"}",
+          "legendFormat": "{{kernel}} | {{backend}} | {{compute_backend}} | CUDA {{cuda_version}} | torch {{pytorch_version}} | py {{python_version}}",
+          "refId": "A"
+        }
+      ],
+      "title": "Duration history",
+      "type": "timeseries"
+    }
+  ],
+  "refresh": "5m",
+  "schemaVersion": 39,
+  "style": "dark",
+  "tags": [
+    "kernels-community",
+    "ci",
+    "matrix"
+  ],
+  "templating": {
+    "list": [
+      {
+        "current": {
+          "selected": true,
+          "text": [
+            "All"
+          ],
+          "value": [
+            "$__all"
+          ]
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(kc_build_last_run_timestamp_seconds, kernel)",
+        "includeAll": true,
+        "label": "Kernel",
+        "multi": true,
+        "name": "kernel",
+        "options": [],
+        "query": {
+          "query": "label_values(kc_build_last_run_timestamp_seconds, kernel)",
+          "refId": "PrometheusVariableQueryEditor-kernel"
+        },
+        "refresh": 1,
+        "sort": 1,
+        "type": "query"
+      },
+      {
+        "current": {
+          "selected": true,
+          "text": [
+            "All"
+          ],
+          "value": [
+            "$__all"
+          ]
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(kc_build_last_run_timestamp_seconds, backend)",
+        "includeAll": true,
+        "label": "Backend",
+        "multi": true,
+        "name": "backend",
+        "options": [],
+        "query": {
+          "query": "label_values(kc_build_last_run_timestamp_seconds, backend)",
+          "refId": "PrometheusVariableQueryEditor-backend"
+        },
+        "refresh": 1,
+        "sort": 1,
+        "type": "query"
+      },
+      {
+        "current": {
+          "selected": true,
+          "text": [
+            "All"
+          ],
+          "value": [
+            "$__all"
+          ]
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(kc_build_last_run_timestamp_seconds, compute_backend)",
+        "includeAll": true,
+        "label": "Compute backend",
+        "multi": true,
+        "name": "compute_backend",
+        "options": [],
+        "query": {
+          "query": "label_values(kc_build_last_run_timestamp_seconds, compute_backend)",
+          "refId": "PrometheusVariableQueryEditor-compute_backend"
+        },
+        "refresh": 1,
+        "sort": 1,
+        "type": "query"
+      },
+      {
+        "current": {
+          "selected": true,
+          "text": [
+            "All"
+          ],
+          "value": [
+            "$__all"
+          ]
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(kc_build_last_run_timestamp_seconds, cuda_version)",
+        "includeAll": true,
+        "label": "CUDA",
+        "multi": true,
+        "name": "cuda_version",
+        "options": [],
+        "query": {
+          "query": "label_values(kc_build_last_run_timestamp_seconds, cuda_version)",
+          "refId": "PrometheusVariableQueryEditor-cuda_version"
+        },
+        "refresh": 1,
+        "sort": 1,
+        "type": "query"
+      },
+      {
+        "current": {
+          "selected": true,
+          "text": [
+            "All"
+          ],
+          "value": [
+            "$__all"
+          ]
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(kc_build_last_run_timestamp_seconds, pytorch_version)",
+        "includeAll": true,
+        "label": "PyTorch",
+        "multi": true,
+        "name": "pytorch_version",
+        "options": [],
+        "query": {
+          "query": "label_values(kc_build_last_run_timestamp_seconds, pytorch_version)",
+          "refId": "PrometheusVariableQueryEditor-pytorch_version"
+        },
+        "refresh": 1,
+        "sort": 1,
+        "type": "query"
+      },
+      {
+        "current": {
+          "selected": true,
+          "text": [
+            "All"
+          ],
+          "value": [
+            "$__all"
+          ]
+        },
+        "datasource": {
+          "type": "prometheus",
+          "uid": "prometheus"
+        },
+        "definition": "label_values(kc_build_last_run_timestamp_seconds, python_version)",
+        "includeAll": true,
+        "label": "Python",
+        "multi": true,
+        "name": "python_version",
+        "options": [],
+        "query": {
+          "query": "label_values(kc_build_last_run_timestamp_seconds, python_version)",
+          "refId": "PrometheusVariableQueryEditor-python_version"
+        },
+        "refresh": 1,
+        "sort": 1,
+        "type": "query"
+      }
+    ]
+  },
+  "time": {
+    "from": "now-30d",
+    "to": "now"
+  },
+  "timezone": "browser",
+  "title": "Kernels Build Matrix Overview",
+  "uid": "kernels-build-matrix",
+  "version": 1,
+  "weekStart": ""
+}

monitoring/grafana/provisioning/dashboards/dashboards.yml ADDED Viewed

	@@ -0,0 +1,11 @@

+apiVersion: 1
+providers:
+  - name: kernels-community
+    orgId: 1
+    folder: Kernels Community
+    type: file
+    disableDeletion: false
+    editable: true
+    options:
+      path: /var/lib/grafana/dashboards

monitoring/grafana/provisioning/datasources/prometheus.yml ADDED Viewed

	@@ -0,0 +1,10 @@

+apiVersion: 1
+datasources:
+  - name: Prometheus
+    uid: prometheus
+    type: prometheus
+    access: proxy
+    url: http://prometheus:9090
+    isDefault: true
+    editable: false

monitoring/prometheus/prometheus.yml ADDED Viewed

	@@ -0,0 +1,18 @@

+global:
+  scrape_interval: 15s
+  evaluation_interval: 15s
+rule_files:
+  - /etc/prometheus/rules/*.yml
+scrape_configs:
+  - job_name: prometheus
+    static_configs:
+      - targets:
+          - prometheus:9090
+  - job_name: pushgateway
+    honor_labels: true
+    static_configs:
+      - targets:
+          - pushgateway:9091

monitoring/prometheus/rules/build-alerts.yml ADDED Viewed

	@@ -0,0 +1,20 @@

+groups:
+  - name: kernels-community-build-alerts
+    rules:
+      - alert: KernelsBuildMatrixComboFailing
+        expr: kc_build_last_run_failed == 1
+        for: 10m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Kernel build matrix combo failing"
+          description: "{{ $labels.kernel }} backend={{ $labels.backend }} compute={{ $labels.compute_backend }} cuda={{ $labels.cuda_version }} torch={{ $labels.pytorch_version }} python={{ $labels.python_version }} is currently failing."
+      - alert: KernelsBuildMetricsStale
+        expr: (time() - kc_build_last_run_timestamp_seconds) > 86400
+        for: 30m
+        labels:
+          severity: warning
+        annotations:
+          summary: "Kernel build metrics stale"
+          description: "{{ $labels.kernel }} backend={{ $labels.backend }} compute={{ $labels.compute_backend }} cuda={{ $labels.cuda_version }} torch={{ $labels.pytorch_version }} python={{ $labels.python_version }} has not pushed fresh metrics for more than 24 hours."

requirements-dev.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+-r requirements.txt
+pytest>=8.3,<9
+ruff>=0.11,<0.12

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+gradio>=6.10,<7
+httpx>=0.27,<1
+pydantic>=2.7,<3
+PyYAML>=6.0,<7
+cachetools>=5.3,<6
+python-dateutil>=2.9,<3
+python-dotenv>=1.0,<2
+huggingface_hub>=0.30,<1
+beautifulsoup4>=4.14,<5

scripts/bootstrap_space.py ADDED Viewed

	@@ -0,0 +1,150 @@

+# ruff: noqa: E402
+from __future__ import annotations
+import argparse
+import os
+import subprocess
+import sys
+from pathlib import Path
+from dotenv import load_dotenv
+from huggingface_hub import HfApi
+from huggingface_hub.utils import get_token
+ROOT_DIR = Path(__file__).resolve().parents[1]
+SRC_DIR = ROOT_DIR / "src"
+if str(SRC_DIR) not in sys.path:
+    sys.path.insert(0, str(SRC_DIR))
+from kc_monitor.config import load_config
+def _cached_github_token() -> str | None:
+    try:
+        completed = subprocess.run(
+            ["gh", "auth", "token"],
+            capture_output=True,
+            text=True,
+            check=True,
+        )
+    except (OSError, subprocess.CalledProcessError):
+        return None
+    token = completed.stdout.strip()
+    return token or None
+def build_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(description="Create or update the Kernels Community monitor Space.")
+    parser.add_argument(
+        "--space-id",
+        default=os.getenv("KCM_SPACE_ID", "adarshxs/kernels-community-monitor"),
+        help="Target Hugging Face Space repo ID.",
+    )
+    parser.add_argument(
+        "--private",
+        action="store_true",
+        help="Create the Space as private if it does not already exist.",
+    )
+    parser.add_argument(
+        "--skip-secret",
+        action="store_true",
+        help="Do not update the GITHUB_TOKEN Space secret.",
+    )
+    parser.add_argument(
+        "--skip-variables",
+        action="store_true",
+        help="Do not update Space variables/settings.",
+    )
+    parser.add_argument(
+        "--create-pr",
+        action="store_true",
+        help="Open a Hub pull request instead of pushing directly when write access is unavailable.",
+    )
+    return parser
+def main() -> int:
+    load_dotenv()
+    parser = build_parser()
+    args = parser.parse_args()
+    hf_token = os.getenv("HF_TOKEN") or get_token()
+    github_token = os.getenv("GITHUB_TOKEN") or os.getenv("GH_TOKEN") or _cached_github_token()
+    if not hf_token:
+        parser.error("HF_TOKEN must be set in the environment or available from a local Hugging Face login.")
+    config = load_config(ROOT_DIR / "config" / "monitor.yaml")
+    api = HfApi(token=hf_token)
+    api.create_repo(
+        repo_id=args.space_id,
+        repo_type="space",
+        space_sdk="gradio",
+        private=args.private,
+        exist_ok=True,
+    )
+    if github_token and not args.skip_secret:
+        api.add_space_secret(repo_id=args.space_id, key="GITHUB_TOKEN", value=github_token)
+    if not args.skip_variables:
+        github_vars = {
+            "KCM_GITHUB_OWNER": config.github.owner,
+            "KCM_GITHUB_REPO": config.github.repo,
+            "KCM_GITHUB_BRANCH": config.github.branch,
+            "KCM_REFRESH_INTERVAL_SECONDS": str(config.monitor.refresh_interval_seconds),
+            "KCM_WORKFLOW_RUN_PAGE_SIZE": str(config.monitor.workflow_run_page_size),
+            "KCM_WORKFLOW_RUN_PAGES": str(config.monitor.workflow_run_pages),
+        }
+        if config.monitor.critical_kernels:
+            github_vars["KCM_CRITICAL_KERNELS"] = ",".join(config.monitor.critical_kernels)
+        for key, value in github_vars.items():
+            if value:
+                api.add_space_variable(repo_id=args.space_id, key=key, value=value)
+        grafana_vars = {
+            "KCM_GRAFANA_BASE_URL": config.grafana.base_url,
+            "KCM_GRAFANA_ORG_ID": str(config.grafana.org_id),
+            "KCM_GRAFANA_THEME": config.grafana.theme,
+            "KCM_GRAFANA_OVERVIEW_UID": config.grafana.overview_dashboard_uid,
+            "KCM_GRAFANA_DURATION_UID": config.grafana.duration_dashboard_uid,
+            "KCM_GRAFANA_FAILURE_UID": config.grafana.failure_dashboard_uid,
+            "KCM_PROMETHEUS_BASE_URL": config.prometheus.base_url,
+            "KCM_PUSHGATEWAY_URL": config.pushgateway.url,
+            "KCM_PUSHGATEWAY_JOB_NAME": config.pushgateway.job_name,
+        }
+        for key, value in grafana_vars.items():
+            if value:
+                api.add_space_variable(repo_id=args.space_id, key=key, value=value)
+    api.upload_folder(
+        repo_id=args.space_id,
+        repo_type="space",
+        folder_path=str(ROOT_DIR),
+        create_pr=args.create_pr,
+        ignore_patterns=[
+            ".env",
+            ".git",
+            ".git/*",
+            ".venv/*",
+            "venv/*",
+            "__pycache__/*",
+            ".pytest_cache/*",
+            ".ruff_cache/*",
+            "*.log",
+        ],
+    )
+    print(f"Space URL: https://huggingface.co/spaces/{args.space_id}")
+    try:
+        runtime = api.get_space_runtime(repo_id=args.space_id)
+        print(f"Runtime stage: {runtime.stage}")
+        print(f"Hardware: {runtime.hardware}")
+    except Exception:
+        print("Runtime not yet available (Space is provisioning).")
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

scripts/push_build_metrics.py ADDED Viewed

	@@ -0,0 +1,48 @@

+# ruff: noqa: E402
+from __future__ import annotations
+import os
+import sys
+from pathlib import Path
+from dotenv import load_dotenv
+ROOT_DIR = Path(__file__).resolve().parents[1]
+SRC_DIR = ROOT_DIR / "src"
+if str(SRC_DIR) not in sys.path:
+    sys.path.insert(0, str(SRC_DIR))
+from kc_monitor.config import load_config
+from kc_monitor.metrics_push import BuildMetricSample, push_build_metrics
+def main() -> int:
+    load_dotenv()
+    config = load_config(ROOT_DIR / "config" / "monitor.yaml")
+    pushgateway_url = os.getenv("PUSHGATEWAY_URL") or config.pushgateway.url
+    if not pushgateway_url:
+        raise SystemExit("Pushgateway URL is required via PUSHGATEWAY_URL or KCM_PUSHGATEWAY_URL.")
+    job_name = os.getenv("KCM_PUSHGATEWAY_JOB_NAME") or config.pushgateway.job_name
+    sample = BuildMetricSample.from_env(os.environ)
+    push_url = push_build_metrics(
+        sample,
+        pushgateway_url=pushgateway_url,
+        job_name=job_name,
+    )
+    print(f"Pushed metrics to {push_url}")
+    print(f"Matrix combo: {sample.grouping_key}")
+    print(
+        "Outcome:"
+        f" result={sample.result}"
+        f" result_code={sample.result_code}"
+        f" failed={sample.failed}"
+        f" duration_seconds={sample.duration_seconds:.3f}"
+    )
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

scripts/smoke_check.py ADDED Viewed

	@@ -0,0 +1,63 @@

+# ruff: noqa: E402
+from __future__ import annotations
+import sys
+from pathlib import Path
+ROOT_DIR = Path(__file__).resolve().parents[1]
+SRC_DIR = ROOT_DIR / "src"
+if str(SRC_DIR) not in sys.path:
+    sys.path.insert(0, str(SRC_DIR))
+from kc_monitor.config import load_config
+from kc_monitor.grafana import build_dashboard_url, dashboard_catalog
+from kc_monitor.service import MonitorService
+def main() -> int:
+    config = load_config(ROOT_DIR / "config" / "monitor.yaml")
+    service = MonitorService(config)
+    try:
+        snapshot = service.get_snapshot(force_refresh=True)
+    finally:
+        service.close()
+    print(f"Generated at: {snapshot.generated_at.isoformat()}")
+    print(
+        "Summary:"
+        f" tracked={snapshot.summary.tracked_kernels}"
+        f" active={snapshot.summary.active_builds}"
+        f" uploading={snapshot.summary.uploading_builds}"
+        f" failed={snapshot.summary.failed_builds}"
+    )
+    for row in snapshot.kernel_rows[:10]:
+        primary = row.primary_group
+        run_url = primary.run.html_url if primary else "n/a"
+        print(
+            f"- {row.kernel_name:20}"
+            f" status={row.row_status_label:10}"
+            f" runs={row.recent_run_count:2}"
+            f" run={run_url}"
+        )
+    print(f"Grafana enabled: {config.grafana.enabled}")
+    print(f"Grafana base URL: {config.grafana.base_url or 'not configured'}")
+    print(f"Prometheus base URL: {config.prometheus.base_url or 'not configured'}")
+    print(f"Pushgateway URL: {config.pushgateway.url or 'not configured'}")
+    dashboards = dashboard_catalog(config.grafana)
+    for dashboard in dashboards:
+        print(
+            f"- {dashboard.title:18}"
+            f" uid={dashboard.uid:24}"
+            f" view={build_dashboard_url(config.grafana, dashboard.uid, embed=False) or 'not configured'}"
+        )
+    if snapshot.errors and not snapshot.kernel_rows:
+        return 1
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

src/kc_monitor/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+"""Kernels Community Monitor package."""
+__all__ = ["__version__"]
+__version__ = "0.2.0"

src/kc_monitor/config.py ADDED Viewed

	@@ -0,0 +1,190 @@

+from __future__ import annotations
+import os
+from pathlib import Path
+from typing import Any, Literal
+import yaml
+from dotenv import load_dotenv
+from pydantic import BaseModel, ConfigDict, Field
+from kc_monitor.models import WorkflowTarget
+ROOT_DIR = Path(__file__).resolve().parents[2]
+DEFAULT_CONFIG_PATH = ROOT_DIR / "config" / "monitor.yaml"
+class GitHubSettings(BaseModel):
+    model_config = ConfigDict(extra="ignore")
+    owner: str = "huggingface"
+    repo: str = "kernels-community"
+    branch: str = "main"
+    per_page: int = 30
+    request_timeout_seconds: float = 25.0
+    user_agent: str = "kernels-community-monitor/0.1"
+    token: str | None = None
+    workflows: list[WorkflowTarget] = Field(default_factory=list)
+    @property
+    def repo_slug(self) -> str:
+        return f"{self.owner}/{self.repo}"
+class MonitorSettings(BaseModel):
+    model_config = ConfigDict(extra="ignore")
+    refresh_interval_seconds: int = 120
+    snapshot_ttl_seconds: int = 45
+    workflow_run_page_size: int = 100
+    workflow_run_pages: int = 10
+    recent_completed_hours: int = 72
+    recent_limit: int = 40
+    completed_runs_per_workflow: int = 3
+    log_line_limit: int = 400
+    log_char_limit: int = 35000
+    detail_event_limit: int = 25
+    stall_without_log_minutes: int = 45
+    stall_active_phase_minutes: int = 180
+    critical_kernels: list[str] = Field(default_factory=list)
+    @property
+    def critical_kernel_set(self) -> set[str]:
+        return {item.strip() for item in self.critical_kernels if item.strip()}
+class GrafanaSettings(BaseModel):
+    model_config = ConfigDict(extra="ignore")
+    base_url: str | None = None
+    org_id: int = 1
+    theme: Literal["dark", "light"] = "dark"
+    default_from: str = "now-30d"
+    default_to: str = "now"
+    default_refresh: str = "5m"
+    overview_dashboard_uid: str = "kernels-build-matrix"
+    duration_dashboard_uid: str = "kernels-build-durations"
+    failure_dashboard_uid: str = "kernels-build-failures"
+    @property
+    def enabled(self) -> bool:
+        return bool(self.base_url)
+class PrometheusSettings(BaseModel):
+    model_config = ConfigDict(extra="ignore")
+    base_url: str | None = None
+class PushgatewaySettings(BaseModel):
+    model_config = ConfigDict(extra="ignore")
+    url: str | None = None
+    job_name: str = "kernels-community-build-matrix"
+class AppConfig(BaseModel):
+    model_config = ConfigDict(extra="ignore")
+    github: GitHubSettings = Field(default_factory=GitHubSettings)
+    monitor: MonitorSettings = Field(default_factory=MonitorSettings)
+    grafana: GrafanaSettings = Field(default_factory=GrafanaSettings)
+    prometheus: PrometheusSettings = Field(default_factory=PrometheusSettings)
+    pushgateway: PushgatewaySettings = Field(default_factory=PushgatewaySettings)
+    @property
+    def workflow_targets(self) -> list[WorkflowTarget]:
+        return [workflow for workflow in self.github.workflows if workflow.enabled]
+def _deep_merge(base: dict[str, Any], updates: dict[str, Any]) -> dict[str, Any]:
+    merged = dict(base)
+    for key, value in updates.items():
+        if isinstance(value, dict) and isinstance(merged.get(key), dict):
+            merged[key] = _deep_merge(merged[key], value)
+        else:
+            merged[key] = value
+    return merged
+def _load_yaml(path: Path) -> dict[str, Any]:
+    if not path.exists():
+        return {}
+    with path.open("r", encoding="utf-8") as handle:
+        return yaml.safe_load(handle) or {}
+def _csv_env(name: str) -> list[str] | None:
+    raw = os.getenv(name)
+    if raw is None:
+        return None
+    return [item.strip() for item in raw.split(",") if item.strip()]
+def _env_overrides() -> dict[str, Any]:
+    critical_kernels = _csv_env("KCM_CRITICAL_KERNELS")
+    overrides: dict[str, Any] = {
+        "github": {
+            "owner": os.getenv("KCM_GITHUB_OWNER"),
+            "repo": os.getenv("KCM_GITHUB_REPO"),
+            "branch": os.getenv("KCM_GITHUB_BRANCH"),
+            "token": os.getenv("GITHUB_TOKEN") or os.getenv("GH_TOKEN"),
+        },
+        "monitor": {
+            "refresh_interval_seconds": os.getenv("KCM_REFRESH_INTERVAL_SECONDS"),
+            "snapshot_ttl_seconds": os.getenv("KCM_SNAPSHOT_TTL_SECONDS"),
+            "workflow_run_page_size": os.getenv("KCM_WORKFLOW_RUN_PAGE_SIZE"),
+            "workflow_run_pages": os.getenv("KCM_WORKFLOW_RUN_PAGES"),
+            "recent_completed_hours": os.getenv("KCM_RECENT_COMPLETED_HOURS"),
+            "recent_limit": os.getenv("KCM_RECENT_LIMIT"),
+            "completed_runs_per_workflow": os.getenv("KCM_COMPLETED_RUNS_PER_WORKFLOW"),
+            "log_line_limit": os.getenv("KCM_LOG_LINE_LIMIT"),
+            "log_char_limit": os.getenv("KCM_LOG_CHAR_LIMIT"),
+            "detail_event_limit": os.getenv("KCM_DETAIL_EVENT_LIMIT"),
+            "stall_without_log_minutes": os.getenv("KCM_STALL_WITHOUT_LOG_MINUTES"),
+            "stall_active_phase_minutes": os.getenv("KCM_STALL_ACTIVE_PHASE_MINUTES"),
+            "critical_kernels": critical_kernels,
+        },
+        "grafana": {
+            "base_url": os.getenv("KCM_GRAFANA_BASE_URL"),
+            "org_id": os.getenv("KCM_GRAFANA_ORG_ID"),
+            "theme": os.getenv("KCM_GRAFANA_THEME"),
+            "default_from": os.getenv("KCM_GRAFANA_FROM"),
+            "default_to": os.getenv("KCM_GRAFANA_TO"),
+            "default_refresh": os.getenv("KCM_GRAFANA_REFRESH"),
+            "overview_dashboard_uid": os.getenv("KCM_GRAFANA_OVERVIEW_UID"),
+            "duration_dashboard_uid": os.getenv("KCM_GRAFANA_DURATION_UID"),
+            "failure_dashboard_uid": os.getenv("KCM_GRAFANA_FAILURE_UID"),
+        },
+        "prometheus": {
+            "base_url": os.getenv("KCM_PROMETHEUS_BASE_URL"),
+        },
+        "pushgateway": {
+            "url": os.getenv("KCM_PUSHGATEWAY_URL"),
+            "job_name": os.getenv("KCM_PUSHGATEWAY_JOB_NAME"),
+        },
+    }
+    github = {key: value for key, value in overrides["github"].items() if value is not None}
+    monitor = {key: value for key, value in overrides["monitor"].items() if value is not None}
+    grafana = {key: value for key, value in overrides["grafana"].items() if value is not None}
+    prometheus = {key: value for key, value in overrides["prometheus"].items() if value is not None}
+    pushgateway = {key: value for key, value in overrides["pushgateway"].items() if value is not None}
+    return {
+        "github": github,
+        "monitor": monitor,
+        "grafana": grafana,
+        "prometheus": prometheus,
+        "pushgateway": pushgateway,
+    }
+def load_config(config_path: str | Path | None = None) -> AppConfig:
+    load_dotenv()
+    path = Path(config_path) if config_path else DEFAULT_CONFIG_PATH
+    raw = _load_yaml(path)
+    merged = _deep_merge(raw, _env_overrides())
+    return AppConfig.model_validate(merged)

src/kc_monitor/github_client.py ADDED Viewed

	@@ -0,0 +1,456 @@

+from __future__ import annotations
+import base64
+import html
+import json
+import re
+import subprocess
+from typing import Any
+from bs4 import BeautifulSoup
+import httpx
+from kc_monitor.models import GitHubJob, GitHubRun, parse_github_datetime, utcnow
+class GitHubActionsError(RuntimeError):
+    """Raised when the GitHub API returns an unexpected response."""
+class GitHubActionsClient:
+    def __init__(
+        self,
+        owner: str,
+        repo: str,
+        token: str | None = None,
+        request_timeout_seconds: float = 25.0,
+        user_agent: str = "kernels-community-monitor/0.1",
+    ) -> None:
+        if not token:
+            token = self._token_from_gh_cli()
+        headers = {
+            "Accept": "application/vnd.github+json",
+            "User-Agent": user_agent,
+            "X-GitHub-Api-Version": "2022-11-28",
+        }
+        if token:
+            headers["Authorization"] = f"Bearer {token}"
+        self.owner = owner
+        self.repo = repo
+        self._client = httpx.Client(
+            base_url="https://api.github.com",
+            headers=headers,
+            timeout=request_timeout_seconds,
+            follow_redirects=False,
+        )
+        self._anonymous_client = httpx.Client(
+            base_url="https://api.github.com",
+            headers={
+                "Accept": "application/vnd.github+json",
+                "User-Agent": user_agent,
+                "X-GitHub-Api-Version": "2022-11-28",
+            },
+            timeout=request_timeout_seconds,
+            follow_redirects=False,
+        )
+        self._web_client = httpx.Client(
+            base_url="https://github.com",
+            headers={"User-Agent": user_agent},
+            timeout=request_timeout_seconds,
+            follow_redirects=True,
+        )
+        self._raw_client = httpx.Client(
+            base_url="https://raw.githubusercontent.com",
+            headers={"User-Agent": user_agent},
+            timeout=request_timeout_seconds,
+            follow_redirects=True,
+        )
+    @staticmethod
+    def _token_from_gh_cli() -> str | None:
+        try:
+            completed = subprocess.run(
+                ["gh", "auth", "token"],
+                capture_output=True,
+                text=True,
+                check=True,
+            )
+        except (OSError, subprocess.CalledProcessError):
+            return None
+        token = completed.stdout.strip()
+        return token or None
+    def close(self) -> None:
+        self._client.close()
+        self._anonymous_client.close()
+        self._web_client.close()
+        self._raw_client.close()
+    @staticmethod
+    def _is_classic_pat_forbidden(response: httpx.Response) -> bool:
+        return response.status_code == 403 and "forbids access via a personal access token (classic)" in response.text
+    def _request_with_fallback(self, method: str, path: str, **kwargs: Any) -> httpx.Response:
+        response = self._client.request(method, path, **kwargs)
+        if self._is_classic_pat_forbidden(response):
+            response = self._anonymous_client.request(method, path, **kwargs)
+        return response
+    def _request(self, method: str, path: str, **kwargs: Any) -> httpx.Response:
+        response = self._request_with_fallback(method, path, **kwargs)
+        if response.status_code >= 400:
+            raise GitHubActionsError(
+                f"GitHub API request failed for {path}: {response.status_code} {response.text}"
+            )
+        return response
+    @staticmethod
+    def _should_use_public_fallback(response: httpx.Response) -> bool:
+        text = response.text.lower()
+        return response.status_code in {403, 404, 429} or "rate limit exceeded" in text
+    @staticmethod
+    def _workflow_path(workflow_file: str) -> str:
+        if workflow_file.startswith(".github/workflows/"):
+            return workflow_file
+        return f".github/workflows/{workflow_file}"
+    @staticmethod
+    def _parse_run_state(aria_label: str) -> tuple[str, str | None]:
+        normalized = aria_label.lower()
+        if "completed successfully" in normalized:
+            return "completed", "success"
+        if "cancel" in normalized:
+            return "completed", "cancelled"
+        if "fail" in normalized:
+            return "completed", "failure"
+        if "queued" in normalized:
+            return "queued", None
+        if "in progress" in normalized or "running" in normalized:
+            return "in_progress", None
+        return "completed", None
+    def _list_workflow_runs_public(self, workflow_file: str, page: int = 1) -> list[GitHubRun]:
+        response = self._web_client.get(
+            f"/{self.owner}/{self.repo}/actions/workflows/{workflow_file}",
+            params={"page": page},
+        )
+        response.raise_for_status()
+        soup = BeautifulSoup(response.text, "html.parser")
+        rows = soup.find_all("div", class_="Box-row")
+        runs: list[GitHubRun] = []
+        run_prefix = f"/{self.owner}/{self.repo}/actions/runs/"
+        branch_prefix = f"/{self.owner}/{self.repo}/tree/refs/heads/"
+        pull_prefix = f"/{self.owner}/{self.repo}/pull/"
+        workflow_path = self._workflow_path(workflow_file)
+        for row in rows:
+            run_link = next(
+                (a for a in row.find_all("a") if (a.get("href") or "").startswith(run_prefix)),
+                None,
+            )
+            if not run_link:
+                continue
+            run_href = run_link.get("href") or ""
+            try:
+                run_id = int(run_href.rstrip("/").split("/")[-1])
+            except ValueError:
+                continue
+            display_title = run_link.get_text(" ", strip=True)
+            aria_label = run_link.get("aria-label") or ""
+            status, conclusion = self._parse_run_state(aria_label)
+            relative_time = row.find("relative-time")
+            timestamp = parse_github_datetime(relative_time.get("datetime")) if relative_time else None
+            branch_link = next(
+                (a for a in row.find_all("a") if (a.get("href") or "").startswith(branch_prefix)),
+                None,
+            )
+            actor_link = next(
+                (
+                    a
+                    for a in row.find_all("a")
+                    if (href := a.get("href") or "")
+                    and href.startswith("/")
+                    and not href.startswith(run_prefix)
+                    and not href.startswith(branch_prefix)
+                    and not href.startswith(pull_prefix)
+                    and href.count("/") == 1
+                ),
+                None,
+            )
+            workflow_name = row.find("span", class_="text-bold")
+            pull_link = next(
+                (a for a in row.find_all("a") if (a.get("href") or "").startswith(pull_prefix)),
+                None,
+            )
+            event = "pull_request" if pull_link else "workflow_dispatch"
+            head_branch = branch_link.get_text(" ", strip=True) if branch_link else ""
+            actor_login = actor_link.get_text(" ", strip=True) if actor_link else None
+            run_time = timestamp or utcnow()
+            runs.append(
+                GitHubRun(
+                    id=run_id,
+                    name=workflow_name.get_text(" ", strip=True) if workflow_name else workflow_file,
+                    display_title=display_title,
+                    path=workflow_path,
+                    status=status,
+                    conclusion=conclusion,
+                    head_branch=head_branch,
+                    head_sha="",
+                    event=event,
+                    html_url=f"https://github.com{run_href}",
+                    jobs_url=f"https://api.github.com/repos/{self.owner}/{self.repo}/actions/runs/{run_id}/jobs",
+                    created_at=run_time,
+                    updated_at=run_time,
+                    run_started_at=run_time,
+                    actor_login=actor_login,
+                    raw={"source": "public_html"},
+                )
+            )
+        return runs
+    @staticmethod
+    def _runner_group_from_job_name(job_name: str) -> str | None:
+        match = re.search(r"\(([^)]+)\)", job_name)
+        if not match:
+            return None
+        parts = [part.strip() for part in match.group(1).split(",") if part.strip()]
+        if len(parts) < 2:
+            return None
+        return parts[1]
+    def _list_jobs_public(self, run_id: int) -> list[GitHubJob]:
+        response = self._web_client.get(f"/{self.owner}/{self.repo}/actions/runs/{run_id}")
+        response.raise_for_status()
+        soup = BeautifulSoup(response.text, "html.parser")
+        scripts = [
+            script
+            for script in soup.find_all("script")
+            if script.get("data-target") == "react-partial.embeddedData"
+        ]
+        jobs_script = next(
+            (
+                script
+                for script in scripts
+                if (parent := script.find_parent("react-partial"))
+                and parent.get("partial-name") == "actions-run-jobs-list"
+            ),
+            None,
+        )
+        if jobs_script is None or not jobs_script.string:
+            raise GitHubActionsError(f"Could not locate jobs list for run {run_id} in the public page.")
+        embedded = json.loads(jobs_script.string)
+        props = embedded.get("props") or {}
+        fetch_url = props.get("jobGroupsFetchUrl")
+        if not fetch_url:
+            raise GitHubActionsError(f"Public run page for {run_id} did not expose job groups fetch URL.")
+        batch_response = self._web_client.get(
+            fetch_url,
+            params={"batch": 0, "size": 200},
+            headers={"X-Requested-With": "XMLHttpRequest"},
+        )
+        batch_response.raise_for_status()
+        payload = batch_response.json()
+        jobs: list[GitHubJob] = []
+        run_url = f"https://github.com/{self.owner}/{self.repo}/actions/runs/{run_id}"
+        for group in payload.get("jobGroups") or []:
+            non_nested = group.get("nonNested") or {}
+            for job_payload in non_nested.get("jobs") or []:
+                job_name = job_payload.get("displayName") or group.get("name") or ""
+                job_href = job_payload.get("href") or ""
+                jobs.append(
+                    GitHubJob(
+                        id=job_payload["id"],
+                        run_id=run_id,
+                        workflow_name="",
+                        head_branch="",
+                        run_url=run_url,
+                        run_attempt=1,
+                        head_sha="",
+                        url="",
+                        html_url=f"https://github.com{job_href}",
+                        status=job_payload.get("status") or "unknown",
+                        conclusion=job_payload.get("conclusion"),
+                        created_at=utcnow(),
+                        started_at=None,
+                        completed_at=None,
+                        name=job_name,
+                        steps=[],
+                        runner_group_name=self._runner_group_from_job_name(job_name),
+                    )
+                )
+        return jobs
+    def _list_repo_tree_paths_public(self, ref: str = "main") -> list[str]:
+        response = self._web_client.get(f"/{self.owner}/{self.repo}/tree/{ref}")
+        response.raise_for_status()
+        soup = BeautifulSoup(response.text, "html.parser")
+        prefix = f"/{self.owner}/{self.repo}/tree/{ref}/"
+        candidates = sorted(
+            {
+                href.removeprefix(prefix).split("/", 1)[0]
+                for anchor in soup.find_all("a")
+                if (href := anchor.get("href") or "").startswith(prefix)
+                and "/" not in href.removeprefix(prefix)
+            }
+        )
+        paths: list[str] = []
+        for candidate in candidates:
+            if candidate.startswith("."):
+                continue
+            raw_response = self._raw_client.get(f"/{self.owner}/{self.repo}/{ref}/{candidate}/build.toml")
+            if raw_response.status_code == 200:
+                paths.append(f"{candidate}/build.toml")
+        return paths
+    def _get_file_text_public(self, path: str, ref: str | None = None) -> str | None:
+        target_ref = ref or "main"
+        response = self._raw_client.get(f"/{self.owner}/{self.repo}/{target_ref}/{path}")
+        if response.status_code == 404:
+            return None
+        response.raise_for_status()
+        return response.text
+    def list_runs(self, per_page: int = 30, page: int = 1) -> list[GitHubRun]:
+        response = self._request(
+            "GET",
+            f"/repos/{self.owner}/{self.repo}/actions/runs",
+            params={"per_page": per_page, "page": page},
+        )
+        payload = response.json()
+        return [GitHubRun.from_api(item) for item in payload.get("workflow_runs") or []]
+    def list_workflow_runs(
+        self,
+        workflow_file: str,
+        per_page: int = 30,
+        page: int = 1,
+    ) -> list[GitHubRun]:
+        response = self._request_with_fallback(
+            "GET",
+            f"/repos/{self.owner}/{self.repo}/actions/workflows/{workflow_file}/runs",
+            params={"per_page": per_page, "page": page},
+        )
+        if self._should_use_public_fallback(response):
+            return self._list_workflow_runs_public(workflow_file, page=page)
+        if response.status_code >= 400:
+            raise GitHubActionsError(
+                f"GitHub API request failed for /repos/{self.owner}/{self.repo}/actions/workflows/{workflow_file}/runs: "
+                f"{response.status_code} {response.text}"
+            )
+        payload = response.json()
+        return [GitHubRun.from_api(item) for item in payload.get("workflow_runs") or []]
+    def list_jobs(self, run_id: int) -> list[GitHubJob]:
+        response = self._request_with_fallback(
+            "GET",
+            f"/repos/{self.owner}/{self.repo}/actions/runs/{run_id}/jobs",
+            params={"per_page": 100},
+        )
+        if self._should_use_public_fallback(response):
+            return self._list_jobs_public(run_id)
+        if response.status_code >= 400:
+            raise GitHubActionsError(
+                f"GitHub API request failed for /repos/{self.owner}/{self.repo}/actions/runs/{run_id}/jobs: "
+                f"{response.status_code} {response.text}"
+            )
+        payload = response.json()
+        return [GitHubJob.from_api(item) for item in payload.get("jobs") or []]
+    def list_repo_tree_paths(self, ref: str = "main") -> list[str]:
+        response = self._request_with_fallback(
+            "GET",
+            f"/repos/{self.owner}/{self.repo}/git/trees/{ref}",
+            params={"recursive": 1},
+        )
+        if self._should_use_public_fallback(response):
+            return self._list_repo_tree_paths_public(ref=ref)
+        if response.status_code >= 400:
+            raise GitHubActionsError(
+                f"GitHub API request failed for /repos/{self.owner}/{self.repo}/git/trees/{ref}: "
+                f"{response.status_code} {response.text}"
+            )
+        payload = response.json()
+        return [item["path"] for item in payload.get("tree") or [] if item.get("path")]
+    def get_job_logs(
+        self,
+        job_id: int,
+        line_limit: int = 400,
+        char_limit: int = 35000,
+        job_html_url: str | None = None,
+    ) -> str | None:
+        response = self._request_with_fallback(
+            "GET",
+            f"/repos/{self.owner}/{self.repo}/actions/jobs/{job_id}/logs",
+        )
+        if response.status_code in {301, 302, 307, 308}:
+            location = response.headers.get("Location")
+            if not location:
+                return None
+            redirected = self._anonymous_client.get(location, follow_redirects=True)
+            if redirected.status_code in {404, 410}:
+                return None
+            redirected.raise_for_status()
+            text = redirected.text
+        elif response.status_code in {404, 410}:
+            return None
+        elif response.status_code == 403 and job_html_url:
+            text = self._fetch_public_job_page(job_html_url)
+        elif response.status_code >= 400:
+            raise GitHubActionsError(
+                f"GitHub API request failed for /repos/{self.owner}/{self.repo}/actions/jobs/{job_id}/logs: "
+                f"{response.status_code} {response.text}"
+            )
+        else:
+            text = response.text
+        if not text:
+            return None
+        lines = text.splitlines()
+        if line_limit and len(lines) > line_limit:
+            lines = lines[-line_limit:]
+        trimmed = "\n".join(lines)
+        if char_limit and len(trimmed) > char_limit:
+            trimmed = trimmed[-char_limit:]
+        return trimmed
+    def _fetch_public_job_page(self, job_html_url: str) -> str | None:
+        response = self._anonymous_client.get(job_html_url, follow_redirects=True)
+        response.raise_for_status()
+        text = response.text
+        text = re.sub(r"(?is)<script.*?</script>", " ", text)
+        text = re.sub(r"(?is)<style.*?</style>", " ", text)
+        text = re.sub(r"(?s)<[^>]+>", "\n", text)
+        text = html.unescape(text)
+        normalized_lines = [line.strip() for line in text.splitlines() if line.strip()]
+        return "\n".join(normalized_lines)
+    def get_file_text(self, path: str, ref: str | None = None) -> str | None:
+        params = {"ref": ref} if ref else None
+        response = self._request_with_fallback(
+            "GET",
+            f"/repos/{self.owner}/{self.repo}/contents/{path}",
+            params=params,
+        )
+        if self._should_use_public_fallback(response):
+            return self._get_file_text_public(path, ref=ref)
+        if response.status_code >= 400:
+            raise GitHubActionsError(
+                f"GitHub API request failed for /repos/{self.owner}/{self.repo}/contents/{path}: "
+                f"{response.status_code} {response.text}"
+            )
+        payload = response.json()
+        encoded = payload.get("content")
+        if not encoded:
+            return None
+        content = base64.b64decode(encoded)
+        return content.decode("utf-8", errors="replace")

src/kc_monitor/grafana.py ADDED Viewed

	@@ -0,0 +1,65 @@

+from __future__ import annotations
+from dataclasses import dataclass
+from urllib.parse import urlencode
+from kc_monitor.config import GrafanaSettings
+@dataclass(frozen=True, slots=True)
+class GrafanaDashboard:
+    key: str
+    title: str
+    description: str
+    uid: str
+    height: int
+def dashboard_catalog(settings: GrafanaSettings) -> list[GrafanaDashboard]:
+    return [
+        GrafanaDashboard(
+            key="overview",
+            title="Matrix overview",
+            description="Latest outcome per build matrix combo, with fast filters across kernel, backend, CUDA, PyTorch, and Python.",
+            uid=settings.overview_dashboard_uid,
+            height=420,
+        ),
+        GrafanaDashboard(
+            key="durations",
+            title="Duration trends",
+            description="Compilation and upload duration trends, so regressions show up as rising wall time instead of surprise failures.",
+            uid=settings.duration_dashboard_uid,
+            height=460,
+        ),
+        GrafanaDashboard(
+            key="failures",
+            title="Failure overview",
+            description="Current failing combinations and stale metrics signals, tuned for alert-driven triage instead of log scraping.",
+            uid=settings.failure_dashboard_uid,
+            height=420,
+        ),
+    ]
+def build_dashboard_url(
+    settings: GrafanaSettings,
+    uid: str,
+    *,
+    embed: bool,
+) -> str:
+    base_url = (settings.base_url or "").rstrip("/")
+    if not base_url:
+        return ""
+    query = {
+        "orgId": settings.org_id,
+        "from": settings.default_from,
+        "to": settings.default_to,
+        "theme": settings.theme,
+        "refresh": settings.default_refresh,
+    }
+    if embed:
+        query["kiosk"] = "tv"
+    return f"{base_url}/d/{uid}/_?{urlencode(query)}"

src/kc_monitor/kernel_index.py ADDED Viewed

	@@ -0,0 +1,108 @@

+from __future__ import annotations
+import re
+import tomllib
+from cachetools import TTLCache
+from kc_monitor.github_client import GitHubActionsClient, GitHubActionsError
+from kc_monitor.models import GitHubRun, KernelInfo
+PR_TITLE_RE = re.compile(r"^\s*([A-Za-z0-9_-]+)\s*:")
+MANUAL_BUILD_RE = re.compile(
+    r"Manual Kernel Build\s*/\s*([A-Za-z0-9_-]+)\s*/",
+    flags=re.IGNORECASE,
+)
+class KernelIndex:
+    def __init__(
+        self,
+        client: GitHubActionsClient,
+        branch: str = "main",
+        cache_ttl_seconds: int = 900,
+    ) -> None:
+        self.client = client
+        self.branch = branch
+        self._cache: TTLCache[str, KernelInfo] = TTLCache(maxsize=256, ttl=cache_ttl_seconds)
+        self._catalog_cache: TTLCache[str, list[KernelInfo]] = TTLCache(maxsize=1, ttl=cache_ttl_seconds)
+    @staticmethod
+    def infer_kernel_name(run: GitHubRun) -> str | None:
+        candidates = [run.display_title, run.name]
+        for candidate in candidates:
+            if not candidate:
+                continue
+            match = PR_TITLE_RE.match(candidate)
+            if match:
+                return match.group(1)
+            match = MANUAL_BUILD_RE.search(candidate)
+            if match:
+                return match.group(1)
+        return None
+    @staticmethod
+    def _fallback_kernel_info(kernel_name: str) -> KernelInfo:
+        repo_id = f"kernels-community/{kernel_name}"
+        return KernelInfo(
+            kernel_name=kernel_name,
+            repo_id=repo_id,
+            hub_url=f"https://huggingface.co/{repo_id}",
+        )
+    def get_kernel_info(self, kernel_name: str) -> KernelInfo:
+        if kernel_name in self._cache:
+            return self._cache[kernel_name]
+        info = self._fallback_kernel_info(kernel_name)
+        try:
+            content = self.client.get_file_text(f"{kernel_name}/build.toml", ref=self.branch)
+        except GitHubActionsError:
+            self._cache[kernel_name] = info
+            return info
+        if not content:
+            self._cache[kernel_name] = info
+            return info
+        try:
+            data = tomllib.loads(content)
+        except tomllib.TOMLDecodeError:
+            self._cache[kernel_name] = info
+            return info
+        general = data.get("general") or {}
+        hub = general.get("hub") or {}
+        repo_id = hub.get("repo-id") or info.repo_id
+        parsed = KernelInfo(
+            kernel_name=general.get("name") or kernel_name,
+            repo_id=repo_id,
+            hub_url=f"https://huggingface.co/{repo_id}",
+            version=general.get("version"),
+            backends=list(general.get("backends") or []),
+        )
+        self._cache[kernel_name] = parsed
+        return parsed
+    def list_kernel_catalog(self) -> list[KernelInfo]:
+        if "catalog" in self._catalog_cache:
+            return self._catalog_cache["catalog"]
+        try:
+            paths = self.client.list_repo_tree_paths(ref=self.branch)
+        except GitHubActionsError:
+            return []
+        kernel_names = sorted(
+            {
+                path.split("/", 1)[0]
+                for path in paths
+                if path.endswith("/build.toml") and "/" in path and path.count("/") == 1
+            }
+        )
+        catalog = [self._cache.get(name) or self._fallback_kernel_info(name) for name in kernel_names]
+        self._catalog_cache["catalog"] = catalog
+        return catalog

src/kc_monitor/log_parser.py ADDED Viewed

	@@ -0,0 +1,216 @@

+from __future__ import annotations
+import re
+from kc_monitor.models import (
+    FAILING_CONCLUSIONS,
+    GitHubJob,
+    GitHubJobStep,
+    GitHubRun,
+    ParsedJobState,
+    ParsedLogEvent,
+    parse_github_datetime,
+)
+PHASE_LABELS = {
+    "queued": "Queued",
+    "setup": "Setup",
+    "validating": "Validating",
+    "building": "Building",
+    "uploading": "Uploading",
+    "upload_complete": "Upload complete",
+    "testing": "Testing",
+    "completed": "Completed",
+    "failed": "Failed",
+    "cancelled": "Cancelled",
+    "stalled": "Stalled",
+}
+UPLOAD_LABELS = {
+    "not_started": "Not started",
+    "running": "Running",
+    "completed": "Completed",
+    "failed": "Failed",
+    "skipped": "Skipped",
+}
+STEP_PHASE_RULES: list[tuple[re.Pattern[str], str]] = [
+    (re.compile(r"Set up job|checkout|nix-installer|Nix info|cachix", re.IGNORECASE), "setup"),
+    (re.compile(r"Validate kernel directory", re.IGNORECASE), "validating"),
+    (re.compile(r"Build and upload kernel|Build kernel|Build and copy kernel", re.IGNORECASE), "building"),
+    (re.compile(r"Upload kernel|Upload v1 kernels to main|Upload ci-test closure", re.IGNORECASE), "uploading"),
+    (re.compile(r"Run GPU tests", re.IGNORECASE), "testing"),
+]
+TIMESTAMP_RE = re.compile(r"(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:\.\d+)?Z)")
+REPO_ID_RE = re.compile(r"--repo-id(?:=|\s+)?\"?([A-Za-z0-9._-]+/[A-Za-z0-9._-]+)\"?")
+UPLOAD_START_RE = re.compile(r"(kernels\s+--\s+upload|upload\s+--repo-id|Uploading\s+[A-Za-z0-9._-]+/[A-Za-z0-9._-]+)", re.IGNORECASE)
+UPLOAD_SUCCESS_RE = re.compile(r"(Upload finished|Upload complete|Committed|commit created|pushed to hub)", re.IGNORECASE)
+ERROR_RE = re.compile(r"(error:|Process completed with exit code|Traceback|FAILED|fatal:)", re.IGNORECASE)
+def classify_step_name(step_name: str | None) -> str | None:
+    if not step_name:
+        return None
+    for pattern, phase in STEP_PHASE_RULES:
+        if pattern.search(step_name):
+            return phase
+    return None
+def _interesting_category(line: str) -> str | None:
+    if ERROR_RE.search(line):
+        return "error"
+    if UPLOAD_START_RE.search(line) or "upload" in line.lower():
+        return "upload"
+    if "build-and-upload" in line or "build-and-copy" in line or "nix build" in line.lower():
+        return "build"
+    if "validate kernel directory" in line.lower():
+        return "validation"
+    return None
+class JobLogParser:
+    def parse(
+        self,
+        run: GitHubRun,
+        job: GitHubJob,
+        log_text: str | None,
+        event_limit: int = 20,
+    ) -> ParsedJobState:
+        lines = log_text.splitlines() if log_text else []
+        latest_log_at = self._latest_log_timestamp(lines)
+        repo_id = self._extract_repo_id(lines)
+        events = self._extract_events(lines, limit=event_limit)
+        failure_excerpt = self._failure_excerpt(lines) if (job.conclusion or "") in FAILING_CONCLUSIONS else None
+        active_step = job.active_step or job.last_step
+        step_phase = classify_step_name(active_step.name if active_step else None)
+        upload_status = self._upload_status(job, lines)
+        phase, reason = self._phase_for_job(job, step_phase, upload_status, lines, active_step)
+        return ParsedJobState(
+            phase=phase,
+            phase_label=PHASE_LABELS.get(phase, phase.title()),
+            phase_reason=reason,
+            upload_status=upload_status,
+            upload_status_label=UPLOAD_LABELS[upload_status],
+            repo_id=repo_id,
+            latest_log_at=latest_log_at,
+            active_step_name=active_step.name if active_step else None,
+            active_step_started_at=active_step.started_at if active_step else None,
+            events=events,
+            failure_excerpt=failure_excerpt,
+        )
+    def _phase_for_job(
+        self,
+        job: GitHubJob,
+        step_phase: str | None,
+        upload_status: str,
+        lines: list[str],
+        active_step: GitHubJobStep | None,
+    ) -> tuple[str, str]:
+        upload_started = any(UPLOAD_START_RE.search(line) for line in lines)
+        combined_step = any("Build and upload kernel" in step.name for step in job.steps)
+        if job.status != "completed":
+            if upload_started or upload_status == "running":
+                return "uploading", "Upload command detected in the active job log."
+            if step_phase:
+                return step_phase, f"Current GitHub Actions step: {active_step.name}."
+            return "queued", "Job is queued or still waiting for the first step to start."
+        conclusion = job.conclusion or "completed"
+        if conclusion == "success":
+            if upload_status == "completed" or (combined_step and upload_started):
+                return "upload_complete", "Build finished and upload markers were detected."
+            return "completed", "Job completed successfully."
+        if conclusion == "cancelled":
+            return "cancelled", "GitHub marked the job as cancelled."
+        if upload_status == "failed":
+            return "failed", "Job failed after upload started or inside an upload step."
+        return "failed", "GitHub marked the job as failed."
+    def _upload_status(self, job: GitHubJob, lines: list[str]) -> str:
+        upload_steps = [step for step in job.steps if classify_step_name(step.name) == "uploading"]
+        if any(step.is_running for step in upload_steps):
+            return "running"
+        if any((step.conclusion or "") == "success" for step in upload_steps):
+            return "completed"
+        if any((step.conclusion or "") in FAILING_CONCLUSIONS for step in upload_steps):
+            return "failed"
+        if upload_steps and all((step.conclusion or "") == "skipped" for step in upload_steps):
+            return "skipped"
+        upload_started = any(UPLOAD_START_RE.search(line) for line in lines)
+        upload_succeeded = any(UPLOAD_SUCCESS_RE.search(line) for line in lines)
+        combined_step_success = any(
+            "Build and upload kernel" in step.name and (step.conclusion or "") == "success"
+            for step in job.steps
+        )
+        if job.status != "completed":
+            return "running" if upload_started else "not_started"
+        if upload_succeeded or upload_started or combined_step_success:
+            return "completed"
+        if upload_started and (job.conclusion or "") in FAILING_CONCLUSIONS:
+            return "failed"
+        if (job.conclusion or "") == "cancelled":
+            return "skipped"
+        return "not_started"
+    def _latest_log_timestamp(self, lines: list[str]) -> None | object:
+        timestamps = []
+        for line in lines:
+            match = TIMESTAMP_RE.search(line)
+            if match:
+                parsed = parse_github_datetime(match.group(1))
+                if parsed:
+                    timestamps.append(parsed)
+        return max(timestamps) if timestamps else None
+    def _extract_repo_id(self, lines: list[str]) -> str | None:
+        for line in lines:
+            match = REPO_ID_RE.search(line)
+            if match:
+                return match.group(1)
+        return None
+    def _extract_events(self, lines: list[str], limit: int) -> list[ParsedLogEvent]:
+        events: list[ParsedLogEvent] = []
+        for index, line in enumerate(lines, start=1):
+            category = _interesting_category(line)
+            if not category:
+                continue
+            timestamp = None
+            match = TIMESTAMP_RE.search(line)
+            if match:
+                timestamp = parse_github_datetime(match.group(1))
+            events.append(
+                ParsedLogEvent(
+                    category=category,
+                    message=line.strip(),
+                    line_number=index,
+                    timestamp=timestamp,
+                )
+            )
+        return events[-limit:]
+    def _failure_excerpt(self, lines: list[str]) -> str | None:
+        if not lines:
+            return None
+        failure_lines = [line.strip() for line in lines if line.strip() and ERROR_RE.search(line)]
+        if failure_lines:
+            return "\n".join(failure_lines[-8:])
+        non_empty = [line.strip() for line in lines if line.strip()]
+        if not non_empty:
+            return None
+        return "\n".join(non_empty[-10:])

src/kc_monitor/metrics_push.py ADDED Viewed

	@@ -0,0 +1,190 @@

+from __future__ import annotations
+from dataclasses import dataclass
+from datetime import datetime, timezone
+import time
+from typing import Mapping
+from urllib.parse import quote
+import httpx
+GROUPING_LABEL_ORDER = (
+    "kernel",
+    "backend",
+    "compute_backend",
+    "cuda_version",
+    "pytorch_version",
+    "python_version",
+)
+METRIC_LABEL_ORDER = (
+    "repository",
+    "workflow",
+    "branch",
+    "job",
+    "runner_os",
+    "runner_arch",
+)
+RESULT_CODE_BY_STATUS = {
+    "success": 0,
+    "cancelled": 1,
+    "skipped": 1,
+    "neutral": 1,
+    "failure": 2,
+    "timed_out": 2,
+    "startup_failure": 2,
+    "action_required": 2,
+}
+def _coalesce(value: str | None, default: str = "unknown") -> str:
+    cleaned = (value or "").strip()
+    return cleaned or default
+def _escape_label_value(value: str) -> str:
+    return value.replace("\\", "\\\\").replace("\n", "\\n").replace('"', '\\"')
+def _parse_unix_or_iso(value: str) -> float:
+    raw = value.strip()
+    try:
+        return float(raw)
+    except ValueError:
+        normalized = raw.replace("Z", "+00:00")
+        return datetime.fromisoformat(normalized).astimezone(timezone.utc).timestamp()
+def resolve_duration_seconds(env: Mapping[str, str], completed_at_seconds: float) -> float:
+    explicit_duration = env.get("KCM_BUILD_DURATION_SECONDS")
+    if explicit_duration:
+        return max(float(explicit_duration), 0.0)
+    started_at = env.get("KCM_JOB_STARTED_AT")
+    if not started_at:
+        return 0.0
+    started_at_seconds = _parse_unix_or_iso(started_at)
+    return max(completed_at_seconds - started_at_seconds, 0.0)
+def result_code_for_status(status: str) -> int:
+    return RESULT_CODE_BY_STATUS.get(status.strip().lower(), 3)
+@dataclass(frozen=True, slots=True)
+class BuildMetricSample:
+    grouping_key: dict[str, str]
+    metric_labels: dict[str, str]
+    duration_seconds: float
+    completed_at_seconds: int
+    result_code: int
+    failed: int
+    result: str
+    @classmethod
+    def from_env(
+        cls,
+        env: Mapping[str, str],
+        *,
+        completed_at_seconds: int | None = None,
+    ) -> "BuildMetricSample":
+        completed_at = completed_at_seconds or int(time.time())
+        result = _coalesce(env.get("KCM_JOB_STATUS") or env.get("JOB_STATUS")).lower()
+        result_code = result_code_for_status(result)
+        grouping_key = {
+            "kernel": _coalesce(env.get("KCM_KERNEL")),
+            "backend": _coalesce(env.get("KCM_BACKEND")),
+            "compute_backend": _coalesce(env.get("KCM_COMPUTE_BACKEND")),
+            "cuda_version": _coalesce(env.get("KCM_CUDA_VERSION")),
+            "pytorch_version": _coalesce(env.get("KCM_PYTORCH_VERSION")),
+            "python_version": _coalesce(env.get("KCM_PYTHON_VERSION")),
+        }
+        metric_labels = {
+            "repository": _coalesce(env.get("GITHUB_REPOSITORY")),
+            "workflow": _coalesce(env.get("GITHUB_WORKFLOW")),
+            "branch": _coalesce(
+                env.get("GITHUB_REF_NAME")
+                or env.get("GITHUB_HEAD_REF")
+                or env.get("GITHUB_REF")
+            ),
+            "job": _coalesce(env.get("GITHUB_JOB")),
+            "runner_os": _coalesce(env.get("RUNNER_OS")),
+            "runner_arch": _coalesce(env.get("RUNNER_ARCH")),
+        }
+        return cls(
+            grouping_key=grouping_key,
+            metric_labels=metric_labels,
+            duration_seconds=resolve_duration_seconds(env, completed_at),
+            completed_at_seconds=completed_at,
+            result_code=result_code,
+            failed=1 if result_code == 2 else 0,
+            result=result,
+        )
+def build_pushgateway_url(base_url: str, job_name: str, grouping_key: Mapping[str, str]) -> str:
+    path = [base_url.rstrip("/"), "metrics", "job", quote(job_name, safe="")]
+    for label in GROUPING_LABEL_ORDER:
+        path.append(quote(label, safe=""))
+        path.append(quote(grouping_key[label], safe=""))
+    return "/".join(path)
+def format_prometheus_metrics(sample: BuildMetricSample) -> str:
+    labels = {
+        key: sample.metric_labels[key]
+        for key in METRIC_LABEL_ORDER
+    }
+    label_blob = ",".join(
+        f'{key}="{_escape_label_value(value)}"'
+        for key, value in labels.items()
+    )
+    info_labels = f'{label_blob},result="{_escape_label_value(sample.result)}"'
+    lines = [
+        "# TYPE kc_build_last_run_result_code gauge",
+        f"kc_build_last_run_result_code{{{label_blob}}} {sample.result_code}",
+        "# TYPE kc_build_last_run_failed gauge",
+        f"kc_build_last_run_failed{{{label_blob}}} {sample.failed}",
+        "# TYPE kc_build_last_run_duration_seconds gauge",
+        f"kc_build_last_run_duration_seconds{{{label_blob}}} {sample.duration_seconds:.3f}",
+        "# TYPE kc_build_last_run_timestamp_seconds gauge",
+        f"kc_build_last_run_timestamp_seconds{{{label_blob}}} {sample.completed_at_seconds}",
+        "# TYPE kc_build_last_run_info gauge",
+        f"kc_build_last_run_info{{{info_labels}}} 1",
+    ]
+    return "\n".join(lines) + "\n"
+def push_build_metrics(
+    sample: BuildMetricSample,
+    *,
+    pushgateway_url: str,
+    job_name: str,
+    timeout_seconds: float = 10.0,
+    max_attempts: int = 3,
+) -> str:
+    url = build_pushgateway_url(pushgateway_url, job_name, sample.grouping_key)
+    payload = format_prometheus_metrics(sample)
+    headers = {"Content-Type": "text/plain; version=0.0.4; charset=utf-8"}
+    last_error: httpx.HTTPError | None = None
+    with httpx.Client(timeout=timeout_seconds) as client:
+        for attempt in range(1, max_attempts + 1):
+            try:
+                response = client.put(url, content=payload.encode("utf-8"), headers=headers)
+                response.raise_for_status()
+                return url
+            except httpx.HTTPError as exc:
+                last_error = exc
+                if attempt == max_attempts:
+                    break
+                time.sleep(0.5 * attempt)
+    if last_error is not None:
+        raise last_error
+    return url

src/kc_monitor/models.py ADDED Viewed

	@@ -0,0 +1,342 @@

+from __future__ import annotations
+from dataclasses import dataclass, field
+from datetime import datetime, timezone
+from typing import Any
+from dateutil import parser as date_parser
+FAILING_CONCLUSIONS = {"failure", "timed_out", "cancelled", "startup_failure"}
+def utcnow() -> datetime:
+    return datetime.now(timezone.utc)
+def parse_github_datetime(value: str | None) -> datetime | None:
+    if not value:
+        return None
+    return date_parser.isoparse(value).astimezone(timezone.utc)
+@dataclass(slots=True)
+class WorkflowTarget:
+    path: str
+    label: str
+    enabled: bool = True
+    @property
+    def basename(self) -> str:
+        return self.path.rsplit("/", 1)[-1]
+@dataclass(slots=True)
+class GitHubRun:
+    id: int
+    name: str
+    display_title: str
+    path: str
+    status: str
+    conclusion: str | None
+    head_branch: str
+    head_sha: str
+    event: str
+    html_url: str
+    jobs_url: str
+    created_at: datetime
+    updated_at: datetime
+    run_started_at: datetime | None
+    actor_login: str | None = None
+    raw: dict[str, Any] = field(default_factory=dict)
+    @classmethod
+    def from_api(cls, payload: dict[str, Any]) -> "GitHubRun":
+        actor = payload.get("actor") or {}
+        return cls(
+            id=payload["id"],
+            name=payload.get("name") or "",
+            display_title=payload.get("display_title") or payload.get("name") or "",
+            path=payload.get("path") or "",
+            status=payload.get("status") or "unknown",
+            conclusion=payload.get("conclusion"),
+            head_branch=payload.get("head_branch") or "",
+            head_sha=payload.get("head_sha") or "",
+            event=payload.get("event") or "",
+            html_url=payload.get("html_url") or "",
+            jobs_url=payload.get("jobs_url") or "",
+            created_at=parse_github_datetime(payload.get("created_at")) or utcnow(),
+            updated_at=parse_github_datetime(payload.get("updated_at")) or utcnow(),
+            run_started_at=parse_github_datetime(payload.get("run_started_at")),
+            actor_login=actor.get("login"),
+            raw=payload,
+        )
+    @property
+    def is_active(self) -> bool:
+        return self.status != "completed"
+    @property
+    def sort_time(self) -> datetime:
+        return self.run_started_at or self.created_at
+@dataclass(slots=True)
+class GitHubJobStep:
+    name: str
+    status: str
+    conclusion: str | None
+    number: int
+    started_at: datetime | None
+    completed_at: datetime | None
+    @classmethod
+    def from_api(cls, payload: dict[str, Any]) -> "GitHubJobStep":
+        return cls(
+            name=payload.get("name") or "",
+            status=payload.get("status") or "unknown",
+            conclusion=payload.get("conclusion"),
+            number=payload.get("number") or 0,
+            started_at=parse_github_datetime(payload.get("started_at")),
+            completed_at=parse_github_datetime(payload.get("completed_at")),
+        )
+    @property
+    def is_running(self) -> bool:
+        return self.status != "completed"
+    @property
+    def is_failed(self) -> bool:
+        return (self.conclusion or "") in FAILING_CONCLUSIONS
+    @property
+    def duration_seconds(self) -> float | None:
+        if not self.started_at:
+            return None
+        end = self.completed_at or utcnow()
+        return max((end - self.started_at).total_seconds(), 0.0)
+@dataclass(slots=True)
+class GitHubJob:
+    id: int
+    run_id: int
+    workflow_name: str
+    head_branch: str
+    run_url: str
+    run_attempt: int
+    head_sha: str
+    url: str
+    html_url: str
+    status: str
+    conclusion: str | None
+    created_at: datetime
+    started_at: datetime | None
+    completed_at: datetime | None
+    name: str
+    steps: list[GitHubJobStep]
+    runner_name: str | None = None
+    runner_group_name: str | None = None
+    @classmethod
+    def from_api(cls, payload: dict[str, Any]) -> "GitHubJob":
+        steps = [GitHubJobStep.from_api(item) for item in payload.get("steps") or []]
+        return cls(
+            id=payload["id"],
+            run_id=payload.get("run_id") or 0,
+            workflow_name=payload.get("workflow_name") or "",
+            head_branch=payload.get("head_branch") or "",
+            run_url=payload.get("run_url") or "",
+            run_attempt=payload.get("run_attempt") or 1,
+            head_sha=payload.get("head_sha") or "",
+            url=payload.get("url") or "",
+            html_url=payload.get("html_url") or "",
+            status=payload.get("status") or "unknown",
+            conclusion=payload.get("conclusion"),
+            created_at=parse_github_datetime(payload.get("created_at")) or utcnow(),
+            started_at=parse_github_datetime(payload.get("started_at")),
+            completed_at=parse_github_datetime(payload.get("completed_at")),
+            name=payload.get("name") or "",
+            steps=steps,
+            runner_name=payload.get("runner_name"),
+            runner_group_name=payload.get("runner_group_name"),
+        )
+    @property
+    def is_active(self) -> bool:
+        return self.status != "completed"
+    @property
+    def active_step(self) -> GitHubJobStep | None:
+        for step in self.steps:
+            if step.is_running:
+                return step
+        return None
+    @property
+    def last_step(self) -> GitHubJobStep | None:
+        return self.steps[-1] if self.steps else None
+    @property
+    def duration_seconds(self) -> float | None:
+        if not self.started_at:
+            return None
+        end = self.completed_at or utcnow()
+        return max((end - self.started_at).total_seconds(), 0.0)
+@dataclass(slots=True)
+class KernelInfo:
+    kernel_name: str
+    repo_id: str
+    hub_url: str
+    version: int | None = None
+    backends: list[str] = field(default_factory=list)
+@dataclass(slots=True)
+class ParsedLogEvent:
+    category: str
+    message: str
+    line_number: int
+    timestamp: datetime | None = None
+@dataclass(slots=True)
+class ParsedJobState:
+    phase: str
+    phase_label: str
+    phase_reason: str
+    upload_status: str
+    upload_status_label: str
+    repo_id: str | None
+    latest_log_at: datetime | None
+    active_step_name: str | None
+    active_step_started_at: datetime | None
+    events: list[ParsedLogEvent] = field(default_factory=list)
+    failure_excerpt: str | None = None
+@dataclass(slots=True)
+class MonitorRecord:
+    key: str
+    kernel_name: str
+    critical: bool
+    kernel_info: KernelInfo
+    workflow_name: str
+    workflow_path: str
+    run: GitHubRun
+    job: GitHubJob
+    phase: str
+    phase_label: str
+    phase_reason: str
+    upload_status: str
+    upload_status_label: str
+    arch: str
+    runner_group: str | None
+    suspected_stalled: bool
+    stall_reason: str | None
+    latest_signal_at: datetime | None
+    events: list[ParsedLogEvent] = field(default_factory=list)
+    failure_excerpt: str | None = None
+    active_step_name: str | None = None
+    active_step_started_at: datetime | None = None
+    @property
+    def is_active(self) -> bool:
+        return self.job.is_active
+    @property
+    def started_at(self) -> datetime | None:
+        return self.job.started_at or self.run.run_started_at
+    @property
+    def completed_at(self) -> datetime | None:
+        return self.job.completed_at
+    @property
+    def elapsed_seconds(self) -> float | None:
+        start = self.started_at
+        if not start:
+            return None
+        end = self.completed_at or utcnow()
+        return max((end - start).total_seconds(), 0.0)
+@dataclass(slots=True)
+class KernelRunGroup:
+    kernel_name: str
+    run: GitHubRun
+    workflow_name: str
+    records: list[MonitorRecord]
+    @property
+    def is_active(self) -> bool:
+        return any(record.is_active for record in self.records)
+    @property
+    def has_failure(self) -> bool:
+        return any((record.job.conclusion or "") in FAILING_CONCLUSIONS for record in self.records)
+    @property
+    def has_stall(self) -> bool:
+        return any(record.suspected_stalled for record in self.records)
+    @property
+    def has_uploading(self) -> bool:
+        return any(record.upload_status == "running" for record in self.records)
+    @property
+    def triggered_at(self) -> datetime:
+        return self.run.run_started_at or self.run.created_at
+    @property
+    def latest_update_at(self) -> datetime:
+        candidates = [record.latest_signal_at for record in self.records if record.latest_signal_at]
+        if candidates:
+            return max(candidates)
+        return self.run.updated_at
+@dataclass(slots=True)
+class KernelRow:
+    kernel_name: str
+    kernel_info: KernelInfo
+    critical: bool
+    current_group: KernelRunGroup | None
+    recent_groups: list[KernelRunGroup]
+    row_status_kind: str
+    row_status_label: str
+    row_reason: str
+    upload_label: str
+    last_triggered_at: datetime | None
+    @property
+    def primary_group(self) -> KernelRunGroup | None:
+        if self.current_group:
+            return self.current_group
+        return self.recent_groups[0] if self.recent_groups else None
+    @property
+    def recent_run_count(self) -> int:
+        return len(self.recent_groups)
+@dataclass(slots=True)
+class DashboardSummary:
+    tracked_kernels: int = 0
+    active_builds: int = 0
+    uploading_builds: int = 0
+    stalled_builds: int = 0
+    failed_builds: int = 0
+    completed_uploads: int = 0
+@dataclass(slots=True)
+class DashboardSnapshot:
+    generated_at: datetime
+    summary: DashboardSummary
+    kernel_rows: list[KernelRow]
+    active_records: list[MonitorRecord]
+    recent_records: list[MonitorRecord]
+    errors: list[str] = field(default_factory=list)

src/kc_monitor/service.py ADDED Viewed

	@@ -0,0 +1,572 @@

+from __future__ import annotations
+from collections import defaultdict
+import re
+from datetime import timedelta
+from cachetools import TTLCache
+from kc_monitor.config import AppConfig
+from kc_monitor.github_client import GitHubActionsClient
+from kc_monitor.kernel_index import KernelIndex
+from kc_monitor.log_parser import JobLogParser, classify_step_name
+from kc_monitor.models import (
+    DashboardSnapshot,
+    DashboardSummary,
+    FAILING_CONCLUSIONS,
+    GitHubJob,
+    GitHubJobStep,
+    GitHubRun,
+    KernelInfo,
+    KernelRow,
+    KernelRunGroup,
+    MonitorRecord,
+    utcnow,
+)
+from kc_monitor.stall_detector import detect_stall
+ARCH_RE = re.compile(r"\(([^,]+),")
+class MonitorService:
+    def __init__(
+        self,
+        config: AppConfig,
+        client: GitHubActionsClient | None = None,
+        parser: JobLogParser | None = None,
+        kernel_index: KernelIndex | None = None,
+    ) -> None:
+        self.config = config
+        self.client = client or GitHubActionsClient(
+            owner=config.github.owner,
+            repo=config.github.repo,
+            token=config.github.token,
+            request_timeout_seconds=config.github.request_timeout_seconds,
+            user_agent=config.github.user_agent,
+        )
+        self.parser = parser or JobLogParser()
+        self.kernel_index = kernel_index or KernelIndex(self.client, branch=config.github.branch)
+        self._snapshot_cache: TTLCache[str, DashboardSnapshot] = TTLCache(
+            maxsize=1,
+            ttl=max(5, config.monitor.snapshot_ttl_seconds),
+        )
+        self._workflow_labels = {
+            workflow.path: workflow.label for workflow in config.workflow_targets
+        }
+        self._workflow_paths = set(self._workflow_labels)
+    def close(self) -> None:
+        self.client.close()
+    def get_snapshot(self, force_refresh: bool = False) -> DashboardSnapshot:
+        if not force_refresh and "snapshot" in self._snapshot_cache:
+            return self._snapshot_cache["snapshot"]
+        snapshot = self._build_snapshot()
+        self._snapshot_cache["snapshot"] = snapshot
+        return snapshot
+    def _build_snapshot(self) -> DashboardSnapshot:
+        errors: list[str] = []
+        records: list[MonitorRecord] = []
+        kernel_catalog = self.kernel_index.list_kernel_catalog()
+        catalog_names = {info.kernel_name for info in kernel_catalog}
+        selected_runs = self._collect_runs(catalog_names, errors)
+        if not selected_runs and not errors:
+            errors.append("No kernel runs returned from any tracked workflow.")
+        needs_job_detail: set[int] = {run.id for run in selected_runs}
+        for run in selected_runs:
+            if run.id in needs_job_detail:
+                try:
+                    jobs = self.client.list_jobs(run.id)
+                except Exception as exc:  # noqa: BLE001
+                    errors.append(f"Run {run.id}: {exc}")
+                    records.append(self._build_lightweight_record(run))
+                    continue
+                for job in jobs:
+                    try:
+                        records.append(self._build_record(run, job))
+                    except Exception as exc:  # noqa: BLE001
+                        errors.append(f"Job {job.id}: {exc}")
+            else:
+                records.append(self._build_lightweight_record(run))
+        records.sort(key=self._record_sort_key)
+        active_records = [record for record in records if record.is_active]
+        recent_records = records[: self.config.monitor.recent_limit]
+        kernel_rows = self._build_kernel_rows(records)
+        summary = DashboardSummary(
+            tracked_kernels=len(kernel_rows),
+            active_builds=sum(1 for row in kernel_rows if row.current_group is not None),
+            uploading_builds=sum(
+                1 for row in kernel_rows if row.current_group is not None and row.current_group.has_uploading
+            ),
+            stalled_builds=sum(1 for row in kernel_rows if row.row_status_kind == "stalled"),
+            failed_builds=sum(
+                1 for row in kernel_rows if any(group.has_failure for group in row.recent_groups)
+            ),
+            completed_uploads=sum(
+                1
+                for row in kernel_rows
+                if any(
+                    any(record.upload_status == "completed" for record in group.records)
+                    for group in row.recent_groups
+                )
+            ),
+        )
+        return DashboardSnapshot(
+            generated_at=utcnow(),
+            summary=summary,
+            kernel_rows=kernel_rows,
+            active_records=active_records,
+            recent_records=recent_records,
+            errors=errors,
+        )
+    def _collect_runs(
+        self,
+        catalog_names: set[str],
+        errors: list[str],
+    ) -> list[GitHubRun]:
+        latest_by_workflow_kernel: dict[tuple[str, str], GitHubRun] = {}
+        active_runs: dict[int, GitHubRun] = {}
+        per_page = max(1, self.config.monitor.workflow_run_page_size)
+        max_pages = max(1, self.config.monitor.workflow_run_pages)
+        for workflow in self.config.workflow_targets:
+            seen_for_workflow: set[str] = set()
+            for page in range(1, max_pages + 1):
+                try:
+                    workflow_runs = self.client.list_workflow_runs(
+                        workflow.basename,
+                        per_page=per_page,
+                        page=page,
+                    )
+                except Exception as exc:  # noqa: BLE001
+                    errors.append(f"Workflow {workflow.label}: {exc}")
+                    break
+                if not workflow_runs:
+                    break
+                for run in workflow_runs:
+                    if run.path not in self._workflow_paths:
+                        continue
+                    kernel = KernelIndex.infer_kernel_name(run)
+                    if not kernel:
+                        continue
+                    if catalog_names and kernel not in catalog_names:
+                        continue
+                    seen_for_workflow.add(kernel)
+                    if run.is_active:
+                        active_runs[run.id] = run
+                        continue
+                    key = (workflow.path, kernel)
+                    if key not in latest_by_workflow_kernel:
+                        latest_by_workflow_kernel[key] = run
+                if len(workflow_runs) < per_page:
+                    break
+                if catalog_names and seen_for_workflow >= catalog_names:
+                    break
+        selected = list(active_runs.values())
+        selected.extend(latest_by_workflow_kernel.values())
+        deduped = {run.id: run for run in selected}
+        return sorted(deduped.values(), key=lambda run: (0 if run.is_active else 1, -run.sort_time.timestamp()))
+    def _filter_runs(self, runs: list[GitHubRun]) -> list[GitHubRun]:
+        now = utcnow()
+        cutoff = now - timedelta(hours=self.config.monitor.recent_completed_hours)
+        filtered: list[GitHubRun] = []
+        completed_counts: dict[str, int] = {}
+        for run in runs:
+            if run.path not in self._workflow_paths:
+                continue
+            if run.is_active:
+                filtered.append(run)
+                continue
+            if run.updated_at < cutoff:
+                continue
+            seen = completed_counts.get(run.path, 0)
+            if seen >= self.config.monitor.completed_runs_per_workflow:
+                continue
+            completed_counts[run.path] = seen + 1
+            filtered.append(run)
+        return filtered
+    def _build_lightweight_record(self, run: GitHubRun) -> MonitorRecord:
+        kernel_name = KernelIndex.infer_kernel_name(run) or "unknown"
+        kernel_info = self._kernel_info_for(kernel_name, None)
+        critical = kernel_name in self.config.monitor.critical_kernel_set
+        conclusion = run.conclusion or ""
+        if conclusion == "success":
+            phase, phase_label = "completed", "Completed"
+        elif conclusion == "failure":
+            phase, phase_label = "failed", "Failed"
+        elif conclusion == "cancelled":
+            phase, phase_label = "cancelled", "Cancelled"
+        elif run.is_active:
+            phase, phase_label = "running", "Running"
+        else:
+            phase, phase_label = "completed", conclusion.title() or "Done"
+        stub_job = GitHubJob(
+            id=0, run_id=run.id, workflow_name=run.name, head_branch=run.head_branch,
+            run_url=run.html_url, run_attempt=1, head_sha=run.head_sha, url="",
+            html_url=run.html_url, status=run.status, conclusion=run.conclusion,
+            created_at=run.created_at, started_at=run.run_started_at,
+            completed_at=run.updated_at, name=run.name, steps=[],
+        )
+        return MonitorRecord(
+            key=f"{run.id}:0",
+            kernel_name=kernel_name,
+            critical=critical,
+            kernel_info=kernel_info,
+            workflow_name=self._workflow_labels.get(run.path, run.name),
+            workflow_path=run.path,
+            run=run,
+            job=stub_job,
+            phase=phase,
+            phase_label=phase_label,
+            phase_reason=f"Run {conclusion or run.status} (summary only).",
+            upload_status="not_started",
+            upload_status_label="Unknown",
+            arch="all",
+            runner_group=None,
+            suspected_stalled=False,
+            stall_reason=None,
+            latest_signal_at=run.updated_at,
+        )
+    def _build_record(self, run: GitHubRun, job: GitHubJob) -> MonitorRecord:
+        job = self._normalize_job(run, job)
+        log_text = None
+        if self._should_fetch_logs(job):
+            log_text = self.client.get_job_logs(
+                job.id,
+                line_limit=self.config.monitor.log_line_limit,
+                char_limit=self.config.monitor.log_char_limit,
+                job_html_url=job.html_url,
+            )
+        parsed = self.parser.parse(
+            run,
+            job,
+            log_text,
+            event_limit=self.config.monitor.detail_event_limit,
+        )
+        kernel_name = KernelIndex.infer_kernel_name(run) or "unknown"
+        kernel_info = self._kernel_info_for(kernel_name, parsed.repo_id)
+        latest_signal_at = parsed.latest_log_at or run.updated_at or job.started_at
+        critical = kernel_name in self.config.monitor.critical_kernel_set
+        record = MonitorRecord(
+            key=f"{run.id}:{job.id}",
+            kernel_name=kernel_name,
+            critical=critical,
+            kernel_info=kernel_info,
+            workflow_name=self._workflow_labels.get(run.path, run.name),
+            workflow_path=run.path,
+            run=run,
+            job=job,
+            phase=parsed.phase,
+            phase_label=parsed.phase_label,
+            phase_reason=parsed.phase_reason,
+            upload_status=parsed.upload_status,
+            upload_status_label=parsed.upload_status_label,
+            arch=self._extract_arch(job.name),
+            runner_group=job.runner_group_name,
+            suspected_stalled=False,
+            stall_reason=None,
+            latest_signal_at=latest_signal_at,
+            events=parsed.events,
+            failure_excerpt=parsed.failure_excerpt,
+            active_step_name=parsed.active_step_name,
+            active_step_started_at=parsed.active_step_started_at,
+        )
+        stalled, stall_reason = detect_stall(record, self.config.monitor)
+        record.suspected_stalled = stalled
+        record.stall_reason = stall_reason
+        return record
+    @staticmethod
+    def _normalize_job(run: GitHubRun, job: GitHubJob) -> GitHubJob:
+        if job.steps:
+            return job
+        started_at = run.run_started_at or run.created_at
+        completed_at = None if job.is_active else run.updated_at
+        synthetic_steps: list[GitHubJobStep] = []
+        if run.path.endswith("build-release.yaml"):
+            synthetic_steps.append(
+                GitHubJobStep(
+                    name="Build and upload kernel",
+                    status=job.status,
+                    conclusion=job.conclusion,
+                    number=1,
+                    started_at=started_at,
+                    completed_at=completed_at,
+                )
+            )
+            if (job.conclusion or "") == "success":
+                synthetic_steps.append(
+                    GitHubJobStep(
+                        name="Upload v1 kernels to main",
+                        status="completed",
+                        conclusion="success",
+                        number=2,
+                        started_at=completed_at or started_at,
+                        completed_at=completed_at,
+                    )
+                )
+        elif run.path.endswith("manual-build-upload.yaml"):
+            synthetic_steps.append(
+                GitHubJobStep(
+                    name="Build and copy kernel",
+                    status=job.status,
+                    conclusion=job.conclusion,
+                    number=1,
+                    started_at=started_at,
+                    completed_at=completed_at,
+                )
+            )
+            if (job.conclusion or "") == "success":
+                synthetic_steps.append(
+                    GitHubJobStep(
+                        name="Upload kernel",
+                        status="completed",
+                        conclusion="success",
+                        number=2,
+                        started_at=completed_at or started_at,
+                        completed_at=completed_at,
+                    )
+                )
+        if synthetic_steps:
+            job.steps = synthetic_steps
+        return job
+    def _kernel_info_for(self, kernel_name: str, parsed_repo_id: str | None) -> KernelInfo:
+        if kernel_name == "unknown":
+            repo_id = parsed_repo_id or f"{self.config.github.owner}/{self.config.github.repo}"
+            return KernelInfo(
+                kernel_name=kernel_name,
+                repo_id=repo_id,
+                hub_url=f"https://huggingface.co/{repo_id}",
+            )
+        info = self.kernel_index.get_kernel_info(kernel_name)
+        if not parsed_repo_id or parsed_repo_id == info.repo_id:
+            return info
+        return KernelInfo(
+            kernel_name=info.kernel_name,
+            repo_id=parsed_repo_id,
+            hub_url=f"https://huggingface.co/{parsed_repo_id}",
+            version=info.version,
+            backends=info.backends,
+        )
+    def _should_fetch_logs(self, job: GitHubJob) -> bool:
+        if job.is_active:
+            return True
+        if (job.conclusion or "") in FAILING_CONCLUSIONS:
+            return True
+        return False
+    @staticmethod
+    def _extract_arch(job_name: str) -> str:
+        match = ARCH_RE.search(job_name)
+        if match:
+            return match.group(1).strip()
+        return "n/a"
+    @staticmethod
+    def _record_sort_key(record: MonitorRecord) -> tuple[int, int, float]:
+        started_at = record.started_at or utcnow()
+        return (
+            0 if record.is_active else 1,
+            0 if record.critical else 1,
+            -started_at.timestamp(),
+        )
+    def _build_kernel_rows(self, records: list[MonitorRecord]) -> list[KernelRow]:
+        grouped_records: dict[str, list[MonitorRecord]] = defaultdict(list)
+        for record in records:
+            grouped_records[record.kernel_name].append(record)
+        info_map = {info.kernel_name: info for info in self.kernel_index.list_kernel_catalog()}
+        for record in records:
+            info_map[record.kernel_name] = record.kernel_info
+        rows: list[KernelRow] = []
+        for kernel_name, kernel_info in info_map.items():
+            kernel_records = sorted(grouped_records.get(kernel_name, []), key=self._record_sort_key)
+            recent_groups = self._group_kernel_runs(kernel_name, kernel_records)
+            current_group = next((group for group in recent_groups if group.is_active), None)
+            row_status_kind, row_status_label, row_reason, upload_label = self._summarize_kernel(
+                current_group,
+                recent_groups,
+            )
+            rows.append(
+                KernelRow(
+                    kernel_name=kernel_name,
+                    kernel_info=kernel_info,
+                    critical=kernel_name in self.config.monitor.critical_kernel_set,
+                    current_group=current_group,
+                    recent_groups=recent_groups,
+                    row_status_kind=row_status_kind,
+                    row_status_label=row_status_label,
+                    row_reason=row_reason,
+                    upload_label=upload_label,
+                    last_triggered_at=recent_groups[0].triggered_at if recent_groups else None,
+                )
+            )
+        rows.sort(key=self._kernel_row_sort_key)
+        return rows
+    def _group_kernel_runs(
+        self,
+        kernel_name: str,
+        records: list[MonitorRecord],
+    ) -> list[KernelRunGroup]:
+        grouped: dict[int, list[MonitorRecord]] = defaultdict(list)
+        run_lookup: dict[int, GitHubRun] = {}
+        workflow_lookup: dict[int, str] = {}
+        for record in records:
+            grouped[record.run.id].append(record)
+            run_lookup[record.run.id] = record.run
+            workflow_lookup[record.run.id] = record.workflow_name
+        groups: list[KernelRunGroup] = []
+        for run_id, run_records in grouped.items():
+            sorted_records = sorted(
+                run_records,
+                key=lambda record: (
+                    0 if record.is_active else 1,
+                    0 if record.arch == "x86_64-linux" else 1,
+                    record.arch,
+                ),
+            )
+            groups.append(
+                KernelRunGroup(
+                    kernel_name=kernel_name,
+                    run=run_lookup[run_id],
+                    workflow_name=workflow_lookup[run_id],
+                    records=sorted_records,
+                )
+            )
+        groups.sort(key=lambda group: -group.triggered_at.timestamp())
+        return groups
+    @staticmethod
+    def _summarize_kernel(
+        current_group: KernelRunGroup | None,
+        recent_groups: list[KernelRunGroup],
+    ) -> tuple[str, str, str, str]:
+        if current_group is not None:
+            if current_group.has_stall:
+                status_kind = "stalled"
+                status_label = "Stalled"
+            elif current_group.has_uploading:
+                status_kind = "uploading"
+                status_label = "Uploading"
+            else:
+                status_kind = "running"
+                status_label = "Running"
+            return (
+                status_kind,
+                status_label,
+                MonitorService._arch_summary(current_group.records),
+                MonitorService._upload_summary(current_group.records),
+            )
+        if not recent_groups:
+            return ("idle", "Idle", "No recent tracked CI run.", "No recent upload")
+        latest_group = recent_groups[0]
+        if latest_group.has_failure:
+            status_kind = "failed"
+            status_label = "Failed"
+        elif any(record.upload_status == "completed" for record in latest_group.records):
+            status_kind = "completed"
+            status_label = "Completed"
+        elif all((record.job.conclusion or "") == "cancelled" for record in latest_group.records):
+            status_kind = "cancelled"
+            status_label = "Cancelled"
+        else:
+            status_kind = "recent"
+            status_label = "Recent"
+        return (
+            status_kind,
+            status_label,
+            MonitorService._arch_summary(latest_group.records),
+            MonitorService._upload_summary(latest_group.records),
+        )
+    @staticmethod
+    def _arch_summary(records: list[MonitorRecord]) -> str:
+        if not records:
+            return "No job details."
+        return " | ".join(
+            f"{MonitorService._short_arch(record.arch)}: {record.phase_label}"
+            for record in records
+        )
+    @staticmethod
+    def _upload_summary(records: list[MonitorRecord]) -> str:
+        if not records:
+            return "No upload"
+        return " | ".join(
+            f"{MonitorService._short_arch(record.arch)}: {record.upload_status_label}"
+            for record in records
+        )
+    @staticmethod
+    def _short_arch(arch: str) -> str:
+        mapping = {
+            "x86_64-linux": "x86",
+            "aarch64-linux": "arm64",
+            "x86_64-darwin": "mac",
+            "aarch64-darwin": "mac-arm",
+        }
+        return mapping.get(arch, arch)
+    @staticmethod
+    def _kernel_row_sort_key(row: KernelRow) -> tuple[int, int, int, str]:
+        status_rank = {
+            "stalled": 0,
+            "uploading": 1,
+            "running": 2,
+            "failed": 3,
+            "completed": 4,
+            "cancelled": 5,
+            "recent": 6,
+            "idle": 7,
+        }
+        return (
+            status_rank.get(row.row_status_kind, 99),
+            0 if row.critical else 1,
+            0 if row.last_triggered_at else 1,
+            row.kernel_name,
+        )

src/kc_monitor/stall_detector.py ADDED Viewed

	@@ -0,0 +1,48 @@

+from __future__ import annotations
+from datetime import datetime, timedelta
+from kc_monitor.config import MonitorSettings
+from kc_monitor.models import MonitorRecord, utcnow
+ACTIVE_PHASES = {"building", "uploading", "testing"}
+def _format_duration(delta: timedelta) -> str:
+    total_seconds = int(delta.total_seconds())
+    if total_seconds < 60:
+        return f"{total_seconds}s"
+    if total_seconds < 3600:
+        return f"{total_seconds // 60}m"
+    hours, remainder = divmod(total_seconds, 3600)
+    minutes = remainder // 60
+    if minutes:
+        return f"{hours}h {minutes}m"
+    return f"{hours}h"
+def detect_stall(
+    record: MonitorRecord,
+    settings: MonitorSettings,
+    now: datetime | None = None,
+) -> tuple[bool, str | None]:
+    if not record.is_active:
+        return False, None
+    if record.phase not in ACTIVE_PHASES:
+        return False, None
+    now = now or utcnow()
+    latest_signal = record.latest_signal_at or record.run.updated_at or record.started_at
+    if latest_signal:
+        silent_for = now - latest_signal
+        if silent_for >= timedelta(minutes=settings.stall_without_log_minutes):
+            return True, f"No fresh signal for { _format_duration(silent_for) }."
+    if record.active_step_started_at:
+        phase_duration = now - record.active_step_started_at
+        if phase_duration >= timedelta(minutes=settings.stall_active_phase_minutes):
+            return True, f"{record.phase_label} has been running for { _format_duration(phase_duration) }."
+    return False, None

src/kc_monitor/ui.py ADDED Viewed

	@@ -0,0 +1,1110 @@

+from __future__ import annotations
+import html
+import re
+from datetime import datetime, timezone
+import gradio as gr
+from kc_monitor.config import AppConfig
+from kc_monitor.grafana import build_dashboard_url, dashboard_catalog
+from kc_monitor.models import DashboardSnapshot, KernelRow, KernelRunGroup, MonitorRecord
+from kc_monitor.service import MonitorService
+VARIANT_RE = re.compile(r"\(([^)]+)\)")
+THEME = gr.themes.Base()
+PAGE_JS = """
+function kcmBoot() {
+  if (window._kcmBooted) return;
+  window._kcmBooted = true;
+  function applyFilters() {
+    var search = document.querySelector('.kcm-search');
+    var status = document.querySelector('.kcm-status-filter');
+    var searchValue = search ? search.value.toLowerCase().trim() : '';
+    var statusValue = status ? status.value : 'all';
+    document.querySelectorAll('#kernelTable tbody tr[data-idx]').forEach(function(row) {
+      var kernel = (row.getAttribute('data-kernel') || '').toLowerCase();
+      var rowStatus = row.getAttribute('data-status') || 'all';
+      var workflow = (row.getAttribute('data-workflow') || '').toLowerCase();
+      var searchOk = !searchValue || kernel.indexOf(searchValue) >= 0 || workflow.indexOf(searchValue) >= 0;
+      var statusOk = statusValue === 'all' || rowStatus === statusValue;
+      row.style.display = searchOk && statusOk ? '' : 'none';
+    });
+  }
+  document.addEventListener('click', function(e) {
+    var row = e.target.closest('tr[data-idx]');
+    if (row && !e.target.closest('a')) {
+      var idx = row.getAttribute('data-idx');
+      var el = document.getElementById('modal-content-' + idx);
+      if (!el) return;
+      document.getElementById('kcmModal').innerHTML = el.innerHTML;
+      document.getElementById('kcmOverlay').classList.add('open');
+      document.body.style.overflow = 'hidden';
+      return;
+    }
+    if (e.target.closest('.kcm-modal-close') || e.target.id === 'kcmOverlay') {
+      document.getElementById('kcmOverlay').classList.remove('open');
+      document.body.style.overflow = '';
+    }
+  });
+  document.addEventListener('input', function(e) {
+    if (e.target.classList.contains('kcm-search')) applyFilters();
+  });
+  document.addEventListener('change', function(e) {
+    if (e.target.classList.contains('kcm-status-filter')) applyFilters();
+  });
+  document.addEventListener('keydown', function(e) {
+    if (e.key === 'Escape') {
+      document.getElementById('kcmOverlay').classList.remove('open');
+      document.body.style.overflow = '';
+    }
+  });
+  applyFilters();
+}
+kcmBoot();
+new MutationObserver(function() {
+  window._kcmBooted = false;
+  kcmBoot();
+}).observe(document.body, { childList: true, subtree: true });
+"""
+CSS = """
+:root {
+  --bg: #050711;
+  --surface: rgba(11, 16, 30, 0.92);
+  --surface-2: rgba(14, 22, 40, 0.94);
+  --surface-3: rgba(19, 29, 53, 0.98);
+  --surface-hover: rgba(121, 171, 255, 0.06);
+  --text: #f4f7ff;
+  --text-secondary: #98a7c4;
+  --text-tertiary: #6d7b98;
+  --accent: #86b0ff;
+  --accent-2: #6ff0c0;
+  --ok: #74efab;
+  --warn: #ffca6d;
+  --bad: #ff808e;
+  --border: rgba(255, 255, 255, 0.08);
+  --border-strong: rgba(255, 255, 255, 0.14);
+  --radius: 24px;
+  --radius-sm: 16px;
+  --shadow: 0 28px 90px rgba(0, 0, 0, 0.32);
+}
+*,
+*::before,
+*::after {
+  box-sizing: border-box;
+}
+body,
+.gradio-container {
+  background:
+    radial-gradient(circle at 0% 0%, rgba(134, 176, 255, 0.18), transparent 28%),
+    radial-gradient(circle at 100% 0%, rgba(111, 240, 192, 0.10), transparent 30%),
+    radial-gradient(circle at 50% 100%, rgba(110, 130, 255, 0.08), transparent 40%),
+    #050711 !important;
+  color: var(--text);
+  font-family: "Inter", -apple-system, BlinkMacSystemFont, sans-serif;
+}
+a {
+  color: var(--accent);
+  text-decoration: none;
+}
+a:hover {
+  text-decoration: underline;
+}
+.kcm-shell {
+  max-width: 1540px;
+  margin: 0 auto;
+  padding: 18px 20px 28px;
+}
+.kcm-hero {
+  position: relative;
+  overflow: hidden;
+  background:
+    linear-gradient(135deg, rgba(134, 176, 255, 0.14), rgba(111, 240, 192, 0.06)),
+    var(--surface);
+  border: 1px solid var(--border);
+  border-radius: 30px;
+  padding: 30px 34px;
+  box-shadow: var(--shadow);
+}
+.kcm-hero::after {
+  content: "";
+  position: absolute;
+  inset: auto -80px -120px auto;
+  width: 320px;
+  height: 320px;
+  border-radius: 50%;
+  background: radial-gradient(circle, rgba(111, 240, 192, 0.16), transparent 62%);
+  pointer-events: none;
+}
+.kcm-eyebrow {
+  color: var(--accent-2);
+  font-size: 11px;
+  text-transform: uppercase;
+  letter-spacing: 0.16em;
+}
+.kcm-hero h1 {
+  margin: 10px 0 0;
+  font-size: 38px;
+  line-height: 1.05;
+  letter-spacing: -0.05em;
+}
+.kcm-hero p {
+  margin: 12px 0 0;
+  max-width: 1040px;
+  color: var(--text-secondary);
+  font-size: 15px;
+  line-height: 1.65;
+}
+.kcm-meta,
+.kcm-stats,
+.kcm-graphs {
+  display: grid;
+  gap: 12px;
+}
+.kcm-meta {
+  grid-template-columns: repeat(3, minmax(0, 1fr));
+  margin-top: 18px;
+}
+.kcm-stats {
+  grid-template-columns: repeat(5, minmax(0, 1fr));
+  margin-top: 18px;
+}
+.kcm-meta-card,
+.kcm-stat,
+.kcm-panel-link {
+  background: rgba(255, 255, 255, 0.04);
+  border: 1px solid var(--border);
+  border-radius: 20px;
+  padding: 16px 18px;
+}
+.kcm-meta-card-label,
+.kcm-stat-label {
+  font-size: 11px;
+  text-transform: uppercase;
+  letter-spacing: 0.10em;
+  color: var(--text-tertiary);
+}
+.kcm-meta-card-value {
+  margin-top: 8px;
+  font-size: 14px;
+  color: var(--text-secondary);
+  word-break: break-word;
+}
+.kcm-stat-value {
+  margin-top: 8px;
+  font-size: 30px;
+  font-weight: 700;
+  letter-spacing: -0.03em;
+}
+.kcm-toolbar {
+  margin-top: 18px;
+  display: flex;
+  justify-content: space-between;
+  align-items: center;
+  gap: 14px;
+}
+.kcm-toolbar-left {
+  color: var(--text-tertiary);
+  font-size: 13px;
+}
+.kcm-toolbar-left code {
+  padding: 3px 8px;
+  background: rgba(255, 255, 255, 0.05);
+  border-radius: 999px;
+  color: var(--text-secondary);
+}
+.kcm-toolbar-right {
+  display: flex;
+  align-items: center;
+  gap: 10px;
+}
+.kcm-search,
+.kcm-status-filter {
+  background: var(--surface-2);
+  border: 1px solid var(--border);
+  border-radius: 14px;
+  padding: 10px 14px;
+  color: var(--text);
+  font-size: 14px;
+  outline: none;
+}
+.kcm-search {
+  min-width: 260px;
+}
+.kcm-table-shell {
+  margin-top: 16px;
+  background: var(--surface);
+  border: 1px solid var(--border);
+  border-radius: 26px;
+  overflow: hidden;
+  box-shadow: var(--shadow);
+}
+.kcm-table-wrap {
+  overflow-x: auto;
+}
+.kcm-table {
+  width: 100%;
+  border-collapse: separate;
+  border-spacing: 0;
+}
+.kcm-table th {
+  position: sticky;
+  top: 0;
+  z-index: 2;
+  text-align: left;
+  padding: 14px 16px;
+  font-size: 11px;
+  text-transform: uppercase;
+  letter-spacing: 0.12em;
+  color: var(--text-tertiary);
+  background: rgba(7, 11, 23, 0.96);
+  border-bottom: 1px solid var(--border-strong);
+}
+.kcm-table td {
+  padding: 16px;
+  vertical-align: top;
+  border-bottom: 1px solid var(--border);
+  font-size: 14px;
+}
+.kcm-table tbody tr {
+  cursor: pointer;
+  transition: background 0.16s ease;
+}
+.kcm-table tbody tr:hover td {
+  background: var(--surface-hover);
+}
+.kcm-table tbody tr:last-child td {
+  border-bottom: none;
+}
+.kcm-kernel-name {
+  font-size: 16px;
+  font-weight: 700;
+  letter-spacing: -0.02em;
+}
+.kcm-kernel-meta,
+.kcm-subtle,
+.kcm-activity-sub {
+  margin-top: 4px;
+  color: var(--text-tertiary);
+  font-size: 12px;
+  line-height: 1.45;
+}
+.kcm-badges,
+.kcm-variant-stack,
+.kcm-actions {
+  display: flex;
+  flex-wrap: wrap;
+  gap: 8px;
+}
+.kcm-badge {
+  display: inline-flex;
+  align-items: center;
+  gap: 6px;
+  padding: 5px 10px;
+  border-radius: 999px;
+  font-size: 11px;
+  font-weight: 700;
+  white-space: nowrap;
+  border: 1px solid transparent;
+}
+.kcm-badge.ok {
+  color: var(--ok);
+  background: rgba(116, 239, 171, 0.10);
+  border-color: rgba(116, 239, 171, 0.14);
+}
+.kcm-badge.warn {
+  color: var(--warn);
+  background: rgba(255, 202, 109, 0.10);
+  border-color: rgba(255, 202, 109, 0.15);
+}
+.kcm-badge.bad {
+  color: var(--bad);
+  background: rgba(255, 128, 142, 0.10);
+  border-color: rgba(255, 128, 142, 0.14);
+}
+.kcm-badge.info {
+  color: var(--accent);
+  background: rgba(134, 176, 255, 0.12);
+  border-color: rgba(134, 176, 255, 0.16);
+}
+.kcm-badge.muted {
+  color: var(--text-tertiary);
+  background: rgba(255, 255, 255, 0.05);
+  border-color: rgba(255, 255, 255, 0.06);
+}
+.kcm-badge.critical {
+  color: var(--bad);
+  background: rgba(255, 128, 142, 0.10);
+  border-color: rgba(255, 128, 142, 0.14);
+  text-transform: uppercase;
+  letter-spacing: 0.12em;
+}
+.kcm-variant {
+  min-width: 180px;
+  padding: 10px 12px;
+  border-radius: 16px;
+  background: rgba(255, 255, 255, 0.04);
+  border: 1px solid var(--border);
+}
+.kcm-variant-head {
+  display: flex;
+  justify-content: space-between;
+  gap: 8px;
+  align-items: center;
+}
+.kcm-variant-name {
+  font-size: 12px;
+  font-weight: 700;
+}
+.kcm-variant-sub {
+  margin-top: 6px;
+  font-size: 11px;
+  color: var(--text-tertiary);
+  line-height: 1.45;
+}
+.kcm-action {
+  display: inline-flex;
+  align-items: center;
+  padding: 8px 12px;
+  border-radius: 12px;
+  background: rgba(255, 255, 255, 0.05);
+  border: 1px solid var(--border);
+  color: var(--text-secondary);
+  font-size: 12px;
+  font-weight: 600;
+}
+.kcm-action:hover {
+  text-decoration: none;
+  border-color: var(--border-strong);
+  color: var(--text);
+}
+.kcm-section {
+  margin-top: 22px;
+}
+.kcm-section-title {
+  margin: 0 0 12px;
+  font-size: 18px;
+  letter-spacing: -0.02em;
+}
+.kcm-graphs {
+  grid-template-columns: repeat(3, minmax(0, 1fr));
+}
+.kcm-panel-link {
+  transition: transform 0.15s ease, border-color 0.15s ease;
+}
+.kcm-panel-link:hover {
+  transform: translateY(-2px);
+  border-color: var(--border-strong);
+  text-decoration: none;
+}
+.kcm-panel-label {
+  color: var(--accent-2);
+  font-size: 11px;
+  text-transform: uppercase;
+  letter-spacing: 0.12em;
+}
+.kcm-panel-title {
+  margin-top: 8px;
+  font-size: 18px;
+  font-weight: 700;
+}
+.kcm-panel-copy {
+  margin-top: 8px;
+  color: var(--text-secondary);
+  font-size: 13px;
+  line-height: 1.55;
+}
+.kcm-frame {
+  margin-top: 16px;
+  background: var(--surface-3);
+  border: 1px solid var(--border);
+  border-radius: 24px;
+  overflow: hidden;
+  box-shadow: var(--shadow);
+}
+.kcm-frame-head {
+  padding: 14px 18px;
+  display: flex;
+  justify-content: space-between;
+  align-items: center;
+  gap: 12px;
+  border-bottom: 1px solid var(--border);
+}
+.kcm-frame-title {
+  font-size: 15px;
+  font-weight: 700;
+}
+.kcm-frame-copy {
+  font-size: 13px;
+  color: var(--text-secondary);
+  line-height: 1.45;
+}
+.kcm-open {
+  font-size: 12px;
+  font-weight: 700;
+}
+.kcm-frame iframe {
+  display: block;
+  width: 100%;
+  border: none;
+  background: #0b1020;
+}
+.kcm-overlay {
+  position: fixed;
+  inset: 0;
+  z-index: 9999;
+  display: none;
+  padding: 26px 16px;
+  overflow-y: auto;
+  background: rgba(4, 7, 16, 0.82);
+  backdrop-filter: blur(16px);
+}
+.kcm-overlay.open {
+  display: block;
+}
+.kcm-modal {
+  max-width: 1180px;
+  margin: 0 auto;
+  background: var(--surface-3);
+  border: 1px solid var(--border-strong);
+  border-radius: 28px;
+  overflow: hidden;
+  box-shadow: 0 40px 140px rgba(0, 0, 0, 0.42);
+}
+.kcm-modal-header {
+  padding: 24px 28px;
+  border-bottom: 1px solid var(--border);
+  display: flex;
+  justify-content: space-between;
+  align-items: flex-start;
+  gap: 20px;
+}
+.kcm-modal-header h2 {
+  margin: 0;
+  font-size: 28px;
+  letter-spacing: -0.04em;
+}
+.kcm-modal-header p {
+  margin: 8px 0 0;
+  color: var(--text-secondary);
+  font-size: 14px;
+  line-height: 1.55;
+}
+.kcm-modal-close {
+  padding: 9px 14px;
+  border-radius: 12px;
+  border: 1px solid var(--border);
+  background: rgba(255, 255, 255, 0.05);
+  color: var(--text-secondary);
+  cursor: pointer;
+  font-size: 12px;
+  font-weight: 700;
+}
+.kcm-modal-body {
+  padding: 24px 28px 30px;
+}
+.kcm-run-card {
+  margin-top: 14px;
+  background: rgba(255, 255, 255, 0.03);
+  border: 1px solid var(--border);
+  border-radius: 22px;
+  padding: 18px;
+}
+.kcm-run-card-head {
+  display: flex;
+  justify-content: space-between;
+  align-items: flex-start;
+  gap: 14px;
+  margin-bottom: 14px;
+}
+.kcm-run-card-title {
+  font-size: 16px;
+  font-weight: 700;
+}
+.kcm-run-card-meta {
+  margin-top: 6px;
+  color: var(--text-tertiary);
+  font-size: 12px;
+  line-height: 1.55;
+}
+.kcm-arch-grid {
+  display: grid;
+  grid-template-columns: repeat(auto-fit, minmax(260px, 1fr));
+  gap: 12px;
+}
+.kcm-arch-card {
+  background: rgba(255, 255, 255, 0.03);
+  border: 1px solid var(--border);
+  border-radius: 18px;
+  padding: 14px;
+}
+.kcm-arch-head {
+  display: flex;
+  justify-content: space-between;
+  align-items: center;
+  gap: 10px;
+}
+.kcm-arch-name {
+  font-size: 14px;
+  font-weight: 700;
+}
+.kcm-arch-detail {
+  margin-top: 8px;
+  font-size: 12px;
+  color: var(--text-secondary);
+  line-height: 1.55;
+}
+.kcm-failure-box {
+  margin-top: 10px;
+  padding: 10px 12px;
+  border-radius: 14px;
+  background: rgba(255, 128, 142, 0.08);
+  border: 1px solid rgba(255, 128, 142, 0.12);
+  color: var(--bad);
+  font-family: "JetBrains Mono", Consolas, monospace;
+  font-size: 12px;
+  white-space: pre-wrap;
+  max-height: 200px;
+  overflow-y: auto;
+}
+.kcm-empty {
+  padding: 16px 0;
+  color: var(--text-tertiary);
+  font-size: 14px;
+}
+@media (max-width: 1260px) {
+  .kcm-stats,
+  .kcm-meta,
+  .kcm-graphs {
+    grid-template-columns: repeat(2, minmax(0, 1fr));
+  }
+}
+@media (max-width: 900px) {
+  .kcm-stats,
+  .kcm-meta,
+  .kcm-graphs,
+  .kcm-arch-grid {
+    grid-template-columns: 1fr;
+  }
+  .kcm-toolbar,
+  .kcm-run-card-head,
+  .kcm-modal-header {
+    flex-direction: column;
+    align-items: stretch;
+  }
+  .kcm-search {
+    min-width: 0;
+    width: 100%;
+  }
+}
+"""
+def _dt(value: datetime | None) -> str:
+    if not value:
+        return "n/a"
+    return value.astimezone(timezone.utc).strftime("%Y-%m-%d %H:%M UTC")
+def _short_dt(value: datetime | None) -> str:
+    if not value:
+        return "Never"
+    return value.astimezone(timezone.utc).strftime("%b %d, %H:%M")
+def _badge(label: str, kind: str) -> str:
+    css = {
+        "completed": "ok",
+        "uploading": "warn",
+        "running": "info",
+        "recent": "info",
+        "failed": "bad",
+        "cancelled": "bad",
+        "stalled": "warn",
+        "idle": "muted",
+        "success": "ok",
+        "not_started": "muted",
+        "skipped": "muted",
+    }.get(kind, "info")
+    return f'<span class="kcm-badge {css}">{html.escape(label)}</span>'
+def _short_arch(arch: str) -> str:
+    return {
+        "x86_64-linux": "x86_64-linux",
+        "aarch64-linux": "aarch64-linux",
+        "x86_64-darwin": "x86_64-darwin",
+        "aarch64-darwin": "aarch64-darwin",
+    }.get(arch, arch)
+def _variant_label(record: MonitorRecord) -> str:
+    match = VARIANT_RE.search(record.job.name)
+    if match:
+        parts = [part.strip() for part in match.group(1).split(",") if part.strip()]
+        if parts:
+            parts[0] = _short_arch(parts[0])
+            return " | ".join(parts)
+    if record.workflow_name.lower().startswith("manual"):
+        return "manual upload"
+    if record.arch and record.arch not in {"all", "n/a"}:
+        return _short_arch(record.arch)
+    return record.job.name or "job"
+def _variant_chip(record: MonitorRecord) -> str:
+    phase_kind = "stalled" if record.suspected_stalled else record.phase
+    upload = _badge(record.upload_status_label, record.upload_status)
+    return f"""
+    <div class="kcm-variant">
+      <div class="kcm-variant-head">
+        <div class="kcm-variant-name">{html.escape(_variant_label(record))}</div>
+        {_badge(record.phase_label, phase_kind)}
+      </div>
+      <div class="kcm-variant-sub">Upload {upload}</div>
+      <div class="kcm-variant-sub">Runner {html.escape(record.runner_group or 'n/a')}</div>
+    </div>
+    """
+def _group_badges(group: KernelRunGroup) -> str:
+    badges = []
+    if group.is_active:
+        badges.append(_badge("Running", "running"))
+    elif group.has_failure:
+        badges.append(_badge("Failed", "failed"))
+    else:
+        badges.append(_badge("Completed", "completed"))
+    if group.has_uploading:
+        badges.append(_badge("Uploading", "uploading"))
+    if group.has_stall:
+        badges.append(_badge("Stalled", "stalled"))
+    return " ".join(badges)
+def _latest_group_for_workflow(row: KernelRow, workflow_path: str) -> KernelRunGroup | None:
+    return next((group for group in row.recent_groups if group.run.path == workflow_path), None)
+def _workflow_cell(group: KernelRunGroup | None, empty_label: str) -> str:
+    if not group:
+        return f'<div class="kcm-subtle">{html.escape(empty_label)}</div>'
+    variant_stack = "".join(_variant_chip(record) for record in group.records)
+    return f"""
+    <div class="kcm-badges">{_group_badges(group)}</div>
+    <div class="kcm-subtle">{html.escape(group.run.display_title or group.run.name)}</div>
+    <div class="kcm-variant-stack" style="margin-top:10px">{variant_stack}</div>
+    """
+def _actions_cell(row: KernelRow, config: AppConfig) -> str:
+    actions: list[str] = []
+    release_group = _latest_group_for_workflow(row, ".github/workflows/build-release.yaml")
+    manual_group = _latest_group_for_workflow(row, ".github/workflows/manual-build-upload.yaml")
+    if release_group:
+        actions.append(
+            f'<a class="kcm-action" href="{html.escape(release_group.run.html_url)}" target="_blank">Release run</a>'
+        )
+    if manual_group:
+        actions.append(
+            f'<a class="kcm-action" href="{html.escape(manual_group.run.html_url)}" target="_blank">Manual run</a>'
+        )
+    if config.grafana.enabled:
+        overview_url = build_dashboard_url(config.grafana, config.grafana.overview_dashboard_uid, embed=False)
+        actions.append(
+            f'<a class="kcm-action" href="{html.escape(overview_url)}" target="_blank">Grafana</a>'
+        )
+    return "".join(actions) or '<span class="kcm-subtle">No links</span>'
+def _render_kernel_row(row: KernelRow, idx: int, config: AppConfig) -> str:
+    release_group = _latest_group_for_workflow(row, ".github/workflows/build-release.yaml")
+    manual_group = _latest_group_for_workflow(row, ".github/workflows/manual-build-upload.yaml")
+    critical_tag = '<span class="kcm-badge critical">critical</span>' if row.critical else ""
+    workflows_text = " / ".join(
+        group.workflow_name for group in [release_group, manual_group] if group is not None
+    )
+    activity = row.primary_group
+    activity_title = html.escape(activity.run.display_title or activity.run.name) if activity else "No tracked run yet"
+    activity_sub = html.escape(_short_dt(row.last_triggered_at)) if row.last_triggered_at else "No activity"
+    return f"""
+    <tr
+      data-idx="{idx}"
+      data-kernel="{html.escape(row.kernel_name.lower())}"
+      data-status="{html.escape(row.row_status_kind)}"
+      data-workflow="{html.escape(workflows_text.lower())}"
+    >
+      <td style="min-width:220px">
+        <div class="kcm-kernel-name">{html.escape(row.kernel_name)} {critical_tag}</div>
+        <div class="kcm-kernel-meta">{html.escape(row.kernel_info.repo_id)}</div>
+        <div class="kcm-kernel-meta">{html.escape(", ".join(row.kernel_info.backends) or "backend metadata unavailable")}</div>
+      </td>
+      <td style="min-width:360px">{_workflow_cell(release_group, "No release workflow run found in the scanned history.")}</td>
+      <td style="min-width:280px">{_workflow_cell(manual_group, "No manual upload run found in the scanned history.")}</td>
+      <td style="min-width:240px">
+        <div class="kcm-badges">{_badge(row.row_status_label, row.row_status_kind)}</div>
+        <div class="kcm-activity-sub">{activity_title}</div>
+        <div class="kcm-activity-sub">{activity_sub}</div>
+      </td>
+      <td style="min-width:220px"><div class="kcm-actions">{_actions_cell(row, config)}</div></td>
+    </tr>
+    """
+def _render_arch_card(record: MonitorRecord) -> str:
+    phase_kind = "stalled" if record.suspected_stalled else record.phase
+    stall_line = (
+        f'<div class="kcm-arch-detail" style="color:var(--warn)">{html.escape(record.stall_reason or "")}</div>'
+        if record.suspected_stalled
+        else ""
+    )
+    failure = (
+        f'<div class="kcm-failure-box">{html.escape(record.failure_excerpt)}</div>'
+        if record.failure_excerpt
+        else ""
+    )
+    return f"""
+    <div class="kcm-arch-card">
+      <div class="kcm-arch-head">
+        <span class="kcm-arch-name">{html.escape(_variant_label(record))}</span>
+        {_badge(record.phase_label, phase_kind)}
+      </div>
+      <div class="kcm-arch-detail">Upload { _badge(record.upload_status_label, record.upload_status) }</div>
+      <div class="kcm-arch-detail">Runner {html.escape(record.runner_group or 'n/a')}</div>
+      <div class="kcm-arch-detail">Started {_dt(record.started_at)} | Latest signal {_dt(record.latest_signal_at)}</div>
+      <div class="kcm-arch-detail"><a href="{html.escape(record.job.html_url)}" target="_blank">Open job</a></div>
+      {stall_line}
+      {failure}
+    </div>
+    """
+def _render_group(group: KernelRunGroup) -> str:
+    arch_cards = "".join(_render_arch_card(record) for record in group.records)
+    return f"""
+    <div class="kcm-run-card">
+      <div class="kcm-run-card-head">
+        <div>
+          <div class="kcm-run-card-title">{html.escape(group.run.display_title or group.run.name)}</div>
+          <div class="kcm-run-card-meta">
+            {html.escape(group.workflow_name)} | branch {html.escape(group.run.head_branch or 'n/a')} | actor {html.escape(group.run.actor_login or 'n/a')}<br>
+            Triggered {_dt(group.triggered_at)}
+          </div>
+        </div>
+        <div>
+          <div class="kcm-badges">{_group_badges(group)}</div>
+          <div class="kcm-run-card-meta" style="margin-top:8px">
+            <a href="{html.escape(group.run.html_url)}" target="_blank">Open Actions run</a>
+          </div>
+        </div>
+      </div>
+      <div class="kcm-arch-grid">{arch_cards}</div>
+    </div>
+    """
+def _render_hidden_modal(row: KernelRow, idx: int, config: AppConfig) -> str:
+    release_group = _latest_group_for_workflow(row, ".github/workflows/build-release.yaml")
+    manual_group = _latest_group_for_workflow(row, ".github/workflows/manual-build-upload.yaml")
+    critical_tag = '<span class="kcm-badge critical">critical</span>' if row.critical else ""
+    grafana_link = ""
+    if config.grafana.enabled:
+        grafana_url = build_dashboard_url(config.grafana, config.grafana.overview_dashboard_uid, embed=False)
+        grafana_link = f'<a href="{html.escape(grafana_url)}" target="_blank" class="kcm-modal-close">Open Grafana</a>'
+    sections = []
+    if release_group:
+        sections.append(f'<h3 class="kcm-section-title">Latest release build</h3>{_render_group(release_group)}')
+    if manual_group:
+        sections.append(f'<h3 class="kcm-section-title">Latest manual upload</h3>{_render_group(manual_group)}')
+    if row.recent_groups:
+        sections.append(
+            "<h3 class=\"kcm-section-title\">Recent tracked runs</h3>"
+            + "".join(_render_group(group) for group in row.recent_groups[:8])
+        )
+    if not sections:
+        sections.append('<div class="kcm-empty">No tracked GitHub Actions runs found for this kernel yet.</div>')
+    return f"""
+    <div id="modal-content-{idx}" style="display:none">
+      <div class="kcm-modal-header">
+        <div>
+          <h2>{html.escape(row.kernel_name)} {critical_tag}</h2>
+          <p>{html.escape(row.kernel_info.repo_id)}</p>
+          <p>{_badge(row.row_status_label, row.row_status_kind)} {html.escape(", ".join(row.kernel_info.backends) or "No backend metadata")}</p>
+        </div>
+        <div style="display:flex;gap:10px;flex-wrap:wrap">
+          <a href="{html.escape(row.kernel_info.hub_url)}" target="_blank" class="kcm-modal-close">Open Hub repo</a>
+          {grafana_link}
+          <button class="kcm-modal-close">Close</button>
+        </div>
+      </div>
+      <div class="kcm-modal-body">
+        {"".join(sections)}
+      </div>
+    </div>
+    """
+def _render_graph_section(config: AppConfig) -> str:
+    if not config.grafana.enabled:
+        return """
+        <section class="kcm-section">
+          <h2 class="kcm-section-title">Metrics + trends</h2>
+          <div class="kcm-panel-link">
+            <div class="kcm-panel-label">Grafana not configured</div>
+            <div class="kcm-panel-title">The live Actions table is active; the Grafana deck is ready to attach.</div>
+            <div class="kcm-panel-copy">
+              Set <code>KCM_GRAFANA_BASE_URL</code> on the Space once you have a public Grafana endpoint.
+              The provisioning and Actions metrics emitter already live in <code>monitoring/</code> and
+              <code>scripts/push_build_metrics.py</code>.
+            </div>
+          </div>
+        </section>
+        """
+    dashboards = dashboard_catalog(config.grafana)
+    cards = "".join(
+        f"""
+        <a class="kcm-panel-link" href="{html.escape(build_dashboard_url(config.grafana, dashboard.uid, embed=False))}" target="_blank">
+          <div class="kcm-panel-label">Grafana</div>
+          <div class="kcm-panel-title">{html.escape(dashboard.title)}</div>
+          <div class="kcm-panel-copy">{html.escape(dashboard.description)}</div>
+        </a>
+        """
+        for dashboard in dashboards
+    )
+    embeds = "".join(
+        f"""
+        <div class="kcm-frame">
+          <div class="kcm-frame-head">
+            <div>
+              <div class="kcm-frame-title">{html.escape(dashboard.title)}</div>
+              <div class="kcm-frame-copy">{html.escape(dashboard.description)}</div>
+            </div>
+            <a class="kcm-open" href="{html.escape(build_dashboard_url(config.grafana, dashboard.uid, embed=False))}" target="_blank">Open in Grafana</a>
+          </div>
+          <iframe src="{html.escape(build_dashboard_url(config.grafana, dashboard.uid, embed=True))}" height="{dashboard.height}" loading="lazy"></iframe>
+        </div>
+        """
+        for dashboard in dashboards
+    )
+    return f"""
+    <section class="kcm-section">
+      <h2 class="kcm-section-title">Metrics + trends</h2>
+      <div class="kcm-graphs">{cards}</div>
+      {embeds}
+    </section>
+    """
+def render_page(snapshot: DashboardSnapshot, config: AppConfig) -> str:
+    summary = snapshot.summary
+    meta_cards = "".join(
+        [
+            f'<div class="kcm-meta-card"><div class="kcm-meta-card-label">Source repo</div><div class="kcm-meta-card-value">{html.escape(config.github.repo_slug)}</div></div>',
+            f'<div class="kcm-meta-card"><div class="kcm-meta-card-label">GitHub scans</div><div class="kcm-meta-card-value">{html.escape(str(config.monitor.workflow_run_pages))} pages x {html.escape(str(config.monitor.workflow_run_page_size))} runs</div></div>',
+            f'<div class="kcm-meta-card"><div class="kcm-meta-card-label">Grafana</div><div class="kcm-meta-card-value">{html.escape(config.grafana.base_url or "not configured")}</div></div>',
+        ]
+    )
+    stats = "".join(
+        f'<div class="kcm-stat"><div class="kcm-stat-label">{label}</div><div class="kcm-stat-value">{value}</div></div>'
+        for label, value in [
+            ("Kernels", summary.tracked_kernels),
+            ("Active", summary.active_builds),
+            ("Uploading", summary.uploading_builds),
+            ("Stalled", summary.stalled_builds),
+            ("Failed", summary.failed_builds),
+        ]
+    )
+    rows_html = "".join(_render_kernel_row(row, idx, config) for idx, row in enumerate(snapshot.kernel_rows))
+    errors_html = ""
+    if snapshot.errors:
+        errors_html = f' | <span style="color:var(--bad)">{html.escape("; ".join(snapshot.errors[:3]))}</span>'
+    return f"""
+    <div class="kcm-shell">
+      <section class="kcm-hero">
+        <div class="kcm-eyebrow">Kernels community observatory</div>
+        <h1>Kernel CI Command Center.</h1>
+        <p>
+          Every kernel source directory in <code>{html.escape(config.github.repo_slug)}</code> is enumerated from the repo tree,
+          then matched to its latest release and manual-upload GitHub Actions runs. Variant-level job status stays visible, and
+          Grafana handles the longer-term duration and failure telemetry.
+        </p>
+        <div class="kcm-meta">{meta_cards}</div>
+        <div class="kcm-stats">{stats}</div>
+      </section>
+      <div class="kcm-toolbar">
+        <div class="kcm-toolbar-left">
+          Refreshed <code>{html.escape(_dt(snapshot.generated_at))}</code> | <code>{len(snapshot.kernel_rows)}</code> kernels{errors_html}
+        </div>
+        <div class="kcm-toolbar-right">
+          <input class="kcm-search" type="text" placeholder="Filter kernel or workflow..." />
+          <select class="kcm-status-filter">
+            <option value="all">All states</option>
+            <option value="running">Running</option>
+            <option value="uploading">Uploading</option>
+            <option value="stalled">Stalled</option>
+            <option value="failed">Failed</option>
+            <option value="completed">Completed</option>
+            <option value="idle">Idle</option>
+          </select>
+        </div>
+      </div>
+      <section class="kcm-table-shell">
+        <div class="kcm-table-wrap">
+          <table class="kcm-table" id="kernelTable">
+            <thead>
+              <tr>
+                <th>Kernel dir</th>
+                <th>Latest release build</th>
+                <th>Latest manual upload</th>
+                <th>Latest activity</th>
+                <th>Actions</th>
+              </tr>
+            </thead>
+            <tbody>{rows_html}</tbody>
+          </table>
+        </div>
+      </section>
+      {_render_graph_section(config)}
+    </div>
+    <div class="kcm-overlay" id="kcmOverlay">
+      <div class="kcm-modal" id="kcmModal"></div>
+    </div>
+    {"".join(_render_hidden_modal(row, idx, config) for idx, row in enumerate(snapshot.kernel_rows))}
+    """
+LOADING_HTML = """
+<div class="kcm-shell">
+  <section class="kcm-hero">
+    <div class="kcm-eyebrow">Kernels community observatory</div>
+    <h1>Booting the kernel CI command center...</h1>
+    <p>The first load walks the kernel catalog and scans the latest GitHub Actions runs, so it can take a few seconds.</p>
+  </section>
+</div>
+"""
+def build_dashboard(service: MonitorService, config: AppConfig) -> gr.Blocks:
+    with gr.Blocks() as demo:
+        refresh_timer = gr.Timer(value=8, active=True)
+        loaded_state = gr.State(False)
+        with gr.Row():
+            refresh_btn = gr.Button("Refresh now", variant="primary", scale=0, min_width=160)
+        page_html = gr.HTML(value=LOADING_HTML)
+        def refresh(_=None):
+            snapshot = service.get_snapshot(force_refresh=True)
+            return render_page(snapshot, config), True, gr.Timer(value=config.monitor.refresh_interval_seconds, active=True)
+        def tick_refresh(loaded):
+            snapshot = service.get_snapshot(force_refresh=not loaded)
+            return render_page(snapshot, config), True, gr.Timer(value=config.monitor.refresh_interval_seconds, active=True)
+        refresh_btn.click(refresh, outputs=[page_html, loaded_state, refresh_timer])
+        refresh_timer.tick(tick_refresh, inputs=[loaded_state], outputs=[page_html, loaded_state, refresh_timer])
+    return demo

tests/conftest.py ADDED Viewed

	@@ -0,0 +1,10 @@

+from __future__ import annotations
+import sys
+from pathlib import Path
+ROOT_DIR = Path(__file__).resolve().parents[1]
+SRC_DIR = ROOT_DIR / "src"
+if str(SRC_DIR) not in sys.path:
+    sys.path.insert(0, str(SRC_DIR))

tests/fixtures/active_build_job.json ADDED Viewed

	@@ -0,0 +1,45 @@

+{
+  "id": 66947931664,
+  "run_id": 23049830725,
+  "workflow_name": "Build Release",
+  "head_branch": "tiny-build-fix",
+  "run_url": "https://api.github.com/repos/huggingface/kernels-community/actions/runs/23049830725",
+  "run_attempt": 1,
+  "head_sha": "ca745cc4e08039817fc47d780f7dd3126187a6d6",
+  "url": "https://api.github.com/repos/huggingface/kernels-community/actions/jobs/66947931664",
+  "html_url": "https://github.com/huggingface/kernels-community/actions/runs/23049830725/job/66947931664",
+  "status": "in_progress",
+  "conclusion": null,
+  "created_at": "2026-03-22T10:00:00Z",
+  "started_at": "2026-03-22T10:00:10Z",
+  "completed_at": null,
+  "name": "build-kernel (aarch64-linux, aws-r8g-8xl-plus-nix)",
+  "steps": [
+    {
+      "name": "Set up job",
+      "status": "completed",
+      "conclusion": "success",
+      "number": 1,
+      "started_at": "2026-03-22T10:00:10Z",
+      "completed_at": "2026-03-22T10:00:12Z"
+    },
+    {
+      "name": "Validate kernel directory",
+      "status": "completed",
+      "conclusion": "success",
+      "number": 6,
+      "started_at": "2026-03-22T10:00:30Z",
+      "completed_at": "2026-03-22T10:00:31Z"
+    },
+    {
+      "name": "Build and upload kernel",
+      "status": "in_progress",
+      "conclusion": null,
+      "number": 7,
+      "started_at": "2026-03-22T10:01:00Z",
+      "completed_at": null
+    }
+  ],
+  "runner_name": "aws-r8g-8xl-plus-nix-runner",
+  "runner_group_name": "aws-r8g-8xl-plus-nix"
+}

tests/fixtures/build_release_run.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "id": 23049830725,
+  "name": "Build Release",
+  "display_title": "sgl-flash-attn3: upload path sanity",
+  "path": ".github/workflows/build-release.yaml",
+  "status": "in_progress",
+  "conclusion": null,
+  "head_branch": "tiny-build-fix",
+  "head_sha": "ca745cc4e08039817fc47d780f7dd3126187a6d6",
+  "event": "pull_request",
+  "html_url": "https://github.com/huggingface/kernels-community/actions/runs/23049830725",
+  "jobs_url": "https://api.github.com/repos/huggingface/kernels-community/actions/runs/23049830725/jobs",
+  "created_at": "2026-03-22T10:00:00Z",
+  "updated_at": "2026-03-22T14:20:00Z",
+  "run_started_at": "2026-03-22T10:00:00Z",
+  "actor": {
+    "login": "adarshxs"
+  }
+}

tests/fixtures/failed_build_job.json ADDED Viewed

	@@ -0,0 +1,45 @@

+{
+  "id": 66947931666,
+  "run_id": 23049830726,
+  "workflow_name": "Build Release",
+  "head_branch": "repo-id-bug",
+  "run_url": "https://api.github.com/repos/huggingface/kernels-community/actions/runs/23049830726",
+  "run_attempt": 1,
+  "head_sha": "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb",
+  "url": "https://api.github.com/repos/huggingface/kernels-community/actions/jobs/66947931666",
+  "html_url": "https://github.com/huggingface/kernels-community/actions/runs/23049830726/job/66947931666",
+  "status": "completed",
+  "conclusion": "failure",
+  "created_at": "2026-03-21T10:00:00Z",
+  "started_at": "2026-03-21T10:00:10Z",
+  "completed_at": "2026-03-21T10:26:08Z",
+  "name": "build-kernel (aarch64-linux, aws-r8g-8xl-plus-nix)",
+  "steps": [
+    {
+      "name": "Set up job",
+      "status": "completed",
+      "conclusion": "success",
+      "number": 1,
+      "started_at": "2026-03-21T10:00:10Z",
+      "completed_at": "2026-03-21T10:00:12Z"
+    },
+    {
+      "name": "Validate kernel directory",
+      "status": "completed",
+      "conclusion": "success",
+      "number": 6,
+      "started_at": "2026-03-21T10:00:30Z",
+      "completed_at": "2026-03-21T10:00:31Z"
+    },
+    {
+      "name": "Build and upload kernel",
+      "status": "completed",
+      "conclusion": "failure",
+      "number": 7,
+      "started_at": "2026-03-21T10:01:00Z",
+      "completed_at": "2026-03-21T10:26:08Z"
+    }
+  ],
+  "runner_name": "aws-r8g-8xl-plus-nix-runner",
+  "runner_group_name": "aws-r8g-8xl-plus-nix"
+}

tests/fixtures/failed_build_run.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "id": 23049830726,
+  "name": "Build Release",
+  "display_title": "sgl-flash-attn3: repo id regression",
+  "path": ".github/workflows/build-release.yaml",
+  "status": "completed",
+  "conclusion": "failure",
+  "head_branch": "repo-id-bug",
+  "head_sha": "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb",
+  "event": "pull_request",
+  "html_url": "https://github.com/huggingface/kernels-community/actions/runs/23049830726",
+  "jobs_url": "https://api.github.com/repos/huggingface/kernels-community/actions/runs/23049830726/jobs",
+  "created_at": "2026-03-21T10:00:00Z",
+  "updated_at": "2026-03-21T10:26:08Z",
+  "run_started_at": "2026-03-21T10:00:00Z",
+  "actor": {
+    "login": "adarshxs"
+  }
+}

tests/fixtures/manual_build_run.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "id": 23049830727,
+  "name": "Manual Kernel Build",
+  "display_title": "Manual Kernel Build / flash-attn3 / target=main / request=manual",
+  "path": ".github/workflows/manual-build-upload.yaml",
+  "status": "completed",
+  "conclusion": "success",
+  "head_branch": "manual-test",
+  "head_sha": "cccccccccccccccccccccccccccccccccccccccc",
+  "event": "workflow_dispatch",
+  "html_url": "https://github.com/huggingface/kernels-community/actions/runs/23049830727",
+  "jobs_url": "https://api.github.com/repos/huggingface/kernels-community/actions/runs/23049830727/jobs",
+  "created_at": "2026-03-21T14:00:00Z",
+  "updated_at": "2026-03-21T15:01:00Z",
+  "run_started_at": "2026-03-21T14:00:00Z",
+  "actor": {
+    "login": "adarshxs"
+  }
+}

tests/fixtures/manual_upload_job.json ADDED Viewed

	@@ -0,0 +1,53 @@

+{
+  "id": 66947931668,
+  "run_id": 23049830727,
+  "workflow_name": "Manual Kernel Build",
+  "head_branch": "manual-test",
+  "run_url": "https://api.github.com/repos/huggingface/kernels-community/actions/runs/23049830727",
+  "run_attempt": 1,
+  "head_sha": "cccccccccccccccccccccccccccccccccccccccc",
+  "url": "https://api.github.com/repos/huggingface/kernels-community/actions/jobs/66947931668",
+  "html_url": "https://github.com/huggingface/kernels-community/actions/runs/23049830727/job/66947931668",
+  "status": "completed",
+  "conclusion": "success",
+  "created_at": "2026-03-21T14:00:00Z",
+  "started_at": "2026-03-21T14:00:10Z",
+  "completed_at": "2026-03-21T15:01:00Z",
+  "name": "build-and-upload",
+  "steps": [
+    {
+      "name": "Set up job",
+      "status": "completed",
+      "conclusion": "success",
+      "number": 1,
+      "started_at": "2026-03-21T14:00:10Z",
+      "completed_at": "2026-03-21T14:00:12Z"
+    },
+    {
+      "name": "Validate kernel directory",
+      "status": "completed",
+      "conclusion": "success",
+      "number": 6,
+      "started_at": "2026-03-21T14:00:30Z",
+      "completed_at": "2026-03-21T14:00:31Z"
+    },
+    {
+      "name": "Build and copy kernel",
+      "status": "completed",
+      "conclusion": "success",
+      "number": 7,
+      "started_at": "2026-03-21T14:01:00Z",
+      "completed_at": "2026-03-21T14:45:00Z"
+    },
+    {
+      "name": "Upload kernel",
+      "status": "completed",
+      "conclusion": "success",
+      "number": 8,
+      "started_at": "2026-03-21T14:45:10Z",
+      "completed_at": "2026-03-21T15:01:00Z"
+    }
+  ],
+  "runner_name": "aws-highmemory-32-plus-nix-runner",
+  "runner_group_name": "aws-highmemory-32-plus-nix"
+}

tests/test_grafana.py ADDED Viewed

	@@ -0,0 +1,44 @@

+from __future__ import annotations
+from kc_monitor.config import GrafanaSettings
+from kc_monitor.grafana import build_dashboard_url, dashboard_catalog
+def test_dashboard_catalog_uses_configured_uids() -> None:
+    settings = GrafanaSettings(
+        base_url="https://grafana.example.com",
+        overview_dashboard_uid="overview-uid",
+        duration_dashboard_uid="durations-uid",
+        failure_dashboard_uid="failures-uid",
+    )
+    dashboards = dashboard_catalog(settings)
+    assert [dashboard.uid for dashboard in dashboards] == [
+        "overview-uid",
+        "durations-uid",
+        "failures-uid",
+    ]
+def test_build_dashboard_url_supports_embed_mode() -> None:
+    settings = GrafanaSettings(
+        base_url="https://grafana.example.com/",
+        org_id=7,
+        theme="light",
+        default_from="now-7d",
+        default_to="now",
+        default_refresh="30s",
+    )
+    embed_url = build_dashboard_url(settings, "overview-uid", embed=True)
+    full_url = build_dashboard_url(settings, "overview-uid", embed=False)
+    assert embed_url == (
+        "https://grafana.example.com/d/overview-uid/_?"
+        "orgId=7&from=now-7d&to=now&theme=light&refresh=30s&kiosk=tv"
+    )
+    assert full_url == (
+        "https://grafana.example.com/d/overview-uid/_?"
+        "orgId=7&from=now-7d&to=now&theme=light&refresh=30s"
+    )

tests/test_log_parser.py ADDED Viewed

	@@ -0,0 +1,52 @@

+from __future__ import annotations
+import json
+from pathlib import Path
+from kc_monitor.log_parser import JobLogParser
+from kc_monitor.models import GitHubJob, GitHubRun
+FIXTURES_DIR = Path(__file__).parent / "fixtures"
+def load_json_fixture(name: str) -> dict:
+    return json.loads((FIXTURES_DIR / name).read_text(encoding="utf-8"))
+def load_text_fixture(name: str) -> str:
+    return (FIXTURES_DIR / name).read_text(encoding="utf-8")
+def test_parser_detects_upload_in_progress_from_combined_step() -> None:
+    run = GitHubRun.from_api(load_json_fixture("build_release_run.json"))
+    job = GitHubJob.from_api(load_json_fixture("active_build_job.json"))
+    parsed = JobLogParser().parse(run, job, load_text_fixture("running_build_upload.log"))
+    assert parsed.phase == "uploading"
+    assert parsed.upload_status == "running"
+    assert parsed.repo_id == "kernels-community/sgl-flash-attn3"
+    assert parsed.latest_log_at is not None
+def test_parser_keeps_upload_not_started_when_build_fails_first() -> None:
+    run = GitHubRun.from_api(load_json_fixture("failed_build_run.json"))
+    job = GitHubJob.from_api(load_json_fixture("failed_build_job.json"))
+    parsed = JobLogParser().parse(run, job, load_text_fixture("failed_build.log"))
+    assert parsed.phase == "failed"
+    assert parsed.upload_status == "not_started"
+    assert "Mandatory repo-id is missing" in (parsed.failure_excerpt or "")
+def test_parser_marks_manual_upload_as_completed() -> None:
+    run = GitHubRun.from_api(load_json_fixture("manual_build_run.json"))
+    job = GitHubJob.from_api(load_json_fixture("manual_upload_job.json"))
+    parsed = JobLogParser().parse(run, job, load_text_fixture("manual_upload_success.log"))
+    assert parsed.phase == "upload_complete"
+    assert parsed.upload_status == "completed"
+    assert parsed.repo_id == "kernels-community/flash-attn3"

tests/test_metrics_push.py ADDED Viewed

	@@ -0,0 +1,96 @@

+from __future__ import annotations
+from kc_monitor.metrics_push import (
+    BuildMetricSample,
+    build_pushgateway_url,
+    format_prometheus_metrics,
+)
+def test_build_metric_sample_uses_matrix_labels_and_duration() -> None:
+    sample = BuildMetricSample.from_env(
+        {
+            "KCM_JOB_STATUS": "failure",
+            "KCM_JOB_STARTED_AT": "100",
+            "KCM_KERNEL": "flash-attn3",
+            "KCM_BACKEND": "cuda",
+            "KCM_COMPUTE_BACKEND": "triton",
+            "KCM_CUDA_VERSION": "12.4",
+            "KCM_PYTORCH_VERSION": "2.5.1",
+            "KCM_PYTHON_VERSION": "3.11",
+            "GITHUB_REPOSITORY": "huggingface/kernels-community",
+            "GITHUB_WORKFLOW": "Build Release",
+            "GITHUB_REF_NAME": "main",
+            "GITHUB_JOB": "build_kernel",
+            "RUNNER_OS": "Linux",
+            "RUNNER_ARCH": "X64",
+        },
+        completed_at_seconds=145,
+    )
+    assert sample.grouping_key == {
+        "kernel": "flash-attn3",
+        "backend": "cuda",
+        "compute_backend": "triton",
+        "cuda_version": "12.4",
+        "pytorch_version": "2.5.1",
+        "python_version": "3.11",
+    }
+    assert sample.metric_labels["repository"] == "huggingface/kernels-community"
+    assert sample.result == "failure"
+    assert sample.result_code == 2
+    assert sample.failed == 1
+    assert sample.duration_seconds == 45.0
+def test_build_pushgateway_url_is_stable_per_matrix_combo() -> None:
+    url = build_pushgateway_url(
+        "http://pushgateway:9091",
+        "kernels-community-build-matrix",
+        {
+            "kernel": "flash-attn3",
+            "backend": "cuda",
+            "compute_backend": "triton",
+            "cuda_version": "12.4",
+            "pytorch_version": "2.5.1",
+            "python_version": "3.11",
+        },
+    )
+    assert url == (
+        "http://pushgateway:9091/metrics/job/kernels-community-build-matrix/"
+        "kernel/flash-attn3/backend/cuda/compute_backend/triton/cuda_version/12.4/"
+        "pytorch_version/2.5.1/python_version/3.11"
+    )
+def test_prometheus_payload_contains_expected_metrics() -> None:
+    sample = BuildMetricSample.from_env(
+        {
+            "KCM_JOB_STATUS": "success",
+            "KCM_BUILD_DURATION_SECONDS": "12.5",
+            "KCM_KERNEL": "flash-attn3",
+            "KCM_BACKEND": "cuda",
+            "KCM_COMPUTE_BACKEND": "triton",
+            "KCM_CUDA_VERSION": "12.4",
+            "KCM_PYTORCH_VERSION": "2.5.1",
+            "KCM_PYTHON_VERSION": "3.11",
+            "GITHUB_REPOSITORY": "huggingface/kernels-community",
+            "GITHUB_WORKFLOW": "Build Release",
+            "GITHUB_REF_NAME": "main",
+            "GITHUB_JOB": "build_kernel",
+            "RUNNER_OS": "Linux",
+            "RUNNER_ARCH": "X64",
+        },
+        completed_at_seconds=1700000000,
+    )
+    payload = format_prometheus_metrics(sample)
+    assert "kc_build_last_run_result_code" in payload
+    assert "kc_build_last_run_failed" in payload
+    assert "kc_build_last_run_duration_seconds" in payload
+    assert "kc_build_last_run_timestamp_seconds" in payload
+    assert 'result="success"' in payload
+    assert "12.500" in payload
+    assert "1700000000" in payload

tests/test_service.py ADDED Viewed

	@@ -0,0 +1,152 @@

+from __future__ import annotations
+import json
+from pathlib import Path
+from kc_monitor.config import AppConfig
+from kc_monitor.models import GitHubJob, GitHubRun
+from kc_monitor.service import MonitorService
+FIXTURES_DIR = Path(__file__).parent / "fixtures"
+def load_json_fixture(name: str) -> dict:
+    return json.loads((FIXTURES_DIR / name).read_text(encoding="utf-8"))
+def load_text_fixture(name: str) -> str:
+    return (FIXTURES_DIR / name).read_text(encoding="utf-8")
+class FakeGitHubClient:
+    def __init__(self) -> None:
+        self.runs = [
+            GitHubRun.from_api(load_json_fixture("build_release_run.json")),
+            GitHubRun.from_api(load_json_fixture("failed_build_run.json")),
+            GitHubRun.from_api(load_json_fixture("manual_build_run.json")),
+        ]
+        self.jobs = {
+            23049830725: [GitHubJob.from_api(load_json_fixture("active_build_job.json"))],
+            23049830726: [GitHubJob.from_api(load_json_fixture("failed_build_job.json"))],
+            23049830727: [GitHubJob.from_api(load_json_fixture("manual_upload_job.json"))],
+        }
+        self.logs = {
+            66947931664: load_text_fixture("running_build_upload.log"),
+            66947931666: load_text_fixture("failed_build.log"),
+            66947931668: load_text_fixture("manual_upload_success.log"),
+        }
+        self.build_toml = {
+            "sgl-flash-attn3/build.toml": """
+[general]
+name = "sgl-flash-attn3"
+version = 1
+backends = ["cuda"]
+[general.hub]
+repo-id = "kernels-community/sgl-flash-attn3"
+""".strip(),
+            "flash-attn3/build.toml": """
+[general]
+name = "flash-attn3"
+version = 1
+backends = ["cuda"]
+[general.hub]
+repo-id = "kernels-community/flash-attn3"
+""".strip(),
+        }
+        self.tree_paths = [
+            "sgl-flash-attn3/build.toml",
+            "flash-attn3/build.toml",
+            "deep-gemm/build.toml",
+        ]
+    def close(self) -> None:
+        return None
+    def list_runs(self, per_page: int = 30, page: int = 1) -> list[GitHubRun]:
+        return self.runs[:per_page]
+    def list_workflow_runs(
+        self,
+        workflow_file: str,
+        per_page: int = 30,
+        page: int = 1,
+    ) -> list[GitHubRun]:
+        return [r for r in self.runs if r.path.endswith(workflow_file)][:per_page]
+    def list_jobs(self, run_id: int) -> list[GitHubJob]:
+        return self.jobs[run_id]
+    def get_job_logs(
+        self,
+        job_id: int,
+        line_limit: int = 400,
+        char_limit: int = 35000,
+        job_html_url: str | None = None,
+    ) -> str:
+        return self.logs[job_id]
+    def get_file_text(self, path: str, ref: str | None = None) -> str | None:
+        return self.build_toml.get(path)
+    def list_repo_tree_paths(self, ref: str = "main") -> list[str]:
+        return self.tree_paths
+def test_service_builds_summary_and_records() -> None:
+    config = AppConfig.model_validate(
+        {
+            "github": {
+                "owner": "huggingface",
+                "repo": "kernels-community",
+                "branch": "main",
+                "per_page": 10,
+                "workflows": [
+                    {
+                        "path": ".github/workflows/build-release.yaml",
+                        "label": "Build Release",
+                        "enabled": True,
+                    },
+                    {
+                        "path": ".github/workflows/manual-build-upload.yaml",
+                        "label": "Manual Kernel Build",
+                        "enabled": True,
+                    },
+                ],
+            },
+            "monitor": {
+                "recent_completed_hours": 400,
+                "critical_kernels": ["flash-attn3", "sgl-flash-attn3"],
+                "snapshot_ttl_seconds": 1,
+            },
+        }
+    )
+    service = MonitorService(config, client=FakeGitHubClient())
+    snapshot = service.get_snapshot(force_refresh=True)
+    assert snapshot.summary.active_builds == 1
+    assert snapshot.summary.completed_uploads == 1
+    assert snapshot.summary.failed_builds == 1
+    assert snapshot.summary.uploading_builds == 1
+    assert snapshot.summary.tracked_kernels == 3
+    assert len(snapshot.active_records) == 1
+    assert len(snapshot.kernel_rows) == 3
+    assert snapshot.active_records[0].kernel_name == "sgl-flash-attn3"
+    assert snapshot.active_records[0].critical is True
+    assert snapshot.kernel_rows[0].kernel_name == "sgl-flash-attn3"
+    assert snapshot.kernel_rows[-1].kernel_name == "deep-gemm"
+    assert any(record.upload_status == "completed" for record in snapshot.recent_records)
+    assert any(record.phase == "failed" for record in snapshot.recent_records)
+def test_service_normalizes_public_jobs_without_steps() -> None:
+    run = GitHubRun.from_api(load_json_fixture("build_release_run.json"))
+    job = GitHubJob.from_api(load_json_fixture("active_build_job.json"))
+    job.steps = []
+    normalized = MonitorService._normalize_job(run, job)
+    assert [step.name for step in normalized.steps] == ["Build and upload kernel"]