.. _research-ngf-alpha-benchmarks: Benchmarks ===================== Stage-11 introduced the breakthrough: - **Warp:** Embed latents into PCA(3) space, warp into a single dominant well. - **Detect:** Use matched filters with null calibration to identify the true well. - **Denoise:** Apply smoothing, phantom guards, and jitter averaging to suppress false wells. (Part A) Latent-ARC Results (n=100) ----------------------------------- This benchmark evaluates the geometric WDD pipeline (Warp → Detect → Denoise) on synthetic ARC-like tasks. Instead of using raw pixel grids, problems are projected into a latent space where ARC transformations (flip, rotate, color swap, etc.) are simulated. The goal is to stress-test whether warped semantic manifolds can preserve task structure and recover the correct primitive sequence. In short: it’s a sandbox for testing ARC-style reasoning in latent space, validating that WDD can generalize without hallucination. +-------------------+-----------+-----------+--------+--------+---------+----------+ | Model | Exact Acc | Precision | Recall | F1 | Halluc. | Omission | +===================+===========+===========+========+========+=========+==========+ | Denoise (Stage 11)| **1.000** | 0.9977 | 0.9989 | 0.9983 | 0.0045 | 0.0023 | +-------------------+-----------+-----------+--------+--------+---------+----------+ | Geodesic (pre) | 0.640 | 0.8450 | 1.0000 | 0.8980 | 0.1550 | 0.0000 | +-------------------+-----------+-----------+--------+--------+---------+----------+ | Stock baseline | 0.490 | 0.8900 | 0.7767 | 0.7973 | 0.1100 | 0.2233 | +-------------------+-----------+-----------+--------+--------+---------+----------+ **Note (Part A):** Stock baseline approximates what you’d see if you used simple thresholds on LLM latents/logits without NGF’s Warp → Detect → Denoise. (Part B) LMM-HellaSwag Results (n=1000) --------------------------------------- This benchmark instruments a frozen GPT-2 small model, tapped nine layers from the output, to evaluate how well raw transformer latents support downstream WDD analysis. The idea is to test whether the geometry of pre-trained language model embeddings (without fine-tuning or training) already contains enough separability for stable warp–detect–denoise behavior. It’s a probe of “out-of-the-box” model latents to assess how much structure the sidecar can extract from a standard foundation model. +-------------------+--------+------------------+-------------+-----------------------+ | Model | F1 | ECE (Calibration)| Brier Score | Overconfidence > 0.70 | +===================+========+==================+=============+=======================+ | MaxWarp (Stage 11)| 0.35 | 0.080 | 0.743 | 1.2% | +-------------------+--------+------------------+-------------+-----------------------+ | Stock baseline | 0.324 | 0.122 | 0.750 | 0.7% | +-------------------+--------+------------------+-------------+-----------------------+ | Change (Δ) | +0.032 | -0.032 | -0.007 | +0.5% | +-------------------+--------+------------------+-------------+-----------------------+ .. figure:: /_static/stage11_well_compare.png :alt: NGF Warped vs Flat Paths :align: center :scale: 25% Fig 1. PCA-2 visualization of “semantic wells” (pre vs post warp) on GPT-2 (tap 9). (Part C) Tier-1 micro-LM Sidecar Benchmarks (SBERT, n = 750) ---------------------------------------------------------------------- This benchmark runs the Tier-1 micro-lm pipeline using SBERT latents (MiniLM-L6-v2) as input. Prompts from DeFi and ARC domains are parsed through the sidecar to yield deterministic PASS / ABSTAIN / REJECT verdicts, with coverage, abstain, and hallucination rates logged. This is the first reproducible production-grade benchmark for the micro-lm sidecar, showing ~94–95% coverage with hallucination suppressed to ~1–2%. It serves as the reference “community edition” baseline for ngeodesic.ai. **Aggregate** - Coverage: **94.5%** (95% CI **92.7–95.9%**) - Abstain: **5.5%** - Hallucination: **1.7%** (95% CI **1.0–2.9%**) - Multi-accept: **46.3%** - Span yield: **100%** **Per-primitive (with 95% Wilson CIs)** +------------------+-----+----------+------------------+---------------+------------------------+ | Primitive | n | Coverage | 95% CI | Hallucination | Notes | +==================+=====+==========+==================+===============+========================+ | borrow_asset | 78 | 100.0% | 95.3% – 100.0% | 0.0% | Strong | +------------------+-----+----------+------------------+---------------+------------------------+ | claim_rewards | 111 | 100.0% | 96.7% – 100.0% | 0.0% | Strong | +------------------+-----+----------+------------------+---------------+------------------------+ | deposit_asset | 105 | 92.4% | 85.7% – 96.1% | 1.0% | Slightly lower | +------------------+-----+----------+------------------+---------------+------------------------+ | repay_asset | 98 | 100.0% | 96.2% – 100.0% | 0.0% | Strong | +------------------+-----+----------+------------------+---------------+------------------------+ | stake_asset | 99 | 100.0% | 96.3% – 100.0% | 0.0% | Strong | +------------------+-----+----------+------------------+---------------+------------------------+ | swap_asset | 82 | 100.0% | 95.5% – 100.0% | 0.0% | Strong | +------------------+-----+----------+------------------+---------------+------------------------+ | unstake_asset | 85 | 100.0% | 95.7% – 100.0% | 0.0% | Strong | +------------------+-----+----------+------------------+---------------+------------------------+ | withdraw_asset | 92 | 64.1% | 53.9% – 73.2% | 4.3% | **Tuning underway** | +------------------+-----+----------+------------------+---------------+------------------------+ **Setup**: SBERT `sentence-transformers/all-MiniLM-L6-v2`; thresholds `tau_span=0.5`, `tau_rel=0.6`, `tau_abs=0.93`; `n_max=4`, `topk=3`; `T=720`, `L=160`, `beta=8.6`, `sigma=0.0`. **Definition notes**: *Coverage* = correct PASS on in-scope primitive; *Abstain* = ABSTAIN when ambiguous; *Hallucination* = confident wrong PASS; *Span yield* = share of predictions with grounded spans. (Part D) Tier-2 micro-LM Sidecar Benchmarks (WDD + SBERT, n = 750, **coming**) ---------------------------------------------------------------------- This benchmark extends the Tier-1 setup by fully integrating the WDD audit layer (Warp → Detect → Denoise) on top of SBERT latents. Whereas Tier-1 establishes deterministic single-primitive verdicts, Tier-2 stress-tests multi-primitive sequences (e.g. rotate → flip → swap) and evaluates whether the sidecar maintains low hallucination rates under compositional load. Early runs on 20–25 sample prompts show stable performance and safe abstentions, with the full benchmark suite (n ≈ 500+) slated next. This milestone demonstrates how micro-lm scales from primitive-level correctness to sequence-level reasoning, validating the geometry-first audit pipeline in higher-complexity domains. **Metric Definitions** - **Coverage** → % of prompts where the sidecar returned the correct primitive (PASS). - **Abstain** → % of prompts where the sidecar explicitly declined to guess (ABSTAIN). - **Hallucination** → % of prompts where the sidecar produced a confident but wrong PASS verdict. - **Multi-accept** → % of prompts where the sidecar marked **more than one parse as valid** (e.g., a prompt could map to both `unstake_asset` and `withdraw_asset`). - **Span yield** → % of predictions where the model produced grounded spans (inputs/outputs aligned to canonical tokens). Key Results → Micro-LMs ----------------------- 1. **WDD Works in Practice** The staged experiments confirmed that WDD reliably improved reproducibility, separation, and robustness. This validated NGF as more than theory—it was a working framework. 2. **Deterministic Traces** By Stage‑11, identical prompts produced **identical traces and outcomes** across runs, a property absent in baseline models. 3. **Principled Abstains** Experiments showed that ambiguity could be detected geometrically—when wells overlapped or margins were thin, abstains were triggered instead of guesses. 4. **Blueprint for Sidecars** The insights from ngf‑alpha directly inspired **Micro‑LMs**: lightweight, domain‑specific reasoning sidecars that use NGF rails to map intent to primitives under verifiers. - **ARC Micro‑LM** became the reasoning stress test. - **DeFi Micro‑LM** became the production‑style demo with policy checks.