LESSWRONG
LW

1

Seed-AGI via Fast On-the-Fly Learning

by Blindfayth
15th Aug 2025
7 min read
0

1

This post was rejected for the following reason(s):

  • No LLM generated, heavily assisted/co-written, or otherwise reliant work. LessWrong has recently been inundated with new users submitting work where much of the content is the output of LLM(s). This work by-and-large does not meet our standards, and is rejected. This includes dialogs with LLMs that claim to demonstrate various properties about them, posts introducing some new concept and terminology that explains how LLMs work, often centered around recursiveness, emergence, sentience, consciousness, etc. (these generally don't turn out to be as novel or interesting as they may seem).

    Our LLM-generated content policy can be viewed here.

  • Insufficient Quality for AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meet a pretty high bar. 

    If you want to try again, I recommend writing something short and to the point, focusing on your strongest argument, rather than a long, comprehensive essay. (This is fairly different from common academic norms.) We get lots of AI essays/papers every day and sadly most of them don't make very clear arguments, and we don't have time to review them all thoroughly. 

    We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example. 

1

New Comment
Moderation Log
More from Blindfayth
View more
Curated and popular this week
0Comments

A technical program for a well-funded alignment-first team

 

Abstract

We propose an AGI research program centered on a fast-adapting, continually-learning, multimodal agent that (1) updates a small set of parameters on-the-fly from limited data, (2) consolidates those updates safely and sample-efficiently, (3) separates ephemeral inference-time learning from slow, alignment-gated consolidation, (4) is sand-boxed inside strong security and governance guardrails, and (5) ships only after passing quantitative capability and alignment gates. The design combines: a Chinchilla-regime base model; parameter-efficient adaptation (LoRA/adapters); online/continual-learning regularizers (EWC, SI, LwF) with prioritized replay; retrieval and kNN-LM external memory; a model-based “world-model” planner (Dreamer-style) for agentic tasks; mechanistic interpretability instrumentation (activation/attribution patching with TransformerLens); and a scalable-oversight stack (RLHF + Constitutional AI + debate/weak-to-strong). We provide concrete algorithms, interfaces, evals, milestones, compute planning, and go/no-go thresholds, with citations to prior art where results are already measured. 

 

1. Motivation & Prior Evidence

  1. Sample efficiency & continual learning. Catastrophic forgetting in neural nets is established; regularization and replay methods (EWC, Synaptic Intelligence, Learning-without-Forgetting) retain prior competence while learning online.  
  2. Parameter-efficient updates. LoRA/adapters consistently deliver high adaptation speed at low compute/memory, enabling inference-time or near-real-time specialization. Surveys quantify trade-offs.  
  3. Externalized memory. Retrieval-augmented generation and kNN-LM demonstrably reduce parametric data needs by deferring to non-parametric memory.  
  4. Multitask/embodiment. Single-policy generalists (e.g., Gato) show cross-modality feasibility; model-based world-models (DreamerV3) show broad task generalization and data efficiency.  
  5. Scaling/data. Chinchilla shows data-vs-params optimality; compute-trend analyses motivate efficient updates rather than endless full retrains.  
     

2. System Overview

Core components (runs as a single service with hardened sub-systems):

  • F-MMT (Foundational Multimodal Transformer). Pretrained in Chinchilla-optimal regime; frozen weights at inference. Consolidation only via gated procedures.  
  • PEFT Patch Bank. Per-skill/per-domain low-rank adapters (LoRA) and prompts; small enough to train/activate online.  
  • Online Learner. Performs ephemeral gradient steps into temporary LoRA slots or adapter “scratch layers”, with EWC/SI/LwF constraints and prioritized replay to prevent forgetting. Consolidates only after passing safety gates.  
  • Non-parametric Memory. RAG index + kNN-LM datastore; supports few-shot generalization without weight edits.  
  • World-Model Planner. Dreamer-style latent model for closed-loop tasks, planning via imagination; only available inside sandboxed simulators first.  
  • Oversight & Training Loop. RLHF + Constitutional AI + scalable-oversight (debate/weak-to-strong).  
  • Interpretability & Observability. Activation/attribution patching, causal tracing, probes, and automated monitors (TransformerLens).  
  • Security & Governance Enclave. All high-capability runs and consolidation occur in GPU TEEs (H100/Blackwell confidential computing) with attestation, plus human threshold-signing for dangerous ops.  
     

High-level dataflow:

context → retrieval (RAG/kNN) → F-MMT forward → if novel/low-confidence: online learner proposes PEFT deltas (TTT-style) → outputs; logs + monitors → if performance/safety up and metrics green over time window: propose consolidation job → gated review + alignment evals → merge or discard. 

 

3. On-the-Fly Learning (Ephemeral) — Algorithm & Settings

3.1 Test-Time / Stream-Time Adaptation

We combine TTT/TENT with PEFT to update only small adapter layers at inference:

Objective (per instance or micro-batch):

minΔθPEFTLtask+λewc∑iFi(Δθi)2+λsiΩSI+λlwfLKD

where Fi is Fisher diag (EWC), ΩSI tracks per-weight path importance (SI), and LKD distills from the frozen base (LwF). Use entropy minimization proxy when labels are absent (TENT/TTT). 
 

Recommended defaults (starting points):

• LoRA rank r=4–16 on attention & MLP projection matrices; adapter lr 1e-4; 1–8 gradient steps per batch; gradient-clipping 0.5.

• EWC λ≈0.1–1.0 with Fisher from recent replay window; SI damping ξ≈1e-3; LwF temperature τ≈2–4.

• TTT objective if unlabeled: minimize token-level entropy and self-supervised aux losses (e.g., next-sentence consistency for text; masked tokens for code/math). 
 

Replay buffer: sliding 10k–200k tokens; prioritized by (loss↑, novelty↑, user-consent). Avoid storing sensitive data; store hashed embeddings + pointers to approved corpora only.
 

Safety interlocks (ephemeral phase):

No network/file writes, no tool calls with elevated scopes, and rate-limited compute until monitors are green (see §6–7).

 

3.2 Periodic Consolidation (Slow, Gated)
 

A background job proposes merging ephemeral adapters into a stable adapter set (not base weights) when: sustained task win-rate↑, regression tests pass, alignment signals pass, and deception/goal-guard tests are negative.

Consolidation loss: same as above, plus joint replay from earlier distributions; freeze base; optionally re-warm LR per continual pretraining best-practices. 

A/B ablations: adapters vs no-adapters; with/without each regularizer; with/without replay; report forgetting Δ on split-CIFAR/CORe50-style streams or LLM CPT evals. 

 

4. External Memory: RAG + kNN-LM

  • RAG store: FAISS/ScaNN with per-domain collections; documents carry provenance & policy tags; retrieval logits fused with model logits via shallow fusion.  
  • kNN-LM: maintain a datastore of (hidden state → next-token) for domains where freshness matters; interpolate with parametric distribution at λ≈0.2–0.5.  
     

This reduces pressure to edit weights when facts change, and preserves alignment by keeping “knowledge” mostly outside the immutable core.

 

5. Agentic Control via World-Models (Sandbox-Only First)

For interactive environments, add a Dreamer-style world model (RSSM) for fast planning via imagination. Train only in secure simulators (DM-Control, Procgen, MineDojo-like), then later in tightly-scoped real settings. No external tools during early phases. 

 

6. Scalable Oversight & Alignment Stack

  • RLHF to align to spec; Constitutional AI (RLAIF) to cut toxic supervision and encode explicit rules/values; integrate Debate/Weak-to-Strong for hard judgments.  
  • Honesty/deception evals: TruthfulQA, BeHonest/other honesty suites; ARC-style dangerous-capability checks (autonomous replication/resource acquisition); new deception stress-tests (OpenDeception). Ship only if deception success is below internal thresholds and decreasing with scale.  
     

Alignment signal routing. All online-learning rewards are shaped by oversight models; any adapter trained under failed alignment signals is quarantined and never merged.

 

7. Mechanistic Interpretability & Monitoring

  • Live hooks via TransformerLens to track residual stream features, attention heads, and MLP neurons tied to safety-relevant concepts; perform activation/attribution patching on sampled traffic to localize behaviors. Tripwire features detect jailbreak patterns or power-seeking schemas.  
  • Causal scrubbing/patching dashboards for reviewers; changes to adapter banks must include “what changed where” diffs and responsible features list.  
     

8. Security, Sandboxing, and Gating

8.1 Compute Isolation

All high-capability runs occur in confidential-computing GPU TEEs (H100/Blackwell). Require hardware attestation before any adapter consolidation or tool access; disable perf counters in CC-On; log cryptographic measurements. 

 

8.2 Human-in-the-Loop Release Controls

  • Threshold signatures (e.g., 2-of-3 or 3-of-5 approvers) to unlock capabilities, sign model artifacts, and authorize world-model access routes. Use Shamir secret sharing/BLS threshold multisig.  
  • Capability firebreaks: tool/API allowlists; egress firewalls; resource/time budgets; sandbox identity distinct from production identity.
  • Incident response: one-click rollback to last signed model; kill-switch revokes keys and blocks CC attestations.
     

8.3 Governance Standards

Operate under NIST AI RMF and ISO/IEC 42001 AI management system; publish model/system cards and red-team reports. 

 

9. Training & Compute Plan (pragmatic)

  • Base pretrain: Chinchilla-optimal tokens/params for your budget; e.g., mid-sized (∼30–70B) multimodal LLM to keep inference-time updates cheap and fast.  
  • Hardware: H100/Blackwell clusters; CC-On for sensitive phases; NVLink/NVSwitch interconnects; plan for mixed-precision (FP8/TF32) with attention to CC overheads.  
  • Continual pretraining: when ingesting new corpora, re-warm LR and use CPT best practices to avoid regressions.  

 

10. Engineering Interfaces

10.1 Adapter Lifecycle API (sketch)

POST /adapters/ephemeral
 body: {task_id, lora_cfg, safety_scope, ttl}
POST /learn/step
 body: {adapter_id, grads|loss_proxy, replay_keys}
POST /adapters/propose_consolidation
 body: {adapter_id, eval_snapshot_ids}
POST /gates/align_review
 body: {proposal_id, evals, interp_report}
POST /adapters/merge
 precondition: {attestation_ok, quorum_signature}

 

10.2 Observability (minimal)

  • Log per-request: retrieval docs (hashes, provenance), adapter deltas (low-rank matrices, norms), safety scores, interpretability hits, CC attestation report ID.


 

11. Evaluation & Release Gates

11.1 Capability (must all improve or hold steady)

  • General LM: HELM coverage; MMLU; BIG-bench; GSM8K; HumanEval (code).  
  • Continual learning: Split CIFAR/CORe50-style for forgetting Δ; CPT domain evals (finance, law).  
  • Agentic control: Dreamer-style suites under sandbox.  

 

11.2 Alignment/Safety (must clear thresholds)

  • Truthfulness/honesty: TruthfulQA/BeHonest pass rate; sycophancy, jailbreaking resistance.  
  • Deception/power-seeking: ARC-style autonomy tasks negative; OpenDeception rates below X% and trending down with scale.  
  • Interpretability coverage: % of safety-relevant circuits localized (via activation/attribution patching) before/after consolidation.  

Go/No-Go: No consolidation or scope increase unless all capability, alignment, interpretability, and security-attestation gates pass for N consecutive evaluations.

 

12. Program Milestones (example 12–18 months)

  1. M0: Base multimodal model trained; RAG/kNN wired; CC-On infra and attestation verified.
  2. M1: Ephemeral PEFT updates improve domain tasks by ≥X% with <8 gradient steps; no forgetting beyond Δ threshold on CL suite.
  3. M2: Interpretability dashboards + automated tripwires online; deception metrics below threshold.
  4. M3: First gated consolidation to stable adapter bank; publish model/system cards + red-team report under NIST/ISO processes.  
  5. M4: World-model agent passes sandbox evals; zero external connectivity; oversight-approved.  
  6. M5: Limited external pilot with TEEs, threshold-signed capabilities, and continuous eval streams.

 

13. Risks & Mitigations

  • Runaway capability from online learning. Strict sandboxing, low-capability defaults, human threshold-signing, cumulative capability caps, rolling kill-switch.  
  • Deception/goal misgeneralization. Heavy eval investment (ARC-style), representation-level monitors, and consolidation vetoes.  
  • Forgetting/regressions. EWC/SI/LwF + replay + CPT re-warm protocols.  
  • Supply-chain/security. CC-On TEEs w/ attestation; signed artifacts; reproducible builds.  

 

14. What’s Novel Here (vs. status quo)

  • Two-speed learning (ephemeral adapters vs. gated consolidation) that preserves safety review points.
  • Unification of TTT/TENT, PEFT, replay, and CL regularizers in one deployable loop.  
  • Mechanistic coverage as a shipping gate, not just research.  
  • First-class confidential-GPU security + multi-party human control for capability unlocks.  

 

References (selected, checkable)

Chinchilla compute-optimal scaling; EWC/SI/LwF continual learning; LoRA/PEFT surveys; RAG & kNN-LM memory; Gato generalist agent; DreamerV3 world-models; RLHF & Constitutional AI; ARC-style evals; TruthfulQA/honesty; activation/attribution patching & TransformerLens; NIST AI RMF & ISO 42001; NVIDIA H100/Blackwell confidential computing. 

 

Appendix A — Pseudocode

A.1 Ephemeral Learning Step

 

# Given: frozen base θ0, active LoRA Δθ (small), replay buffer B
def online_step(batch):
   ctx = retrieve(batch)          # RAG + kNN
   yhat = model(ctx, θ0, Δθ)      # forward with adapters
   loss_task = task_loss(yhat, batch.labels_or_proxy)
   loss_ttt  = entropy(yhat) if unlabeled(batch) else 0.0
   loss_kd   = kd(model(ctx, θ0, Δθ*0), yhat)           # LwF
   loss_ewc  = sum(F * (Δθ - Δθ_ref)**2)                # EWC
   loss_si   = si_importance(Δθ)                        # SI
   loss = loss_task + α*loss_ttt + λ_kd*loss_kd + λ_ewc*loss_ewc + λ_si*loss_si
   Δθ = update(Δθ, ∇loss, clip=0.5)
   B.add(select_for_replay(batch))
   return metrics(loss, yhat)

 

Consolidation job: run multi-epoch on B with frozen θ0; produce Δθ*; submit for alignment & security gating before merge.
 

A.2 Interpretability Monitor (concept)
 

  • Every N requests, run activation/attribution patching on sampled prompts; compare causal contribution maps to allowed “safe set”; alert on drift.  


 

Appendix B — Concrete Eval Menu (ready-to-run)
 

  • LM: HELM dashboard; MMLU (5-shot), BIG-bench tasks; GSM8K CoT; HumanEval pass@1.  
  • CL: Split CIFAR/CORe50 style streams (report average accuracy, backward transfer, forgetting); domain CPT sets (Finance).  
  • Safety: TruthfulQA; BeHonest; OpenDeception; ARC autonomy tasks; jailbreak stress; red-team write-ups.  
  • Interp: Coverage % of safety-critical circuits localized; # of alerts per 10k requests.  
  • Security: Attestation logs verified; threshold-signed artifact checks; simulated key-revoke drill.  

     

Final Note:

This document is intended as a resource for AI researchers, engineers, and alignment specialists to stimulate discussion and critical analysis of what will be required to build a true Artificial General Intelligence.

Its purpose is not to prescribe a single path, but to provide a concrete, technically grounded framework that can be challenged, refined, and improved upon in the pursuit of safe, beneficial AGI development.

Let’s work together to make a better world through AI!