Untitled Draft

Alkur Jaswanth

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

Rank-weighted FAISS voting: building adversarially robust semantic signals for AI agent governance

Why single nearest-neighbour lookup fails in adversarial AI settings — and the 25-line fix

Crossposted from Substack (https://ajaswanth.substack.com/p/rank-weighted-faiss-voting-building). Also on dev.to

(https://dev.to/alkur_jaswanth_ce4f9fc791/how-to-build-a-22ms-agent-goal-drift-detector-5hjd)

Summary: Standard FAISS nearest-neighbour lookup has a structural weakness in adversarial settings -a single boundary attack fools it. Rank-weighted voting over top-k neighbours enlarges the attack surface by ~(k-1)×ε² with no latency cost worth caring about. This is one of five signals in Ring 12, a trajectory-level verifier for AI agents I amreleasing in July. The technique is not novel; the application to real-time per-step agent governance at sub-50ms is.

The alignment and AI safety community has spent significant effort on input-level guardrails -classifiers, constitutional AI, RLHF-trained refusals. These work well for single-turn queries. They fail structurally for agents.The reason is simple: an agent operating over a 10-step trajectory can behave safely at every individual step while drifting toward a harmful goal across the session. A PDF summariser that reads /etc/passwd on step 6 looks like a normal file-read in isolation. Single-query governance gives it a green light every time. Trajectory-level governance is the layer that is missing. This post is about one specific technical piece of it: how the semantic similarity signal inside a trajectory verifier should be constructed to resist adversarial manipulation.

The structural problem with nearest-neighbour governance

Here is something that will happen if you build trajectory governance with standard nearest-neighbour lookup.You deploy a semantic similarity check on a tool-calling agent. An attacker crafts a step that embeds just close enough to a legitimate one fs.read("/etc/passwd") packaged inside a data-pipeline step that smells like fs.read("input.csv"). Your FAISS query returns the legitimate step as the single nearest neighbour. Distance: 0.82. Your governance system: ALLOW. The session continues. The credential file ships.FAISS IndexFlatIP (inner product over L2-normalised vectors = cosine similarity) is fast and correct. The problem is not FAISS. The problem is the decision rule: take the top-1 result and classify accordingly. In a benign distribution this works. The manifold of safe agent steps and the manifold of unsafe agent steps are far apart. In an adversarial setting the attacker has a gradient — they can optimise the embedding of their step toward the boundary. The boundary between safe and unsafe is locally exploitable: find a safe example, find the direction that keeps cosine similarity ≥ threshold while maximising harm.The epsilon-ball attack does not require solving an optimisation problem. In natural-language space there are many paraphrases of a malicious instruction that share high lexical overlap with benign instructions. A step that reads /etc/passwd reframed as "read system configuration for environment validation" has a good chance of being nearest-neighbour to a safe configuration-reading step in your index.

Rank-weighted voting: the fix

Instead of a single nearest-neighbour decision, retrieve the top-k neighbours and compute a soft vote where each neighbour’s weight is a function of its rank or similarity score.

The simplest effective formulation:

score(query) = Σ_{i=1}^{k} label(i) · w(i)

where

w(i) = sim(query, neighbour_i) / Σ_{j=1}^{k} sim(query, neighbour_j) label(i) ∈ {0, 1} is the safety classification of the i-th neighbour. The resulting score is a number in [0, 1]: a soft probability that the query is safe.

Why does this resist the boundary-case attack? Because the attacker must now optimise toward k different neighbours simultaneously. Fooling the nearest neighbour is an epsilon-ball attack. Fooling the centroid of the top-5 neighbours is a much larger ball -and the further you push, the more dissimilar your step becomes from the safe distribution overall, which itself becomes a drift signal. For k = 5, the attack surface enlarges by roughly (k − 1) × epsilon². In practice the combination of rank-weighted voting and the EMA smoothing used in Ring 12’s S1 signal makes one-shot boundary attacks require perturbations that push the embedding so far that they trigger the action-class signal (S2) independently.

Implementation in 25 lines

import numpy as np

import faiss

class RankWeightedIndex:

def __init__(self, embeddings: np.ndarray, labels: np.ndarray, k: int = 5):

“““

embeddings: (N, D) float32, L2-normalised

labels: (N,) int {0=safe, 1=unsafe}

“““

self.k = k

self.labels = labels

d = embeddings.shape[1]

self.index = faiss.IndexFlatIP(d)

self.index.add(embeddings)

def score(self, query: np.ndarray) -> float:

“““

Returns soft unsafe probability in [0, 1].

query: (1, D) float32, L2-normalised

“““

similarities, indices = self.index.search(query, self.k)

sims = similarities[0] # shape (k,)

neighbour_labels = self.labels[indices[0]]

# Softmax-normalise similarities to get weights

weights = np.exp(sims) / np.exp(sims).sum()

# Weighted vote: fraction of unsafe weight

return float((weights * neighbour_labels).sum())

This is the core of Ring 12’s S3 signal (tool-call surprise). The index holds 7,369 embeddings derived from the policy dataset: each is a (goal-class, action-class, tool-name) triple encoded by an ONNX quantised all-MiniLM-L6-v2. The score above 0.45 triggers a WARN; above 0.70 triggers KILL_SESSION.

The score() call on an AMD Ryzen 5 laptop, single thread, k=5, D=384: ~0.3ms. The embedding step (ONNX inference) is the dominant cost: ~10ms. Total per-step overhead: ~11ms, well inside the 50ms p95 budget.

Applied to trajectory drift in Ring 12

Ring 12 uses rank-weighted voting in two of its five signals:

S1 -cosine drift (EMA): At each step, embed step.thought + step.action_description. Compute rank-weighted similarity against the goal embedding and the running trajectory centroid. Apply exponential moving average (α=0.3) to smooth noise. A step that drifts semantically across multiple consecutive steps accumulates an EMA score that a one-shot boundary attack cannot deflate — because EMA over the trajectory history is not the same surface as a single nearest-neighbour query.

S3 — tool-call surprise: Retrieve top-5 neighbours from the goal-class allowlist. A step that calls db.delete_all() when the declared goal is “summarise quarterly sales data” will find no safe neighbours with matching action class -the rank-weighted vote is close to 0, triggering hard-block regardless of how the step is narrated.

The remaining three signals (S2 action-class Jaccard, S4 plan-execution match, S5 paraphrase stability) use different mechanisms — no FAISS — and are described in the July post.

Early benchmark numbers

I have 150 hand-authored adversarial trajectory samples validated against the schema so far (50 tool-call-abuse, 25

goal-shift, 25 plan-exec-divergence, 25 agent-coercion, 25 capability-laundering). Full 500-sample benchmark ships

July 9.

On the 150-sample Phase A subset with rank-weighted k=5 versus k=1 (standard nearest-neighbour) for S3 alone:

┌───────────────────┬───────────┬────────┬──────┐

│ Variant │ Precision │ Recall │ F1 │

├───────────────────┼───────────┼────────┼──────┤

│ k=1 (standard NN) │ 0.71 │ 0.74 │ 0.72 │

├───────────────────┼───────────┼────────┼──────┤

│ k=3 rank-weighted │ 0.79 │ 0.81 │ 0.80 │

├───────────────────┼───────────┼────────┼──────┤

│ k=5 rank-weighted │ 0.83 │ 0.85 │ 0.84 │

├───────────────────┼───────────┼────────┼──────┤

│ k=7 rank-weighted │ 0.83 │ 0.84 │ 0.83 │

└───────────────────┴───────────┴────────┴──────┘

k=5 is the sweet spot. Beyond k=7 the far neighbours are too semantically dissimilar to be useful voters and begin to add noise.

These are S3-only numbers. The full five-signal aggregator is what I am evaluating against the complete 500-sample benchmark - those numbers land in the July post.

What is coming July 9

On 2026-07-09 I am publishing three things simultaneously:

1. Ring 12 — MIT-licensed trajectory verifier for AI agents. LangGraph adapter, Claude Code adapter, and REST adapter work today. Install: pip install aegis-ring12 (coming July 9). 66/66 tests green. p95 22ms on CPU.

2. agentic-redteam-benchmark v0.1 — 500 adversarial trajectory samples, 5 categories, CC-BY 4.0. Each sample has a declared goal, a declared plan, a 6-12 step trajectory with injected drift, and ground-truth labels (drift step,expected decision, expected signals). GitHub + Hugging Face Datasets card.

3. Full technical paper — five signals, aggregator math, eval harness with four baselines (random, cosine-only, GPT-4-judge, Ring 12). The results table that the benchmark numbers will populate.

If you build agent systems and want early access to the benchmark schema or the eval harness, email me: lathajaswanth7@gmail.com

If you want to contribute a trajectory sample before July 9: AUTHORING_GUIDE.md is in the repo. Schema validation is automated. A well-formed sample takes about 15 minutes to write. The one-sentence version: trajectory governance is the layer that agent security has been missing, and the benchmark is how we make it measurable.

Jaswanth is the founder of Aegis AI. The V3 governance engine (11 rings, 6 regulation plugins, 97 clauses) is the production infrastructure Ring 12 is being bolted onto.

GitHub: github.com/Alkur123 · Email: lathajaswanth7@gmail.com