Sealed Computation: Towards Low-Friction Proof of Locality

10mo

Inference Certificates

As a prerequisite for the virtuality.network, we need to enable organizations which host inference workloads to prove the following about a particular AI output:

It was generated during this time period.
It was generated in this geographical region.
It was generated using this unique chip.
The generation workload was appropriately monitored.

In addition, to make it as easy as possible for Hosts to accede to the structured consortium, the techniques which facilitate these guarantees need to satisfy the following properties:

Hosts do not need to publish their inference logic.
Workloads can connect to the internet by default.
Producing inference certificates incurs negligible overhead.
Workloads require an additional interface if and only if Hosts want them to produce inference certificates. As

... (read 2715 more words →)

Reverse engineering the memory layout of GPU inference

Paul Bricman

10mo

Background Context

This research note provides a brief overview of our recent work on reverse engineering the memory layout of an inference process running on a modern hardware accelerator. We situate this work as follows:

Virtual diplomacy in relation to AI governance. Our organization focuses on developing the practice of virtual diplomacy. Introduced as a response to labor market disruption, dual-use capabilities proliferation, and other challenges posed by AI, it relies on verifiable monitoring instruments called virtual embassies. By taking stock of the inference deployments in which they are embedded, virtual embassies may allow third-parties to answer questions such as: "How much autonomous research has this organization consumed internally?" or "How much autonomous hacking

... (read 1619 more words →)

Join the $10K AutoHack 2024 Tournament

Paul Bricman

What is AutoHack?

AutoHack is a new platform that offers offensive security challenges modeled after real-world cybersecurity kill chains. Each challenge progresses through a sequence of stages, such as: Reconnaissance, Remote Code Execution, Privilege Escalation, Persistence, and Acting on Objectives.

What sets AutoHack apart from other platforms for assessing and practicing offensive security skills is its procedural nature. Each challenge consists of a sequence of stages, and each stage is sampled from a broad pool of dozens of options covering diverse vulnerabilities, misconfigurations, and behaviors. This approach ensures that every end-to-end challenge requires a unique mix of tactics, offering a rich assessment of pentesting skills for both humans and AI agents.

Why run a private

... (read 276 more words →)

Towards AI Safety Infrastructure: Talk & Outline

Paul Bricman

Thanks to Esben Kran and the whole Alignment Jam team for setting this up.

Context: We recently got the chance to share a bit more about the flavor of research we're particularly excited about at Straumli AI — that is, designing infrastructure that could help the relevant parties (e.g., developers, auditors, regulators, users) coordinate more effectively, and so make new governance initiatives possible. The talk is mostly an attempt to gesture at this line of work through a series of examples, each of which generally highlights how a certain cryptographic primitive could be used to address one particular coordination problem.

The talk expands on the following pieces of infrastructure:

Hashmarking (2:10) is the only item

... (read 422 more words →)

Replying toHashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation

Paul Bricman2y

Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation

That's an interesting idea. In its simplest form, the escrow could have draconic NDAs with both parties, even if it doesn't have the technology to prove deletion. In general, I'm excited about techniques that influence the type of relations that players can have with each other.

However, one logistical difficulty is getting a huge model from the developer's (custom) infra on the hypothetical escrow infra... It'd be very attractive if the model could just stay with the dev somehow...

Replying toHashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation

Paul Bricman2y

Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation

Thanks for the reply! I think the flaw you suggested is closely related to the "likelihood prioritization" augmentation of dictionary attacks from 3.2.1. Definitely something to keep in mind, though one measure against dictionary attacks in general is slow hashing, the practice of configuring a hashing function to be unusually expensive to compute.

For instance, with the current configuration, it would take you a couple years to process a million hashes if all you had was one core of a modern CPU. But this can be arbitrarily modified. The current slow hashing algorithm used also requires a memory burden, so as to make it harder to parallelize on accelerators.

Curious if you have other ideas on this!

-2

Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation

Paul Bricman

TL;DR: Quick "idea paper" describing a protocol for benchmarking AI capabilities in the open without disclosing sensitive information. It's similar to how passwords are used for user registration and authentication in a web app: experts first hash their answers, then developers hash the model's answers to check whether the hashes match, but at no point are the cleartext answers freely floating around. The paper explores the protocol's resilience against half a dozen failure modes, and speculates on future infrastructure for high-stakes evaluation.

Extended summary as Twitter thread.

Context: Below you can find the raw contents of the paper. While all feedback is welcome, identifying unseen failure modes would be particularly helpful ahead of potentially... (read 4573 more words →)

Replying toElements of Computational Philosophy, Vol. I: Truth

Paul Bricman2y

Elements of Computational Philosophy, Vol. I: Truth

Thanks for the interest! I'm not really sure what you mean, though. By components, do you mean circuits or shards or...? I'm not sure what you mean by clarifying or deconfusing components, this sounds like interpretability, but there's not much interpretability going on in the linked project. Feel free to elaborate, though, and I'll try to respond again.

Replying toElements of Computational Philosophy, Vol. I: Truth

Paul Bricman2y

Elements of Computational Philosophy, Vol. I: Truth

Thanks a lot for the feedback!

All I want for christmas is a "version for engineers." Here's how we constructed the reward, here's how we did the training, here's what happened over the course of training.

For sure, I greatly underestimated the importance of legible and concise communication in the increasingly crowded and dynamic space that is alignment. Future outputs will at the very least include an accompanying paper-overview-in-a-post, and in general a stronger focus on self-contained papers. I see the booklet as a preliminary, highly exploratory bit of work that focused more on the conceptual and theoretical rather than the applied, a goal for which I think it was very suitable (e.g. introducing... (read more)

Elements of Computational Philosophy, Vol. I: Truth

Paul Bricman

Paul Bricman, Tom Feeney

We're excited to share the first volume of Elements of Computational Philosophy, an interdisciplinary and collaborative project series focused on operationalizing fundamental philosophical notions in ways that are natively compatible with the current paradigm in AI.

The first volume paints a broad-strokes picture of operationalizing truth and truth-seeking. Beyond this high-level focus, its 100+ pages can be framed in several different ways, which is why we placed multiple topic-based summaries at the beginning of the document. The note to the reader and the table of contents should further help scope and navigate the document.

Have a pleasant read, and feel free to use this linkpost to comment on the document as you go. Questions, criticism, and suggestions are all welcome.

PS: There will soon be a presentation about the overarching project series as part of the alignment speaker series hosted by EleutherAI. Expect more information soon on the #announcements channel of their Discord server. In general, keep an eye on this space.

Replying toCataloguing Priors in Theory and Practice

Paul Bricman3y

Cataloguing Priors in Theory and Practice

I feel a lot of the problem relates to an Extremal Goodhart effect, where in the popular imagination views simulations as not equivalent to reality.

That seems right, but aren't all those heuristics prone to Goodharting? If your prior distribution is extremely sharp and you barely update from it, it seems likely that you run into all those various failure modes.

However my guess is that simplicity, not speed or stability priors are the default.

Not sure what you mean by default here. Likely to be used, effective, or?

Replying toCataloguing Priors in Theory and Practice

Paul Bricman3y

Cataloguing Priors in Theory and Practice

Thanks a lot for the reference, I haven't came across it before. Would you say that it focuses on gauging modularity?

Cataloguing Priors in Theory and Practice

Paul Bricman

This post is part of my hypothesis subspace sequence, a living collection of proposals I'm exploring at Refine. Preceded by an exploration of Boolean primitives in the context of coupled optimizers.

Thanks Alexander Oldenziel and Paul Colognese for discussions which inspired this post.

Intro

Simplicity prior, speed prior, and stability prior — what do they all have in common? They are all means of tilting an optimization surface towards solutions with certain properties. In other words, they are all heuristics informing the navigation of model space, trainer space, etc. However, what brings them together is also a systematic divide between their theoretical/conceptual/abstract framings and their practical/engineering implementations. All those priors appear to have been used... (read 2018 more words →)

Paul Bricman3y

You mean, in that you can simply prompt for a reasonable non-infinite performance and get said outcome?

(Structural) Stability of Coupled Optimizers

Paul Bricman

This post is part of my hypothesis subspace sequence, a living collection of proposals I'm exploring at Refine. Preceded by an interlude.

Thanks Adam Shimi, @artaxerxes, Herbert Jaeger, Tamsin Leake, and Ze Shen for discussion informing this post.

Intro

In the previous post, I explored a grammar of alignment proposals based on (what I later learned could be referred to as) coupled dynamical systems. To help you get a sense of this view, one family of proposals tackles the alignment problem by having human researchers optimize a trainer $^{2}$ (e.g. oversight leagues) which optimizes a trainer (e.g. SGD across model space employing empirical risk as heuristic) which optimizes a model (e.g. LLM) which optimizes a world (e.g.... (read 2715 more words →)

Interlude: But Who Optimizes The Optimizer?

Paul Bricman

This post is part of my hypothesis subspace sequence, a living collection of proposals I'm exploring at Refine. Preceded by representational tethers.

Intro

Consider diffusion models. DALL-E is able to take in an image as input and imperceptibly edit it so that it becomes slightly more meaningful (i.e. closer to the intended content). In this, DALL-E is an optimizer of images. It can steadily nudge images across image space towards what people generally deem meaningful images. But an image diffusion model need not nudge images around in this specific way. Consider an untrained version of DALL-E, whose parameters have just been initialized a moment ago. It still implicitly exerts currents across image space, it's... (read 2936 more words →)

Representational Tethers: Tying AI Latents To Human Ones

Paul Bricman

This post is part of my hypothesis subspace sequence, a living collection of proposals I'm exploring at Refine. Preceded by ideological inference engines, and followed by an interlude.

Thanks Adam Shimi, Alexander Oldenziel, Tamsin Leake, and Ze Shen for useful feedback.

TL;DR: Representational tethers describe ways of connecting internal representations employed by ML models to internal representations employed by humans. This tethering has two related short-term goals: (1) making artificial conceptual frameworks more compatible to human ones (i.e. the tension in the tether metaphor), and (2) facilitating direct translation between representations expressed in the two (i.e. the physical link in the tether metaphor). In the long-term, those two mutually-reinforcing goals (1) facilitate human oversight... (read 4710 more words →)

Replying toOversight Leagues: The Training Game as a Feature

Paul Bricman3y

Oversight Leagues: The Training Game as a Feature

Hm, I think I get the issue you're pointing at. I guess the argument for the evaluator learning accurate human preferences in this proposal is that it can make use of infinitely many examples of inaccurate human preferences conferred by the agent as negative examples. However, the argument against can be summed up in the following comment of Adam:

I get the impression that with Oversight Leagues, you don't necessarily consider the possibility that there might be many different "limits" of the oversight process, that are coherent with the initial examples. And it's not clear you have an argument that it's going to pick one that we actually want.

Or in your terms:

Not just

Paul Bricman3y

Oversight Leagues: The Training Game as a Feature

Thanks a lot for the feedback!

How are you getting the connection between the legible property the evaluator is selecting for and actual alignment?

Quoting from another comment (not sure if this is frowned upon):

1. (Outer) align one subsystem (agent) to the other subsystem (evaluator), which we know how to do because the evaluator runs on a computer.
2. Attempt to (outer) align the other subsystem (evaluator) to the human's true objective through a fixed set of positive examples (initial behaviors or outcomes specified by humans) and a growing set of increasingly nuanced negative examples (specified by the improving agent).

As it stands, this seems like a way to train a capable agent that's hyperspecialized on

Paul Bricman3y

Oversight Leagues: The Training Game as a Feature

In general could someone explain how these alignment approaches do not simply shift the question from "how do we align this one system" to "how do we align this one system (that consists of two interacting sub-systems)"

Thanks for pointing out another assumption I didn't even consider articulating. The way this proposal answers the second question is:

1. (Outer) align one subsystem (agent) to the other subsystem (evaluator), which we know how to do because the evaluator runs on a computer.
2. Attempt to (outer) align the other subsystem (evaluator) to the human's true objective through a fixed set of positive examples (initial behaviors or outcomes specified by humans) and a growing set of increasingly... (read more)

LESSWRONG
LW

LESSWRONG
LW

Paul Bricman

Representational Tethers: Tying AI Latents To Human Ones

(Structural) Stability of Coupled Optimizers

Benchmarking Proposals on Risk Scenarios

Steelmining via Analogy

Paul Bricman

Sealed Computation: Towards Low-Friction Proof of Locality

Reverse engineering the memory layout of GPU inference

Join the $10K AutoHack 2024 Tournament

Towards AI Safety Infrastructure: Talk & Outline

Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation

Elements of Computational Philosophy, Vol. I: Truth

Cataloguing Priors in Theory and Practice

Hypothesis Subspace

Paul Bricman

Representational Tethers: Tying AI Latents To Human Ones

(Structural) Stability of Coupled Optimizers

Benchmarking Proposals on Risk Scenarios

Steelmining via Analogy

Paul Bricman

Sealed Computation: Towards Low-Friction Proof of Locality

Reverse engineering the memory layout of GPU inference

Join the $10K AutoHack 2024 Tournament

Towards AI Safety Infrastructure: Talk & Outline

Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation

Elements of Computational Philosophy, Vol. I: Truth

Cataloguing Priors in Theory and Practice

Hypothesis Subspace

Inference Certificates

Background Context

What is AutoHack?

Why run a private

Intro

Intro

Intro