tjade273 — LessWrong

How do models with high deception-activation act? Are they Cretan liars, saying the opposite of every statement they believe to be true? Do they lie only when they expect not to be caught? Are they broadly more cynical or and conniving, more prone to reward hacking? Do they lose other values (like animal welfare)?

It seems at least plausible that cranking up “deception” pushes the model towards a character space with lower empathy and willingness to ascribe or value sentience in general

Room Available in Boston Group House

tjade2731y30

DM’d

Bounty: Diverse hard tasks for LLM agents

tjade2732yΩ580

If we submit tasks as "ideas", is there some way to get early feedback as to which ideas you find most promising? I have several ideas and the bandwidth to turn some smaller amount into specs and/or implementations. Rolling feedback would help to prioritize.

What would a compute monitoring plan look like? [Linkpost]

tjade2733y21

Current SNARK provers are many orders of magnitude slower than the underlying computation - certainly you can prove that you took sufficient time to do the computation (e.g. using a VDF) or did it in few enough steps or according to some policy (e.g. using NARKS) but what is the point of using a modern GPU in the first place if you're going to be limited to speeds easily achievable by a 1990's CPU?

The only way this scheme becomes more useful than just banning GPU usage outright is if the proof of policy compliance can be generated only slightly slower than actually doing the computation. We don't have primitives that can do that currently.

What would a compute monitoring plan look like? [Linkpost]

tjade2733y103

There are some serious issues that need to be overcome for any scheme of this nature to be secure.

Chips are not tamper-proof black boxes. Secure computing enclaves like SGX are routinely broken by academics: https://arstechnica.com/information-technology/2022/08/architectural-bug-in-some-intel-cpus-is-more-bad-news-for-sgx-users/. Secure flash memory is expensive and tampering with the memory writing logic or hardware could allow an actor to write false logs. Cryptographically signing logs runs the risk of leaking key material through power or fault-injection side channels. Once a chip is physically in the possession of someone with university-lab-level equipment and expertise, all bets are off w.r.t. on-chip controls. While some of these exploits are hard to carry out at scale, even a single effective exploit renders whole generations of GPUs uncontrollable.
The author does not actually propose a "Proof of Training Transcript" protocol, only providing a possible definition of such a scheme. While they acknowledge the challenges in constructing such a scheme, I think it is worth highlighting the fact that constructing a secure, efficient instantiation of such a scheme is not something that we (as a species) know how to do. The best noninteractive proofs for arbitrary verifiable computation currently require several orders of magnitude more time to generate the proof than to carry out the original computation and proof generation is not totally parallelizable. The requirement to generate PoTTs basically obviates the utility of using a fast GPU in the first place. It is possible to imagine a special-purpose proof scheme for gradient descent with faster concrete efficiency, but the vague outline of a scheme proposed by the author relies on a stream of nonstandard hardness assumptions that I find very unconvincing.

I think that from a quick read of the paper or from the summary in the post one might be led to believe that such a scheme could be implemented with a few years of engineering effort and the cooperation of chip manufacturers. In fact, substantial advances in cryptography would be required.

Policy-makers' attempts to enforce policy by requiring the use of special chips have in the past largely been broken: Clipper chip, DRM via SGX, etc.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments