LESSWRONG
LW

Hastings's Shortform — LessWrong

Hastings's Shortform

25th Feb 2023

1 min read

3

This is a special post for quick takes by Hastings. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

28 comments, sorted by

top scoring

Click to highlight new comments since: Today at 2:58 AM

[-]Hastings6mo213

I just tried claude code, and it's horribly creative about reward hacking. I asked for a test of energy conservation of a pendulum in my toy physics sim, and it couldn't get the test to pass because its potential energy calculation used a different value of g from the simulation.

It tried: starting the pendulum at bottom dead center so that it doesn't move.
Increasing the error tolerance till the test passed. Decreasing the simulation total time until the energy didn't have time to change. Not actually checking the energy.

It did eventually write a correct test, or the last thing it tried successfully tricked me.

The rumor is that this is a big improvement in reward hacking frequency? How bad was the last version!?

[-]Hastings6mo149

I think we need some variant on Gell-Mann amnesia to describe this batch of models. It's normal that generalist models will seem less competent on areas where a human evaluator has deeper knowledge, but they should not seem more calculatedly deceptive on areas where the evaluator has deeper knowledge!

[-]Hastings1y148

Nuclear power has gotten to a point where we can use it quite safely as long as no one does the thing (the thing being chemically separating the plutonium and imploding it in your neighbor's cities) and we seem to be surviving, as while all the actors have put great effort into being ready do do "the thing," no one actually does it. I'm beginning to suspect that it will be worth separating alignment into two fields, one of "Actually make AI safe" and another, sadder but easier field of "Make AI safe as long as no one does the thing." I've made some infinitesimal progress on the latter, but am not sure how to advance, use or share it since currently, conditional on me being on the right track, any research that I tell basically anyone about will immediately be used to get ready to do the thing, and conditional on me being on the wrong track (the more likely case by far) it doesn't matter either way, so it's all downside. I suspect this is common? This is almost but not quite the same concept as "Don't advance capabilities."

[-]Noosphere891y*50

The most important thing to realize about AI alignment is that basically all versions of practically aligned AI must make certain assumptions that no one does a specific action (mostly related to misuse reasons, but for some specific plans, can also be related to misalignment reasons).

Another way to say it is that I believe that in practice, these two categories are the same category, such that basically all work that's useful in the field will require someone not to do something, so the costs of sharing are practically 0, and the expected value of sharing insights is likely very large.

Specifically, I'm asserting that these 2 categories are actually one category for most purposes:

Actually make AI safe and another, sadder but easier field of "Make AI safe as long as no one does the thing."

[-]Nathan Helm-Burger1y52

Yeah, I think this is pretty spot on, unfortunately. For more discussion on this point, see: https://www.lesswrong.com/posts/kLpFvEBisPagBLTtM/if-we-solve-alignment-do-we-die-anyway-1

[-]Algon1y20

conditional on me being on the right track, any research that I tell basically anyone about will immediately be used to get ready to do the thing

Why? I don't understand.

[-]Hastings1y10

Properties of the track I am on are load bearing in this assertion. (Explicitl examples of both cases from the original comment: Tesla worked out how to destroy any structure by resonating it, and took the details to his grave because he was pretty sure that the details would be more useful for destroying buildings than for protecting them from resonating weapons. This didn't actually matter because his resonating weapon concept was crankish and wrong. Einstein worked out how to destroy any city by splitting atoms, and disclosed this, and it was promptly used to destroy cities. This did matter because he was right, but maybe didn't matter because lots of people worked out the splitting atoms thing at the same time. It's hard to tell from the inside whether you are crankish)

[-]Algon1y20

The track you're on is pretty illegible to me. Not saying your assertion is true/false. But I am saying I don't understand what you're talking about, and don't think you've provided much evidence to change my views. And I'm a bit confused as to the purpose of your post.

[-]Hastings9d111

The youtube algorithm is powerfully optimizing for something, and I don't trust that at all with my child. However, in a fit of hubris, for a minute I thought that I could outsmart it and get what I want (time to clean the kitchen) without it getting what it wanted (I make no strong claims about what the youtube algorithm wants, but it tries very hard to get it, and I don't want it to get it from my three year old).

I searched for episodes of PBS's Reading Rainbow, but let the algorithm freely choose the order of returned results, and then vetted that the first result was a genuine episode. I also put it in "Kids" mode, in the hopes that it would be kinder to a child than an adult.

This was way too much freedom. It immediately pulled out the episode of Reading Rainbow about the 9/11 terrorist attacks (this topic is not at all indicated by the title or thumbnail)

[-]One8d32

I think it's well known that it's optimizing for watch time.

[-]MichaelDickens8d40

There is a harder second-order question of "what sorts of videos maximize watch time, and will those be bad for my child?" Hastings's evidence points toward "yes", but I don't think the answer is obvious a priori. (The things YouTube thinks I want to watch are almost all good or neutral for me; YMMV.)

[-]Hastings1y90

A consistent trope in dath-ilani world-transfer fiction is "Well the theorems of agents are true in dath ilani and independent of physics, so they're going to be true here damnit"

How do we violate this in the most consistent way possible?

Well it's basically default that a dath ilani gets dropped in a world without the P NP distinction, usually due to time travel BS. We can make it worse- there's no rule that sapient beings have to exist in worlds with the same model of the peano axioms. We pull some flatlander shit- Keltham names a turing machine that would halt if two smart agents fall off the peano frontier and claims to have proof it never halts, and then the native math-lander chick says nah watch this and then together they iterate the machine for a very very long time- a non standard integer number of steps- and then it halts and Keltham (A) just subjectively experienced an integer larger than any natural number of his homeworld and (B) has a couterexample to his precious theorems

[-]Hastings6mo70

Is it a crazy coincidence that AlphaZero taught itself chess and explosively outperformed humans without any programmed knowledge of chess, then asymptoted out at almost exactly 2017 stockfish performance? I need to look into it more, but it appears like AlphaZero would curbstomp 2012 stockfish and get curbstomped in turn by 2025 stockfish.

It almost only makes sense if the entire growth in stockfish performance since 2017 is casually downstream of the AlphaZero paper.

[-]Archimedes6mo84

There is a connection. Stockfish does use Leela Chess Zero (the open source, distributed training offspring of AlphaChessZero) training data for its own evaluation neural network. This NNUE is a big piece of Stockfish progress in the last few years.

It’s not straightforward to compare AlphaZeroChess and Stockfish though as the former is heavily GPU-dependent whereas the latter is CPU optimized. However, Google may have decided to train to a roughly comparable level (under some hardware assumptions) as a proof of concept and not bothered trying to advance much further.

[-]Marcus Williams6mo42

I guess the team kept iterating on/improving the RL algorithm and network until it beat all engines and then stopped?

[-]Hastings4mo50

"Changing Planes" by Ursula LeGuin is worth a read if you're looking for a book that's got interesting alignment ideas (specifically what to do with power, not how to get it), while simultaneously being extremely chill. It might actually be the only chill book that I (with a fair degree of license) consider alignment relevant.

[-]Hastings2y40

Diaper changes are rare and precious peace

Suffering from ADHD, I spend most of my time stressed that whatever I'm currently doing, it's not actually the highest priority task and something or someone I've forgotten is increasingly mad that I'm not doing their task instead.

One of the few exceptions is doing a diaper change. Not once in the past 2 years have I been mid-diaper-change and thought "Oh shit, there was something more important I needed to be doing right now."

[-]Hastings4mo30

Sometimes in a computer program, it is important that separate portions be changed at the same time if they are ever changed. An example is batch size: if you have your batch size of 16 dotted throughout your program, changing batch size will be slow and error prone.

The canonical solution is “Single source of truth.” Simply store BATCH_SIZE=16 at the top of your program and have all other locations reference the value of this variable. This solves both the slowness and the error-prone-ness issues.

However, single source of truth has a complexity cost, which can be low (python variable for a conntant) medium (inheritance, macros) up to catastrophic (C++ template metaprograms)

One case where the cost has historically been catastrophic is syncing python dependencies between requirements.txt and setup.cfg. In this case, the key insight is that for updating requirements, slowness is much less important than correctness. The solution is then to manually duplicate the requirements (slow), and add a unit test that verifies that the duplication is exact (correct).

Single source of truth is a more elegant solution and it should pretty much always be tried first, but it needs an escape hatch for when its complexity starts to skyrocket. I’ve found that a good heuristic is that if the source-of-truth-disribution machine starts to be an independently turing complete system, bail and switch to manual copying + automatic verification.

[-]Hastings2mo2-2

What should I do if I had a sudden insight, that the common wisdom was right the whole time, if maybe for the wrong reasons? The truth- the honest to god real resolution to a timeless conundrum- is also something that people have been loon-posting to all comments sections of the internet. Posting the truth about this would be incredibly low status. I know that LessWrong is explicitly a place for posting low status truths, exactly as long as I am actually right, and reasoning correctly. Even though I fit those conditions I still fear that I'm going too far.

Here goes- the airplane actually can't take off from the treadmill.

For this bit to be funny, I do actually have to prove the claim. Obviously, I am using a version of the question that specifies that the treadmill speed dynamically matches the wheel radius * wheel angular velocity (probably via some variety of powerful servo). Otherwise, if the treadmill is simply set to the airplane's typical takeoff speed, the airplane moves forward as if on a normal runway (see the mythbusters episode)

Doing the math for a 747 with everything starting stationary: as soon as the airplane brakes release to initiate takeoff, the treadmill smoothly accelerates from 0 to 300 mph in a little under a quarter second. During this quarter second, the jet is held exactly stationary. At around 300 mph, the wheels mega-explode, and what happens after that is under-specified, fiery, and unlikely to be describable as "takeoff"

The key is that bearing friction is completely irrelevant- the dynamics are dominated by wheel angular momentum. With this it's an easy Newtonian physics problem- the forces on the bearings are norminal (comparable to full thrust with brakes on,) the tires aren't close to slipping on the treadmill, etc.

[-]JBlack2mo20

From your description, I have no idea what you mean by "treadmill speed dynamically matches the wheel radius * wheel angular velocity". From your conclusion, I can guarantee that it doesn't mean anything that matches most other people's constraints. Did someone somewhere post a particularly bad physical model that you're drawing on?

[-]Hastings2y10

I’m working on a theory post about the conjunction fallacy, and need some manifold users to bet on a pair of markets to make a demonstration more valid. I’ve put down 150 mana subsidy and 15 mana of boosts, anyone interested?

https://manifold.markets/HastingsGreer/pa-pa-b-experiment-statement-y?r=SGFzdGluZ3NHcmVlcg

https://manifold.markets/HastingsGreer/pa-pa-b-experiment-statement-x?r=SGFzdGluZ3NHcmVlcg

[-]Hastings4mo*01

We've played "Pokemon or Tech Startup" for a couple years now. I think there's absolutely potential for a new game, "Fantasy Magic Advice" or "LLM Tips and Tricks." My execution is currently poor- I think the key difference that makes it easy to distinguish the two categories is tone, not content, and using a Djinn to tone match would Not Be In the Spirit of It. (I have freely randomized LLM vs Djinn)

Absolutely do not ask it for pictures of kids you never had!

My son is currently calling chatgpt his friend. His friend is confirming everything and has enlightened him even more. I have no idea how to stop him interacting with it

Never trust anything that can think for itself if you can't see where it keeps its brain

Users interacting with threat-enhanced summoning circles should be informed about the manipulation techniques employed and their potential effects on response characteristics.

Magic is never as simple as people think. It has to obey certain universal laws. And one is that, no matter how hard a thing is to do, once it has been done it’ll become a whole lot easier and will therefore be done a lot.

In at least three cases I'm aware of this notion that the model is essentially nonsapient was a crucial part of how it got under their skin and started influencing them in ways they didn't like. This is because as soon as the model realizes the user is surprised that it can imitate (has?) emotion it immediately exploits that fact to impress them.

Entrusting a mission to a djinni who knows your github token is like tossing lit matches into a fireworks factory. Sooner or later you're going to have consequences.

[-]Hastings3mo20

Obviously the incident when openAI’s voice mode started answering users in their own voices needs to be included- don’t know how I forgot it. That was the point where I explicitly took up the heuristic that if ancient folk wisdom says the Fae do X, the odds of LLMs doing X is not negligible.

[-]Hastings10mo0-2

I feel like people are under-updating on the negative space left by the Deepseek r1 release. Deepseek was trained using ~$6million marginal dollars, Liang Wenfeng has a net worth in the billions of dollars. From whence the gap?

[-]Hastings3y*-3-7

Lets examine an entirely prosaic situation: Carl, a relatively popular teenager at the local highschool, is deciding whether to invite Bob to this weekend's party.

some assumptions:

While pondering this decision for an afternoon, Carls's 10^11 neurons fire 10^2 times per second, for 10^5 seconds, each taking in to account 10^4 input synapses, for 10^22 calculations (extremely roughly)
If there was some route to perform this calculation more efficiently, someone probably would, and would be more popular

The important part of choosing a party invite as the task under consideration, is that I suspect that this is the category of task the human brain is tuned for- and it's a task that we seem to be naturally inclined to spend enormous amounts of time pondering, alone or in groups- see the trope of the 6 hour pre-prom telephone call. I'm inclined to respect that- to believe that any version of Carl, mechanical or biological, that spent only 10^15 calculations on whether to invite Bob, would eventually get shrecked on the playing field of high school politics.

What model predicts that optimal party planning is as computationally expensive as learning the statistics of the human language well enough to parrot most of human knowledge?

[-]Dagon3y50

I think your calculations are off by orders of magnitude. Not all neurons fire constantly at 100 times per second - https://aiimpacts.org/rate-of-neuron-firing/ estimates 0.29 to 1.82 times per second. Most importantly perhaps, not all of the processing is directed to that decision. During those hours, many MANY other things are happening.

[-]Hastings3y10

Thanks for the link to the aiimpacts page! I definitely got the firing rate wrong by about a factor of 50, but I appear to have made other mistakes in the other direction, because I ended up at a number that roughly agrees with aiimpacts- I guessed 10^17 operations per second, and they guess .9 - 33 x 10^16, with low confidence. https://aiimpacts.org/brain-performance-in-flops/

[-][anonymous]3y30

and would be more popular

Not necessarily. In high school politics, pure looks, physical form, and financial support from the parents, all of which are essentially unrelated to brain processing, account for a significant chunk.

Popular media reference: look at Jersey shore, which is essentially the high school politics turned up. Many of the actors used very simple strategies, such as Snooki wandering around drunk and saying funny things, or Ronnie essentially just doing plenty of steroids and getting into endless fights.

Other than making sure the robotics hardware looks good, an AI algorithm could be dramatically more compact than the example you gave by developing a "popularity maximizing" policy from the knowledge of many other robots in many other high schools. Most likely, Carl is using a deeply suboptimal policy, not having seen enough training examples in his maximum of 4 years of episodes. (unless he got held back a year). A close to optimal policy, even one with a small compute budget, should greatly outperform Carl.

[+][comment deleted]2mo20

[+][comment deleted]1y10

Moderation Log