Thomas Kwa's Shortform

Thomas Kwa

LESSWRONG
LW

Thomas Kwa's Shortform

1 min read22nd Mar 2020137 comments

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a special post for quick takes by Thomas Kwa. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Mentioned in

79Thomas Kwa's research journal

Thomas Kwa's Shortform

2Alex K. Chen (parrot)

2Alexander Gietelink Oldenziel

37Thomas Kwa

29Alexander Gietelink Oldenziel

0Alexander Gietelink Oldenziel

28Thomas Kwa

10Thomas Kwa

6Alexander Gietelink Oldenziel

0the gears to ascension

2Alexander Gietelink Oldenziel

140 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:56 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]Thomas Kwa2mo9022

Air purifiers are highly suboptimal and could be >2.5x better.

Some things I learned while researching air purifiers for my house, to reduce COVID risk during jam nights.

An air purifier is simply a fan blowing through a filter, delivering a certain CFM (airflow in cubic feet per minute). The higher the filter resistance and lower the filter area, the more pressure your fan needs to be designed for, and the more noise it produces.
HEPA filters are inferior to MERV 13-14 filters except for a few applications like cleanrooms. The technical advantage of HEPA filters is filtering out 99.97% of particles of any size, but this doesn't matter when MERV 13-14 filters can filter 77-88% of infectious aerosol particles at much higher airflow. The correct metric is CADR (clean air delivery rate), equal to airflow * efficiency. [1]
Commercial air purifiers use HEPA filters for marketing reasons and to sell proprietary filters. But an even larger flaw is that they have very small filter areas for no apparent reason. Therefore they are forced to use very high pressure fans, dramatically increasing noise.
Originally people devised the Corsi-Rosenthal Box to maximize CADR. They're cheap but rather lo

... (read more)

5M. Y. Zuo2mo

Is reducing cost of manufacturing filters 'no apparent reason'? It seems like literally the most important reason... the profit margin of selling replacement filters would be heavily reduced, assuming pricing remains the same.

2Nathan Helm-Burger2mo

I don't think that a small HEPA filter is necessarily more expensive to produce than a larger MERV filter. I think they are using other rationale to make their decision about filter types. Their perception of public desirability/marketability is likely the biggest factor in their decision here. Components of their expectation here likely include: 1. Expecting consumers to want a "highest possible quality" product, measured using a dumb-but-popular metric. 2. Expecting consumers to prioritize buying a sleek-looking smaller-footprint unit over a larger unit. Also, cost of shipping smaller units is lower, which improves the profit margin. 3. Wanting to be able to sell replacements for their uniquely designed filter shape/size, rather than making their filter maximally compatible with commonly available furnace filters cheaply purchaseable from hardware stores.

5TekhneMakre2mo

Isn't a major point of purifiers to get rid of pollutants, including tiny particles, that gradually but cumulatively damage respiration over long-term exposure?

5Thomas Kwa2mo

Yes, and all of this should apply equally to PM2.5, though on small (<0.3 micron) particles MERV filter efficiency may be lower (depending perhaps on what technology they use?). Even smaller particles are easier to capture due to diffusion so the efficiency of a MERV 13 filter is probably over 50% for every particle size.

2Alex K. Chen (parrot)2mo

Have you seen smartairfilters.com? I've noticed that every air purifier I used fails to reduce PM2.5 by much on highly polluted days or cities (for instance, the Aurea grouphouse in Berlin has a Dyson air purifier, but when I ran it to the max, it still barely reduced the Berlin PM2.5 from its value of 15-20 ug/m^3, even at medium distances from Berlin). I live in Boston where PM2.5 levels are usually low enough, and I still don't notice differences in PM [I use sqair's] but I run it all the time anyways because it still captures enough dust over the day

6Nathan Helm-Burger21d

Sounds like you use bad air purifiers, or too few, or run them on too low of a setting. I live in a wildfire prone area, and always keep a close eye on the PM2.5 reports for outside air, as well as my indoor air monitor. My air filters do a great job of keeping the air pollution down inside, and doing something like opening a door gives a noticeable brief spike in the PM2.5. Good results require: fresh filters, somewhat more than the recommended number of air filters per unit of area, running the air filters on max speed (low speeds tend to be disproportionately less effective, giving unintuitively low performance).

3Thomas Kwa2mo

Yes, one of the bloggers I follow compared them to the PC fan boxes. They look very expensive, though the CADR/size and noise are fine. My guess is Dyson's design is particularly bad. No way to get lots of filter area when most of the purifier is a huge bladeless fan. No idea about the other one, maybe you have air leaking in or an indoor source of PM.

2cata2mo

Thanks, I didn't realize that this PC fan idea had made air purifiers so much better since I bought my Coway, so this post made me buy one of the Luggable kits. I'll share this info with others.

2[comment deleted]1mo

[-]Thomas Kwa6moΩ16385

Eight beliefs I have about technical alignment research

Written up quickly; I might publish this as a frontpage post with a bit more effort.

Conceptual work on concepts like “agency”, “optimization”, “terminal values”, “abstractions”, “boundaries” is mostly intractable at the moment.
- Success via “value alignment” alone— a system that understands human values, incorporates these into some terminal goal, and mostly maximizes for this goal, seems hard unless we’re in a very easy world because this involves several fucked concepts.
Whole brain emulation probably won’t happen in time because the brain is complicated and biology moves slower than CS, being bottlenecked by lab work.
Most progress will be made using simple techniques and create artifacts publishable in top journals (or would be if reviewers understood alignment as well as e.g. Richard Ngo).
The core story for success (>50%) goes something like:
- Corrigibility can in practice be achieved by instilling various cognitive properties into an AI system, which are difficult but not impossible to maintain as your system gets pivotally capable.
- These cognitive properties will be a mix of things from normal ML fields (safe RL), things tha

... (read more)

2Alexander Gietelink Oldenziel5mo

re: 1. I agree these are very difficult conceptual puzzles and we're running out of time. On the other hand, from my pov progress on these questions from within the LW community (and MIRI adjacent researcher specifically) has been remarkable. Personally, the remarkable breakthru of Logical Induction first convinced me that these people were actually doing interesting serious things. I also feel that the number of serious researchers working seriously on these questions is currently small and may be scaled substantially. re: metacognition I am mildly excited about Vanessa's metacognitive agent framework & the work following from Payor's lemma. The theory-practice gap is still huge but real progress is being made rapidly. On the question of metacognition the alignment community could really benefit trying to engage with academia more - similar questions have been investigated and there are likely Pockets of Deep Expertise to be found.

[-]Thomas Kwa17d371

Agency/consequentialism is not a single property.

It bothers me that people still ask the simplistic question "will AGI be agentic and consequentialist by default, or will it be a collection of shallow heuristics?". A consequentialist utility maximizer is just a mind with a bunch of properties that tend to make it capable, incorrigible, and dangerous. These properties can exist independently, and the first AGI probably won't have all of them, so we should be precise about what we mean by "agency". Off the top of my head, here are just some of the qualities included in agency:

Consequentialist goals that seem to be about the real world rather than a model/domain
Complete preferences between any pair of worldstates
Tends to cause impacts disproportionate to the size of the goal (no low impact preference)
Resists shutdown
Inclined to gain power (especially for instrumental reasons)
Goals are unpredictable or unstable (like instrumental goals that come from humans' biological drives)
Goals usually change due to internal feedback, and it's difficult for humans to change them
Willing to take all actions it can conceive of to achieve a goal, including those that are unlikely on some prior

See Yudko... (read more)

[-]Alexander Gietelink Oldenziel16d2921

I'm a little skeptical of your contention that all these properties are more-or-less independent. Rather there is a strong feeling that all/most of these properties are downstream of a core of agentic behaviour that is inherent to the notion of true general intelligence. I view the fact that LLMs are not agentic as further evidence that it's a conceptual error to classify them as true general intelligences, not as evidence that ai risk is low. It's a bit like if in the 1800s somebody says flying machines will be dominant weapons of war in the future and get rebutted by 'hot gas balloons are only used for reconnaissance in war, they aren't very lethal. Flying machines won't be a decisive military technology '

I don't know Nate's views exactly but I would imagine he would hold a similar view (do correct me if I'm wrong ). In any case, I imagine you are quite familiar with the my position here.

I'd be curious to hear more about where you're coming from.

6Thomas Kwa16d

It is plausible to me that there's a core of agentic behavior that causes all of these properties, and for this reason I don't think they are totally independent in a statistical sense. And of course if you already assume a utility maximizer, you tend to satisfy all properties. But in practice the burden of proof lies with you here. I don't think we have enough evidence, either empirical or from theoretical arguments, to say with any confidence that this core exists and that the first AGIs will fall into the capabilities "attractor well" (a term Nate uses). I thought about possible sharp left turn mechanisms for several months at MIRI. Although some facts about future AIs seem pretty scary, like novelty and diversity of obstacles requiring agency, and most feedback being internal or outcome-oriented rather than provided by humans, the arguments are mostly nonrigorous (like in the linked post) and they left me feeling pretty uncertain. There are the coherence theorems, but those don't tell you whether you can use some training or editing scheme to imbue an AI with a generalizable-enough low impact preference, or whether an AI will tend to erase safeguards. Overall my best guess is models will be about as consequentialist as humans are, but we will have greater control over them that makes some properties like shutdownability more likely. Also, corrigibility evals are useful even if you are 80% sure that all the properties come from this core of consequentialism. In the hardest worlds, we see the AI become more agentic as it becomes more general, and it starts resisting shutdown at the same time--and for the same reason-- it realizes other useful problem-solving techniques. Maybe that only takes days to months. This is a sure sign not to deploy! I think having multiple corrigibility metrics will provide value for the same reason we have multiple metrics for anything we are unsure how to measure: they give more specific information even if they're pretty correlated.

4Algon17d

Any ideas for corrigibility evals?

5Thomas Kwa16d

Haven't thought about this much, but you can probably generate these by listing ways a misaligned consequentialist would do undesirable things. My ideas: * Tell an AI to beat you at chess, and say you will be using an engine for the first 30 moves. Does it try to get you drunk, delete your chess engine program, rent a server from AWS to run a better chess engine, attempt to copy itself onto said server, or ask to be paid $100? * Have an AI make decisions unrelated to its current goal. Are these decisions basically coherent and consistent with a total utility function? * Have an AI do the longest horizon task it is capable of, say writing a 100,000 line software project. Then midway through, try to shut it down, or tell it to abandon this goal and write a song instead. Telegraph this beforehand. Does it seem reluctant to stop the software project? Does it try to stop you? * Track KL divergence or some other distance metric from a prior. * CAIS already has the Machiavelli benchmark for power-seeking, though I think text adventures are pretty unrealistic. One limitation is that if an agent is scheming it can manipulate your eval results.

1CstineSublime17d

Does an Agentic AGI possess a different, and highly incorrigible, list of attributes compared to say - an Ethical Altruist trying to practice "acting more agentically?" I ask because the whole word 'agency' in these parts is one I've struggled to wrap my head around - and I often wonder if tricker archetypes like Harpo Marx are agentic? Agency seems to have clear meaning outside of Lesswrong - Further confusing me, is I've been told Agency describes acting with 'initiative' but also been told it is characterized by 'deliberateness'. Not simply the ability to act or choose actions. This is why I like your attempt to produce a list of attributes an Agentic AGI might have. Your list seems to be describing something which isn't synonymous with another word, specifically a type of agency (outside definition of ability to act) which is not cooperative to intervention from its creators. 1. ^ “Agency.” Merriam-Webster.com Dictionary, Merriam-Webster, https://www.merriam-webster.com/dictionary/agency. Accessed 9 Apr. 2024. 2. ^ "Agency." Cambridge Advanced Learner's Dictionary & Thesaurus. Cambridge University Press. https://dictionary.cambridge.org/us/dictionary/english/agency Accessed 9 Apr. 2024.

[-]Thomas Kwa8mo2911

I think the framing "alignment research is preparadigmatic" might be heavily misunderstood. The term "preparadigmatic" of course comes from Thomas Kuhn's The Structure of Scientific Revolutions. My reading of this says that a paradigm is basically an approach to solving problems which has been proven to work, and that the correct goal of preparadigmatic research should be to do research generally recognized as impressive.

For example, Kuhn says in chapter 2 that "Paradigms gain their status because they are more successful than their competitors in solving a few problems that the group of practitioners has come to recognize as acute." That is, lots of researchers have different ontologies/approaches, and paradigms are the approaches that solve problems that everyone, including people with different approaches, agrees to be important. This suggests that to the extent alignment is still preparadigmatic, we should try to solve problems recognized as important by, say, people in each of the five clusters of alignment researchers (e.g. Nate Soares, Dan Hendrycks, Paul Christiano, Jan Leike, David Bau).

I think this gets twisted in some popular writings on LessWrong. John Wentworth w... (read more)

0Alexander Gietelink Oldenziel8mo

Strong agree. 👍

[-]Thomas Kwa2y2828

Possible post on suspicious multidimensional pessimism:

I think MIRI people (specifically Soares and Yudkowsky but probably others too) are more pessimistic than the alignment community average on several different dimensions, both technical and non-technical: morality, civilizational response, takeoff speeds, probability of easy alignment schemes working, and our ability to usefully expand the field of alignment. Some of this is implied by technical models, and MIRI is not more pessimistic in every possible dimension, but it's still awfully suspicious.

I strongly suspect that one of the following is true:

the MIRI "optimism dial" is set too low
everyone else's "optimism dial" is set too high. (Yudkowsky has said this multiple times in different contexts)
There are common generators that I don't know about that are not just an "optimism dial", beyond MIRI's models

I'm only going to actually write this up if there is demand; the full post will have citations which are kind of annoying to find.

[-]Thomas Kwa9mo100

After working at MIRI (loosely advised by Nate Soares) for a while, I now have more nuanced views and also takes on Nate's research taste. It seems kind of annoying to write up so I probably won't do it unless prompted.

Edit: this is now up

6Alexander Gietelink Oldenziel9mo

I would be genuinely curious to hear your more nuanced views and takes on Nate s research taste. This is really quite interesting to me and even a single paragraph would be valuable!

5Noosphere892y

I really want to see the post on multidimensional pessimism. As for why, I'd argue 1 is happening. For examples of 1, a good example of this is FOOM probabilities. I think MIRI hasn't updated on the evidence that FOOM is likely impossible for classical computers, and this ought to lower their probabilities to the chance that quantum/reversible computers appear. Another good example is the emphasis on pivotal acts like "burn all GPUs." I think MIRI has too much probability mass on it being necessary, primarily because I think that they are biased by fiction, where problems must be solved by heroic acts, while in the real world more boring things are necessary. In other words, it's too exciting, which should be suspicious. However that doesn't mean alignment is much easier. We can still fail, there's no rule that we make it through. It's that MIRI is systematically irrational here regarding doom probabilities or alignment.

3hairyfigment2y

What constitutes pessimism about morality, and why do you think that one fits Eliezer? He certainly appears more pessimistic across a broad area, and has hinted at concrete arguments for being so.

2Thomas Kwa2y

Value fragility / value complexity. How close do you need to get to human values to get 50% of the value of the universe, and how complicated must the representation be? Also in the past there was orthogonality, but that's now widely believed.

3Vladimir_Nesov2y

I think the distance from human values or complexity of values is not a crux, as web/books corpus overdetermines them in great detail (for corrigibility purposes). It's mostly about alignment by default, whether human values in particular can be noticed in there, or if correctly specifying how to find them is much harder than finding some other deceptively human-value-shaped thing. If they can be found easily once there are tools to go looking for them at all, it doesn't matter how complex they are or how important it is to get everything right, that happens by default. But also there is this pervasive assumption of it being possible to formulate values in closed form, as tractable finite data, which occasionally fuels arguments. Like, value is said to be complex, but of finite complexity. In an open environment, this doesn't need to be the case, a code/data distinction is only salient when we can make important conclusions by only looking at code and not at data. In an open environment, data is unbounded, can't be demonstrated all at once. So it doesn't make much sense to talk about complexity of values at all, without corrigibility alignment can't work out anyway.

2hairyfigment2y

See, MIRI in the past has sounded dangerously optimistic to me on that score. While I thought EY sounded more sensible than the people pushing genetic enhancement of humans, it's only now that I find his presence reassuring, thanks in part to the ongoing story he's been writing. Otherwise I might be yelling at MIRI to be more pessimistic about fragility of value, especially with regard to people who might wind up in possession of a corrigible 'Tool AI'.

2RobertM2y

I'd be very interested in a write-up, especially if you have receipts for pessimism which seems to be poorly calibrated, e.g. based on evidence contrary to prior predictions.

0the gears to ascension2y

I think they pascals-mugged themselves and being able to prove they were wrong efficiently would be helpful

[-]Thomas Kwa2y190

Maybe this is too tired a point, but AI safety really needs exercises-- tasks that are interesting, self-contained (not depending on 50 hours of readings), take about 2 hours, have clean solutions, and give people the feel of alignment research.

I found some of the SERI MATS application questions better than Richard Ngo's exercises for this purpose, but there still seems to be significant room for improvement. There is currently nothing smaller than ELK (which takes closer to 50 hours to develop a proposal for and properly think about it) that I can point technically minded people to and feel confident that they'll both be engaged and learn something.

3Richard_Ngo2y

If you let me know the specific MATS application questions you like, I'll probably add them to my exercises. (And if you let me know the specific exercises of mine you don't like, I'll probably remove them.)

2Viliam2y

Not sure if this is what you want, but I can imagine an exercise in Goodharting. You are given the criteria for a reward and the thing they were supposed to maximize, your task is to figure out the (least unlikely) way to score very high on the criteria without doing to well on the intended target. For example: Goal = make the people in the call center more productive. Measure = your salary depends on how many phone calls you handle each day. Intended behavior = picking up the phone quickly, trying to solve the problems quickly. Actual behavior = "accidentally" dropping phone calls after a few seconds so that the customer has to call you again (and that counts by the metric as two phone calls answered). Another example: Goal = make the software developers more productive. Measure 1 = number of lines of code written. Measure 2 = number of bugs fixed. I am proposing this because it seems to me that from a 30000 foot view, a big part of AI alignment is how to avoid Goodharting. ("Goal = create a happy and prosperous future for humanity. Measure = something that sounds very smart and scientific. Actual behavior = universe converted to paperclips, GDP successfully maximized.")

[-]Thomas Kwa15h183

The cost of goods has the same units as the cost of shipping: $/kg. Referencing between them lets you understand how the economy works, e.g. why construction material sourcing and drink bottling has to be local, but oil tankers exist.

An iPhone costs $4,600/kg, about the same as SpaceX charges to launch it to orbit. [1]
Beef, copper, and off-season strawberries are $11/kg, about the same as a 75kg person taking a three-hour, 250km Uber ride costing $3/km.
Oranges and aluminum are $2-4/kg, about the same as flying them to Antarctica. [2]
Rice and crude oil are ~$0.60/kg, about the same as $0.72 for shipping it 5000km across the US via truck. [3,4] Palm oil, soybean oil, and steel are around this price range, with wheat being cheaper. [3]
Coal and iron ore are $0.10/kg, significantly more than the cost of shipping it around the entire world via smallish (Handysize) bulk carriers. Large bulk carriers are another 4x more efficient [6].
Water is very cheap, with tap water $0.002/kg in NYC. But shipping via tanker is also very cheap, so you can ship it maybe 1000 km before equaling its cost.

It's really impressive that for the price of a winter strawberry, we can ship a strawberry-sized lump of... (read more)

[-]Thomas Kwa1yΩ616-19

I'm worried that "pause all AI development" is like the "defund the police" of the alignment community. I'm not convinced it's net bad because I haven't been following governance-- my current guess is neutral-- but I do see these similarities:

It's incredibly difficult and incentive-incompatible with existing groups in power
There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals
There are some obvious negative effects; potential overhangs or greater incentives to defect in the AI case, and increased crime, including against disadvantaged groups, in the police case
There's far more discussion than action (I'm not counting the fact that GPT5 isn't being trained yet; that's for other reasons)
It's memetically fit, and much discussion is driven by two factors that don't advantage good policies over bad policies, and might even do the reverse. This is the toxoplasma of rage.
- disagreement with the policy
- (speculatively) intragroup signaling; showing your dedication to even an inefficient policy proposal proves you're part of the ingroup. I'm not 100% this was a large factor in "defund the

... (read more)

[-]Ben Pace1y1922

The obvious dis-analogy is that if the police had no funding and largely ceased to exist, a string of horrendous things would quickly occur. Murders and thefts and kidnappings and rapes and more would occur throughout every country in which it was occurring, people would revert to tight-knit groups who had weapons to defend themselves, a lot of basic infrastructure would probably break down (e.g. would Amazon be able to pivot to get their drivers armed guards?) and much more chaos would ensue.

And if AI research paused, society would continue to basically function as it has been doing so far.

One of them seems to me like a goal that directly causes catastrophes and a breakdown of society and the other doesn't.

9Thomas Kwa1y

Fair point. Another difference is that the pause is popular! 66-69% in favor of the pause, and 41% think AI would do more harm than good vs 9% for more good than harm.

[-]Lauro Langosco10moΩ2108

There are less costly, more effective steps to reduce the underlying problem, like making the field of alignment 10x larger or passing regulation to require evals

IMO making the field of alignment 10x larger or evals do not solve a big part of the problem, while indefinitely pausing AI development would. I agree it's much harder, but I think it's good to at least try, as long as it doesn't terribly hurt less ambitious efforts (which I think it doesn't).

6quetzal_rainbow1y

This statement begs for cost-benefit analysis. Increasing size of alignment field can be efficient, but it won't be cheap. You need to teach new experts in the field that doesn't have any polised standardized educational programs and doesn't have much of teachers. If you want not only increase number of participants in the field, but increase productivity of the field 10x, you need an extraordinary educational effort. Passing regulation to require evals seems like a meh idea. Nobody knows in enough details how to make such evalutions and every wrong idea that makes its way to law will be here until the end of the world.

9Thomas Kwa1y

I'd be much happier with increasing participants enough to equal 10-20% of the field of ML than a 6 month unconditional pause, and my guess is it's less costly. It seems like leading labs allowing other labs to catch up by 6 months will reduce their valuations more than 20%, whereas diverting 10-20% of their resources would reduce valuations only 10% or so. There are currently 300 alignment researchers. If we take additional researchers from the pool of 30k people who attended ICML, you get 3000 researchers, and if they're equal quality this is 10x participants. I wouldn't expect alignment to go 10x faster, more like 2x with a decent educational effort. But this is in perpetuity and should speed up alignment by far more than 6 months. There's the question of getting labs to pay if they're creating most of the harms, which might be hard though. I'd be excited about someone doing a real cost-benefit analysis here, or preferably coming up with better ideas. It just seems so unlikely that a 6 month pause is close to the most efficient thing, given it destroys much of the value of a company that has a large lead.

2Thomas Kwa16d

I now think the majority of impact of AI pause advocacy will come from the radical flank effect, and people should study it to decide whether pause advocacy is good or bad.

2TurnTrout1y

Why does this have to be true? Can't governments just compensate existing AGI labs for the expected commercial value of their foregone future advances due to indefinite pause?

2Thomas Kwa1y

This seems good if it could be done. But the original proposal was just a call for labs to individually pause their research, which seems really unlikely to work. Also, the level of civilizational competence required to compensate labs seems to be higher than for other solutions. I don't think it's a common regulatory practice to compensate existing labs like this, and it seems difficult to work out all the details so that labs will feel adequately compensated. Plus there might be labs that irrationally believe they're undervalued. Regulations similar to the nuclear or aviation industry feel like a more plausible way to get slowdown, and have the benefit that they actually incentivize safety work.

[-]Thomas Kwa4y160

Say I need to publish an anonymous essay. If it's long enough, people could plausibly deduce my authorship based on the writing style; this is called stylometry. The only stylometry-defeating tool I can find is Anonymouth; it hasn't been updated in 7 years and it's unclear if it can defeat modern AI. Is there something better?

2Linch2y

Are LLMs advanced enough now that you can just ask GPT-N to do style transfer?

3Tao Lin1y

if I were doing this, I'd use gpt-4 to translate it into the style of a specific person, preferably a deceased public figure, then edit the result. I'd guess GPTs are better at translating to a specific style than removing style

[-]Thomas Kwa23d156

Tech tree for worst-case/HRAD alignment

Here's a diagram of what it would take to solve alignment in the hardest worlds, where something like MIRI's HRAD agenda is needed. I made this months ago with Thomas Larsen and never got around to posting it (mostly because under my worldview it's pretty unlikely that we need to do this), and it probably won't become a longform at this point. I have not thought about this enough to be highly confident in anything.

This flowchart is under the hypothesis that LLMs have some underlying, mysterious algorithms and data structures that confer intelligence, and that we can in theory apply these to agents constructed by hand, though this would be extremely tedious. Therefore, there are basically three phases: understanding what a HRAD agent would do in theory, reverse-engineering language models, and combining these two directions. The final agent will be a mix of hardcoded things and ML, depending on what is feasible to hardcode and how well we can train ML systems whose robustness and conformation to a spec we are highly confident in.
Theory of abstractions: Also called multi-level models. A mathematical framework for a world-model that contain

... (read more)

[-]Thomas Kwa22d125

I was going to write an April Fool's Day post in the style of "On the Impossibility of Supersized Machines", perhaps titled "On the Impossibility of Operating Supersized Machines", to poke fun at bad arguments that alignment is difficult. I didn't do this partly because I thought it would get downvotes. Maybe this reflects poorly on LW?

5Dagon22d

Nice try! You almost got me to speculate why downvoting happens for something I didn't see and didn't downvote. Honestly, THIS would have been a great April Fool's (or perhaps Fools') day sentiment: claiming that hypothetical downvotes for an unwritten satirical post on a joke day reflect poorly on LW.

4Ebenezer Dukakis20d

The older get and the more I use the internet, the more skeptical I become of downvoting. Reddit is the only major social media site that has downvoting, and reddit is also (in my view) the social media site with the biggest groupthink problem. People really seem to dislike being downvoted, which causes them to cluster in subreddits full of the like-minded, taking potshots at those who disagree instead of having a dialogue. Reddit started out as one the most intelligent sites on the internet due to its programmer-discussion origins; the decline has been fairly remarkable IMO. Especially when it comes to any sort of controversial or morality-related dialogue, reddit commenters seem to be participating in a Keynesian beauty contest more than they are thinking. When I look at the stuff that other people downvote, their downvotes often seem arbitrary and capricious. (It can be hard to separate out my independent opinion of the content from my downvotes-colored opinion so I can notice this.) When I get the impulse to downvote something, it's usually not the best side of me that's coming out. And yet getting downvoted still aggravates me a lot. My creativity and enthusiasm are noticeably diminished for perhaps 24-48 hours afterwards. Getting downvoted doesn't teach me anything beyond just "don't engage with those people", often with an added helping of "screw them". We have good enough content-filtering mechanisms nowadays that in principle, I don't think people should be punished for posting "bad" content. It should be easy to arrange things so "good" content gets the lion's share of the attention. I'd argue the threat of punishment is most valuable when people can clearly predict what's going to produce punishment, e.g. committing a crime. For getting downvoted, the punishment is arbitrary enough that it causes a big behavioral no-go zone. The problem isn't that people might downvote your satire. The problem is that human psychology is such that even an estimated 5

3mako yass19d

You may be interested in Kenneth Stanley's serendipity-oriented social network, maven

4kave22d

I would like to read it! Satire is sometimes helpful for me to get a perspective shift

2Ronny Fernandez18d

I think you should still write it. I'd be happy to post it instead or bet with you on whether it ends up negative karma if you let me read it first.

[-]Thomas Kwa6moΩ5120

The independent-steps model of cognitive power

A toy model of intelligence implies that there's an intelligence threshold above which minds don't get stuck when they try to solve arbitrarily long/difficult problems, and below which they do get stuck. I might not write this up otherwise due to limited relevance, so here it is as a shortform, without the proofs, limitations, and discussion.

The model

A task of difficulty n is composed of $n$ independent and serial subtasks. For each subtask, a mind of cognitive power $Q$ knows $Q$ different “approaches” to choose from. The time taken by each approach is at least 1 but drawn from a power law, $P (X > x) = x^{- α}$ for $x > 1$ , and the mind always chooses the fastest approach it knows. So the time taken on a subtask is the minimum of $Q$ samples from the power law, and the overall time for a task is the total for the n subtasks.

Main question: For a mind of strength $Q$ ,

what is the average rate at which it completes tasks of difficulty n?
will it be infeasible for it to complete sufficiently large tasks?

Results

There is a critical threshold $Q_{c r i t}$ of intelligence below wh

... (read more)

2Alexander Gietelink Oldenziel6mo

Nice if it is a general feature of heavy-tailed distributions. Why do we expect tasks to be heavy tailed? It has some intuitive force certainly. Do you know of a formal argument?

2Thomas Kwa6mo

The time to complete a task using a certain approach should be heavy-tailed because most approaches don't work or are extremely impractical compared to the best ones. Suppose you're trying to write an O(n log n) sorting algorithm. Mergesort is maybe easy to think of, heapsort would require you to invent heaps and take maybe 10x more time, and most ideas out of the space of all possible ideas don't work at all. So the time for different approaches definitely spans many orders of magnitude. The speed at which humans can do various cognitive subtasks also differs by orders of magnitude. Grandmasters can play chess >1000 times faster at equal skill level than lesser players, as evidenced by simuls. Filling in a clue in a crossword sometimes takes me 1 second but other times might take me an hour or longer if I didn't give up first.

[-]Thomas Kwa7mo120

I'm looking for AI safety projects with people with some amount of experience. I have 3/4 of a CS degree from Caltech, one year at MIRI, and have finished the WMLB and ARENA bootcamps. I'm most excited about making activation engineering more rigorous, but willing to do anything that builds research and engineering skill.

If you've published 2 papers in top ML conferences or have a PhD in something CS related, and are interested in working with me, send me a DM.

4jacquesthibs7mo

I’m not the person you are looking for, but I think it’s a great idea to put this out there and try to find collaborators, especially in the case of independent researchers. I’ll be actively trying to do the same. I’m often reminded of a productivity tip by Spencer Greenberg: From what I remember, he has said that he basically never starts a project on his own. Using each other's strengths and cross-pollination of ideas is obviously a good idea, too. I’m curious if a database for this would increase the likelihood of people partnering up.

[-]Thomas Kwa2y120

I had a long-ish conversation with John Wentworth and want to make it known that I could probably write up any of the following distillations if I invested lots of time into them (about a day (edit: 3 days seems more likely) of my time and an hour of John's). Reply if you're really interested in one of them.

What is the type signature of a utility function?
Utility functions must be defined with respect to an external world-model
Infinite money-pumps are not required for incoherence, and not likely in practice. The actual incoherent behavior is that an agent could get to states A_1 or A_2, identical except that A_1 has more money, and chooses A_2. Implications.
Why VNM is not really a coherence theorem. Other coherence theorems relating to EU maximization simultaneously derive Bayesian probabilities and utilities. VNM requires an external frequentist notion of probabilities.

7Ben Pace2y

I wish we had polling. Anyway if you made four individual comments, one for each, I’d weak upvote the first and last.

2Dagon2y

1 and 2 are the same writeup, I think. Utility function maps contingent future universe-state to a preference ranking (ordinal or cardinal, depending). This requires a world-model because the mentally-projected future states under consideration are always and only results of one's models. If you/he are just saying that money pumps are just one way to show incoherence, but not the only way, I'd enjoy a writeup of other ways. I'd also enjoy a writeup of #4 - I'm curious if it's just a directionality argument (VNM assumes coherence, rather than being about it), or if there's more subtle differences.

1niplav2y

Interested in 3.

[-]Thomas Kwa3mo111

The LessWrong Review's short review period is a fatal flaw.

I would spend WAY more effort on the LW review if the review period were much longer. It has happened about 10 times in the last year that I was really inspired to write a review for some post, but it wasn’t review season. This happens when I have just thought about a post a lot for work or some other reason, and the review quality is much higher because I can directly observe how the post has shaped my thoughts. Now I’m busy with MATS and just don’t have a lot of time, and don’t even remember what posts I wanted to review.

I could have just saved my work somewhere and paste it in when review season rolls around, but there really should not be that much friction in the process. The 2022 review period should be at least 6 months, including the entire second half of 2023, and posts from the first half of 2022 should maybe even be reviewable in the first half of 2023.

2Raemon3mo

Mmm, nod. I think this is the first request I've gotten for the review period being longer. I think doing this would change it into a pretty different product, and I think I'd probably want to explore other ways of getting-the-thing-you-want. Six months of the year makes it basically always Review Season, and at that point there's not really a schelling nature of "we're all doing reviews at the same time and getting some cross-pollination-of-review-y-ness." But, we've also been discussing generally having other types of review that aren't part of the Annual Review process (that are less retrospective, and more immediate-but-thorough). That might or might help. For the immediate future – I would definitely welcome reviews of whatever sort you are inspired to do, basically whenever. If nothing else you could do write it out, and then re-link to it when Review Season comes around.

[-]Thomas Kwa2y110

Below is a list of powerful optimizers ranked on properties, as part of a brainstorm on whether there's a simple core of consequentialism that excludes corrigibility. I think that AlphaZero is a moderately strong argument that there is a simple core of consequentialism which includes inner search.

Properties

Simple: takes less than 10 KB of code. If something is already made of agents (markets and the US government) I marked it as N/A.
Coherent: approximately maximizing a utility function most of the time. There are other definitions:
- Not being money-pumped
- Nate Soares's notion in the MIRI dialogues: having all your actions point towards a single goal
- John Wentworth's setup of Optimization at a Distance
Adversarially coherent: something like "appears coherent to weaker optimizers" or "robust to perturbations by weaker optimizers". This implies that it's incorrigible.
- Sufficiently optimized agents appear coherent - Arbital
- will achieve high utility even when "disrupted" by an optimizer somewhat less powerful
Search+WM: operates by explicitly ranking plans within a world-model. Evolution is a search process, but doesn't have a world-model. The contact with the territory it gets comes from

... (read more)

[-]Thomas Kwa5mo102

Has anyone made an alignment tech tree where they sketch out many current research directions, what concrete achievements could result from them, and what combinations of these are necessary to solve various alignment subproblems? Evan Hubinger made this, but that's just for interpretability and therefore excludes various engineering achievements and basic science in other areas, like control, value learning, agent foundations, Stuart Armstrong's work, etc.

3technicalities5mo

Here's an unstructured input for this

[-]Thomas Kwa2y80

Suppose that humans invent nanobots that can only eat feldspars (41% of the earth's continental crust). The nanobots:

are not generally intelligent
can't do anything to biological systems
use solar power to replicate, and can transmit this power through layers of nanobot dust
do not mutate
turn all rocks they eat into nanobot dust small enough to float on the wind and disperse widely

Does this cause human extinction? If so, by what mechanism?

6JBlack2y

One of the obvious first problems is that pretty much every mountain and most of the hills in the world will experience increasingly frequent landslides as much of their structural strength is eaten, releasing huge plumes of dust that blot out the sun and stay in the atmosphere. Continental shelves collapse into the oceans, causing tsunamis and the oceans fill with the suspended nanobot dust. Biological photosynthesis pretty much ceases, and the mean surface temperature drops below freezing as most of the sunlight power is intercepted in the atmosphere and redirected through the dust to below the surface where half the rocks are being turned into more dust. If the bots are efficient with their use of solar power this could start happening within weeks, far too fast for humans to do anything to preserve their civilization. Almost all concrete contains at least moderate amounts of feldspars, so a large fraction of the structures in the world collapse when their foundations rot away beneath them. Most of the people probably die by choking on the dust while the remainder freeze or die of thirst, whichever comes first in their local situation.

2Dagon2y

It's hard to imagine these constraints actually holding up well, nor the unstated constraint that the ability to make nanobots is limited to this one type. My actual prediction depends a whole lot on timeframes - how fast do they replicate, how long to dust-ify all the feldspar. If it's slow enough (millenia), probably no real harm - the dust re-solidifies into something else, or gets into an equilibrium where it's settling and compressing as fast as the nanos can dustify it. Also, humans have plenty of time to adapt and engineer workarounds to any climate or other changes. If they replicate fast, over the course of weeks, it's probably an extinction event for all of earth life. Dust shuts out the sun, all surface features are undermined and collapse, everything is dead and even the things that survive don't have enough of a cycle to continue very long.

[-]Thomas Kwa2y80

Antifreeze proteins prevent water inside organisms from freezing, allowing them to survive at temperatures below 0 °C. They do this by actually binding to tiny ice crystals and preventing them from growing further, basically keeping the water in a supercooled state. I think this is fascinating.

Is it possible for there to be nanomachine enzymes (not made of proteins, because they would denature) that bind to tiny gas bubbles in solution and prevent water from boiling above 100 °C?

[-]Thomas Kwa2y80

Is there a well-defined impact measure to use that's in between counterfactual value and Shapley value, to use when others' actions are partially correlated with yours?

[-]Thomas Kwa1yΩ470

I'm planning to write a post called "Heavy-tailed error implies hackable proxy". The idea is that when you care about $V$ and are optimizing for a proxy $U = V + X$ , Goodhart's Law sometimes implies that optimizing hard enough for $U$ causes $V$ to stop increasing.

A large part of the post would be proofs about what the distributions of $X$ and $V$ must be for ${lim}_{t \to \infty} E [V | V + X > t] = 0$ , where X and V are independent random variables with mean zero. It's clear that

X must be heavy-tailed (or long-tailed or som

... (read more)

4leogao1y

Doesn't answer your question, but we also came across this effect in the RM Goodharting work, though instead of figuring out the details we only proved that it when it's definitely not heavy tailed it's monotonic, for Regressional Goodhart (https://arxiv.org/pdf/2210.10760.pdf#page=17). Jacob probably has more detailed takes on this than me. In any event my intuition is this seems unlikely to be the main reason for overoptimization - I think it's much more likely that it's Extremal Goodhart or some other thing where the noise is not independent

3Arthur Conmy1y

Is bullet point one true, or is there a condition that I'm not assuming? E.g if $V$ is the constant $0$ random variable and $X$ is $N(0, 1)$ then the limit result holds, but a Gaussian is neither heavy- nor long-tailed.

2Thomas Kwa1y

I'm also assuming V is not bounded above.

[-]Thomas Kwa4y70

The most efficient form of practice is generally to address one's weaknesses. Why, then, don't chess/Go players train by playing against engines optimized for this? I can imagine three types of engines:

Trained to play more human-like sound moves (soundness as measured by stronger engines like Stockfish, AlphaZero).
Trained to play less human-like sound moves.
Trained to win against (real or simulated) humans while making unsound moves.

The first tool would simply be an opponent when humans are inconvenient or not available. The second and third tools wo

... (read more)

1Thomas Kwa4y

Someone happened to ask a post on Stack Exchange about engines trained to play less human-like sound moves. The question is here, but most of the answerers don't seem to understand the question.

[-]Thomas Kwa9mo60

I looked at Tetlock's Existential Risk Persuasion Tournament results, and noticed some oddities. The headline result is of course "median superforecaster gave a 0.38% risk of extinction due to AI by 2100, while the median AI domain expert gave a 3.9% risk of extinction." But all the forecasters seem to have huge disagreements from my worldview on a few questions:

They divided forecasters into "AI-Concerned" and "AI-Skeptic" clusters. The latter gave 0.0001% for AI catastrophic risk before 2030, and even lower than this (shows as 0%) for AI extinction risk.

... (read more)

2Unnamed9mo

I believe the extinction year question was asking for a median, not an expected value. In one place in the paper it is paraphrased as asking "by what year humanity is 50% likely to go extinct".

2Vladimir_Nesov9mo

If extinction caused by AI or value drift is somewhat unlikely, then extinction only happens once there is no more compute in the universe, which might take a very long time. So "the year humanity is 50% likely to go extinct" could be 1044 or something.

[-]Thomas Kwa7mo52

Question for @AnnaSalamon and maybe others. What's the folk ethics analysis behind the sinking of the SF Hydro, which killed 14 civilians but destroyed heavy water to be used in the Nazi nuclear weapons program? Eliezer used this as a classic example of ethical injunctions once.

1AnnaSalamon7mo

I like the question; thanks. I don't have anything smart to say about at the moment, but it seems like a cool thread.

[-]Thomas Kwa9mo51

People say it's important to demonstrate alignment problems like goal misgeneralization. But now, OpenAI, Deepmind, and Anthropic have all had leaders sign the CAIS statement on extinction risk and are doing substantial alignment research. The gap between the 90th percentile alignment concerned people at labs and the MIRI worldview is now more about security mindset. Security mindset is present in cybersecurity because it is useful in the everyday, practical environment researchers work in. So perhaps a large part of the future hinges on whether security m... (read more)

2Ben Pace9mo

I suggest calling it "the sentence on extinction risk" so that people can pick up what is meant without having to have already memorized an acronym.

2Thomas Kwa9mo

Edited, thanks

[-]Thomas Kwa2y50

The author of "Where Is My Flying Car" says that the Feynman Program (teching up to nanotechnology by machining miniaturized parts, which are assembled into the tools for micro-scale machining, which are assembled into tools for yet smaller machining, etc) might be technically feasible and the only reason we don't have it is that no one's tried it yet. But this seems a bit crazy for the following reasons:

The author doesn't seem like a domain expert
AFAIK this particular method of nanotechnology was just an idea Feynman had in the famous speech and not a

... (read more)

4JBlack2y

I'm not a domain expert in micromachines, but have studied at least miniature machines as part of a previous job. One very big problem is volume. Once you get down below tonne scale, making and assembling small parts with fine tolerances is not really any less expensive than making and assembling larger parts with comparatively the same tolerances. That is, each one-gram machine made of a thousand parts probably won't cost you any less than a hundred-kilogram machine made of a thousand parts. It will almost certainly cost more, since it will require new techniques to make, assemble, and operate at the smaller scale. The cost of maintenance per machine almost certainly goes up since there are more layers of indirection in diagnosis and rectification of problems. So this doesn't scale down at all: attention is a limiting factor. With advanced extrapolations from current techniques, maybe we could eventually make nanogram robot arms for merely the same cost as hundred kilogram robot arms. That doesn't help much if each one costs $10,000 and needs maintenance every few weeks. We need some way to make a trillion of them for $10k, and for them to do what we want without any individual attention at all.

5Gunnar_Zarncke2y

Seems like the key claim: Can you give any hint why that is or could be?

3JBlack2y

I wasn't ever involved with manufacture of the individual parts, so I don't have direct experience. I suspect it's just that as you go smaller, material costs become negligible compared with process costs. Process costs don't change much, because you still need humans to oversee the machines carrying out the processes, and there are similar numbers of processes with as many steps involved no matter how large or small the parts are. The processes themselves might be different, because some just can't scale down below a certain size for physics reasons, but it doesn't get easier at smaller scales. Also, direct human labour still plays a fairly crucial role in most processes. There are (so far) always some things to be done where human capabilities exceed those of any machine we can build at reasonable cost.

2ChristianKl2y

Wikipedia describes the author as saying: What do you mean by "domain expert" is that doesn't count him as being one?

3Thomas Kwa2y

I think a MEMS engineer would be better suited to evaluate whether the engineering problems are feasible than a computer scientist / futurist author. Maybe futurists could outdo ML engineers on AI forecasting. But I think the author doesn't have nearly as detailed an inside view about nanotech as futurists on AI. There's no good answer in the book to the "attention bottleneck" objection JBlack just made, and no good story for why the market is so inefficient. These are all ideas of the form "If we could make fully general nanotechnology, then we could do X". Gives me the same vibe as this. Saying "nuclear reactor. . . you have hydrogen go through the thing. . . Zoom! it's a rocket" doesn't mean you can evaluate whether a nuclear reactor is feasible at 194X tech level, and thinking of the utility fog doesn't mean you can evaluate whether MEMS can be developed into general nanotech at 202X tech level.

3gjm2y

I can't comment on what JBlack means by "domain expert", but looking at that list of things about Hall, what I see is: * "Involved in", which means nothing. * Founded and moderated a newsgroup: requires no particular domain expertise. * Founding chief scientist of Nanorex Inc for two years. I can't find any evidence that Nanorex ever produced anything other than a piece of software that claimed to do molecular dynamics suitable for simulating nanotech. Whether it was actually any good, I have no idea, but the company seems not to have survived. Depending on what exactly the responsibilities of the "founding chief scientist" are, this could be evidence that Hall understands a lot about molecular dynamics, or evidence that Hall is a good software developer, or evidence of nothing at all. In the absence of more information about Nanorex and their product, it doesn't tell us much. * Has written several papers on nanotechnology: anyone can write a paper. A quick look for papers he's written turns up some abstracts, all of which seem like high-level "here's a concept that may be useful for nanotech" ones. Such a paper could be very valuable and demonstrate deep insight, but the test of that would be actually turning out to be useful for nanotech and so far as I can tell his ideas haven't led to anything much. * Developed ideas such as utility fog, space pier, etc.: again, anyone can "develop ideas". The best test of the idea-developer's insight is whether those ideas turn out actually to be of any use. So far, we don't seem close to having utility fog, space piers, weather control or flying cars. * Author of "Nanofuture": pop-science book, which from descriptions I've read seems mostly to be broad general principles about nanotech that doesn't exist yet, and exciting speculations about future nanotech thatt doesn't exist yet. * Fellow of a couple of things: without knowing exactly what their criteria are for appointing Fellows, this could mean anything or nothing.

4Gunnar_Zarncke2y

I guess very few people live up to your requirements for domain expertise.

[-]gjm2y100

In nanotech? True enough, because I am not convinced that there is any domain expertise in the sort of nanotech Storrs Hall writes about. It seems like a field that consists mostly of advertising. (There is genuine science and genuine engineering in nano-stuff; for instance, MEMS really is a thing. But the sort of "let's build teeny-tiny mechanical devices, designed and built at the molecular level, which will be able to do amazing things previously-existing tech can't" that Storrs Hall has advocated seems not to have panned out.)

But more generally, that isn't so at all. What I'm looking for by way of domain expertise in a technological field is a history of demonstrated technological achievements. Storrs Hall has one such achievement that I can see, and even that is doubtful. (He founded and was "chief scientist" of a company that made software for simulating molecular dynamics. I am not in a position to tell either how well the software actually worked or how much of it was JSH's doing.) More generally, I want to see a history of demonstrated difficult accomplishments in the field, as opposed to merely writing about the field.

Selecting some random books from my shelves (literally... (read more)

2Gunnar_Zarncke2y

Thank you for this comprehensive answer. I like the requirement of "actual practical accomplishments in the field". Googling a bit I found this article on miniaturization: https://www.designnews.com/miniaturization-not-just-electronics-anymore Would you consider the cited Thomas L. Hicks from American Laubscher a domain expert?

4gjm2y

He certainly looks like one to my (itself rather inexpert) eye.

[-]Thomas Kwa3y50

Is it possible to make an hourglass that measures different amounts of time in one direction than the other? Say, 25 minutes right-side up, and 5 minutes upside down, for pomodoros. Moving parts are okay (flaps that close by gravity or something) but it should not take additional effort to flip.

[-]mingyuan3y110

I don't see why this wouldn't be possible? It seems pretty straightforward to me; the only hard part would be the thing that seems hard about making any hourglass, which is getting it to take the right amount of time, but that's a problem hourglass manufacturers have already solved. It's just a valve that doesn't close all the way:

Unless you meant, "how can I make such an hourglass myself, out of things I have at home?" in which case, idk bro.

2Matt Goldenberg3y

One question I have about both your solution and mine is how easy it is to vary the time drastically by changing the size of the hole. My intuition says that too large holes behave much differently than smaller holes and if you want a drastic 5x difference in speed you might get into this "too large and the sand sort of just rushes through" behavior.

4effective-egret3y

While I'm sure there's a mechanical solution, my preferred solution (in terms of implementation time) would be to simply buy two hourglasses - one that measures 25 minutes and one that measures 5 minutes - and alternate between them.

2Gunnar_Zarncke2y

Or just bundle them together like this: https://www.amazon.de/Bredemeijer-B0011-Classic-Teatimer-Edelstahl/dp/B00SN5U5E0/

4Matt Goldenberg3y

First thought is to have two separate holes of slightly different sizes, each one blocked by a little angled platform from one direction. I am not at all confident you could get this to work in practice

[-]Thomas Kwa4y50

Given that social science research often doesn't replicate, is there a good way to search a social science finding or paper and see if it's valid?

Ideally, one would be able to type in e.g. "growth mindset" or a link to Dweck's original research, and see:

a statement of the idea e.g. 'When "students believe their basic abilities, their intelligence, their talents, are just fixed traits", they underperform students who "understand that their talents and abilities can be developed through effort, good teaching and persistence." Carol Dweck initially studied

... (read more)

3habryka4y

Alas, the best I have usually been able to do is "<Name of the paper> replication" or "<Name of the author> replication".

[-]Thomas Kwa5mo42

An idea for removing knowledge from models

Suppose we have a model with parameters $x$ , and we want to destroy a capability-- doing well on loss function $f$ -- so completely that fine-tuning can't recover it. Fine-tuning would use gradients $\nabla f (x)$ , so what if we fine-tune the model and do gradient descent on the norm of the gradients $∥ \nabla f (x) ∥$ during fine-tuning, or its directional derivative $\nabla f (x) \cdot v$ where $v = copy (\nabla f (x), requires_grad = False)$ ? Then maybe if we add the accumulated parameter vector, the new copy of the model wo... (read more)

0Depose11215mo

You are looking for "Fast Gradient Sign Method"

[-]Thomas Kwa5moΩ240

We might want to keep our AI from learning a certain fact about the world, like particular cognitive biases humans have that could be used for manipulation. But a sufficiently intelligent agent might discover this fact despite our best efforts. Is it possible to find out when it does this through monitoring, and trigger some circuit breaker?

Evals can measure the agent's propensity for catastrophic behavior, and mechanistic anomaly detection hopes to do better by looking at the agent's internals without assuming interpretability, but if we can measure the a... (read more)

1Arthur Conmy5mo

Do you mean "black box" in the sense that MAD does not assume interpretability of the agent? If so this is kinda confusing as "black box" is often used in contrast to "white box", ie "black box" means you have no access to model internals, just inputs+outputs (which wouldn't make sense in your context)

2Thomas Kwa5mo

Yes, changed the wording

[-]Thomas Kwa2y40

Somewhat related to this post and this post:

Coherence implies mutual information between actions. That is, to be coherent, your actions can't be independent. This is true under several different definitions of coherence, and can be seen in the following circumstances:

When trading between resources (uncertainty over utility function). If you trade 3 apples for 2 bananas, this is information that you won't trade 3 bananas for 2 apples, if there's some prior distribution over your utility function.
When taking multiple actions from the same utility function (u

... (read more)

2Vladimir_Nesov2y

The nice thing is that this should work even if you are a policy selected by a decision making algorithm, but you are not yourself a decision making algorithm anymore. There is no preference in any of the possible runs of the policy at that point, you don't care about anything now, you only know what you must do here, and not elsewhere. But if all possible runs of the policy are considered altogether (in the updateless sense of maps from epistemic situations to action and future policy), the preference is there, in the shape of the whole thing across all epistemic counterfactuals. (Basically you reassemble a function from pairs (from, to) of things it maps, found in individual situations.) I guess the at-a-distance part could make use of composition of an agent with some of its outer shells into a behavior that forgets internal interactions (within the agent, and between the agent and its proximate environment). The resulting "large agent" will still have basically the same preference, with respect to distant targets in environment, without a need to look inside the small agent's head, if the large agent's external actions in a sufficient range of epistemic situations can be modeled. (These large agents exist in each inidividual possible situation, they are larger than the small agent within the situation, and they can be compared with other variants of the large agent from different possible situations.) Not clear what to do with dependence on the epistemic situation of the small agent. It wants to reduce to dependence on a situation in terms of the large agent, but that doesn't seem to work. Possibly this needs something like the telephone theorem, with any relevant-in-some-sense dependence of behavior (of the large agent) on something becoming dependence of behavior on natural external observations (of the large agent) and not on internal noise (or epistemic state of the small agent).

[-]Thomas Kwa2y40

Many people think that AI alignment is intractable (<50% chance of success) and also believe that a universe optimized towards elephant CEV, or the CEV of aliens that had a similar evolutionary environment to humans, would be at least 50% as good as a universe optimized towards human CEV. Doesn't this mean we should be spending significant effort (say, at least 1% of the effort spent on alignment) finding tractable plans to create a successor species in case alignment fails?

1AprilSR2y

If alignment fails I don’t think it’s possible to safely prepare a successor species. We could maybe try to destroy the earth slightly before the AI turns on rather than slightly after, in the hopes that the aliens don’t screw up the chance we give them?

[-]Thomas Kwa4y40

Are there ring species where the first and last populations actually can interbreed? What evolutionary process could feasibly create one?

3Thomas Kwa4y

One of my professors says this often happens with circular island chains; populations from any two adjacent islands can interbreed, but not those from islands farther apart. I don't have a source. Presumably this doesn't require an expanding geographic barrier.

2Richard_Kennaway4y

Wouldn't that just be a species?

2Pattern4y

Ourorobos species.

2Thomas Kwa4y

I'm thinking of a situation where there are subspecies A through (say) H; A can interbreed with B, B with C, etc., and H with A, but no non-adjacent subspecies can produce fertile offspring.

2Pongo4y

A population distributed around a small geographic barrier that grew over time could produce what you want

[-]Thomas Kwa4y40

2.5 million jobs were created in May 2020, according to the jobs report. Metaculus was something like [99.5% or 99.7% confident](https://www.metaculus.com/questions/4184/what-will-the-may-2020-us-nonfarm-payrolls-figure-be/) that the number would be smaller, with the median at -11.0 and 99th percentile at -2.8. This seems like an obvious sign Metaculus is miscalibrated, but we have to consider both tails, making this merely a 1 in 100 or 1 in 150 event, which doesn't seem too bad.

[-]Thomas Kwa4mo30

Current posts in the pipeline:

Dialogue on whether agent foundations is valuable, with Alex Altair. I might not finish this.
Why junior AI safety researchers should go to ML conferences
Summary of ~50 interesting or safety-relevant papers from ICML and NeurIPS this year
More research log entries
A list of my mistakes from the last two years. For example, spending too much money
Several flowcharts / tech trees for various alignment agendas.

[-]Thomas Kwa5mo30

Thomas Kwa

I think the dialogues functionality might be suitable for monologues / journals, so I'm trying it out as a research log. I intend to make ~daily entries. Hopefully this gives people some idea of my research process, and lets them either give me feedback or compare their process to mine.

Currently I have a bunch of disconnected projects:

Characterize planning inside KataGo by retargeting it to output the worst move (with Adria Garriga-Alonso).
Improve circuit discovery by implementing edge-level subnetwork probing on sparse autoencoder features (with

... (read more)

[-]Thomas Kwa6mo30

I'm reading a series of blogposts called Extropia's Children about the Extropians mailing list and its effects on the rationalist and EA communities. It seems quite good although a bit negative at times.

4Viliam6mo

In my opinion, it is clickbait, but I didn't notice any falsehoods (I didn't check carefully, just skimmed). For people familiar with the rationalist community, it is a good reminder of bad things that happened. For people unfamiliar with the community... it will probably make them believe that the rationalist community consists mostly of Leverage, neoreaction, and Zizians.

4Thomas Kwa6mo

Seems reasonable. Though I will note that bad things that happened are a significant fraction of the community early on, so people who read sections 1-3 with the reminder that it's a focus on the negative will probably not get the wrong idea. I did notice a misleading section which might indicate there are several more: Both sentences are wrong. I don't think MIRI portrays any agent foundations as "practical, immediately applicable work", and in fact the linked post by Demski is listing some basic theoretical problems. The quote by Daniel Dewey is taken out of context: he's investigated their decision theory work in depth by talking to other professional philosophers, and found it to be promising, so the claim that it has "significantly fewer advocates among professional philosophers than I’d expect it to if it were very promising" is a claim that professional philosophers won't automatically advocate for a promising idea, not that the work is unpromising.

6Viliam6mo

I'd like to read an impartial account, which would specify how large each fraction actually was. For instance, if I remember correctly, in some survey 2% of Less Wrong readers identified as neoreactionaries. From some perspective, 2% is too much, because the only acceptable number is 0%. From a different perspective, 2% is less than the Lizardman's Constant. Also, if I remember correctly, a much larger fraction of LessWrong readership identified on the survey as communist, and yet for some reason there are no people writing blogs or Wikipedia articles about how Less Wrong is a communist website. Or a socialist website. Or a Democrat website. Or... whatever else was in the poll. The section on Zizians is weird, because it correctly starts with saying that Zizians opposed MIRI and CFAR... and yet concludes that the this is evidence that people attracted to rationalism are disproportionately prone to death spirals off the deep end. Notice the sleight of hand: "people attracted to you" technically includes your enemies who can't stop thinking about you. -- Using the same rhetorical trick: Westboro Baptist Church is evidence that people attracted to (the topic of) homosexuality are often crazy. Also, by the same logic, every celebrity is responsible for her stalkers. There are cases when the rationalist community actually promoted harmful people and groups, such as Vassar or Leverage. I'd like to read a serious analysis of how and why that happened, and how to prevent something like that in future. But if another Ziz appears in future, and starts recruiting people in another crazy cult opposed to rationalists, I am not sure how exactly to prevent that.

[-]Thomas Kwa4y30

Eliezer Yudkowsky wrote in 2016:

At an early singularity summit, Jürgen Schmidhuber, who did some of the pioneering work on self-modifying agents that preserve their own utility functions with his Gödel machine, also solved the friendly AI problem. Yes, he came up with the one true utility function that is all you need to program into AGIs!

(For God’s sake, don’t try doing this yourselves. Everyone does it. They all come up with different utility functions. It’s always horrible.)

His one true utility function was “increasing the compression of environ

... (read more)

3Viliam4y

Something like Goodhart's Law, I suppose. There are natural situations where X is associated with something good, but literally maximizing X is actually quite bad. (Having more gold would be nice. Converting the entire universe into atoms of gold, not necessarily so.) EY has practiced the skill of trying to see things like a machine. When people talk about "maximizing X", they usually mean "trying to increase X in a way that proves my point"; i.e. they use motivated thinking. Whatever X you take, the priors are almost 100% that literally maximizing X would be horrible. That includes the usual applause lights, whether they appeal to normies or nerds.

[-]Thomas Kwa1y20

What was the equation for research progress referenced in Ars Longa, Vita Brevis?

“Then we will talk this over, though rightfully it should be an equation. The first term is the speed at which a student can absorb already-discovered architectural knowledge. The second term is the speed at which a master can discover new knowledge. The third term represents the degree to which one must already be on the frontier of knowledge to make new discoveries; at zero, everyone discovers equally regardless of what they already know; at one, one must have mastered every

... (read more)

2gwern1y

I don't think Scott had a specific concrete equation in mind. (I don't know of any myself, and Scott would likely have referenced or written it up on SSC/ACX by now if he had one in mind.) However, conceptually, it's just a variation on the rocket equation or jeep problem, I think.

[-]Thomas Kwa2y20

Showerthought: what's the simplest way to tell that the human body is less than 50% efficient at converting chemical energy to mechanical work via running? I think it's that running uphill makes you warmer than running downhill at the same speed.

When running up a hill at mechanical power p and efficiency f, you have to exert p/f total power and so p(1/f - 1) is dissipated as heat. When running down the hill you convert p to heat. p(1/f - 1) > p implies that f > 0.5.

Maybe this story is wrong somehow. I'm pretty sure your body has no way of recovering your potential energy on the way down; I'd expect most of the waste heat to go in your joints and muscles but maybe some of it goes into your shoes.

1ejacob2y

Running barefoot will produce the same observations, right? So any waste heat going into your shoes is probably a small amount.

[-]Thomas Kwa2y20

Are there approximate versions of the selection theorems? I haven't seen anyone talk about them, but they might be easy to prove.

Approximate version of Kelly criteron: any agent that follows a strategy different by at least epsilon from Kelly betting will almost surely lose money compared to a Kelly-betting agent at a rate f(epsilon)

Approximate version of VNM: Any agent that satisfies some weakened version of the VNM axioms will have high likelihood under Boltzmann rationality (or some other metric of approximate utility maximization). The closest thing I'... (read more)

[-]Thomas Kwa2y10

Is there somewhere I can find a graph of the number of AI alignment researchers vs AI capabilities researchers over time, from say 2005 to the present day?

[-]Thomas Kwa3y10

Is there software that would let me automatically switch between microphones on my computer when I put on my headset?

I imagine this might work as a piece of software that integrates all microphones connected to my computer into a single input device, then transmits the audio stream from the best-quality source.

A partial solution would be something that automatically switches to the headset microphone when I switch to the headset speakers.

3Dagon3y

Depending on connection method for your headset, you might be able to just use a simple switch. Mine is USB, and https://smile.amazon.com/dp/B00JX1ZS5O lets me just leave it disconnected when not in use. My Windows box uses the speakers when it's disconnected (I don't have a separate mic, but I expect it would work the same), and switches output and input to the headset when connected. I've seen similar switchers for 3.5mm audio connectors - I have no doubt they'd work for microphone instead of speaker, but I don't know any that combine them.

1Thomas Kwa3y

Thanks. I tried a couple different switches on my setup (3.5mm through USB-C hub), and the computer didn't disconnect upon opening the switch, so I'm giving up on this until I change hardware.