A book review examining Elinor Ostrom's "Governance of the Commons", in light of Eliezer Yudkowsky's "Inadequate Equilibria." Are successful local institutions for governing common pool resources possible without government intervention? Under what circumstances can such institutions emerge spontaneously to solve coordination problems?

Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
robo9839
16
Our current big stupid: not preparing for 40% agreement Epistemic status: lukewarm take from the gut (not brain) that feels rightish The "Big Stupid" of the AI doomers 2013-2023 was AI nerds' solution to the problem "How do we stop people from building dangerous AIs?" was "research how to build AIs".  Methods normal people would consider to stop people from building dangerous AIs, like asking governments to make it illegal to build dangerous AIs, were considered gauche.  When the public turned out to be somewhat receptive to the idea of regulating AIs, doomers were unprepared. Take: The "Big Stupid" of right now is still the same thing.  (We've not corrected enough).  Between now and transformative AGI we are likely to encounter a moment where 40% of people realize AIs really could take over (say if every month another 1% of the population loses their job).  If 40% of the world were as scared of AI loss-of-control as you, what could the world do? I think a lot!  Do we have a plan for then? Almost every LessWrong post on AIs are about analyzing AIs.  Almost none are about how, given widespread public support, people/governments could stop bad AIs from being built. [Example: if 40% of people were as worried about AI as I was, the US would treat GPU manufacture like uranium enrichment.  And fortunately GPU manufacture is hundreds of time harder than uranium enrichment!  We should be nerding out researching integrated circuit supply chains, choke points, foundry logistics in jurisdictions the US can't unilaterally sanction, that sort of thing.] TLDR, stopping deadly AIs from being built needs less research on AIs and more research on how to stop AIs from being built. *My research included 😬
Very Spicy Take Epistemic Note:  Many highly respected community members with substantially greater decision making experience (and Lesswrong karma) presumably disagree strongly with my conclusion. Premise 1:  It is becoming increasingly clear that OpenAI is not appropriately prioritizing safety over advancing capabilities research. Premise 2: This was the default outcome.  Instances in history in which private companies (or any individual humans) have intentionally turned down huge profits and power are the exception, not the rule.  Premise 3: Without repercussions for terrible decisions, decision makers have no skin in the game.  Conclusion: Anyone and everyone involved with Open Phil recommending a grant of $30 million dollars be given to OpenAI in 2017 shouldn't be allowed anywhere near AI Safety decision making in the future. To go one step further, potentially any and every major decision they have played a part in needs to be reevaluated by objective third parties.  This must include Holden Karnofsky and Paul Christiano, both of whom were closely involved.  To quote OpenPhil: "OpenAI researchers Dario Amodei and Paul Christiano are both technical advisors to Open Philanthropy and live in the same house as Holden. In addition, Holden is engaged to Dario’s sister Daniela."
Why no prediction markets for large infrastructure projects? Been reading this excellent piece on why prediction markets aren't popular. They say that without subsidies prediction markets won't be large enough; the information value of prediction markets is often nog high enough.  Large infrastructure projects undertaken by governments, and other large actors often go overbudget, often hilariously so: 3x,5x,10x or more is not uncommon, indeed often even the standard. One of the reasons is that government officials deciding on billion dollar infrastructure projects don't have enough skin in the game. Politicians are often not long enough in office to care on the time horizons of large infrastructure projects. Contractors don't gain by being efficient or delivering on time. To the contrary, infrastructure projects are huge cashcows. Another problem is that there are often far too many veto-stakeholders. All too often the initial bid is wildly overoptimistic.  Similar considerations apply to other government projects like defense procurement or IT projects. Okay - how to remedy this situation? Internal prediction markets theoretically could prove beneficial. All stakeholders & decisionmakers are endowed with vested equity with which they are forced to bet on building timelines and other key performance indicators. External traders may also enter the market, selling and buying the contracts. The effective subsidy could be quite large. Key decisions could save billions.  In this world, government officials could gain a large windfall which may be difficult to explain to voters. This is a legitimate objection.  A very simple mechanism would simply ask people to make an estimate on the cost C and the timeline T for completion.  Your eventual payout would be proportional to how close you ended up to the real C,T compared to the other bettors. [something something log scoring rule is proper]. 
Akash4841
2
My current perspective is that criticism of AGI labs is an under-incentivized public good. I suspect there's a disproportionate amount of value that people could have by evaluating lab plans, publicly criticizing labs when they break commitments or make poor arguments, talking to journalists/policymakers about their concerns, etc. Some quick thoughts: * Soft power– I think people underestimate the how strong the "soft power" of labs is, particularly in the Bay Area.  * Jobs– A large fraction of people getting involved in AI safety are interested in the potential of working for a lab one day. There are some obvious reasons for this– lots of potential impact from being at the organizations literally building AGI, big salaries, lots of prestige, etc. * People (IMO correctly) perceive that if they acquire a reputation for being critical of labs, their plans, or their leadership, they will essentially sacrifice the ability to work at the labs.  * So you get an equilibrium where the only people making (strong) criticisms of labs are those who have essentially chosen to forgo their potential of working there. * Money– The labs and Open Phil (which has been perceived, IMO correctly, as investing primarily into metastrategies that are aligned with lab interests) have an incredibly large share of the $$$ in the space. When funding became more limited, this became even more true, and I noticed a very tangible shift in the culture & discourse around labs + Open Phil * Status games//reputation– Groups who were more inclined to criticize labs and advocate for public or policymaker outreach were branded as “unilateralist”, “not serious”, and “untrustworthy” in core EA circles. In many cases, there were genuine doubts about these groups, but my impression is that these doubts got amplified/weaponized in cases where the groups were more openly critical of the labs. * Subjectivity of "good judgment"– There is a strong culture of people getting jobs/status for having “good judgment”. This is sensible insofar as we want people with good judgment (who wouldn’t?) but this often ends up being so subjective that it ends up leading to people being quite afraid to voice opinions that go against mainstream views and metastrategies (particularly those endorsed by labs + Open Phil). * Anecdote– Personally, I found my ability to evaluate and critique labs + mainstream metastrategies substantially improved when I spent more time around folks in London and DC (who were less closely tied to the labs). In fairness, I suspect that if I had lived in London or DC *first* and then moved to the Bay Area, it’s plausible I would’ve had a similar feeling but in the “reverse direction”. With all this in mind, I find myself more deeply appreciating folks who have publicly and openly critiqued labs, even in situations where the cultural and economic incentives to do so were quite weak (relative to staying silent or saying generic positive things about labs). Examples: Habryka, Rob Bensinger, CAIS, MIRI, Conjecture, and FLI. More recently, @Zach Stein-Perlman, and of course Jan Leike and Daniel K. 
If your endgame strategy involved relying on OpenAI, DeepMind, or Anthropic to implement your alignment solution that solves science / super-cooperation / nanotechnology, consider figuring out another endgame plan.

Popular Comments

Recent Discussion

Summary

We study language models' capability to perform parallel reasoning in one forward pass. To do so, we test GPT-3.5's ability to solve (in one token position) one or two instances of algorithmic problems. We consider three different problems: repeatedly iterating a given function, evaluating a mathematical expression, and calculating terms of a linearly recursive sequence.

We found no evidence for parallel reasoning in algorithmic problems: The total number of steps the model could perform when handed two independent tasks was comparable to (or less than) the number of steps it could perform when given one task.

Motivation

Broadly, we are interested in AI models' capability to perform hidden cognition: Agendas such as scalable oversight and AI control rely (to some degree) on our ability to supervise and bound models' thinking....

1Olli Järviniemi
  We performed few-shot testing before fine-tuning (this didn't make it to the post). I reran some experiments on the permutation iteration problem, and got similar results as before: for one function (and n = 6), the model got ~60% accuracy for k=2, but not great[1] accuracy for k=3. For two functions, it already failed at the f(x)+g(y) problem. (This was with 50 few-shot examples; gpt-3.5-turbo-0125 only allows 16k tokens.) So fine-tuning really does give considerably better capabilities than simply many-shot prompting.   Let me clarify that with fine-tuning, our intent wasn't so much to create or teach the model new capabilities, but to elicit the capabilities the model already has. (C.f. Hubinger's When can we trust model evaluations?, section 3.) I admit that it's not clear where to draw the lines between teaching and eliciting, though. Relatedly, I do not mean to claim that one simply cannot construct a 175B model that successfully performs nested addition and multiplication. Rather, I'd take the results as evidence for GPT-3.5 not doing much parallel reasoning off-the-shelf (e.g. with light fine-tuning). I could see this being consistent with the data multiplexing paper (they do much heavier training). I'm still confused, though.   I tried to run experiments on open source models on full fine-tuning, but it does, in fact, require much more RAM. I don't currently have the multi-GPU setups required to do full fine-tuning on even 7B models (I could barely fine-tune Pythia-1.4B on a single A100, and did not get much oomph out of it). So I'm backing down; if someone else is able to do proper tests here, go ahead. 1. ^ Note that while you can get 1/6 accuracy trivially, you can get 1/5 if you realize that the data is filtered so that fk(x)≠x, and 1/4 if you also realize that fk(x)≠f(x) (and are able to compute f(x)), ...
Ann10

Going to message you a suggestion I think.

Please help me find research on aspiring AI Safety folk!

I am two weeks into the strategy development phase of my movement building and almost ready to start ideating some programs for the year.

But I want these programs to be solving the biggest pain points people experience when trying to have a positive impact in AI Safety .

Has anyone seen any research that looks at this in depth? For example, through an interview process and then survey to quantify how painful the pain points are?

Some examples of pain points I've observed so far through my interviews wit... (read more)

If it’s worth saying, but not worth its own post, here's a place to put it.

If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.

If you're new to the community, you can start reading the Highlights from the Sequences, a collection of posts about the core ideas of LessWrong.

If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the Concepts section.

The Open Thread tag is here. The Open Thread sequence is here.

Linch120

Do we know if @paulfchristiano or other ex-lab people working on AI policy have non-disparagement agreements with OpenAI or other AI companies? I know Cullen doesn't, but I don't know about anybody else.

I know NIST isn't a regulatory body, but it still seems like standards-setting should be done by people who have no unusual legal obligations. 

To be clear, I want to differentiate between Non-Disclosure Agreements, which are a perfectly sane and reasonable in at least a limited form as a way to prevent leaking trade secrets, and non-disparagement agree... (read more)

3spencerkaplan
Hi everyone! I'm new to LW and wanted to introduce myself. I'm from the SF bay area and working on my PhD in anthropology. I study AI safety, and I'm mainly interested in research efforts that draw methods from the human sciences to better understand present and future models. I'm also interested in the AI safety's sociocultural dynamics, including how ideas circulate the research community and how uncertainty figures into our interactions with models. All thoughts and leads are welcome. This work led me to LW. Originally all the content was overwhelming but now there's much I appreciate. It's my go-to place for developments in the field and informed responses. More broadly, learning about rationality through the sequences and other posts is helping me improve my work as a researcher and I'm looking forward to continuing this process.
3habryka
Welcome! I hope you have a good time here!

 [memetic status: stating directly despite it being a clear consequence of core AI risk knowledge because many people have "but nature will survive us" antibodies to other classes of doom and misapply them here.]

Unfortunately, no.[1]

Technically, “Nature”, meaning the fundamental physical laws, will continue. However, people usually mean forests, oceans, fungi, bacteria, and generally biological life when they say “nature”, and those would not have much chance competing against a misaligned superintelligence for resources like sunlight and atoms, which are useful to both biological and artificial systems.

There’s a thought that comforts many people when they imagine humanity going extinct due to a nuclear catastrophe or runaway global warming: Once the mushroom clouds or CO2 levels have settled, nature will reclaim the cities. Maybe mankind in our hubris will have wounded Mother Earth and paid the price ourselves, but...

I've thought a bit about actions to reduce the probability that AI takeover involves violent conflict.

I don't think there are any amazing looking options. If goverments were generally more competent that would help.

Having some sort of apparatus for negotiating with rogue AIs could also help, but I expect this is politically infeasible and not that leveraged to advocate for on the margin.

2Mitchell_Porter
In preparation for what?
1jaan
AI takeover.
10owencb
OK hmm I think I understand what you mean. I would have thought about it like this: * "our reference class" includes roughly the observations we make before observing that we're very early in the universe * This includes stuff like being a pre-singularity civilization * The anthropics here suggest there won't be lots of civs later arising and being in our reference class and then finding that they're much later in universe histories * It doesn't speak to the existence or otherwise of future human-observer moments in a post-singularity civilization ... but as you say anthropics is confusing, so I might be getting this wrong.
This is a linkpost for https://ailabwatch.org

I'm launching AI Lab Watch. I collected actions for frontier AI labs to improve AI safety, then evaluated some frontier labs accordingly.

It's a collection of information on what labs should do and what labs are doing. It also has some adjacent resources, including a list of other safety-ish scorecard-ish stuff.

(It's much better on desktop than mobile — don't read it on mobile.)

It's in beta—leave feedback here or comment or DM me—but I basically endorse the content and you're welcome to share and discuss it publicly.

It's unincorporated, unfunded, not affiliated with any orgs/people, and is just me.

Some clarifications and disclaimers.

How you can help:

  • Give feedback on how this project is helpful or how it could be different to be much more helpful
  • Tell me what's wrong/missing; point me to sources
...

So Alignment program is to be updated to 0 for OpenAI now that Superalignment team is no more? ( https://docs.google.com/document/d/1uPd2S00MqfgXmKHRkVELz5PdFRVzfjDujtu8XLyREgM/edit?usp=sharing )

When working with numbers that span many orders of magnitude it's very helpful to use some form of scientific notation. At its core, scientific notation expresses a number by breaking it down into a decimal ≥1 and <10 (the "significand" or "mantissa") and an integer representing the order of magnitude (the "exponent"). Traditionally this is written as:

3 × 104

While this communicates the necessary information, it has two main downsides:

  • It uses three constant characters ("× 10") to separate the significand and exponent.

  • It uses superscript, which doesn't work with some typesetting systems and adds awkwardly large line spacing at the best of times. And is generally lost on cut-and-paste.

Instead, I'm a big fan of e-notation, commonly used in programming and on calculators. This looks like:

3e4

This works everywhere, doesn't mess up your line spacing, and requires half as...

I'd like to second this comment, at least broadly. I've seen the e notation in blog posts and the like and I've struggled to put the × 10 in the right place.

One of the reasons why I dislike trying to understand numbers written in scientific notation is because I have trouble mapping them to normal numbers with lots of commas in them. Engineering notation helps a lot with this — at least for numbers greater than 1 — by having the exponent be a multiple of 3. Oftentimes, losing significant figures isn't an issue in anything but the most technical scientific writing.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

by Lucius Bushnaq, Jake Mendel, Kaarel Hänni, Stefan Heimersheim.

A short post laying out our reasoning for using integrated gradients as attribution method. It is intended as a stand-alone post based on our LIB papers [1] [2]. This work was produced at Apollo Research.

Context

Understanding circuits in neural networks requires understanding how features interact with other features. There's a lot of features and their interactions are generally non-linear. A good starting point for understanding the interactions might be to just figure out how strongly each pair of features in adjacent layers of the network interacts. But since the relationships are non-linear, how do we quantify their 'strength' in a principled manner that isn't vulnerable to common and simple counterexamples? In other words, how do we quantify how much the...

We now have a method for how to do attributions on single data points. But when we're searching for circuits, we're probably looking for variables that have strong attributions between each other on average, measured over many data points.

Maybe?

One thing I've been thinking a lot recently is that building tools to interpret networks on individual datapoints might be more relevant than attributing over a dataset. This applies if the goal is to make statistical generalizations since a richer structure on an individual datapoint gives you more to generalize wi... (read more)

This is the script for a video I made about my current full-time project. I think the LW community will understand its value better than the average person I talk to does.

Hi, I'm Bruce Lewis. I'm a computer programmer. For a long time, I've been fascinated by how computers can help people process information. Lately I've been thinking about and experimenting with ways that computers help people process lines of reasoning. This video will catch you up on the series of thoughts and experiments that led me to HowTruthful, and tell you why I'm excited about it. This is going to be a long video, but if you're interested in how people arrive at truth, it will be worth it.

Ten or 15 years ago I noticed how...

I like that HowTruthful uses the idea of (independent) hierarchical subarguments, since I had the same idea. Have you been able to persuade very many to pay for it?

My first thought about it was that the true/false scale should have two dimensions, knowledge & probability:

One of the many things I wanted to do on my site was to gather user opinions, and this does that. ✔ I think of opinions as valuable evidence, just not always valuable evidence about the question under discussion (though to the extent people with "high knowledge" really have high knowle... (read more)

This is a linkpost for our two recent papers:

  1. An exploration of using degeneracy in the loss landscape for interpretability https://arxiv.org/abs/2405.10927
  2. An empirical test of an interpretability technique based on the loss landscape https://arxiv.org/abs/2405.10928

This work was produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs), Avery Griffin, Joern Stoehler, Magdalena Wache and Cindy Wu. Not to be confused with Apollo's recent Sparse Dictionary Learning paper.

A key obstacle to mechanistic interpretability is finding the right representation of neural network internals. Optimally, we would like to derive our features from some high-level principle that holds across different architectures and use cases. At a minimum, we know two things:

  1. We know that the training loss goes down during training. Thus, the features learned during training must be determined by the loss
...

I was thinking in similar lines, but eventually dropped it because I felt like the gradients would likely miss something if e.g. a saturated softmax prevents any gradient from going through. I find it interesting that experiments also find that the interaction basis didn't work, and I wonder whether any of the failure here is due to saturated softmaxes.

LessOnline Festival

May 31st to June 2nd, Berkely CA