A book review examining Elinor Ostrom's "Governance of the Commons", in light of Eliezer Yudkowsky's "Inadequate Equilibria." Are successful local institutions for governing common pool resources possible without government intervention? Under what circumstances can such institutions emerge spontaneously to solve coordination problems?
We study language models' capability to perform parallel reasoning in one forward pass. To do so, we test GPT-3.5's ability to solve (in one token position) one or two instances of algorithmic problems. We consider three different problems: repeatedly iterating a given function, evaluating a mathematical expression, and calculating terms of a linearly recursive sequence.
We found no evidence for parallel reasoning in algorithmic problems: The total number of steps the model could perform when handed two independent tasks was comparable to (or less than) the number of steps it could perform when given one task.
Broadly, we are interested in AI models' capability to perform hidden cognition: Agendas such as scalable oversight and AI control rely (to some degree) on our ability to supervise and bound models' thinking....
Going to message you a suggestion I think.
Please help me find research on aspiring AI Safety folk!
I am two weeks into the strategy development phase of my movement building and almost ready to start ideating some programs for the year.
But I want these programs to be solving the biggest pain points people experience when trying to have a positive impact in AI Safety .
Has anyone seen any research that looks at this in depth? For example, through an interview process and then survey to quantify how painful the pain points are?
Some examples of pain points I've observed so far through my interviews wit...
If it’s worth saying, but not worth its own post, here's a place to put it.
If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.
If you're new to the community, you can start reading the Highlights from the Sequences, a collection of posts about the core ideas of LessWrong.
If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the Concepts section.
The Open Thread tag is here. The Open Thread sequence is here.
Do we know if @paulfchristiano or other ex-lab people working on AI policy have non-disparagement agreements with OpenAI or other AI companies? I know Cullen doesn't, but I don't know about anybody else.
I know NIST isn't a regulatory body, but it still seems like standards-setting should be done by people who have no unusual legal obligations.
To be clear, I want to differentiate between Non-Disclosure Agreements, which are a perfectly sane and reasonable in at least a limited form as a way to prevent leaking trade secrets, and non-disparagement agree...
[memetic status: stating directly despite it being a clear consequence of core AI risk knowledge because many people have "but nature will survive us" antibodies to other classes of doom and misapply them here.]
Unfortunately, no.[1]
Technically, “Nature”, meaning the fundamental physical laws, will continue. However, people usually mean forests, oceans, fungi, bacteria, and generally biological life when they say “nature”, and those would not have much chance competing against a misaligned superintelligence for resources like sunlight and atoms, which are useful to both biological and artificial systems.
There’s a thought that comforts many people when they imagine humanity going extinct due to a nuclear catastrophe or runaway global warming: Once the mushroom clouds or CO2 levels have settled, nature will reclaim the cities. Maybe mankind in our hubris will have wounded Mother Earth and paid the price ourselves, but...
I've thought a bit about actions to reduce the probability that AI takeover involves violent conflict.
I don't think there are any amazing looking options. If goverments were generally more competent that would help.
Having some sort of apparatus for negotiating with rogue AIs could also help, but I expect this is politically infeasible and not that leveraged to advocate for on the margin.
I'm launching AI Lab Watch. I collected actions for frontier AI labs to improve AI safety, then evaluated some frontier labs accordingly.
It's a collection of information on what labs should do and what labs are doing. It also has some adjacent resources, including a list of other safety-ish scorecard-ish stuff.
(It's much better on desktop than mobile — don't read it on mobile.)
It's in beta—leave feedback here or comment or DM me—but I basically endorse the content and you're welcome to share and discuss it publicly.
It's unincorporated, unfunded, not affiliated with any orgs/people, and is just me.
Some clarifications and disclaimers.
How you can help:
So Alignment program is to be updated to 0 for OpenAI now that Superalignment team is no more? ( https://docs.google.com/document/d/1uPd2S00MqfgXmKHRkVELz5PdFRVzfjDujtu8XLyREgM/edit?usp=sharing )
When working with numbers that span many orders of magnitude it's very helpful to use some form of scientific notation. At its core, scientific notation expresses a number by breaking it down into a decimal ≥1 and <10 (the "significand" or "mantissa") and an integer representing the order of magnitude (the "exponent"). Traditionally this is written as:
3
× 104
While this communicates the necessary information, it has two main downsides:
It uses three constant characters ("× 10") to separate the significand and exponent.
It uses superscript, which doesn't work with some typesetting systems and adds awkwardly large line spacing at the best of times. And is generally lost on cut-and-paste.
Instead, I'm a big fan of e-notation, commonly used in programming and on calculators. This looks like:
3e4
This works everywhere, doesn't mess up your line spacing, and requires half as...
I'd like to second this comment, at least broadly. I've seen the e notation in blog posts and the like and I've struggled to put the × 10
in the right place.
One of the reasons why I dislike trying to understand numbers written in scientific notation is because I have trouble mapping them to normal numbers with lots of commas in them. Engineering notation helps a lot with this — at least for numbers greater than 1 — by having the exponent be a multiple of 3. Oftentimes, losing significant figures isn't an issue in anything but the most technical scientific writing.
by Lucius Bushnaq, Jake Mendel, Kaarel Hänni, Stefan Heimersheim.
A short post laying out our reasoning for using integrated gradients as attribution method. It is intended as a stand-alone post based on our LIB papers [1] [2]. This work was produced at Apollo Research.
Understanding circuits in neural networks requires understanding how features interact with other features. There's a lot of features and their interactions are generally non-linear. A good starting point for understanding the interactions might be to just figure out how strongly each pair of features in adjacent layers of the network interacts. But since the relationships are non-linear, how do we quantify their 'strength' in a principled manner that isn't vulnerable to common and simple counterexamples? In other words, how do we quantify how much the...
We now have a method for how to do attributions on single data points. But when we're searching for circuits, we're probably looking for variables that have strong attributions between each other on average, measured over many data points.
Maybe?
One thing I've been thinking a lot recently is that building tools to interpret networks on individual datapoints might be more relevant than attributing over a dataset. This applies if the goal is to make statistical generalizations since a richer structure on an individual datapoint gives you more to generalize wi...
This is the script for a video I made about my current full-time project. I think the LW community will understand its value better than the average person I talk to does.
Hi, I'm Bruce Lewis. I'm a computer programmer. For a long time, I've been fascinated by how computers can help people process information. Lately I've been thinking about and experimenting with ways that computers help people process lines of reasoning. This video will catch you up on the series of thoughts and experiments that led me to HowTruthful, and tell you why I'm excited about it. This is going to be a long video, but if you're interested in how people arrive at truth, it will be worth it.
Ten or 15 years ago I noticed how...
I like that HowTruthful uses the idea of (independent) hierarchical subarguments, since I had the same idea. Have you been able to persuade very many to pay for it?
My first thought about it was that the true/false scale should have two dimensions, knowledge & probability:
One of the many things I wanted to do on my site was to gather user opinions, and this does that. ✔ I think of opinions as valuable evidence, just not always valuable evidence about the question under discussion (though to the extent people with "high knowledge" really have high knowle...
This is a linkpost for our two recent papers:
This work was produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs), Avery Griffin, Joern Stoehler, Magdalena Wache and Cindy Wu. Not to be confused with Apollo's recent Sparse Dictionary Learning paper.
A key obstacle to mechanistic interpretability is finding the right representation of neural network internals. Optimally, we would like to derive our features from some high-level principle that holds across different architectures and use cases. At a minimum, we know two things:
I was thinking in similar lines, but eventually dropped it because I felt like the gradients would likely miss something if e.g. a saturated softmax prevents any gradient from going through. I find it interesting that experiments also find that the interaction basis didn't work, and I wonder whether any of the failure here is due to saturated softmaxes.