Hello! This is jacobjacob from the LessWrong / Lightcone team.
This is a meta thread for you to share any thoughts, feelings, feedback or other stuff about LessWrong, that's been on your mind.
Examples of things you might share:
...or anything else!
The point of this thread is to give you an affordance to share anything that's been on your mind, in a place where you know that a team member will be listening.
(We're a small team and have to prioritise what we work on, so I of course don't promise to action everything mentioned here. But I will at least listen...
If it’s worth saying, but not worth its own post, here's a place to put it.
If you are new to LessWrong, here's the place to introduce yourself. Personal stories, anecdotes, or just general comments on how you found us and what you hope to get from the site and community are invited. This is also the place to discuss feature requests and other ideas you have for the site, if you don't want to write a full top-level post.
If you're new to the community, you can start reading the Highlights from the Sequences, a collection of posts about the core ideas of LessWrong.
If you want to explore the community more, I recommend reading the Library, checking recent Curated posts, seeing if there are any meetups in your area, and checking out the Getting Started section of the LessWrong FAQ. If you want to orient to the content on the site, you can also check out the Concepts section.
The “no sandbagging on checkable tasks” hypothesis: With rare exceptions, if a not-wildly-superhuman ML model is capable of doing some task X, and you can check whether it has done X, then you can get it to do X using already-available training techniques (e.g., fine-tuning it using gradient descent).
Borrowing from Shulman, here’s an example of the sort of thing I mean. Suppose that you have a computer that you don’t know how to hack, and that only someone who had hacked it could make a blue banana show up on the screen. You’re wondering whether a given model can hack this...
There are many things I feel like the post authors miss, and I want to share a few things that seem good to communicate.
I'm going to focus on controlling superintelligent AI systems: systems powerful enough to solve alignment (in the CEV sense) completely, or to kill everyone on the planet.
In this post, I'm going to ignore other AI-related sources of x-risk, such as AI-enabled bioterrorism, and I'm not commenting on everything that seems important to comment on.
I'm also not going to point at all the slippery claims that I think can make the reader generalize incorrectly, as it'd be nitpicky and also not worth the time (examples of what I'd skip- I couldn't find evidence that GPT-4 has undergone any supervised fine-tuning; RLHF shapes chatbots' brains into...
In May and June of 2023, I (Akash) had about 50-70 meetings about AI risks with congressional staffers. I had been meaning to write a post reflecting on the experience and some of my takeaways, and I figured it could be a good topic for a LessWrong dialogue. I saw that hath had offered to do LW dialogues with folks, and I reached out.
In this dialogue, we discuss how I decided to chat with staffers, my initial observations in DC, some context about how Congressional offices work, what my meetings looked like, lessons I learned, and some miscellaneous takes about my experience.
This is the fourth post in my series on Anthropics. The previous one is Anthropical probabilities are fully explained by difference in possible outcomes.
If there is nothing special about anthropics, if it’s just about correctly applying standard probability theory, why do we keep encountering anthropical paradoxes instead of general probability theory paradoxes? Part of the answer is that people tend to be worse at applying probability theory in some cases than in the others.
But most importantly, the whole premise is wrong. We do encounter paradoxes of probability theory all the time. We are just not paying enough attention to them, and occasionally attribute them to anthropics.
As an example, let’s investigate Updateless Dilemma, introduced by Eliezer Yudkowsky in 2009.
Let us start with a
Value learning is a proposed method for incorporating human values in an AGI. It involves the creation of an artificial learner whose actions consider many possible
set of values and preferences, weighed by their likelihood. Value learning could prevent an AGI of having goals detrimental to human values, hence helping in the creation of Friendly AI. Although there are many ways to incorporate human values in an AGI (e.g.: Coherent Extrapolated Volition, Coherent Aggregated Volition and Coherent Blended Volition ), this method is directly mentioned and developed in Daniel Dewey’s paper ‘Learning What to Value’. Like most authors, he assumes that human’s goals would not naturally occur in an artificial agent and should be enforced in it. First, Dewey argues against the use of a simple use of reinforcement learning to solve this problem, on the basis that this lead to the maximization of specific rewards that can diverge from value maximization. For example, even if we forcefully engineer the agent to maximize those rewards that also maximize human values, the agent could alter its environment to more easily produce those same rewards without the trouble of also maximizing human values (i.e.: if the reward was human happiness it could alter the human mind so it became happy with anything). To solve all these problems, Dewey proposes a utility function maximizer, who considers all possible utility functions weighted by their probabilities: "[W]e propose uncertainty over utility functions. Instead of providing an agent one utility function up front, we provide an agent with a pool of possible utility functions and a probability distribution P such that each utility function can be assigned probability P(Ujyxm) given a particular interaction history [yxm]. An agent can then calculate an expected value over possible utility functions given a particular interaction history" He concludes saying that although it solves many of the mentioned problems, this method still leaves many open questions. However it should provide a direction for future work.