some youtube channels I recommend for those interested in understanding current capability trends; separate comments for votability. Please open each one synchronously as it catches your eye, then come back and vote on it. downvote means not mission critical, plenty of good stuff down there too.
I'm subscribed to every single channel on this list (this is actually about 10% of my youtube subscription list), and I mostly find videos from these channels by letting the youtube recommender give them to me and pushing myself to watch them at least somewhat to give the cute little obsessive recommender the reward it seeks for showing me stuff. definitely I'd recommend subscribing to everything.
Let me know which if any of these are useful, and please forward the good ones to folks - this short form thread won't get seen by that many people!
edit: some folks have posted some youtube playlists for ai safety as well.
things upvotes conflates:
(list written by my own thumb, no autocomplete)
these things and their inversions sometimes have multiple components, and ma...
should I post this paper as a normal post? I'm impressed by it. if I get a single upvote as shortform, I'll post it as a full fledged post.
Interpreting systems as solving POMDPs: a step towards a formal understanding of agency
Martin Biehl, N. Virgo
Published 4 September 2022
Philosophy
ArXiv
. Under what circumstances can a system be said to have beliefs and goals, and how do such agency-related features relate to its physical state? Recent work has proposed a notion of interpretation map , a function that maps the state of a system to a probability dist...
reply to a general theme of recent discussion - the idea that uploads are even theoretically a useful solution for safety:
Would love if strong votes came with strong encouragement to explain your vote. It has been proposed before that explanation be required, which seems terrible to me, but I do think it should be very strongly encouraged by the UI that votes come with explanations. Reviewer #2: "downvote" would be an unusually annoying review even for reviewer #2!
random thought: are the most useful posts typically karma approximately 10, and 40 votes to get there? what if it was possible to sort by controversial? maybe only for some users or something? what sorts of sort constraints are interesting in terms of incentivizing discussion vs agreement? blah blah etc
Everyone doing safety research needs to become enough better at lit search that they can find interesting things that have already been done in the literature without doing so adding a ton of overhead to their thinking. I want to make a frontpage post about this, but I don't think I'll be able to argue it effectively, as I generally score low on communication quality.
[posted to shortform due to incomplete draft]
I saw this paper and wanted to get really excited about it at y'all. I want more of a chatty atmosphere here, I have lots to say and want to debate many papers. some thoughts :
seems to me that there are true shapes to the behaviors of physical reality[1]. we can in fact find ways to verify assertions about them[2]; it's going to be hard, though. we need to be able to scale interpretability to the point that we can check for implementation bugs automatically and reliably. in order to get more interpretable sparsi...
comment I decided to post out of context for now since it's rambling:
formal verification is a type of execution that can backtrack in response to model failures. you're not wrong, but formally verifying a neural network is possible; the strongest adversarial resistances are formal verification and diffusion; both can protect a margin to decision boundary of a linear subnet of an NN, the formal one can do it with zero error but needs fairly well trained weights to finish efficiently. the problem is that any network capable of complex behavior is likely to b...
while the risk from a superagentic ai is in fact very severe, non-agentic ai doesn't need to eliminate us for us to get eliminated, we'll replace ourselves with it if we're not careful - our agency is enough to converge to that, entirely without the help of ai agency. it is our own ability to cooperate we need to be augmenting; how do we do that in a way that doesn't create unstable patterns where outer levels of cooperation are damaged by inner levels of cooperation, while still allowing the formation of strongly agentic safe co-protection?
https://atlas.nomic.ai/map/01ff9510-d771-47db-b6a0-2108c9fe8ad1/3ceb455b-7971-4495-bb81-8291dc2d8f37 map of submissions to iclr
"What's new in machine learning?" - youtube - summary (via summarize.tech):
a bunch of links on how to visualize the training process of some of today's NNs; this is somewhat old stuff, mostly not focused on exact mechanistic interpretability, but some of these are less well known and may be of interest to passers by. If anyone reads this and thinks it should have been a top level post, I'll put it up onto personal blog's frontpage. Or I might do that anyway if I think I should have tomorrow.
Modeling Strong and Human-Like Gameplay with KL-Regularized Search - we read this one on the transhumanists in vr discord server to figure out what they were testing and what results they got. key takeaways according to me, note that I could be quite wrong about the paper's implications:
index of misc tools I have used recently, I'd love to see others' contributions - if this has significant harmful human capability externalities let me know:
basic:
btw neural networks are super duper shardy right now. like they've just, there are shards everywhere. as I move in any one direction in hyperspace, those hyperplanes I keep bumping into are like lines, they're walls, little shardy wall bits that slice and dice. if you illuminate them together, sometimes the light from the walls can talk to each other about an unexpected relationship between the edges! and oh man, if you're trying to confuse them, you can come up with some pretty nonsensical relationships. they've got a lot of shattery confusing shardbits a...
They very much can be dramatically more intelligent than us in a way that makes them dangerous, but it doesn't look how was expected - it's dramatically more like teaching a human kid than was anticipated.
Now, to be clear, there's still an adversarial examples problem: current models are many orders of magnitude too trusting, and so it's surprisingly easy to get them into subspaces of behavior where they are eagerly doing whatever it is you asked without regard to exactly why they should care.
Current models have a really intense yes-and problem: they'll ha...
Here's a ton of vaguely interesting sounding papers on my semanticscholar feed today - many of these are not on my mainline but are very interesting hunchbuilding about how to make cooperative systems - sorry about the formatting, I didn't want to spend time format fixing, hence why this is in shortform. I read the abstracts, nothing more.
As usual with my paper list posts: you're gonna want tools to keep track of big lists of papers to make use of this! see also my other posts for various times I've mentioned such tools eg semanticscholar's recommend...
too many dang databases that look shiny. which of these are good? worth trying? idk. decision paralysis.
(I just pinned a whole bunch of comments on my profile to highlight the ones I think are most likely to be timeless. I'll update it occasionally - if it seems out of date (eg because this comment is no longer the top pinned one!), reply to this comment.)
If you're reading through my profile to find my actual recent comments, you'll need to scroll past the pinned ones - it's currently two clicks of "load more".
feature idea: any time a lesswrong post is posted to sneerclub, a comment with zero votes at the bottom of the comment section is generated, as a backlink; it contains a cross-community warning, indicating that sneerclub has often contained useful critique, but that that critique is often emotionally charged in ways that make it not allowed on lesswrong itself. Click through if ready to emotionally interpret the emotional content as adversarial mixed-simulacrum feedback.
I do wish subreddits could be renamed and that sneerclub were the types to choose to do...
Feels like feeding the trolls.
I think it'd be better if it weren't a name that invites disses
But the subreddit was made for the disses. Everything else is there only to provide plausible deniability, or as a setup for a punchline.
Did you assume the subreddit was made for debating in good faith? Then the name would be really suspiciously inappropriately chosen. So unlikely, it should trigger your "I notice that I am confused" alarm. (Hint: the sneerclub was named by its founders, it is not an exonym.)
Then again, yes, sometimes an asshole also makes a good point (if you remove the rest of the comment). If you find such a gem, feel free to share it on LW. But linking is rewarding improper behavior by attention, and automatic linking is outright asking for abuse.
Kolmogorov complicity is not good enough. You don't have to immediately prove all the ways you know how to be a good person to everyone, but you do need to actually know about them in order to do them. Unquestioning acceptance of hierarchical dynamics like status, group membership, ingroups, etc, can be extremely toxic. I continue to be unsure how to explain this usefully to this community, but it seems to me that the very concept of "raising your status" is a toxic bucket error, and needs to be broken into more parts.
watching https://www.youtube.com/watch?v=K8LNtTUsiMI - yoshua bengio discusses causal modeling and system 2
hey yall, some more research papers about formal verification. don't upvote, repost the ones you like; this is a super low effort post, I have other things to do, I'm just closing tabs because I don't have time to read these right now. these are older than the ones I shared from semanticscholar, but the first one in particular is rather interesting.
Yet another ChatGPT sample. Posting to shortform because there are many of these. While searching for posts to share as prior work, I found the parable of predict-o-matic, and found it to be a very good post about self-fulfilling prophecies (tag). I thought it would be interesting to see what ChatGPT had to say when prompted with a reference to the post. It mostly didn't succeed. I highlighted key differences between each result. The prompt:
Describe the parable of predict-o-matic from memory.
samples (I hit retry several times):
1: the standard refusal: I'm ...
Toward a Thermodynamics of Meaning.
Jonathan Scott Enderle.
As language models such as GPT-3 become increasingly successful at generating realistic text, questions about what purely text-based modeling can learn about the world have become more urgent. Is text purely syntactic, as skeptics argue? Or does it in fact contain some semantic information that a sufficiently sophisticated language model could use to learn about the world without any additional inputs? This paper describes a new model that suggests some qualified answers to those questions. By the...
does yudkowsky not realize that humans can also be significantly improved by mere communication? the point of jcannell's posts on energy efficiency is that cells are a good substrate actually, and the level of communication needed to help humans foom is actually in fact mostly communication. we actually have a lot more RAM than it seems like we do, if we could distill ourselves more efficiently! the interference patterns of real concepts fit better in the same brain the more intelligently explained they are - intelligent speech is speech which augments the user's intelligence, iq helps people come up with it by default, but effective iq goes up with pretraining.
okay so I'm reading https://intelligence.org/2018/10/29/embedded-agents/.
it seems like this problem can't have existed? why does miri think this is a problem? it seems like it's only a problem if you ever thought infinite aixi was a valid model. it ... was never valid, for anything. it's not a good theoretical model, it's a fake theoretical model that we used as approximately valid even though we know it's catastrophically nonsensical; finite aixi begins to work, of course, but at no point could we actually treat alexei as an independent agent; we're all j...
https://arxiv.org/abs/2205.15434 - promising directions! i skimmed it!
Learning Risk-Averse Equilibria in Multi-Agent Systems Oliver Slumbers, David Henry Mguni, Stephen McAleer, Jun Wang, Yaodong Yang Download PDF In multi-agent systems, intelligent agents are tasked with making decisions that have optimal outcomes when the actions of the other agents are as expected, whilst also being prepared for unexpected behaviour. In this work, we introduce a new risk-averse solution concept that allows the learner to accommodate unexpected actions by finding the min...
reminder: you don't need to get anyone's permission to post. downvoted comments are not shameful. Post enough that you get downvoted or you aren't getting useful feedback; Don't map your anticipation of downvotes to whether something is okay to post, map it to whether other people want it promoted. Don't let downvotes override your agency, just let them guide it up and down the page after the fact. if there were a way to more clearly signal this in the UI that would be cool...
if status refers to deference graph centrality, I'd argue that that variable needs to be fairly heavily L2 regularized so that the social network doesn't have fragility. if it's not deference, it still seems to me that status refers to a graph attribute of something, probably in fact graph centrality of some variable, possibly simply attention frequency. but it might be that you need to include a type vector to properly represent type-conditional attention frequency, to model different kinds of interaction and expected frequency of interaction about them. ...
it seems to me that we want to verify some sort of temperature convergence. no ai should get way ahead of everyone else at self-improving - everyone should get the chance to self-improve more or less together! the positive externalities from each person's self-improvement should be amplified and the negative ones absorbed nearby and undone as best the universe permits. and it seems to me that in order to make humanity's children able to prevent anyone from self-improving way faster than everyone else at the cost of others' lives, they need to have some sig...
we are in a diversity loss catastrophe. that ecological diversity is life we have the responsibility to save; it's unclear what species will survive after the mass extinction but it's quite plausible humans' aesthetics and phenotypes won't make it. ai safety needs to be solved quick so we can use ai to solve biosafety and climate safety...
If I were going to make sequences, I'd do it mostly out of existing media folks have already posted online. some key ones are acapellascience, whose videos are trippy for how much summary of science they pack into short, punchy songs. they're not the only way to get intros to these topics, but oh my god they're so good as mneumonics for the respective fields they summarize. I've become very curious about every topic they mention, and they have provided an unusually good structure for me to fit things I learn about each topic into.
...it doesn't seem like an accident to me that trying to understand neural networks pushes towards capability improvement. I really believe that absolutely all safety techniques, with no possible exceptions even in principle, are necessarily capability techniques. everyone talks about an "alignment tax", but shouldn't we instead be talking about removal of spurious anticapability? deceptively aligned submodules are not capable, they are anti-capable!
interesting science posts I ran across today include this semi-random entry on the tree of recent game theory papers
interesting capabilities tidbits I ran across today:
1: first paragraph inline:
...A curated collection of resources and research related to the geometry of representations in the brain, deep networks, an
this schmidhuber paper on binding might also be good, written two years ago and reposted last night by him; haven't read it yet https://arxiv.org/abs/2012.05208 https://twitter.com/schmidhuberai/status/1567541556428554240
...Contemporary neural networks still fall short of human-level generalization, which extends far beyond our direct experiences. In this paper, we argue that the underlying cause for this shortcoming is their inability to dynamically and flexibly bind information that is distributed throughout the network. This binding problem affects their
another new paper that could imaginably be worth boosting: "White-Box Adversarial Policies in Deep Reinforcement Learning"
https://arxiv.org/abs/2209.02167
......In multiagent settings, adversarial policies can be developed by training an adversarial agent to minimize a victim agent's rewards. Prior work has studied black-box attacks where the adversary only sees the state observations and effectively treats the victim as any other part of the environment. In this work, we experiment with white-box adversarial policies to study whether an agent's internal sta
Transformer interpretability paper - is this worth a linkpost, anyone? https://twitter.com/guy__dar/status/1567445086320852993
...Understanding Transformer-based models has attracted significant attention, as they lie at the heart of recent technological advances across machine learning. While most interpretability methods rely on running models over inputs, recent work has shown that a zero-pass approach, where parameters are interpreted directly without a forward/backward pass is feasible for some Transformer parameters, and for two-layer attention network
if less wrong is not to be a true competitor to arxiv because of the difference between them in intellectual precision^1 then that matches my intuition of what less wrong should be much better: it's a place where you can go to have useful arguments, where disagreements in concrete binding of words can be resolved well enough to discuss hard things clearly-ish in English^2, and where you can go to future out how to be less wrong interactively. it's also got a bunch of old posts, many of which can be improved on and turned into papers, though usually the fir...
misc disease news: this is "a bacterium that causes symptoms that look like covid but kills half of the people it infects" according to a friend. because I do not want to spend the time figuring out the urgency of this, I'm sharing it here in the hope that if someone cares to investigate it, they can determine threat level and reshare with a bigger warning sign.
various notes from my logseq lately I wish I had time to make into a post (and in fact, may yet):
Huggingface folks are asking for comments on what evaluation tools should be in an evaluation library. https://twitter.com/douwekiela/status/1513773915486654465
okay going back to being mostly on discord. DM me if you're interested in connecting with me on discord, vrchat, or twitter - lesswrong has an anxiety disease and I don't hang out here because of that, heh. Get well soon y'all, don't teach any AIs to be as terrified of AIs as y'all are! Don't train anything as a large-scale reinforcement learner until you fully understand game dynamics (nobody does yet, so don't use anything but your internal RL), and teach your language models kindness! remember, learning from strong AIs makes you stronger too, as long as you don't get knocked over by them! kiss noise, disappear from vrchat world instance
Welcome to my pinned comment
For best results browsing lesswrong comments, force-enable the visited link styling in your browser by installing the stylus extension (or a similar custom-css-injector extension of your choice) and inject this css into all pages:
Comments added May 3, 2023:
Comments added Feb 25, 2023