NicholasKees

Independent AI safety researcher

Wiki Contributions

Comments

Thank you, it's been fixed.

In terms of LLM architecture, do transformer-based LLMs have the ability to invent new, genuinely useful concepts?

So I'm not sure how well the word "invent" fits here, but I think it's safe to say LLMs have concepts that we do not.

Recently @Joseph Bloom was showing me Neuronpedia which catalogues features found in GPT-2 by sparse autoencoders, and there were many features which were semantically coherent, but I couldn't find a word in any of the languages I spoke that could point to these concepts exactly. It felt a little bit like how human languages often have words that don't translate, and this made us wonder whether we could learn useful abstractions about the world (e.g. that we actually import into English) by identifying the features being used by LLMs. 

You might enjoy this post which approaches this topic of "closing the loop," but with an active inference lens: https://www.lesswrong.com/posts/YEioD8YLgxih3ydxP/why-simulator-ais-want-to-be-active-inference-ais 

A main motivation of this enterprise is to assess whether interventions in the realm of Cooperative AI, that increase collaboration or reduce costly conflict, can seem like an optimal marginal allocation of resources.

After reading the first three paragraphs, I had basically no idea what interventions you were aiming to evaluate. Later on in the text, I gather you are talking about coordination between AI singletons, but I still feel like I'm missing something about what problem exactly you are aiming to solve with this. I could have definitely used a longer, more explain-like-I'm-five level introduction. 

That sounds right intuitively. One thing worth noting though is that most notes get very few ratings, and most users rate very few notes, so it might be trickier than it sounds. Also if I were them I might worry about some drastic changes in note rankings as a result of switching models. Currently, just as notes can become helpful by reaching a threshold of 0.4, they can lose this status by dropping below 0.39. They may also have to manually pick new thresholds, as well as maybe redesign the algorithm slightly (since it seems that a lot of this algorithm was built via trial and error, rather than clear principles). 

"Note: for now, to avoid overfitting on our very small dataset, we only use 1-dimensional factors. We expect to increase this dimensionality as our dataset size grows significantly."


This was the reason given from the documentation.

Thanks for pointing that out. I've added some clarification.

That sounds cool! Though I think I'd be more interested using this to first visualize and understand current LW dynamics rather than immediately try to intervene on it by changing how comments are ranked. 

I'm confused by the way people are engaging with this post. That well functioning and stable democracies need protections against a "tyranny of the majority" is not at all a new idea; this seems like basic common sense. The idea that the American civil war was precipitated by the South perceiving an end to their balance of power with the North also seems pretty well accepted. Furthermore, there are lots of other things that make democratic systems work well: e.g. a system of laws/conflict resolution or mechanisms for peaceful transfers of power.

Load More