Wiki Contributions

Comments

learn an implicit rule like "if I can control it by an act of will, it is me

This was empirically demonstrated to be possible in this paper: "Curiosity-driven Exploration by Self-supervised Prediction", Pathak et al

We formulate curiosity as the error in an agent's ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model.

It probably could be extended to learn "other" and the "boundary between self and other" in a similar way.

I implemented a version of it myself and it worked. This was years ago. I can only imagine what will happen when someone redoes some of these old RL algo's, with LLM's providing the world model.

What we really want with interpretability is: high accuracy, when out of distribution, scaling to large models. You got very high accuracy... but I have no context to say if this is good or bad. What could a naïve baseline get? And what do SAE's get? Also it would be nice to see an Out Of Distribution set, because getting 100% on your test suggests that it's fully within the training distribution (or that your VQ-VAE worked perfectly).

I tried something similar but only got half as far as you. Still my code may be of interest. I wanted to know if it would help with lie detection, out of distribution, but didn't get great results. I was using a very hard setup where no methods work well.

I think VQ-VAE is a promising approach because it's more scalable than SAE, which have 8 times the parameters of the model they are interpreting. Also your idea of using a decision tree on the tokenised space make a lot of sense given the discrete latent space!

In that way it's similar to charter cities. An cheaper intermediate stage could be online organizations like World Of WarCraft gaming clans, or internet forums, or project overviews. They are not quite the same, but they are cheap.

Yeah, it does seem tricky. OpenAI recently tried a unique governance structure and that it is tuning out unpredictable and might be costly in terms of legal fees and malfunction.,

That's true. But just because people aren't motivated, doesn't mean we should try. It's possible to create incentives with subsidies, direct payments, etc.

My reaction to the Churchill quote is: why don't we try more forms of government?

Of course we can't start with animal trials, but we can try with the American Antarctic base or a village then move up to city trials, state trials, and so on.

"But Sir, I just need your order."

It's also kind of a negative place to put your attention. People probably prefer not to think about it.

At best, it's a boring chore, at worse it's a negative cognitive hazard. Best to minimize time spent on things that will make you feel angry, mistreated, unfair, etc. Especially if you might get trapped in those states, making everything turn out worse, and making day to day interactions a struggle rift with negative associations.

In short, it's vibe.

This topic is a little stressful for to read about, because in the course of doing ours jobs, we risk enacting low or high status behaviors. And these often lead to conflicts with status obsessed people.

For instance, deferring to a more knowledgeable colleague might set a precedent that's hard to change, while assertively leading an initiative can sometimes spark disputes over status.

It doesn't help that we rationalists are part of a very direct ask/tell culture, so we can often conflict with most other cultures.

But being rationalists, we should not just theorycel but consider helpful strategies. Here's a few:

  • Reevaluate your need for status: We all feel like we need status, but we often do not. Mostly it's expensive and fleeting. It's expensive because all humans crave it, and it's fleeting because it disappears when you change city/job/relationship. It also decays with the passing of time so it has a carry cost.
  • Buy low/sell high: You do need enough to get by. Enough so that people will listen to you when it's important and not mistreat you. You want to stock up when it's cheap and needed, and you want to give it up when it's expensive and unneeded. It does require maintenance, and it is not transferable, so this limits the value of stocking up. But within a friend group, team, or hobby group you could build up status when it's cheap (during the founding days for example, or during a lull in popularity, or a crisis).
  • Stake a small defensible territory: Status can be expensive, if you need it, you could aim to defend a small area: like a technical niche where you have some natural advantage and can make a difference
  • Leverage your strengths: It's easier to behave like the boss when you are in fact the boss. And it's easier to maintain some minimum status when you have the credentials/friends/expertise to back it up.
  • Strategic Deference: If you're employed by someone, they likely expect a degree of deference. They may value certain forms of deference (such as dominating conversations) while being indifferent to others (like genuine agreement). Or it may be the other way round. This distinction allows for strategic interaction. I'm reminded of Bryan Caplan's post on non-conformism
    • Don’t be an absolutist non-conformist. Conforming in small ways often gives you the opportunity to non-conform in big ways. Being deferential to your boss, for example, opens up a world of possibilities.

  • Adjust your communication to the status hierarchy you are in, ideally in the cheapest ways possible. It's cheap if it doesn't take much effort, and doesn't sacrifice your goals. We rationalists tend to be seen as more blunt, disagreeable, argumentative, pedantic, and complicated than other cultures, so you should almost always compensate somewhat.
  • Earn weirdness points: Establish a reputation for being unorthodox in ways that are tolerated. This approach helps ensure that your actions are less likely to be interpreted purely through a status lens.
  • Allies: one thing this essay seemed to miss is the status benefit of having many friend and allies. This is a positive sum status game, so it's good to participate!

Anyone got other ideas?

Unsurprisingly you are not the first to think of this, Cicero wrote on this a few thousand years ago in How To Be A Friend!

https://time.com/5361671/how-to-be-a-friend-cicero/

Load More