Project ideas: Epistemics

Lukas Finnveden

This is part of a series of lists of projects. The unifying theme is that the projects are not targeted at solving alignment or engineered pandemics but still targeted at worlds where transformative AI is coming in the next 10 years or so. See here for the introductory post.

If AI capabilities keep improving, AI could soon play a huge role in our epistemic landscape. I think we have an opportunity to affect how it’s used: increasing the probability that we get great epistemic assistance and decreasing the extent to which AI is used to persuade people of false beliefs.

Before I start listing projects, I’ll discuss:

Why AI could matter a lot for epistemics. (Both positively and negatively.)
Why working on this could be urgent. (And not something we should just defer to the future.) Here, I’ll separately discuss:
- That it’s important for epistemics to be great in the near term (and not just in the long run) to help us deal with all the tricky issues that will arise as AI changes the world.
- That there may be path-dependencies that affect humanity’s long-run epistemics.

Why AI matters for epistemics

On the positive side, here are three ways AI could substantially increase our ability to learn and agree on what’s true.

Truth-seeking motivations. We could be far more confident that AI systems are motivated to learn and honestly report what’s true than is typical for humans. (Though in some cases, this will require significant progress on alignment.) Such confidence would make it much easier and more reliable for people to outsource investigations of difficult questions.
Cheaper and more competent investigations. Advanced AI would make high-quality cognitive labor much cheaper, thereby enabling much more thorough and detailed investigations of important topics. Today, society has some ability to converge on questions with overwhelming evidence. AI could generate such overwhelming evidence for much more difficult topics.
Iteration and validation. It will be much easier to control what sort of information AI has and hasn’t seen. (Compared to the difficulty of controlling what information humans have and haven’t seen.) This will allow us to run systematic experiments on whether AIs are good at inferring the right answers to questions that they’ve never seen the answer to.
- For one, this will give supporting evidence to the above two bullet points. If AI systems systematically get the right answer to previously unseen questions, that indicates that they are indeed honestly reporting what’s true without significant bias and that their extensive investigations are good at guiding them toward the truth.
- In addition, on questions where overwhelming evidence isn’t available, it may let us experimentally establish what intuitions and heuristics are best at predicting the right answer.^[1]

On the negative side, here are three ways AI could reduce the degree to which people have accurate beliefs.

Super-human persuasion. If AI capabilities keep increasing, I expect AI to become significantly better than humans at persuasion.
- Notably, on top of high general cognitive capabilities, AI could have vastly more experience with conversation and persuasion than any human has ever had. (Via being deployed to speak with people across the world and being trained on all that data.)
- With very high persuasion capabilities, people’s beliefs might (at least directionally) depend less on what’s true and more on what AI systems’ controllers want people to believe.
Possibility of lock-in. I think it’s likely that people will adopt AI personal assistants for a great number of tasks, including helping them select and filter the information they get exposed to. While this could be crucial for defending against persuasion attempts from outsiders, it also poses dangers of its own.^[2]
- In particular, some people may ask their assistants to protect them from being persuaded of views that they currently consider reprehensible (but which may be correct). Thereby permanently preventing them from being able to change their mind.
- Aside from people voluntarily choosing this, there’s also a risk that certain communities would pressure their members to adopt a filtering policy with this effect. (Even if that’s not the stated aim of the policy.)
Reduced incentives and selection for good epistemic practices. Up until today, (groups of) people’s ability to acquire influence has been partly dependent on them having accurate beliefs about the world.^[3] But if AI becomes capable enough that humans can hand over all decision-making to AI, then people’s own beliefs and epistemic practices could deteriorate much further without reducing their ability to gain and maintain influence.
- It’s unclear how important “incentives/selection for good epistemic practices” has been in the past. But I currently find it hard to rule out that it has been important.^[4]

Why working on this could be urgent

I think there are two reasons why it could be urgent to improve this situation. Firstly, I think that excellent AI advice would be greatly helpful for many decisions that we’re facing soon. Secondly, I think that there may be important path dependencies that we can influence.

One pressing issue for which I want great AI advice soon is misalignment risk. Very few people want misaligned AI to violently seize power. I think most x-risk from misaligned AI comes from future people making a mistake, underestimating risks that turned out to be real. Accordingly, if there was excellent and trusted analysis/advice on the likelihood of misalignment, I think that would significantly reduce x-risk from AI takeover.^[5] AI could also help develop policy solutions, such as treaties and monitoring systems that could reduce risks from AI takeover.

Aside from misalignment risks, there are many issues for which I’d value AI advice within this series of posts that you’re currently reading. I want advice on how to deal with ethical dilemmas around Digital minds. I want advice on how nations could coordinate an intelligence explosion. I want advice on policy issues like what we should do if destructive technology is cheap. Etc.

Taking a step back, one of the best arguments for why we shouldn’t work on those questions today is that AI can help us solve them later. (I previously wrote about this here.) But it’s unclear whether AI’s ability to analyze these questions will, by default, come before or after AI has the capabilities that cause the corresponding problems.^[6] Accordingly, it seems great to differentially accelerate AI’s ability to help us deal with those problems.

What about path dependencies?

One thing I already mentioned above is the possibility of people locking in poorly considered views. In general, if the epistemic landscape gets sufficiently bad, then it might not be self-correcting. People may no longer have the good judgment to switch to better solutions, instead preferring to stick to their current incorrect views.^[7]

This goes doubly for questions where there’s little in the way of objective standards of correctness. I think this applies to multiple areas of philosophy, including ethics. Although I have opinions about what ethical deliberation processes I would and wouldn’t trust, I don’t necessarily think that someone who had gone down a bad deliberative path would share those opinions.

Another source of path dependency could be reputation. Our current epistemic landscape depends a lot on trust and reputation, which can take a long time to gather. So if you want a certain type of epistemic institution to be trusted many years from now, it could be important to immediately get it running and start establishing a track record. And if you fill a niche before anyone else, your longer history may give you a semi-permanent advantage over competing alternatives.

A third (somewhat more subtle) source of path dependency could be a veil of ignorance. Today, there are many areas of controversy where we don’t yet know which side AI-based methods will come down on. Behind this veil of ignorance, it may be possible to convince people that certain AI-based methods are reliable and that it would be in everyone’s best interest to agree to give significant weight to those methods. But this may no longer be feasible after it’s clear what those AI-based methods are saying. In particular, people would be incentivized to deny AI’s reliability on questions where they have pre-existing opinions that they don’t want to give up.

Categories of projects

Let’s get into some more specific projects.

I’ve divided the projects below into the sub-categories:

(Though, in practice, these are not cleanly separated projects. In particular, differential technology development is a huge part of getting the right type of AI to be used & trusted. So the first and second categories are strongly related. In addition, developing & advocating for anti-persuasion legislation could contribute a lot towards getting non-persuasion uses of AI to be used & appropriately trusted. So the second and the third category are strongly related.)

Differential technology development [ML] [Forecasting] [Philosophical/conceptual]

This category is about differentially accelerating AI capabilities that let humanity get correct answers to especially important questions (compared to AI capabilities that e.g. lead to the innovation of new risky technologies, including by speeding up AI R&D itself).

Doing this could (i) lead to people having better advice earlier in the singularity and (ii) enable various time-sensitive attempts at “get AI to be used & trusted” mentioned below.

Note: I think it’s often helpful to distinguish between interventions that improve AI models’ (possibly latent) capabilities vs. interventions that make us better at eliciting models’ latent capabilities and using them for our own ends.^[8] I think the best interventions below will be of the latter kind. I feel significantly better about the latter kind since I think that AI takeover risk largely comes from large latent capabilities (since a misaligned model likely could use its latent capabilities in a takeover attempt). Whereas better ability at eliciting capabilities often reduces risk. Firstly, better capability elicitation improves our understanding of AI capabilities, making it easier to know what risks AI systems pose. Secondly, better capability elicitation means that we can use those capabilities for AI supervision and other tasks that could help reduce AI takeover risk (including advice on how to mitigate AI risk).^[9]

Important subject areas

Here are some subject areas where it could be especially useful for AI to deliver good advice early on.

AI for forecasting.
- Knowing roughly what’s likely to happen would be enormously helpful for mitigating all kinds of risks.
- If you could get conditional forecasts that depend on events that you can affect, that would be even more helpful.
AI for philosophy.
- Besides just knowing what will happen, good decision-making may depend on people's ethical ideas about what ought to happen. This isn’t just restricted to questions about what we ought to do with resources in the long run, but it’s also about questions like “When should I prefer to lock in my views vs. deliberate more on them?”, “How happy should I be about sharing my power with many other individuals”, and “What sort of norms should I abide by?”
- Other potentially urgent philosophical questions include (but are not restricted to) “Which digital minds deserve moral consideration, and how should we treat them?” and “Does some type of non-causal decision theory make sense, and does this have any important implications for how we should act?” (E.g., to act more cooperatively due to evidential cooperation in large worlds.)
AI that can help defeat adversarial persuasion attempts (perhaps from other AI systems) by spotting and flagging dishonest conversation tactics.

Methodologies

Here are some methodologies that could be used for getting AI to deliver excellent advice.

One central part is to get “scalable oversight” (like debate & amplification) to work.
- See the case for aligning narrowly superhuman models for a description of a “sandwiching” methodology that can help with this. More recent work includes Saunders et al. (2022), Radhakrishnan et al. (2023), and Michael et al. (2023).
- It might be a somewhat different task to get that to work on controversial or wicked topics than on more technical topics. So it could be especially useful to test and experiment with “scalable oversight” proposals on controversial or wicked problems.
In general, a standard experimental set-up that you can do for AI (which is harder to do with humans) is to have a held-out set of questions that you try to get AI to generalize to, where you can repeatedly test different training and scaffolding methods and see what works. This seems especially feasible for forecasting, since by training only on questions before a certain date, you can get very rich training data without any undesirable leaked information from the future.
- It seems pretty useful to get started on the basic technical work that’s necessary to make use of this, when using language models for forecasting.
- The most straightforward methodology is to take existing language models with a known cut-off date, and run experiments where you get them to forecast events that we already have ground-truth on. Trying out different methods, and seeing what works best.
  - There’s been some previous work in this vein, e.g. Forecasting Future World Events with Neural Networks.
- A more ambitious approach would be to figure out how to sort pre-training data by date.^[10] If language models could be pre-trained chronologically, they would get a huge amount of automatic forecasting practice. This could be further enhanced by occasionally putting-in explicit forecasting questions throughout pre-training and score the AI on what probability it assigns to them.
You could use the methodology described in Tom Davidson’s Let’s use AI to harden human defenses against AI manipulation to train AI systems that can recognise and point out arguments that are persuasive regardless of whether they are deployed for a true or false conclusion.
Areas without ground truth pose some additional difficulty (e.g. many parts of philosophy).
- But I’m optimistic that it could be helpful to find epistemic strategies that work for questions where we do have ground truth — and then apply them to areas where we don’t have ground truth. (For example: If the method from Tom Davidson’s post successfully produced great advice about what arguments are good vs. bad, I think that advice would also be helpful for areas where we don’t have ground truth.)
- See also Wei Dai on methaphilosophy for some notes on how we might or might not be able to get AI to help us with philosophical progress.
You could also train AI to provide a wide variety of generally useful reasoning tools for humans. For example:
- Getting AIs to pass many different humans’ ideological Turing tests, so that you can on-demand check what people with different ideologies would think about various arguments.
- You can give the AI access to lots of information about your views, and then have it search for contradictory combinations of statements.^[11] Or inconsistent epistemic standards used on different arguments.
- AI can constantly fact-check everything, and constantly present relevant statistics and comparison points to things that you’re investigating. (Perhaps suggesting Fermi estimates for quantities where statistics don’t already exist.)
- AI can help you operationalize fuzzy claims into precise & quantifiable claims.
- Get the AI to generate exercise questions for cognitive habits you want to practice.
- Go through lists of common biases and use AI to notice cases where they might be in play, and counter them. (Provide opposing anchors to avoid anchoring bias; provide actually-representative examples to avoid availability heuristic; prompt you with the reversal test if you might be falling prey to status-quo bias, etc.)
Ideally, you’d want to empirically test whether these methods do help people answer difficult questions.

Alongside technical work to train the AIs to be better at these tasks, creating relevant datasets could also be super important. In fact, just publicly evaluating existing language models’ abilities in these areas could slightly improve incentives for labs to get better at them.

Related/previous work.

Related organizations:

Ought
Quantified Uncertainty Research Institute (QURI)
Forecasting Research Institute (FRI)

Get AI to be used & (appropriately) trusted

The above section was about developing the necessary technology for AI to provide great epistemic assistance. This section is about increasing the probability that such AI systems are used and trusted (to the extent that they are trustworthy).

Develop technical proposals for how to train models in a transparently trustworthy way [ML] [Governance]

One direction here is to develop technical proposals for how people outside of labs can get enough information about models to know whether they are trustworthy.

(This significantly overlaps with Avoiding AI-assisted human coups from the section on “Governance during explosive technological growth”.)

One candidate approach here is to rely on a type of scalable oversight where each of the model’s answers is accompanied by a long trace of justification that explains how the model arrived at that conclusion. If the justification was sufficiently solid, it would be less important why the model chose to write it. (It would be ideal, but perhaps infeasible, for the justification to be structured so that it could be checked for being locally correct at every point, after which the conclusion would be implied.)

Another approach is to clearly and credibly describe the training methodology that was used to produce the AI system so that people can check whether biases were introduced at any point in the process.

One issue here is the balance between preserving trade secrets and releasing enough information that the models can be trusted.
If the training algorithm simply teaches the model to approximate the data, the key thing to release will often be the data. But this has the issue that (i) the data is valuable, and (ii) there’s too much data to easily check for biases.
- Constitutional AI makes some progress on these problems since it makes it easier to concisely explain the data as the result of a short constitution without releasing all of it.
- Similarly, if the pre-training data is harvested from the internet, a company could simply describe how they harvested and filtered it.
Another issue is that people might not trust AI labs’ statements about these things.
- This can be addressed via mechanisms like having good whistleblower policies or third-party auditors.
- This could also be complemented by more advanced compute-governance solutions, like those described in Shavit (2023). (Especially in high-stakes situations, e.g. geopolitics.)

For any such scheme, people’s abilities to trust the AI developers will depend on their own competencies and capabilities.

On one extreme: Someone who doesn’t understand the technical basics (or is too busy to dig into the details) may never be convinced on the merits alone. They would have to rely on track records or endorsements from trusted individuals or institutions.
On the other extreme: An actor with their own AI development capabilities may be able to verify certain claims by recapitulating core experimental results or even entire training runs.

This means that one possible path to impact is to elucidate what capabilities certain core stakeholders (such as government auditors, opposition parties, or international allies) would need to verify key claims. And to advocate for them to develop those capabilities.

One possible methodology could be to write down highly concrete proposals and then check to what degree that would make outside parties trust the AI systems. And potentially go back and forth.

By a wide margin, the most effective check would be to implement the proposal in practice and see whether it successfully helps the organization build trust. (And whether it would, in practice, prevent the organization in question from misleading outsiders.)

But a potentially quicker option (that also requires less power over existing institutions) could be to talk with the kind of people you’d want to build trust with. That brings us to the next project proposal.

Survey groups on what they would find convincing [survey/interview]

If we want people to have appropriate trust in AI systems, it would be nice to have good information about their current beliefs and concerns. And what they would think under certain hypothetical circumstances (e.g. if labs adopted certain training methodologies, or if AIs got certain scores on certain evaluations, or if certain AIs’ truthfulness was endorsed or disendorsed by various individuals).

Perhaps people’s cruxes would need to be addressed by something entirely absent from this list. Perhaps people are very well-calibrated — and we can focus purely on making things good, and then they’ll notice. Perhaps people will trust AI too much by default, and we should be spelling out reasons they should be more skeptical. It would be good to know!

To attack this question, you could develop some plausible stories on what could happen on the epistemic front and then present them to people from various important demographics (people in government, AI researchers, random democrats, random republicans). You could ask how trustworthy various AI systems (or AI-empowered actors) would seem in these scenarios and what information they would use to make that decision.

Note, however, that it will be difficult for people to know what they would think in hypothetical scenarios with (inevitably) very rough and imprecise descriptions. This is especially true since many of these questions are social or political, so people’s reactions are likely to have complex interactions with other people’s reactions (or expectations about their reactions). So any data gathered through this exercise should be interpreted with appropriate skepticism.

Create good organizations or tools [ML] [Empirical research] [Governance]

Creating a good organization or tool could be time-sensitive in several of the ways mentioned in Why working on this could be urgent:

It could increase people’s access to high-quality AI advice as important decisions are being made, during takeoff.
If you’re early with an epistemically excellent product, you could get a reputation for trustworthiness.
If excellent products or organizations are developed before it’s obvious what AI will say about certain controversial topics, that creates a window where people can develop appropriate trust in the AI-based methods, and go on the record about having some trust in them. This might increase their inclination to trust AI-based methods about future controversial claims that are well-supported by the evidence.

Examples of organizations or products

Here are some examples on the “organization” end of things:

Starting a company or non-profit that aims to use frontier AI in-house to provide thoroughly researched, excellent analyses of important topics.
- One analogy here is GiveWell. Many people trust GiveWell because they write up their research thoroughly and transparently — allowing others to critically review & spot-check it.
- The bet here would be: By being at the forefront of using AI for literature reviews, modeling, and other research — an organization could do “GiveWell-style research” for a much broader set of topics. In the ideal world: Becoming the go-to source for rigorous and easy-to-navigate reviews of the evidence within some focus area(s). (Social sciences, forecasting, technical questions that matter for policy, etc.)
Setting up a non-partisan agency within the government that is tasked with using frontier AI to provide advice on policy questions.
- One analogy here is the Congressional Budget Office (CBO). The CBO was set up in the 1970s as a non-partisan source of information for Congress and to reduce Congress’ reliance on the Office of Management and Budget (which resides in the executive branch and has a director that is appointed by the currently sitting president). My impression is that the CBO is fairly successful.^[12]
- The bet here would be: If a competent non-partisan government agency could provide AI policy advice, that would both (i) reduce the government’s reliance on companies' AI advisors and (ii) avoid a scenario where the governments’ AI advice is highly partisan, due to e.g. being developed in the executive branch.
  - This proposal especially relies and capitalizes on it being easier for people to agree on deferring to non-ideological processes before it is apparent what those processes will conclude.
- An alternative set-up could be for the government to stipulate that any AI in government should be (e.g.) neutral and honest. And set up a non-partisan body that verifies that those criteria are met.

Here are some examples on the “tool” end of things:

Instead of starting a whole government agency that’s dedicated to using AI for in-house research: You could also design an AI-based product that’s exceptionally helpful for providing advice and support to government officials and members of the civil service. Maybe quickly explaining difficult topics & events to them, which, if kept up-to-date, might somewhat extend the time during which they and other humans can keep up as the world accelerates.
Or tools that are helpful for a broad array of researchers. (C.f. what Ought is trying to do.)
Or you could target a much broader audience of consumers.
- Maybe people who want help understanding or forming views on difficult topics.
- Maybe people who want advice on making big decisions. (E.g. career decisions.)
- Maybe people who want to frequently use some of the “reasoning tools for humans” or “anti-persuasion tools” mentioned above.
Or as a more narrow goal: You could target institutions that will “by default” provide the most commonly-used AI systems (primarily AI labs) and push them to do better on epistemics. E.g.:
- Spending more effort on making the AI systems ideologically neutral.^[13]
- Accelerating efforts to make deployed AI models systematically good at saying true things, citing sources, fact-checking themselves, etc.
  - Potentially using some of the methodologies suggested above (like running experiments on what sort of AI statements can only be used to argue for true things, vs. can just as easily lead people astray) to identify good epistemic habits that the model can follow, and training the model to follow those.
- Or developing and implementing technical proposals for how to train models in a transparently trustworthy way.

In some ways, “tool” and “organization” is a false dichotomy. For “tools”, it’s probably efficient to pre-process some common queries/investigations in-house and then train the AI to report & explain the results. And for organizations whose major focus is to use AI in-house, it’s likely that “building an AI that explains our takes” should be a core part of how they disseminate their research.

Investigate and publicly make the case for why/when we should trust AI about important issues [Writing] [Philosophical/conceptual] [Advocacy] [Forecasting]

One way to “accelerate people getting good AI advice” and “capitalize off of current veil of ignorance” could be to publicly write (e.g. a book) with good argumentation on:

What sort of training schemes could lead to trustworthy AI vs. non-trustworthy AI. What sort of evaluations could tell us which one we’re getting.
- This would need to discuss legitimate doubts like “AI is being trained on data from biased humans” and what countermeasures would be sufficient to address them.
- This would have to engage a lot with various proposed alignment techniques and concerns.
- This should also get into proposals for how to train models in a transparently trustworthy way, i.e., talk about what labs would have to do to make AI transparently trustworthy for outside actors. (And potentially: What sort of verification capacity external actors would have to build for themselves to verify claims from labs.)
Paint a positive vision for how much AI could improve the epistemic landscape if everything went well. I would focus on the 3 things I mentioned at the top of this section: the ability to get greater confidence about AI motivations, how AI could make it vastly cheaper to do large investigations, and the much better ability to experimentally find and validate great epistemic methods.
Appropriate call to action: push for tech companies to develop AIs with good types of training, push for governments to incorporate good AI advice in decision making, urge people to neither blindly trust nor dismiss surprising AI statements, but to carefully look at the evidence, including information about how that AI was trained and evaluated.

Developing standards or certification approaches [ML] [Governance]

It could be desirable to end up in an ecosystem where evaluators and auditors check popular AI models for their propensity to be truthful. This paper on Truthful AI has lots of content on that. It could be good to develop such standards and certification methodologies or perhaps to start an organization that runs the right evaluations.

Develop & advocate for legislation against bad persuasion [Governance] [Advocacy]

Most of the above project suggestions are about supporting good applications of AI. The other side of the coin is to try to prevent bad applications of AI. In particular, it could be good to develop and advocate for legislation that limits the extent to which language models can be used for persuasion.

I’m not sure what such legislation should look like. But here are some ideas.

One pretty natural target for regulation could be "Real-time AI-produced content which is paid for by a political campaign / PAC".
- For such content, regulation could require e.g. citations for all claims, or could require AI systems’ positions to be consistent when talking with different users.
- If this makes it hard to produce an engaging AI for those purposes, then that's plausibly good.
Regulation could make it harder for organization A to pay company B to change the content that company B's chatbot produces. Or ban those sorts of sponsorships.
In general, plenty of laws pertain to “advertisement” today. I’m not sure how the law defines that, but maybe there are sensible modifications to make such that those laws cover "ad-bots" and have appropriate safeguards in place.
It seems helpful for people to know whether they are interacting with AI systems or humans.
- At least California has already made some legislation requiring companies to disclose facts about this.
- There are also various technical proposals for “watermarking” AI content to make this easier.
- I don’t know what seems to work in practice here, but handling this well could be important.
Some of the methodologies suggested above can be used to find or validate proposals about what models should or shouldn’t be allowed to say. (I.e., you could run experiments on whether such constraints would make it hard for AI to persuasively instill false beliefs, and check how much less useful the AI would be when it was used for honest purposes.)

Related/previous work:

Risks from AI persuasion by Beth Barnes.
Persuasion Tools by Daniel Kokotajlo.
Epistemic security report.
Probably lots of people have been writing about this that I don’t know about!

End

That’s all I have on this topic! As a reminder: it's very incomplete. But if you're interested in working on projects like this, please feel free to get in touch.

Thanks especially to Carl Shulman for many of the ideas in this post.

Other posts in series: Introduction, governance during explosive growth, sentience and rights of digital minds, backup plans & cooperative AI.

^{^}
Illustratively:
- We can do experiments to determine what sorts of procedures provide great reasoning abilities.
- Example procedures to vary: AI architectures, LLM scaffolds, training curricula, heuristics for chain-of-thought, protocols for interaction between different AIs, etc.
- To do this, we need tasks that require great reasoning abilities and where there exists lots of data. One example of such a task is the classic “predict the next word” that current LLMs are trained against.
- With enough compute and researcher hours, such iteration should yield large improvements in reasoning skills and epistemic practices. (And the researcher hours could themselves be provided by automated AI researchers.)
- Those skills and practices could then be translated to other areas, such as forecasting. And their performance could be validated by testing the AIs’ ability to e.g. predict 2030 events from 2020 data.
^{^}
This is related to how defense against manipulation might be more difficult than manipulation itself. See e.g. the second problem discussed by Wei Dai here.
^{^}
Although this fit has been far from perfect, e.g. religions posit many false beliefs but have nevertheless spread and in some cases increased the competitive advantage of groups that adopted them.
^{^}
For some previous discussion, see here for a relevant post by Paul Christiano and a relevant comment thread between Christiano and Wei Dai.
^{^}
Of course, if we’re worried about misalignment, then we should also be less trusting of AI advice. But I think it’s plausible that we’ll be in a situation where AI advice is helpful while there’s still significant remaining misalignment risk. For example, we may have successfully aligned AI systems of one capability level, but be worried about more capable systems. Or we may be able to trust that AI typically behaves well, or behaves well on questions that we can locally spot-check, while still worrying about a sudden treacherous turn.
^{^}
For instance: It seems plausible to me that “creating scary technologies” has better feedback loops than “providing great policy analysis on how to handle scary technologies”. And current AI methods benefit a lot from having strong feedback loops. (Currently, especially in the form of plentiful data for supervised learning.)
^{^}
And if there’s a choice between different epistemic methodologies: perhaps pick whichever methodology lets them keep their current views.
^{^}
What does it mean for a model to have a “latent capability”? I’m thinking about the definition that Beth Barnes uses in this appendix. See also the discussion in this comment thread, where Rohin Shah asks for some nuance about the usage of “capability”, and I propose a slightly more detailed definition.
^{^}
Of course, better capability elicitation would also accelerate tasks that could increase AI risk. In particular: improved capability elicitation could accelerate AI R&D, which could accelerate AI systems’ capabilities. (Including latent capabilities.) I acknowledge that this is a downside, but since it’s only an indirect effect, I think it’s worth it for the kind of tasks that I outline in this section. In general: most of the reason why I’m concerned about AI x-risk is that critical actors will make important mistakes, so improving people’s epistemics and reasoning ability seems like a great lever for reducing x-risk. Conversely, I think it’s quite likely that dangerous models can be built with fairly straightforward scaling-up and tinkering with existing systems, so I don’t think that increased reasoning ability will make any huge difference in how soon we get dangerous systems. That said, considerations like this are a reason to target elicitation efforts more squarely at especially useful and neglected targets (e.g. forecasting) and avoid especially harmful or commercially incentivized targets (e.g. coding abilities).
^{^}
If it’s too difficult to date all existing pre-training data retroactively, then that suggests that it could be time-sensitive to ensure that all newly collected pre-training data is being dated, so that we can at least do this in the future.
^{^}
Though one risk with tools that make your beliefs more internally coherent/consistent is that they could extremize your worldview if you start out with a few wrong but strongly-held beliefs (e.g. if you believe one conspiracy theory, that often requires further conspiracies to make sense). (H/t Fin Moorhouse.)
^{^}
See e.g. this survey which has >30 economists “Agree” or “Strongly agree” (and 0 respondents disagree) with “Adjusting for legal restrictions on what the CBO can assume about future legislation and events, the CBO has historically issued credible forecasts of the effects of both Democratic and Republican legislative proposals.”
^{^}
On some questions, answering truthfully might inevitably have an ideological slant to it. But on others it doesn’t. It seems somewhat scalable to get lots of people to red-team the models to make sure that they’re impartial when that’s appropriate, e.g. avoiding situations where they’re happy to write a poem about Biden but refuse to write a poem about Trump. And on questions of fact — you can ensure that if you ask the model a question where the weight of the evidence is inconvenient for some ideology, the model is equally likely to give a straight answer regardless of which side would find the answer inconvenient. (As opposed to dodging or citing a common misconception.)

43