Request for feedback - infohazards in testing LLMs for causal reasoning?

DirectedEvolution

Edit: I have no idea why this is getting silently downvoted, and would appreciate a comment or PM to explain.

Judea Pearl has tweeted a couple of times the claim that ChatGPT can't perform causal reasoning (1, 2). Although it's possible to get ChatGPT to exhibit successful or failed causal reasoning in response to natural-language prompts, it's not always clear whether the success or failure was due to confusing wording, or a similar test being included in training data. A second limitation of natural-language ad hoc prompts is that they don't allow scaling the complexity or scoring success rates quantitatively. Finally, many prompts do not make it unambiguously clear whether ChatGPT is meant to find any possible causal relationship, or whether it's meant to find a common-sense causal relationship.

To address these limitations, I have developed software that generates a directed acyclic graph (DAG) of an arbitrary size representing a causal network, and then outputs it in natural language, along with a set of test questions interrogating the effect of intervening on a particular node in the graph. It also outputs human-readable answers. This permits directly pasting in the description of the causal DAG into any language model and comparing the LLM's answers to the algorithmically computed answers. Because the causal DAGs are generated randomly and can be of an arbitrary size, it is implausible that the LLM's answers are based on regurgitating training data.

Currently, my test software requires manual input and answer-checking, but I am working on adding the ability to send questions and receive answers from ChatGPT via the API, and to have those answers algorithmically checked for accuracy. This would allow highly scalable and quantitative testing of ChatGPT's performance. Other tests are possible as well:

Can ChatGPT not only determine the correct answers for an intervention in a causal DAG, but generate a compelling real-world example of a set of causal relatoinships that fit the causal DAG?
Can ChatGPT convert a real-world scenario into a causal DAG description, feed that description back into the Python program in order to generate a series of test questions, receive back the abstract description of the causal DAG, and then answer the questions successfully?

As interesting and illuminating as this test may be, I do have a concern that it may be capabilities-enhancing. If ChatGPT performs with high reliability at causal inference, and shows the ability to interconvert natural-language descriptions of real-world scenarios and causal DAGs that can be handled algorithmically by traditional software, this potentially opens the door to new capabilites for problem-solving and planning by autonomous LLM-powered agents.

I don't work in AI alignment and this is a hobby project. I've already satisfied myself that LLMs can perform causal inference. The point of publishing this project would be to generate more conclusive data that this is so. But I would not want to do more harm than good, and because I don't work in the AI capabilities or alignment fields, I just don't have the background in the research conversation to know whether this is novel or interesting work, amounts to just a fun personal hobby project, or has risks of enhancing capabilities while failing to provide utility as an alignment project. I'd appreciate any insight into these questions. Thank you!

LESSWRONG
LW

[ Question ]

Request for feedback - infohazards in testing LLMs for causal reasoning?

16

16