x
BlackBoxQuery [BBQ]-Bench: Measuring Hypothesis Formation and Experimentation Capabilities in LLMs — LessWrong