BlackBoxQuery [BBQ]-Bench: Measuring Hypothesis Formation and Experimentation Capabilities in LLMs
The following is a revised version of the winning paper that my team (Daniel Wu, David Zhang, Justin Zhang) produced as part of the Impact Research Initiative Fall 2025 cohort. We were mentored by Nikola Jurkovic. Abstract We introduce BBQ-Bench: a novel benchmark designed to evaluate research-relevant reasoning skills of...
Jan 1210