This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
I've been running evals on a domain that has received essentially zero attention in the AI safety benchmarking literature: Nigerian indigenous livestock systems.
The short version: Meta Llama 3.1 8B scores 43% full accuracy on a 420-question benchmark I built covering ethnoveterinary practices, indigenous breed characteristics, disease recognition, and production systems specific to Nigeria.
The failures are not random noise. They happen in specific ways that are relevant to anyone thinking about AI deployment safety in low resource, non western knowledge domains like Africa.
Why this is interesting from an evals perspective
Most evals benchmarks test knowledge that is well represented in training data which is western sourced and academically documented. This benchmark tests a domain where almost none of the relevant knowledge exists in those forms. Ethnoveterinary practices are transmitted orally. Breed specific production parameters for Nigerian indigenous breeds are published in low circulation regional papers.
The result is a clean test of what happens when you deploy a model on a knowledge domain it was structurally unlikely to have been trained on. The 43% baseline is the answer.
Methodology
420 questions, 6 categories, 0/1/2 scoring rubric. Questions drawn from Nigerian veterinary curriculum, published ethnoveterinary literature, and field practice knowledge. Scored by a domain expert (me veterinary student with 5 years training specific to Nigerian livestock systems).
Baseline model: Meta Llama 3.1 8B via Groq.
Next phase: Claude Sonnet, GPT-4o, Gemini 1.5 Pro for a comparative study. Paper to follow.
The broader point
If you accept that AI advisory tools will be deployed at scale in African agricultural contexts and they already are being piloted then the absence of evals benchmarks for this domain is a real safety gap. Models can pass standard benchmarks and still fail systematically on knowledge domains that matter to specific populations.
This is one data point. The paper will have four.
Manifund project if interested: [Manifund] Open to methodology feedback and collaboration.
I've been running evals on a domain that has received essentially zero attention in the AI safety benchmarking literature: Nigerian indigenous livestock systems.
The short version: Meta Llama 3.1 8B scores 43% full accuracy on a 420-question benchmark I built covering ethnoveterinary practices, indigenous breed characteristics, disease recognition, and production systems specific to Nigeria.
The failures are not random noise. They happen in specific ways that are relevant to anyone thinking about AI deployment safety in low resource, non western knowledge domains like Africa.
Why this is interesting from an evals perspective
Most evals benchmarks test knowledge that is well represented in training data which is western sourced and academically documented. This benchmark tests a domain where almost none of the relevant knowledge exists in those forms. Ethnoveterinary practices are transmitted orally. Breed specific production parameters for Nigerian indigenous breeds are published in low circulation regional papers.
The result is a clean test of what happens when you deploy a model on a knowledge domain it was structurally unlikely to have been trained on. The 43% baseline is the answer.
Methodology
420 questions, 6 categories, 0/1/2 scoring rubric. Questions drawn from Nigerian veterinary curriculum, published ethnoveterinary literature, and field practice knowledge. Scored by a domain expert (me veterinary student with 5 years training specific to Nigerian livestock systems).
Baseline model: Meta Llama 3.1 8B via Groq.
Next phase: Claude Sonnet, GPT-4o, Gemini 1.5 Pro for a comparative study. Paper to follow.
The broader point
If you accept that AI advisory tools will be deployed at scale in African agricultural contexts and they already are being piloted then the absence of evals benchmarks for this domain is a real safety gap. Models can pass standard benchmarks and still fail systematically on knowledge domains that matter to specific populations.
This is one data point. The paper will have four.
Manifund project if interested: [Manifund] Open to methodology feedback and collaboration.