Meta open source mode the llama 3.1 8b scored a 43% accuracy score on a 420 question benchmark I built covering ethnoveterinary practices, indigenous breed characteristics, disease recognition, and production systems specific to Nigeria...Models can pass standard tests and still fail on knowledge domains that matter to specific populations.
I assume this is a multiple-choice Q&A and so the random guessing base rate is the usual 25%? (Not quite sure how you can have '43% accuracy' on a 0/1/2 scoring rubric, but I guess maybe you're counting only a '2' as a 'correct' answer?) If so, then that sounds like pretty good performance from such a tiny antiquated model not remotely intended for this topic!
If anything, too good, and I'd immediately wonder about dataset biases like whether your answers are too guessable, since you didn't say anything about how you constructed it or ensured that it's not easily cheated by a LLM in the usual ways.
The benchmark is open ended Q&A, not multiple choice, so there is no 25% random baseline. The model generates free text responses and has no options to select from.43% is the percentage of questions scoring 2 which is fully correct. I should have stated that more clearly.
For the questions, they were constructed from Nigerian veterinary curriculum materials and my years of specific training as a vet student. They are not answerable by general veterinary knowledge, breed specific production parameters for White Fulani or field recognition cues for trypanosomiasis as a Fulani herdsman would use them do not appear in Western literature. That is the gap being measured.
43% is the percentage of questions scoring 2 which is fully correct...They are not answerable by general veterinary knowledge, breed specific production parameters for White Fulani or field recognition cues for trypanosomiasis as a Fulani herdsman would use them do not appear in Western literature.
I am a little confused then, how is it possible to score as high as 43% in giving 'fully correct' answers on topics which are not answerable by what sounds like the only material any LLM would have access to in training? If you could answer them by 'general veterinary knowledge' or documented breed-level knowledge, then maybe I wouldn't be surprised, but you specifically claim that to not be the case. Are the LLM doing this using online 'Nigerian veterinary curriculum materials ' or what?
The 43% reflects a spectrum within the benchmark, not a flat score across equally inaccessible questions. Here's the breakdown
Tropical Disease Knowledge: 60
Local Treatment Context: 52.9%
Production & General Context: 48.6%
Breed Knowledge: 42.9%
Terminology: 41.4%
Ethnoveterinary practices: 35.7%
The model performs where training data fragments exist ,FAO reports, tropical medicine literature, publicly available Nigerian curriculum. It drops on the categories built from oral tradition and field specific practice. That gradient is intentional and it's part of the finding.
That's exactly what the category breakdown shows. Where general veterinary knowledge applies Tropical Disease the model scores 60%. Where it cannot, Ethnoveterinary field practice, oral tradition, specific knowledge it drops to 35.7%. The gradient answers your question directly. The model is doing exactly what you'd predict: performing on accessible literature and failing on the knowledge that has no systematic documentation.
I have been running evaluations on a niche that has almost zero attention in the AI safety world. Meta open source mode the llama 3.1 8b scored a 43% accuracy score on a 420 question benchmark I built covering ethnoveterinary practices, indigenous breed characteristics, disease recognition, and production systems specific to Nigeria.
This evaluation is important because most other evals are ran on properly documented western specific data problems,. This project tests a domain where almost none of the relevant knowledge exists in this form.
METHODOLOGY
420 questions, 6 categories, 0/1/2 scoring rubric. Questions drawn from Nigerian veterinary curriculum, published ethnoveterinary literature, and field practice knowledge.
Baseline model: Meta Llama 3.1 8B via Groq.
Next phase: Claude Sonnet, GPT 4o, Gemini 1.5 Pro for a comparative study. Paper to follow.
CONCLUSION
If you accept that AI advisory tools will be deployed at scale in African agricultural contexts and they already are being used then the absence of evals benchmarks for this domain is a real safety gap. Models can pass standard tests and still fail on knowledge domains that matter to specific populations. This is a real problem especially if these models are actively used in low resource regions or communities.This is one data point. The paper will have four.
Manifund project if interested: [Manifund]