I have been running evaluations on a niche that has almost zero attention in the AI safety world. Meta open source mode the llama 3.1 8b scored a 43% accuracy score on a 420 question benchmark I built covering ethnoveterinary practices, indigenous breed characteristics, disease recognition, and production systems specific to Nigeria.
This evaluation is important because most other evals are ran on properly documented western specific data problems,. This project tests a domain where almost none of the relevant knowledge exists in this form.
METHODOLOGY
420 questions, 6 categories, 0/1/2 scoring rubric. Questions drawn from Nigerian veterinary curriculum, published ethnoveterinary literature, and field practice knowledge.
Baseline model: Meta Llama 3.1 8B via Groq.
Next phase: Claude Sonnet, GPT 4o, Gemini 1.5 Pro for a comparative study. Paper to follow.
CONCLUSION
If you accept that AI advisory tools will be deployed at scale in African agricultural contexts and they already are being used then the absence of evals benchmarks for this domain is a real safety gap. Models can pass standard tests and still fail on knowledge domains that matter to specific populations. This is a real problem especially if these models are actively used in low resource regions or communities.This is one data point. The paper will have four.
I have been running evaluations on a niche that has almost zero attention in the AI safety world. Meta open source mode the llama 3.1 8b scored a 43% accuracy score on a 420 question benchmark I built covering ethnoveterinary practices, indigenous breed characteristics, disease recognition, and production systems specific to Nigeria.
This evaluation is important because most other evals are ran on properly documented western specific data problems,. This project tests a domain where almost none of the relevant knowledge exists in this form.
METHODOLOGY
420 questions, 6 categories, 0/1/2 scoring rubric. Questions drawn from Nigerian veterinary curriculum, published ethnoveterinary literature, and field practice knowledge.
Baseline model: Meta Llama 3.1 8B via Groq.
Next phase: Claude Sonnet, GPT 4o, Gemini 1.5 Pro for a comparative study. Paper to follow.
CONCLUSION
If you accept that AI advisory tools will be deployed at scale in African agricultural contexts and they already are being used then the absence of evals benchmarks for this domain is a real safety gap. Models can pass standard tests and still fail on knowledge domains that matter to specific populations. This is a real problem especially if these models are actively used in low resource regions or communities.This is one data point. The paper will have four.
Manifund project if interested: [Manifund]