Benchmarking Open-Source LLMs as Model Evaluators
We evaluate open-source LLMs against proprietary models using benchmarks for instruction adherence and positional bias, finding that open models are closing the performance gap with GPT-4, though GPT-4 still leads in overall consistency and fairness.
Abstract
Using Large Language Models (LLMs) to evaluate other LLMs is becoming a scalable alternative to human assessment. While promising, this approach faces challenges such as positional bias and fairness. We provide a comprehensive evaluation of open-source LLMs using benchmarks for instruction adherence and positional bias. Our results show that open models are rapidly closing the gap with proprietary models like GPT-4. Although some open-source models match GPT-4 in metrics like extraction success rates, GPT-4 still leads in overall consistency and fairness. This study highlights the potential of open-source models in evaluation tasks while identifying areas where they still lag behind proprietary counterparts.