Crowdsourced AI benchmarks have serious flaws, some experts say | TechCrunch
Briefly

AI labs are increasingly using crowdsourced platforms like Chatbot Arena for benchmarking, but experts highlight significant ethical concerns. Emily Bender critiques the lack of construct validity in these assessments, arguing that they do not reliably correlate with actual model preferences. Asmelash Teka Hadgu flags that AI labs often misuse these benchmarks to exaggerate model capabilities. Some experts advocate for dynamic benchmarks that involve diverse independent entities, enhancing relevance and effectiveness across various applications such as education and healthcare.
"To be valid, a benchmark needs to measure something specific, and it needs to have construct validity - that is, there has to be evidence that the construct of interest is well-defined and that the measurements actually relate to the construct."
"Benchmarks should be dynamic rather than static data sets, distributed across multiple independent entities, such as organizations or universities, and tailored specifically to distinct use cases, like education, healthcare, and others."
Read at TechCrunch
[
|
]