Contra Labs - Human Creativity Benchmark

"When professional creatives evaluate AI-generated work, their judgments produce two distinct signals. The first is convergence: evaluators agree on what works, revealing shared best practices like readable typography, functional layout, and strong visual hierarchy. The second is divergence: evaluators disagree, and that disagreement reflects genuine differences in taste, aesthetic direction, and creative intent. Most AI benchmarks treat the second signal as noise to be resolved. The Human Creativity Benchmark separates the two, distinguishing where a model needs to be correct from where it needs to be steerable toward taste, and finds that no current model is reliably both."

"This distinction matters because creative work has no ground truth. The dimensions on which experts disagree - aesthetic direction, mood, conceptual risk - are not reducible to miscalibration or error [1][2]. Standard evaluation approaches, including majority voting, adjudication, and gold-standard reconciliation, treat evaluator disagreement as something to resolve [3][4]. These methods work where labels have objective answers. In creative domains, they would smooth out the information most worth preserving."

"Work in annotation science has recognized that disagreement can carry signal [5], and frameworks like CrowdTruth have formalized this for labeling tasks [4]. The Human Creativity Benchmark applies that insight to creative evaluation, where the standard resolution strategies are structurally wrong because taste is legitimately distributed across professionals. Flattening it into a single quality score artificially homogenizes an otherwise diverse workflow and creative process, and produces exactly the generic output that professionals already find unusable."

"That homogeneity is already a practical problem. Generative models tend toward mode collapse [6][7]: when multiple models are given the same creative brief, they converge on safe, averaged aesthetics rather than distinctive directions. Cre"

Professional creatives produce two evaluation signals when judging AI-generated work: convergence and divergence. Convergence occurs when evaluators agree on what works, indicating shared best practices such as readable typography, functional layout, and clear visual hierarchy. Divergence occurs when evaluators disagree, reflecting real differences in taste, aesthetic direction, and creative intent rather than simple mistakes. Many benchmarks treat disagreement as noise and try to resolve it using majority voting, adjudication, or reconciliation, which assumes objective ground truth. Creative domains lack ground truth, so resolving disagreement can remove the most informative variation. This can contribute to generic outputs and mode collapse, where models converge on safe, averaged aesthetics instead of distinct directions.

#ai-evaluation #creative-taste #human-judgment #benchmarking #mode-collapse

Read at Contralabs

Unable to calculate read time

Collection

[

...

]

Contra Labs - Human Creativity BenchmarkContra Labs - Human Creativity Benchmark Briefly

Contra Labs - Human Creativity Benchmark
Contra Labs - Human Creativity Benchmark
Briefly