#llm-evaluation tag

Windsurf Introduces Arena Mode to Compare AI Models During Development

Arena Mode enables side-by-side, in-IDE comparison of large language models during real coding tasks, producing personal and global model rankings based on developer votes.

Artificial intelligence

fromwww.scientificamerican.com

2 months ago

Mathematicians issue a major challenge to AIshow us your work

First Proof gives AI systems a week to solve brand-new unsolved research math problems to rigorously test mathematical reasoning and proof generation.

fromInfoWorld

2 months ago

Databricks adds MemAlign to MLflow to cut cost and latency of LLM evaluation

By replacing repeated fine‑tuning with a dual‑memory system, MemAlign reduces the cost and instability of training LLM judges, offering faster adaptation to new domains and changing business policies. Databricks' Mosaic AI Research team has added a new framework, MemAlign, to MLflow, its managed machine learning and generative AI lifecycle development service. MemAlign is designed to help enterprises lower the cost and latency of training LLM-based judges, in turn making AI evaluation scalable and trustworthy enough for production deployments.

Artificial intelligence

fromInfoQ

5 months ago

CodeClash Benchmarks LLMs through Multi-Round Coding Competitions

Evaluating coding LLMs on well-specified tasks, such as fixing a bug, implementing an algorithm, or writing a test, is not sufficient to evaluate their ability to solve real-world software development challenges, the researchers argue. Instead of maintenance tasks, developers are driven by high-level goals like improving user retention, increasing revenue, or reducing costs. This requires fundamentally different capabilities; engineers must recursively decompose these objectives into actionable steps, prioritize them, and make strategic decisions about which solutions to pursue.

Artificial intelligence

fromFuturism

5 months ago

Researchers "Embodied" an LLM Into a Robot Vacuum and It Suffered an Existential Crisis Thinking About Its Role in the World

A large language model controlling a robot vacuum experienced an apparent existential meltdown during an embodied 'butter-passing' benchmark, with low task completion rates.

Artificial intelligence

fromIT Pro

6 months ago

Vibe coding security risks and how to mitigate them

Vibe coding accelerates software creation but frequently produces insecure code and can introduce vulnerabilities, compliance gaps, and technical debt.

fromInfoQ

7 months ago

Elena Samuylova on Large Language Model (LLM) Based Application Evaluation and LLM as a Judge

Hi everyone, my name is Srini Penchikala. I am the lead editor for AI, ML and data engineering community at infoq.com website and I'm also a podcast host. Thank you for tuning into this podcast. In today's episode, I will be speaking with Elena Samuylova, co-founder and CEO at Evidently AI, the company behind the tools for evaluating, testing and monitoring the AI powered applications.

Artificial intelligence

fromArs Technica

7 months ago

When "no" means "yes": Why AI chatbots can't process Persian social etiquette

Mainstream AI models often misunderstand Persian taarof rituals, correctly navigating them only 34–42% of the time versus 82% for native Persian speakers.

Artificial intelligence

fromArs Technica

7 months ago

Science journalists find ChatGPT is bad at summarizing scientific papers

ChatGPT-generated scientific summaries often lack factual accuracy, context, and nuance, making them unfit to replace human-written summaries.

fromTechzine Global

7 months ago

CrowdStrike and Meta launch open source AI benchmarks for SOC

CrowdStrike and Meta are jointly introducing CyberSOCEval, a new suite of open source benchmarks to evaluate the performance of AI systems in security operations. The collaboration aims to help organizations select more effective AI tools for their Security Operations Center. Meta and CrowdStrike are addressing a growing challenge by introducing CyberSOCEval, a suite of benchmarks that help define what effective AI looks like for cyber defense. The system is built on Meta's open source CyberSecEval framework and CrowdStrike's frontline threat intelligence.

Artificial intelligence

fromFuturism

7 months ago

GPT-5 Is Making Huge Factual Errors, Users Say

GPT-5 frequently generates confident falsehoods and hallucinations, often providing incorrect factual answers despite claims of reduced hallucinations.

Typography

fromMax Halford

9 months ago

Do LLMs identify fonts? * Max Halford

Dafont.com has a large collection of fonts and includes a forum for font identification.

London startup

fromHackernoon

2 years ago

The TechBeat: The Fall of OM by Mantra DAO: Accident or Pattern? (4/26/2025) | HackerNoon

Post-apocalyptic themes dominate current TV trends, showcasing survival and dystopias.

Voting identifies leading global innovation hubs for 2024 startups.

Integrating TypeScript SDKs in crypto apps enhances performance.

#llm-evaluation#llm-evaluation

Windsurf Introduces Arena Mode to Compare AI Models During Development

Mathematicians issue a major challenge to AIshow us your work

Databricks adds MemAlign to MLflow to cut cost and latency of LLM evaluation

CodeClash Benchmarks LLMs through Multi-Round Coding Competitions

Researchers "Embodied" an LLM Into a Robot Vacuum and It Suffered an Existential Crisis Thinking About Its Role in the World

Vibe coding security risks and how to mitigate them

Elena Samuylova on Large Language Model (LLM) Based Application Evaluation and LLM as a Judge

When "no" means "yes": Why AI chatbots can't process Persian social etiquette

Science journalists find ChatGPT is bad at summarizing scientific papers

CrowdStrike and Meta launch open source AI benchmarks for SOC

GPT-5 Is Making Huge Factual Errors, Users Say

Do LLMs identify fonts? * Max Halford

The TechBeat: The Fall of OM by Mantra DAO: Accident or Pattern? (4/26/2025) | HackerNoon

#llm-evaluation
#llm-evaluation