
"Artificial intelligence (AI) chatbots are worse at retrieving accurate information and reasoning when trained on large amounts of low-quality content, particularly if the content is popular on social media, finds a preprint posted on arXiv on 15 October. In data science, good-quality data need to meet certain criteria, such as being grammatically correct and understandable, says co-author Zhangyang Wang, who studies generative AI at the University of Texas at Austin. But these criteria fail to capture differences in content quality, he says."
"They looked at how these data affected model reasoning, retrieval of information from long inputs, the ethics of responses and model personality traits. The team reports that models given low-quality data skip steps in their reasoning process - or don't use reasoning at all - resulting in the model providing incorrect information about a topic, or when the authors presented a multiple choice question, the model would pick the wrong answer."
"The findings support a long-held tenet of AI: the importance of data quality, says Mehwish Nasim, an AI researcher at the University of Western Australia in Perth. "Even before people started to work on large language models, we used to say that, if you give garbage to an AI model, it's going to produce garbage," she adds. Garbage in, garbage out"
Large language models trained on extensive low-quality social-media content show degraded reasoning and information retrieval. Low-quality data are defined as short, popular social-media posts or content that is superficial or sensationalist. Models trained on such data skip reasoning steps or fail to reason, producing incorrect answers and wrong multiple-choice selections. Mixing junk and high-quality data increases harmful effects as the proportion of junk rises. The degradation affects retrieval from long inputs, ethical response generation, and model personality traits. Open-source models such as Llama 3 and Qwen were trained on one million public X posts to evaluate these effects.
Read at Nature
Unable to calculate read time
Collection
[
|
...
]