A new study shows promise in proving the wrong type of garbage in, means garbage out. Social media is apparently a kind of garbage that lowers intelligence the most.
Wang and his colleagues wanted to see the effects of large language models (LLMs) trained on low-quality data — defined as short, popular social-media posts, or those containing superficial or sensationalist content. They looked at how these data affected model reasoning, retrieval of information from long inputs, the ethics of responses and model personality traits.
The team reports that models given low-quality data skip steps in their reasoning process — or don’t use reasoning at all — resulting in the model providing incorrect information about a topic, or when the authors presented a multiple choice question, the model would pick the wrong answer. In data sets with a mix of junk and high-quality data, the negative effect on reasoning increased as the proportion of junk data increased.
Most notably, the report describes this as an integrity breach that can’t be fixed. The decline is deemed irreversible simply because additional instruction tuning or retraining with high-quality data doesn’t restore lost performance. Degraded models can’t overcome a nearly 20% gap compared to versions that avoid the garbage data.
Are there better methods to reverse the decline, or restore intelligence? A new market for integrity controls has been born, officially now.