A collaborative effort by AI researchers from prominent institutions has led to the development of a large language model trained solely on publicly available data. The process, described in a recent paper, revealed that the main challenge was not computational power but human effort required for data curation. The dataset, called Common Pile v0.1, needed extensive manual cleanup and verification to ensure copyright compliance. The team managed to produce a competitive AI model that stands well against established models, showcasing that ethical AI development is achievable without extensive resources.
"This isn't a thing where you can just scale up the resources that you have available... all of our stuff was manually annotated at the end of the day and checked by people."
"The result? An AI that admirably stacks up against industry models like Meta's Llama 1 and Llama 2 7B - which is impressive, but those were versions released over two years ago."
Collection
[
|
...
]