The Tech Industry Said It Was "Impossible" to Create AI Based Entirely on Ethically-Sourced Data, So These Scientists Proved Them Wrong in Spectacular Fashion

""This isn't a thing where you can just scale up the resources that you have available... all of our stuff was manually annotated at the end of the day and checked by people.""

""The result? An AI that admirably stacks up against industry models like Meta's Llama 1 and Llama 2 7B - which is impressive, but those were versions released over two years ago.""

A collaborative effort by AI researchers from prominent institutions has led to the development of a large language model trained solely on publicly available data. The process, described in a recent paper, revealed that the main challenge was not computational power but human effort required for data curation. The dataset, called Common Pile v0.1, needed extensive manual cleanup and verification to ensure copyright compliance. The team managed to produce a competitive AI model that stands well against established models, showcasing that ethical AI development is achievable without extensive resources.

#ai-ethics #language-models #open-source #data-curation #research-collaboration

Read at Futurism

Unable to calculate read time

Collection

[

...

]