Re-LAION 5B Dataset Released: Improving Safety and Transparency in Web-Scale Datasets for Foundation Model Research Through Rigorous Content Filtering
Background and Motivation
LAION-5B dataset was updated to address critical issues related to potential illegal content, notably Child Sexual Abuse Material (CSAM), and ensure legal compliance of web-scale datasets used in foundational model research.
The Re-LAION 5B Update
Re-LAION 5B removed 2,236 suspect links, including those pointing to CSAM, by leveraging known illegal content hashes. It offers two versions: research and research-safe, with varying levels of sensitive content filtering.
Ensuring Ongoing Safety and Compliance
LAION made the metadata from the updated dataset available to third parties for cleaning their derivatives of LAION-5B, enhancing the safety of derivative datasets and preserving LAION-5B’s usability as a reference dataset for ongoing research.
A Call to Action for the Research Community
LAION encourages researchers and organizations to migrate to the updated version of LAION-5B to ensure safety and legal compliance. It also recommends partnering with expert organizations to obtain resources necessary for effective filtering.
Conclusion
Re-LAION 5B is a significant step forward in LAION’s mission to provide open, transparent, and safe datasets for the machine learning research community, reaffirming its commitment to advancing the field of ML responsibly and ethically.