New research into datasets reveals systematic ethical and legal issues

AI relies on data, but its legal and ethical origins are often unclear. Large language models like LLM require substantial amounts of text data, which can be found on platforms like Kaggle, GitHub, and Hugging Face. However, many datasets lack clear licensing information, posing copyright and fair use concerns. The Data Provenance Initiative has audited over 1,800 datasets, revealing that approximately 70% lack proper licensing information. This raises concerns about copyright, commercial usage restrictions, and the obligation to credit data creators. The initiative aims to shed light on dataset origins and patterns. Additionally, there is a lack of representation and attribution for datasets from the Global South, as most datasets are focused on English-speaking countries. This study highlights significant issues with data collection, distribution, and usage and emphasizes the need for responsible and transparent practices.

 New research into datasets reveals systematic ethical and legal issues

The Importance of Ethical and Legal Data in AI

AI relies on data, but where does that data come from? Is it legal to use? These are important questions that need to be addressed.

Training machine learning models, like large language models, requires vast amounts of text data. While there are numerous datasets available on platforms like Kaggle, GitHub, and Hugging Face, they come with ethical and legal complexities, primarily related to licensing and fair use.

The Data Provenance Initiative, a collaboration between AI researchers and legal professionals, has conducted an investigation to shed light on the true origins of these datasets.

Key Findings from the Investigation

The initiative reviewed over 1,800 datasets from platforms such as Hugging Face, GitHub, and Papers With Code. The focus was on datasets used for fine-tuning open-source models.

The study revealed that around 70% of these datasets either lacked clear licensing information or had overly permissive licenses that deviated from the creators’ original intentions.

This lack of clarity raises concerns about copyright infringement, commercial usage restrictions, and the obligation to acknowledge the work of the dataset creators.

Shayne Longpre, a PhD candidate at MIT Media Lab who led the audit, emphasized that the issue lies within the machine-learning community and not with the hosting platforms.

There have been lawsuits targeting major AI developers, such as Meta, Anthropic, and OpenAI, which have put pressure on them to adopt more transparent data collection practices. Regulations like the EU’s AI Act aim to enforce these practices.

The Data Provenance Initiative’s Solution

The Data Provenance Initiative provides machine learning developers with access to audited datasets. The initiative also analyzes patterns within the datasets, highlighting their geographic and institutional origins.

It was found that the majority of datasets are from English-speaking countries in the Global North.

The initiative’s research reveals systematic issues in how data is collected and distributed. It emphasizes the need for standards in tracing data lineage, ensuring proper attribution, and promoting responsible data use.

The Value of Ethical and Legal Data

Data is a critical resource for AI, and its availability is finite. The study raises concerns about the quality of AI models as they may eventually outgrow existing datasets and start learning from AI-generated text.

High-quality, ethical, and legal data will become increasingly valuable. To stay competitive and evolve with AI, companies must address these issues and leverage AI in their workflows.

How AI Can Transform Your Company

If you want to embrace AI and stay competitive, it’s crucial to address the ethical and legal aspects of data. Here are some practical steps:

Identify Automation Opportunities

Locate key customer interaction points that can benefit from AI. Automating these processes can improve efficiency and customer experience.

Define KPIs

Ensure that your AI initiatives have measurable impacts on business outcomes. Define key performance indicators that align with your goals.

Select an AI Solution

Choose AI tools that meet your specific needs and provide customization options. This ensures that the solution aligns with your business requirements.

Implement Gradually

Start with a pilot project to gather data and test the effectiveness of AI. Gradually expand AI usage based on the insights gained.

For AI KPI management advice and continuous insights into leveraging AI, connect with us at hello@itinai.com. Stay updated on our Telegram channel t.me/itinainews or follow us on Twitter @itinaicom.

Spotlight on a Practical AI Solution: AI Sales Bot

Consider our AI Sales Bot, available at itinai.com/aisalesbot. It is designed to automate customer engagement and manage interactions across all stages of the customer journey.

Discover how AI can redefine your sales processes and customer engagement. Explore our solutions at itinai.com.

List of Useful Links:

AI Products for Business or Try Custom Development

AI Sales Bot

Welcome AI Sales Bot, your 24/7 teammate! Engaging customers in natural language across all channels and learning from your materials, it’s a step towards efficient, enriched customer interactions and sales

AI Document Assistant

Unlock insights and drive decisions with our AI Insights Suite. Indexing your documents and data, it provides smart, AI-driven decision support, enhancing your productivity and decision-making.

AI Customer Support

Upgrade your support with our AI Assistant, reducing response times and personalizing interactions by analyzing documents and past engagements. Boost your team and customer satisfaction

AI Scrum Bot

Enhance agile management with our AI Scrum Bot, it helps to organize retrospectives. It answers queries and boosts collaboration and efficiency in your scrum processes.