This research paper investigates the prevalence and impact of low-cost machine translation (MT) on the web and large multi-lingual language models (LLMs). It highlights the abundance of MT on the web, the use of multi-way parallelism, and the implications for LLMs, raising concerns about quality, bias, and fluency. Recommendations are made for addressing these challenges.
“`html
Unmasking the Web’s Tower of Babel: How Machine Translation Floods Low-Resource Languages with Low-Quality Content
Much of the modern Artificial Intelligence (AI) models are powered by enormous training data, ranging from billions to even trillions of tokens, which is only possible with web-scraped data. This web content is translated into numerous languages, and the quality of these multi-way translations suggests they were primarily created using Machine Translation (MT).
Research Findings
The research paper studies the impact low-cost MT has on the web and on large multi-lingual language models (LLMs). The analysis suggests that much of the web is MT, and the translations on the web are highly multi-way parallel, with low-resource languages having an average parallelism of 8.6. Additionally, these multi-way translations have a significantly lower quality as compared to 2-way parallel translations.
Furthermore, the findings show that multi-way parallel data generally consists of shorter, more predictable sentences and has a different topic distribution. This particularly affects the fluency and accuracy of multi-lingual LLMs and leads to more hallucinations and bias.
Practical Solutions
The researchers suggest that MT detection, along with filtering bitext, should be used in filtering text in lower resource languages. This would help detect low-quality data, especially in lower resource languages, prevent hallucinations and bias, and eventually lead to a better performance of multi-lingual LLMs.
AI Solutions for Middle Managers
If you want to evolve your company with AI, stay competitive, and use Unmasking the Web’s Tower of Babel: How Machine Translation Floods Low-Resource Languages with Low-Quality Content to your advantage. Discover how AI can redefine your way of work by identifying automation opportunities, defining KPIs, selecting an AI solution, and implementing gradually.
Spotlight on a Practical AI Solution
Consider the AI Sales Bot from itinai.com/aisalesbot designed to automate customer engagement 24/7 and manage interactions across all customer journey stages. Explore how AI can redefine your sales processes and customer engagement.
For AI KPI management advice, connect with us at hello@itinai.com. And for continuous insights into leveraging AI, stay tuned on our Telegram t.me/itinainews or Twitter @itinaicom.
“`