LLM360 Group Introduces TxT360: A Top-Quality LLM Pre-Training Dataset with 15T Tokens

Introduction to TxT360: A Revolutionary Dataset

In the fast-changing world of large language models (LLMs), the quality of pre-training datasets is crucial for AI systems to understand and generate human-like text. LLM360 has launched TxT360, an innovative pre-training dataset with 15 trillion tokens. This dataset is notable for its diversity, scale, and thorough data filtering, making it one of the most advanced open-source datasets available.

A Dataset Built on New Foundations

TxT360 stands out by incorporating new sources like FreeLaw (legal texts), PG-19 (a collection of books), scientific papers, and Wikipedia. This combination creates a richer dataset that enhances the capabilities of future LLMs.

From Common Crawl to Clean Data

The development of TxT360 started with Common Crawl, a publicly available web scrape. However, to meet high standards, LLM360 undertook a detailed filtering process:

Text Extraction: Clean and coherent text was extracted from noisy web data.
Language Filtering: Non-English content was removed for consistency.
URL Filtering: Low-value sources were eliminated, including spammy sites.
Repetition Removal: Efforts were made to remove repeated lines and paragraphs.
Document and Line-Level Filtering: Heuristics ensured only quality documents remained.

As a result, 97.65% of the original data was filtered out, leaving only high-quality text for robust language models.

Global Deduplication

To ensure quality, LLM360 used two methods for deduplication: exact deduplication with a Bloom filter and fuzzy deduplication with a MinHash algorithm. This approach ensured unique content, avoiding repetitive learning.

High-Quality Sources

After filtering, LLM360 included carefully selected high-quality sources, such as scientific papers, legal documents, classic literature, and curated Wikipedia entries. Each source was processed to maintain data integrity and quality, enabling language models to cover a wide range of topics effectively.

TxT360: A New Era for Open-Source AI

The launch of TxT360 represents a major advancement in AI and natural language processing (NLP) research. LLM360’s careful construction and filtering show that quality and quantity can go hand in hand. With 15 trillion tokens, TxT360 supports the creation of nuanced and intelligent language models.

LLM360’s transparency about their processes sets a new benchmark in the industry. They plan to release their codebase, providing insights into the methodologies behind this impressive dataset.

Stay Connected

For more details about the dataset, visit our website. Follow us on Twitter, and join our Telegram Channel and LinkedIn Group. If you appreciate our work, subscribe to our newsletter and join our 50k+ ML SubReddit community.

Upcoming Event – Oct 17 202

Don’t miss RetrieveX – The GenAI Data Retrieval Conference!

Transform Your Business with AI

To stay competitive, leverage the power of TxT360 in your organization:

Identify Automation Opportunities: Find key customer interactions that can benefit from AI.
Define KPIs: Ensure your AI projects have measurable impacts.
Select an AI Solution: Choose tools that fit your needs and allow customization.
Implement Gradually: Start with a pilot project, gather data, and expand wisely.

For AI KPI management advice, contact us at hello@itinai.com. For ongoing insights into AI, follow us on Telegram at t.me/itinainews or Twitter @itinaicom.

Discover how AI can transform your sales processes and customer engagement at itinai.com.

List of Useful Links:

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

Automation of internal processes.
Optimizing AI costs without huge budgets.
Training staff, developing custom courses for business needs
Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

Get a plan to reduce routine and improve metrics

100% of clients report increased productivity and reduced operati

AI Agents

Localization Project Manager – Coordinating translation workflows, answering vendor or process-related questions.

Job Title: Localization Project Manager Overview The Localization Project Manager plays a vital role in coordinating translation workflows while addressing vendor and process-related queries. This position is crucial for ensuring that translation projects are executed efficiently…
AI Agents

Environmental Health & Safety Officer – Answering compliance-related questions, retrieving safety protocols or audit histories.

Professional Summary The AI-driven Environmental Health & Safety Officer is a reliable and effective digital team member that performs repetitive and time-consuming tasks with remarkable speed, accuracy, and stability. By automating these tasks, it frees up…
AI Agents

Legal Contract Reviewer – Auto-flagging clause inconsistencies or retrieving precedent cases for review.

Job Title: Legal Contract Reviewer – Auto-flagging Clause Inconsistencies or Retrieving Precedent Cases for Review The AI functions as a reliable and effective digital team member that excels in performing repetitive and time-consuming tasks. With remarkable…
AI Agents

Customer Retention Analyst – Creating customer summaries, identifying churn risk patterns, and suggesting retention steps.

Customer Retention Analyst Professional Summary A highly analytical and detail-oriented Customer Retention Analyst with a proven track record in creating comprehensive customer summaries, identifying churn risk patterns, and suggesting effective retention strategies. Adept at leveraging data-driven…

Itinai.com httpss.mj.runmrqch2uvtvo russian handsome charisma 9fdbb2d5 a55b 425d 8f3b 76d26f86710f 2

AI Business Accelerator

Start Your AI Business in Just a Week with itinai.com

You’re a great fit if you:

Have an audience (even 500+ followers in Instagram, email, etc.)
Have an idea, service, or product you want to scale
Can invest 2–3 hours a day
You’re motivated to earn with AI but don’t want to handle technical setup

AI news and solutions

Stylus: An AI Tool that Automatically Finds and Adds the Best Adapters (LoRAs, Textual Inversions, Hypernetworks) to Stable Diffusion based on Your Prompt

Practical Solutions in AI for Image Generation Adopting Finetuned Adapters Using finetuned adapters in generative image models allows for customized image creation while minimizing storage requirements. This has led to expansive open-source platforms with over 100,000…

AI Tech News
Meta AI Introduces Multi-Line AI-Assisted Code Authoring

CodeCompose, utilized by Meta developers, enhanced its AI-powered code authoring tool to provide multiline suggestions. The transition addressed challenges such as workflow disruption and latency concerns. Model-hosting optimizations improved multiline suggestion latency by 2.5 times, with…

AI Tech News
RxEnvironments.jl: A Reactive Programming Approach to Complex Agent-Environment Simulations in the Julia Language

Practical Solutions and Value of RxEnvironments.jl for AI-driven Simulations Introduction to Free Energy Principle and Active Inference The Free Energy Principle (FEP) and Active Inference (AIF) offer insights into self-organization in natural systems. Agents use generative…

AI Tech News
Top Chinese Open Agentic/Reasoning Models of 2025: A Comprehensive Review for Developers

Introduction to Chinese Open Agentic Models China has emerged as a leader in the development of open-source large language models, particularly in the realms of agentic structures and profound reasoning capabilities. With advancements that rival other…

AI Tech News
MoE Architecture Battle: Qwen3 30B-A3B vs. GPT-OSS 20B Analysis for AI Developers and Researchers

Model Overview In the rapidly evolving landscape of artificial intelligence, two Mixture-of-Experts (MoE) transformer models have recently emerged: Alibaba’s Qwen3 30B-A3B and OpenAI’s GPT-OSS 20B. Released in April and August 2025 respectively, these models showcase different…

AI Tech News
Revolutionizing Cancer Diagnosis: How Deep Learning Predicts Continuous Biomarkers with Unprecedented Accuracy

Researchers have developed a regression-based deep-learning method, CAMIL, to predict continuous biomarkers from pathology slides, surpassing classification-based methods. The approach significantly improves prediction accuracy and aligns better with clinically relevant regions, particularly in predicting HRD status.…

AI Tech News
AI tools streamline eCommerce tasks on Shopify, eBay, and Amazon

eBay, Amazon, and Shopify are incorporating AI features to assist users in listing products and completing mundane tasks. These tools help sellers generate detailed product descriptions quickly and accurately. AI tools on platforms like Shopify are…

AI Tech News
EM-LLM: A Novel and Flexible Architecture that Integrates Key Aspects of Human Episodic Memory and Event Cognition into Transformer-based Language Models

Practical Solutions and Value Extending Language Models’ Context Windows Large language models (LLMs) face limitations in processing extensive contexts due to their Transformer-based architectures. These constraints hinder their ability to incorporate domain-specific, private, or up-to-date information…

AI Tech News
Subscription

Stay Ahead in AI Innovation with itinai.com Newsletter Artificial Intelligence is reshaping industries at an unprecedented pace. To keep your business competitive, you need timely insights, actionable strategies, and updates on cutting-edge tools. At itinai.com, we…

Chief Editor Blog
Beyond the Warm Embrace: A Deeper Look at Hugging Face

This article discusses the process of fine tuning language models for Named Entity Recognition. It can be found on Towards Data Science.

AI Tech News
Animal Shelter Analytics in Practice: The Impact of Shelter Animals Count

The text explores SAC’s groundbreaking role as a data-driven social enterprise. For more information, kindly refer to the full article on Towards Data Science.

AI Tech News
Build an Intelligent Conversational AI Agent with Memory Using Free Tools

The rise of artificial intelligence (AI) has transformed the way businesses and developers think about communication. One of the most exciting developments is the creation of intelligent conversational agents that can remember context and engage users…

AI Tech News
Transfusion Architecture: Enhancing GPT-4o’s Multimodal Creativity

Transforming AI with Transfusion Architecture Transforming AI with Transfusion Architecture Introduction to GPT-4o and Transfusion Architecture OpenAI’s GPT-4o represents a significant advancement in multimodal artificial intelligence, combining fluent text and high-quality image generation in a single…

AI Tech News
Microsoft’s AI Research on Inference-Time Scaling for Enhanced Reasoning Models

Microsoft’s AI Insights: Enhancing Reasoning in Language Models Enhancing Reasoning in Language Models Through Inference-Time Scaling Introduction Large language models have gained acclaim for their fluency in language, yet improving their reasoning capabilities is increasingly vital—particularly…

AI Tech News
Efficient Long-Term Prediction of Chaotic Systems Using Physics-Informed Neural Operators: Overcoming Limitations of Traditional Closure Models

Predicting Long-Term Behavior of Chaotic Systems Practical Solutions and Value Predicting the behavior of chaotic systems like climate models requires significant resources. Instead of fully-resolved simulations, using coarse grids with machine learning methods can improve accuracy.…

AI Tech News
Google DeepMind Researchers Unveil Multistep Consistency Models: A Machine Learning Approach that Balances Speed and Quality in AI Sampling

Google DeepMind researchers have developed Multistep Consistency Models, merging them with TRACT and Consistency Models to narrow the performance gap between standard diffusion and few-step sampling. The method offers a trade-off between sample quality and speed,…

AI Tech News
Evaluating the Impact of GPT-4 on Physician Diagnostic Reasoning: Insights and Future Directions for AI Integration in Clinical Practice

Practical Solutions and Value of AI in Healthcare Reducing Diagnostic Errors with AI Models AI models like LLMs can assist in handling complex cases and patient interactions, enhancing diagnostic reasoning without replacing human expertise. Research on…

AI Tech News
SELMA: A Novel AI Approach to Enhance Text-to-Image Generation Models Using Auto-Generated Data and Skill-Specific Learning Techniques

Practical Solutions for Enhancing Text-to-Image Models Challenges in Text-to-Image Models Text-to-image models struggle to accurately reflect all details from textual prompts, leading to unrealistic images. Current Solutions Researchers are working on methods to improve image faithfulness…

AI Tech News
Effector: A Python-based Machine Learning Library Dedicated to Regional Feature Effects

AI Tech News
HyperGAI Introduces HPT: A Groundbreaking Family of Leading Multimodal LLMs

AI Tech News