SmolDocling: IBM and Hugging Face’s 256M Open-Source Vision Language Model for Document OCR

Challenges in Document Conversion

Converting complex documents into structured data has been a significant challenge in computer science. Traditional methods, such as ensemble systems and large foundational models, often face issues like fine-tuning difficulties, generalization problems, hallucinations, and high computational costs. Ensemble systems may excel in specific tasks but struggle to generalize due to reliance on handcrafted pipelines. Meanwhile, multimodal foundational models, while powerful, can be costly and unreliable.

Introducing SmolDocling

Researchers from IBM and Hugging Face have developed SmolDocling, a 256M open-source vision-language model (VLM) tailored for multi-modal document conversion. Unlike larger models, SmolDocling simplifies the process by handling entire pages with a single model, reducing complexity and resource requirements. Its compact design, with only 256 million parameters, makes it lightweight and efficient.

Innovative Features

SmolDocling utilizes a universal markup format called DocTags, which effectively captures page elements, structures, and spatial contexts. Built on Hugging Face’s SmolVLM-256M architecture, it minimizes computational demands through optimized tokenization and visual feature compression. The innovative DocTags format allows for clear separation of document layout, text, and visual elements like equations and charts.

Performance and Efficiency

SmolDocling demonstrates exceptional performance in benchmark tests, outperforming larger models in various document conversion tasks. For instance, it achieved a lower edit distance (0.48) and higher F1-score (0.80) in full-page document OCR tasks compared to models with significantly more parameters. It also excelled in equation transcription and code snippet recognition, setting new benchmarks in precision and recall.

Versatile Applications

What distinguishes SmolDocling from other OCR solutions is its ability to manage diverse document elements, including complex items like code, charts, and equations. It effectively handles a wide range of documents, from scientific papers to patents and business forms. By providing structured metadata through DocTags, it enhances usability and eliminates ambiguity found in formats like HTML or Markdown.

Conclusion

SmolDocling marks a significant advancement in document conversion technology, proving that compact models can outperform larger counterparts in critical tasks. The research demonstrates how targeted training and innovative data formats can address traditional challenges. SmolDocling sets a new standard for efficiency and versatility in OCR technologies, offering valuable resources for the community with openly available datasets and a compact model architecture.

Next Steps

Explore how AI can transform your business processes. Identify areas for automation, assess key performance indicators (KPIs), and choose tools that align with your objectives. Start with small projects to evaluate effectiveness before scaling up your AI initiatives.

Contact Us

If you need assistance with managing AI in your business, reach out to us at hello@itinai.ru. Connect with us on Telegram, X, and LinkedIn.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

Automation of internal processes.
Optimizing AI costs without huge budgets.
Training staff, developing custom courses for business needs
Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

Get a plan to reduce routine and improve metrics

100% of clients report increased productivity and reduced operati

AI Agents

Localization Project Manager – Coordinating translation workflows, answering vendor or process-related questions.

Job Title: Localization Project Manager Overview The Localization Project Manager plays a vital role in coordinating translation workflows while addressing vendor and process-related queries. This position is crucial for ensuring that translation projects are executed efficiently…
AI Agents

Environmental Health & Safety Officer – Answering compliance-related questions, retrieving safety protocols or audit histories.

Professional Summary The AI-driven Environmental Health & Safety Officer is a reliable and effective digital team member that performs repetitive and time-consuming tasks with remarkable speed, accuracy, and stability. By automating these tasks, it frees up…
AI Agents

Legal Contract Reviewer – Auto-flagging clause inconsistencies or retrieving precedent cases for review.

Job Title: Legal Contract Reviewer – Auto-flagging Clause Inconsistencies or Retrieving Precedent Cases for Review The AI functions as a reliable and effective digital team member that excels in performing repetitive and time-consuming tasks. With remarkable…
AI Agents

Customer Retention Analyst – Creating customer summaries, identifying churn risk patterns, and suggesting retention steps.

Customer Retention Analyst Professional Summary A highly analytical and detail-oriented Customer Retention Analyst with a proven track record in creating comprehensive customer summaries, identifying churn risk patterns, and suggesting effective retention strategies. Adept at leveraging data-driven…

Itinai.com httpss.mj.runmrqch2uvtvo russian handsome charisma 9fdbb2d5 a55b 425d 8f3b 76d26f86710f 2

AI Business Accelerator

Start Your AI Business in Just a Week with itinai.com

You’re a great fit if you:

Have an audience (even 500+ followers in Instagram, email, etc.)
Have an idea, service, or product you want to scale
Can invest 2–3 hours a day
You’re motivated to earn with AI but don’t want to handle technical setup

AI news and solutions

AI finally solves the mystery behind a Renaissance painting

Researchers used machine learning to determine the true artist of the Renaissance painting Madonna della Rosa. While there were lingering doubts, a machine learning model developed by Professor Ugail identified high probabilities that certain parts were…

AI Tech News
XElemNet: A Machine Learning Framework that Applies a Suite of Explainable AI (XAI) for Deep Neural Networks in Materials Science

Advancements in Deep Learning for Material Sciences Transforming Material Design Deep learning has greatly improved material sciences by predicting material properties and optimizing compositions. This technology speeds up material design and allows for exploration of new…

AI Tech News
How to Use Git and Git Bash Locally: A Complete Guide

Using Git and Git Bash: A Business Guide Using Git and Git Bash Locally: A Business Guide Table of Contents Introduction Installation Windows macOS Linux Basic Git Commands Git Configuration Git Workflow Creating a Repository Committing…

AI Tech News
Fondant AI Releases Fondant-25M Dataset of Image-Text Pairs with a Creative Commons License

Researchers have developed an open-source framework called Fondant to simplify and accelerate large-scale data processing. It includes embedded tools for data download, exploration, and processing. They have also created a data-processing pipeline to generate datasets of…

AI Tech News
LG AI Research Open-Sources EXAONE 3.0: A 7.8B Bilingual Language Model Excelling in English and Korean with Top Performance in Real-World Applications and Complex Reasoning

Introduction to EXAONE 3.0: The Vision and Objectives EXAONE 3.0 is a significant advancement in LG AI Research’s language models, designed to democratize access to expert-level AI capabilities. Its release marked the introduction of the EXAONE…

AI Tech News
NV-Embed: NVIDIA’s Groundbreaking Embedding Model Dominates MTEB Benchmarks

NV-Embed: NVIDIA’s Groundbreaking Embedding Model Dominates MTEB Benchmarks NVIDIA has recently introduced NV-Embed on Hugging Face, a revolutionary embedding model poised to redefine the landscape of NLP. This model, characterized by its impressive versatility and performance,…

AI Tech News
NavGPT-2: Integrating LLMs and Navigation Policy Networks for Smarter Agents

NavGPT-2: Integrating LLMs and Navigation Policy Networks for Smarter Agents NavGPT-2 effectively combines Large Language Models (LLMs) and Vision-and-Language Navigation (VLN) tasks to enhance navigation capabilities. Practical Solutions and Value NavGPT-2 overcomes the limitations of integrating…

AI Tech News
Quantum Tunneling Meets AI: How Deep Neural Networks are Transforming Optical Applications

Understanding Quantum Tunneling and AI The quantum tunneling (QT) effect, discovered in the 1920s, is a key advancement in quantum mechanics. Unlike human brains, artificial intelligence (AI) struggles to interpret complex visual illusions, such as the…

AI Tech News
GPZ: Revolutionizing Particle Data Compression with GPU Acceleration for Researchers

Understanding the Target Audience The primary audience for GPZ consists of researchers and practitioners in fields such as cosmology, geology, molecular dynamics, and 3D imaging. These professionals confront significant challenges related to managing large-scale scientific datasets,…

AI Tech News
Think While You Write Hypothesis Verification Promotes Faithful Knowledge-to-Text Generation

Enhance Knowledge-to-Text Generation with TWEAK Neural knowledge-to-text generation models often struggle to faithfully generate descriptions for the input facts. To address this, we propose a novel decoding method, TWEAK (Think While Effectively Articulating Knowledge), which reduces…

AI Tech News
Introducing GS-LoRA++: A Novel Approach to Machine Unlearning for Vision Tasks

Understanding the Importance of Pre-Trained Vision Models Pre-trained vision models play a crucial role in advanced computer vision tasks, such as: Image Classification Object Detection Image Segmentation The Challenge of Data Management As we gather more…

AI Tech News
Microsoft expected to post its best quarterly revenue growth in two years

Microsoft is poised for its best quarterly growth in nearly two years, with a projected 15.8% revenue rise. Its alliance with OpenAI has propelled it to a $3 trillion valuation, establishing dominance in AI. Analysts project…

AI Tech News
Beyond a Single LLM: Advancing AI Through Multi-Model Collaboration

The Evolution of Language Models The rapid advancement of Large Language Models (LLMs) is fueled by the belief that larger models and datasets will lead to human-like intelligence. As these models shift from research to commercial…

AI Tech News
Humane, an OpenAI and Apple collaboration, drop the “AI Pin”

Humane, a startup led by former Apple innovators, has unveiled the AI Pin, a wearable projector priced at $699. The device functions as a personal assistant and comes with features like ultrawide camera capabilities, text/email communication,…

AI Tech News
Content-Adaptive Tokenizer (CAT): An Image Tokenizer that Adapts Token Count based on Image Complexity, Offering Flexible 8x, 16x, or 32x Compression

Overcoming Challenges in AI Image Modeling One major challenge in AI image modeling is the difficulty in handling the variety of image complexities. Current methods use static compression ratios, treating all images the same. This leads…

AI Tech News
Upstage Unveils Solar-10.7B: Pioneering Large Language Models with Depth Up-Scaling and Fine-Tuned Precision for Single-Turn Conversations

Upstage introduces Solar-10.7B, a groundbreaking language model with 10.7 billion parameters, balancing size and performance. It employs the Llama 2 architecture and Upstage Depth Up-Scaling technique, outperforming larger models. The fine-tuned SOLAR-10.7B-Instruct-v1.0 excels in single-turn conversations…

AI Tech News
DataDecide: A Benchmark Suite for Optimizing LLM Pretraining Data Selection

Enhancing AI Model Performance Through Data Optimization Enhancing AI Model Performance Through Data Optimization Understanding the Challenge of Data Selection in LLM Pretraining Creating large language models (LLMs) requires significant computational resources, particularly when testing various…

AI Tech News
Alibaba Researchers Introduce Qwen-Audio Series: A Set of Large-Scale Audio-Language Models with Universal Audio Understanding Abilities

Alibaba Group’s Qwen-Audio series introduces large-scale audio-language models with universal understanding across diverse audio types and tasks. Overcoming prior limitations, Qwen-Audio excels in various benchmarks without fine-tuning, while Qwen-Audio-Chat extends capabilities for versatile human interaction. Future…

AI Tech News
Google Search Enhances User Experience with Gemini 2.5 Pro and Deep Search AI Upgrades

Google Search Just Got a Major AI Upgrade: Gemini 2.5 Pro, Deep Search, and Agentic Intelligence Google is transforming how we interact with Search. With the recent rollout of Gemini 2.5 Pro, Deep Search, and a…

AI Tech News
torchao: A PyTorch Native Library that Makes Models Faster and Smaller by Leveraging Low Bit Dtypes, Quantization and Sparsity

torchao: Enhancing PyTorch Models with Advanced Optimization Practical Solutions and Value Highlights: Optimized Performance: Achieve up to 97% speedup and reduced memory usage during model inference and training. Quantization Techniques: Utilize low-bit dtypes like int4 and…

AI Tech News