Groundlight Launches Open-Source AI Framework for Visual Reasoning Agents

Challenges in Visual Language Models (VLMs)

Modern VLMs face difficulties with complex visual reasoning tasks, where simply understanding an image is not enough. Recent improvements in text-based reasoning have not been matched in the visual domain. VLMs often struggle to combine visual and textual information for logical deductions, revealing a significant gap in their capabilities. This is especially true for tasks requiring stepwise reasoning, where recognizing objects alone is insufficient without understanding their relationships and context.

Current Research Limitations

Most research on multimodal AI has concentrated on object detection, captioning, and question answering, with little focus on advanced reasoning. Some attempts to enhance VLMs through chain-of-thought prompting or explicit reasoning structures have been made, but these methods are often limited to textual data or do not generalize well across various visual tasks. Additionally, many open-source initiatives in this field are still underdeveloped, hindering progress in visual reasoning beyond basic recognition tasks.

Innovative Approaches by Groundlight Researchers

Groundlight researchers have investigated training VLMs for visual reasoning using reinforcement learning, specifically employing GRPO to improve efficiency. They designed a cryptogram-solving task that requires both visual and textual processing, achieving 96% accuracy with a 3B parameter model. Attention analysis showed that the model effectively engages with visual inputs, focusing on relevant areas while solving the task.

Challenges in Training VLMs

Training VLMs with GRPO presents challenges, particularly in tokenization and reward design. Since models process text as tokens, tasks needing precise character-level reasoning can be problematic. To address this, researchers formatted messages with spaces between letters. Reward design was also critical, utilizing three types of rewards: a format reward for output consistency, a decoding reward for meaningful transformations, and a correctness reward for accuracy. This careful balance prevented unintended learning shortcuts, ensuring genuine improvement in cryptogram solving.

Advantages of GRPO

GRPO optimizes learning by comparing multiple outputs instead of relying solely on direct gradient computation, leading to more stable training. By generating various responses for each query and evaluating them against one another, this approach facilitates smoother learning curves. The research also highlighted the potential of VLMs in reasoning tasks while acknowledging the high computational costs of complex vision models. Techniques like selective model escalation were proposed to enhance efficiency, using advanced models only for ambiguous cases. Additionally, integrating pre-trained models for object detection, segmentation, and depth estimation can improve reasoning without significantly increasing computational demands.

Conclusion and Future Directions

The Groundlight team has made notable progress in enhancing VLMs through reinforcement learning techniques, particularly GRPO. Their successful application in a cryptogram-solving task demonstrates the potential of integrating visual and textual data to boost VLM performance. By open-sourcing their methodology and tools, Groundlight aims to empower the broader community to advance visual reasoning capabilities in AI systems.

Explore Further

Check out the Technical details, GitHub Page, and Demo. All credit for this research goes to the researchers of this project. Follow us on Twitter and join our 80k+ ML SubReddit.

Transform Your Business with AI

Explore how artificial intelligence can enhance your work processes:

Identify processes that can be automated.
Find customer interaction moments where AI adds value.
Establish key performance indicators (KPIs) to measure the impact of your AI investments.
Select customizable tools that align with your objectives.
Start with a small project, gather data on its effectiveness, and gradually expand your AI initiatives.

If you need guidance on managing AI in business, contact us at hello@itinai.ru. Connect with us on Telegram, X, and LinkedIn.

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

Automation of internal processes.
Optimizing AI costs without huge budgets.
Training staff, developing custom courses for business needs
Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

Get a plan to reduce routine and improve metrics

100% of clients report increased productivity and reduced operati

AI Agents

Localization Project Manager – Coordinating translation workflows, answering vendor or process-related questions.

Job Title: Localization Project Manager Overview The Localization Project Manager plays a vital role in coordinating translation workflows while addressing vendor and process-related queries. This position is crucial for ensuring that translation projects are executed efficiently…
AI Agents

Environmental Health & Safety Officer – Answering compliance-related questions, retrieving safety protocols or audit histories.

Professional Summary The AI-driven Environmental Health & Safety Officer is a reliable and effective digital team member that performs repetitive and time-consuming tasks with remarkable speed, accuracy, and stability. By automating these tasks, it frees up…
AI Agents

Legal Contract Reviewer – Auto-flagging clause inconsistencies or retrieving precedent cases for review.

Job Title: Legal Contract Reviewer – Auto-flagging Clause Inconsistencies or Retrieving Precedent Cases for Review The AI functions as a reliable and effective digital team member that excels in performing repetitive and time-consuming tasks. With remarkable…
AI Agents

Customer Retention Analyst – Creating customer summaries, identifying churn risk patterns, and suggesting retention steps.

Customer Retention Analyst Professional Summary A highly analytical and detail-oriented Customer Retention Analyst with a proven track record in creating comprehensive customer summaries, identifying churn risk patterns, and suggesting effective retention strategies. Adept at leveraging data-driven…

Itinai.com httpss.mj.runmrqch2uvtvo russian handsome charisma 9fdbb2d5 a55b 425d 8f3b 76d26f86710f 2

AI Business Accelerator

Start Your AI Business in Just a Week with itinai.com

You’re a great fit if you:

Have an audience (even 500+ followers in Instagram, email, etc.)
Have an idea, service, or product you want to scale
Can invest 2–3 hours a day
You’re motivated to earn with AI but don’t want to handle technical setup

AI news and solutions

UniBench: A Python Library to Evaluate Vision-Language Models VLMs Robustness Across Diverse Benchmarks

UniBench: A Comprehensive Evaluation Framework for Vision-Language Models Overview Vision-language models (VLMs) face challenges in evaluation due to the complex landscape of benchmarks. UniBench addresses these challenges by providing a unified platform that implements 53 diverse…

AI Tech News
Large language models can do jaw-dropping things. But nobody knows exactly why.

Yuri Burda and Harri Edwards of OpenAI experimented with training a large language model to do basic arithmetic, discovering unexpected behaviors like grokking and double descent. These odd phenomena challenge classical statistics and highlight the mysterious…

AI Tech News
An Extensible Open-Source AI Framework to Benchmark Attributable Information-Seeking Using Representative LLM-based Approaches

Practical Solutions for Attributable Information-Seeking with AI Challenges in Information-Seeking Search engines use generative methods to provide accurate answers with citations, but open-ended queries pose challenges due to potential incorrect information. AI Framework for Information-Seeking A…

AI Tech News
IBM Researchers Propose ExSL+granite-20b-code: A Granite Code Model to Simplify Data Analysis by Enabling Generative AI to Write SQL Queries from Natural Language Questions

IBM Researchers Propose ExSL+granite-20b-code: A Granite Code Model to Simplify Data Analysis by Enabling Generative AI to Write SQL Queries from Natural Language Questions Practical Solutions and Value IBM’s ExSL+granite-20b-code model simplifies data analysis by using…

AI Tech News
NotebookLM Introduces Audio and YouTube Integration, Enhances Audio Overview Sharing

NotebookLM Enhanced with Audio and YouTube Integration Practical Solutions and Value: NotebookLM, developed by Google, is now equipped to process audio and YouTube videos in addition to text-based sources. This update addresses the challenge of limited…

AI Tech News
WACK: Advancing Hallucination Detection by Identifying Knowledge-Based Errors in Language Models Through Model-Specific, High-Precision Datasets and Prompting Techniques

Understanding Large Language Models (LLMs) Large Language Models (LLMs) are powerful tools used for various language tasks, like answering questions and engaging in conversations. However, they often produce inaccurate responses known as “hallucinations.” This can be…

AI Tech News
VITA-1.5: A Multimodal Large Language Model that Integrates Vision, Language, and Speech Through a Carefully Designed Three-Stage Training Methodology

Introduction to VITA-1.5 The development of multimodal large language models (MLLMs) has opened new doors in artificial intelligence. However, challenges remain in combining visual, linguistic, and speech data effectively. Many MLLMs excel in vision and text…

AI Tech News
AI for Sustainable Business Practices

AI for Sustainable Business Practices The pressure is on. It’s not just about ‘doing good’ anymore; Sustainability and ESG (Environmental, Social, and Governance) initiatives are now core business imperatives. Investors are demanding transparency, regulators are tightening…

Tools
Unleash Creativity with Qwen-Image-Edit: Advanced Image Editing for Professionals

Understanding Qwen-Image-Edit Launched in August 2025, Qwen-Image-Edit is a remarkable tool developed by Alibaba’s Qwen Team. It builds on the foundation of Qwen-Image, boasting a 20B-parameter model that enhances image editing capabilities. This tool is specifically…

AI Tech News
MCSFF Framework: A Novel Multimodal Entity Alignment Framework Designed to Capture Consistency and Specificity Information across Modalities

Understanding Multi-modal Entity Alignment (MMEA) Multi-modal entity alignment (MMEA) is a method that uses information from different sources to match related entities across various knowledge graphs. By integrating data from text, structure, attributes, and external sources,…

AI Tech News
Amazon Lex vs Rasa: Cloud Convenience or Open-Source Freedom for Chatbot Development?

Comparing AI Business Solutions: A Framework Here’s a framework for comparing two AI business solutions across ten key criteria. It’s designed to be practical for businesses evaluating which tool best fits their needs. Criteria: Ease of…

Compare
WorkFusion vs Capgemini: End-to-End Automation to Scale Your Product

Technical Relevance In the modern business landscape, the need for efficiency and scalability has never been more pressing. WorkFusion stands out as a pivotal player in automating end-to-end business processes, particularly in customer onboarding. By leveraging…

Tools
Salesforce AI Introduces ‘ThinK’: A New AI Method that Exploits Substantial Redundancy Across the Channel Dimension of the KV Cache

Practical Solutions and Value of ThinK: Optimizing Large Language Models Revolutionizing Natural Language Processing Large Language Models (LLMs) have transformed natural language processing, enhancing context understanding and enabling applications like document summarization, code generation, and conversational…

AI Tech News
Meet OLMo (Open Language Model): A New Artificial Intelligence Framework for Promoting Transparency in the Field of Natural Language Processing (NLP)

The Large Language Models (LLMs) in Artificial Intelligence (AI) are advancing text generation, translation, and summarization. Yet, limited access reduces comprehension, evaluation, and bias reduction. To address this, the Allen Institute for AI (AI2) introduces OLMo…

AI Tech News
Meta presents Self-Taught Evaluators: A New AI Approach that Aims to Improve Evaluators without Human Annotations and Outperforms Commonly Used LLM Judges Such as GPT-4

Advancements in Natural Language Processing (NLP) Practical Solutions and Value Advancements in NLP have led to the development of large language models (LLMs) capable of performing complex language-related tasks with high accuracy. These advancements have opened…

AI Tech News
Condition-Aware Neural Network (CAN): A New AI Method for Adding Control to Image Generative Models

AI Tech News
Optimizing Spiking Neural P Systems Simulations: Achieving Unprecedented Speed and Efficiency through Compressed Matrix Representations on GPUs Using CUDA

Practical Solutions and Value of Optimizing Spiking Neural P Systems Simulations Simulating Neuronal Interactions Using Spiking Neural P (SNP) Systems The research field of Spiking Neural P (SNP) systems explores computational models inspired by biological neurons.…

AI Tech News
OpenAI enables board to ‘override’ the CEO’s model release decisions

OpenAI’s board can override the CEO’s decisions on releasing new AI models, as outlined in their safety guidelines. After CEO dismissal and reinstatement, concerns over model safety and valuation arose. OpenAI’s preparedness team and safety framework…

AI Tech News
Understanding the Inevitable Nature of Hallucinations in Large Language Models: A Call for Realistic Expectations and Management Strategies

Understanding the Inevitable Nature of Hallucinations in Large Language Models: A Call for Realistic Expectations and Management Strategies Practical Solutions and Value Prior research has shown that Large Language Models (LLMs) have advanced fluency and accuracy…

AI Tech News
Training Value Functions via Classification for Scalable Deep Reinforcement Learning: Study by Google DeepMind Researchers and Others

Value functions are crucial in deep reinforcement learning, employing neural networks to align with target values. Challenges arise when upscaling value-based RL methods for extensive networks, like high-capacity Transformers, with regression. Researchers from Google DeepMind propose…

AI Tech News