Singapore University of Technology and Design (SUTD) Explores Advancements and Challenges in Multimodal Reasoning for AI Models Through Puzzle-Based Evaluations and Algorithmic Problem-Solving Analysis

Advancements in AI Multimodal Reasoning

Overview of Current Research

After the success of large language models (LLMs), research is now focusing on multimodal reasoning, which combines vision and language. This is crucial for achieving artificial general intelligence (AGI). New cognitive benchmarks like PuzzleVQA and AlgoPuzzleVQA are designed to test AI’s ability to understand complex visual information and solve algorithmic problems.

Challenges in Multimodal Reasoning

Despite advancements, LLMs still face difficulties in multimodal reasoning, especially in recognizing patterns and solving spatial problems. High computational costs add to these challenges. Previous evaluations using symbolic benchmarks did not adequately test AI’s ability to handle multimodal inputs.

New Evaluation Datasets

Recent datasets like PuzzleVQA and AlgoPuzzleVQA assess AI’s skills in abstract visual reasoning and algorithmic problem-solving. These require models to integrate visual perception, logical deduction, and structured reasoning.

Research Findings

Researchers from the Singapore University of Technology and Design (SUTD) evaluated OpenAI’s GPT models on multimodal puzzle-solving tasks. They aimed to identify gaps in AI’s perception and reasoning skills by comparing models like GPT-4-Turbo, GPT-4o, and o1 on the new datasets.

Key Datasets Used

– **PuzzleVQA**: Focuses on recognizing patterns in numbers, shapes, colors, and sizes.
– **AlgoPuzzleVQA**: Involves logical deduction and computational reasoning tasks.

Evaluation Methodology

The evaluation included multiple-choice and open-ended questions. A zero-shot Chain of Thought (CoT) prompting method was used for reasoning. The study analyzed performance drops when switching from multiple-choice to open-ended tasks.

Results and Observations

– **Improvement in Reasoning**: There was a noticeable improvement in reasoning capabilities from GPT-4-Turbo to GPT-4o and o1, with o1 showing the most significant advancements, especially in algorithmic reasoning.
– **Performance Metrics**:
– In PuzzleVQA, o1 achieved 79.2% accuracy in multiple-choice tasks, outperforming GPT-4o and GPT-4-Turbo.
– In open-ended tasks, all models showed performance drops, with o1 at 66.3%.
– In AlgoPuzzleVQA, o1 scored 55.3% in multiple-choice tasks, significantly better than previous models.

Identified Limitations

Perception was a major challenge across all models. Providing explicit visual details improved accuracy significantly. Inductive reasoning guidance also enhanced performance, particularly in numerical and spatial tasks. While o1 excelled in numerical reasoning, it struggled with shape-based puzzles.

Conclusion

The study highlights the progress and ongoing challenges in AI multimodal reasoning. For businesses looking to leverage AI, consider the following practical steps:

– **Identify Automation Opportunities**: Find customer interaction points that can benefit from AI.
– **Define KPIs**: Ensure measurable impacts on business outcomes.
– **Select an AI Solution**: Choose tools that fit your needs and allow customization.
– **Implement Gradually**: Start with a pilot project, gather data, and expand AI usage wisely.

Stay Connected

For more insights and AI management advice, contact us at hello@itinai.com. Follow us on @itinaicom and join our Telegram Channel for continuous updates.

Explore AI Solutions

Discover how AI can transform your business processes and customer engagement at itinai.com.

List of Useful Links:

Unleash Your Creative Potential with AI Agents

Competitors are already using AI Agents

Business Problems We Solve

Automation of internal processes.
Optimizing AI costs without huge budgets.
Training staff, developing custom courses for business needs
Integrating AI into client work, automating first lines of contact

Large and Medium Businesses

Startups

Offline Business

Get a plan to reduce routine and improve metrics

100% of clients report increased productivity and reduced operati

AI Agents

Localization Project Manager – Coordinating translation workflows, answering vendor or process-related questions.

Job Title: Localization Project Manager Overview The Localization Project Manager plays a vital role in coordinating translation workflows while addressing vendor and process-related queries. This position is crucial for ensuring that translation projects are executed efficiently…
AI Agents

Environmental Health & Safety Officer – Answering compliance-related questions, retrieving safety protocols or audit histories.

Professional Summary The AI-driven Environmental Health & Safety Officer is a reliable and effective digital team member that performs repetitive and time-consuming tasks with remarkable speed, accuracy, and stability. By automating these tasks, it frees up…
AI Agents

Legal Contract Reviewer – Auto-flagging clause inconsistencies or retrieving precedent cases for review.

Job Title: Legal Contract Reviewer – Auto-flagging Clause Inconsistencies or Retrieving Precedent Cases for Review The AI functions as a reliable and effective digital team member that excels in performing repetitive and time-consuming tasks. With remarkable…
AI Agents

Customer Retention Analyst – Creating customer summaries, identifying churn risk patterns, and suggesting retention steps.

Customer Retention Analyst Professional Summary A highly analytical and detail-oriented Customer Retention Analyst with a proven track record in creating comprehensive customer summaries, identifying churn risk patterns, and suggesting effective retention strategies. Adept at leveraging data-driven…

Itinai.com httpss.mj.runmrqch2uvtvo russian handsome charisma 9fdbb2d5 a55b 425d 8f3b 76d26f86710f 2

AI Business Accelerator

Start Your AI Business in Just a Week with itinai.com

You’re a great fit if you:

Have an audience (even 500+ followers in Instagram, email, etc.)
Have an idea, service, or product you want to scale
Can invest 2–3 hours a day
You’re motivated to earn with AI but don’t want to handle technical setup

AI news and solutions

Integrating Stereoelectronic Effects into Molecular Graphs: A Novel Approach for Enhanced Machine Learning Representations and Molecular Property Predictions

Enhancing Molecular Property Predictions with AI Introduction AI solutions struggle with traditional molecular representations due to their limitations. Our work introduces Stereo Electronics-Infused Molecular Graphs (SIMGs) to revolutionize the interpretation and performance of machine learning models…

AI Tech News
DotaMath: Advancing LLMs’ Mathematical Reasoning Through Decomposition and Self-Correction

Enhancing LLMs’ Mathematical Reasoning with DotaMath Addressing Challenges in Mathematical Reasoning Large language models (LLMs) have made significant progress in natural language processing tasks but face challenges in complex mathematical reasoning. Researchers are working to enable…

AI Tech News
Engineers develop breakthrough ‘robot skin’

A smart and stretchable soft sensor has been developed for robotics and prosthetics. It provides touch sensitivity and dexterity to prosthetic arms and robotic limbs, enabling tasks like picking up soft fruit. The sensor skin is…

AI Tech News
XTuner: An Efficient, Flexible, and Full-Featured AI Toolkit for Fine-Tuning Large Models

Fine-Tuning Large Language Models Made Easy with XTuner Fine-tuning large language models (LLMs) efficiently and effectively is a common challenge. Imagine you have a massive LLM that needs adjustments or training for specific tasks, but the…

AI Tech News
Mistral AI Released Mistral-Small-Instruct-2409: A Game-Changing Open-Source Language Model Empowering Versatile AI Applications with Unmatched Efficiency and Accessibility

Mistral AI Releases Mistral-Small-Instruct-2409: Empowering AI Applications Practical Solutions and Value: Mistral AI introduces Mistral-Small-Instruct-2409, an open-source large language model designed to boost AI system performance and enhance accessibility to advanced models for natural language tasks.…

AI Tech News
AI predictive policing software fails in crime prediction

Predictive policing uses advanced analytics and machine learning to anticipate crimes before they happen. By analyzing historical crime data and other relevant information, algorithms can identify patterns and hotspots of criminal activity. However, recent investigations have…

AI Tech News
Google AI Introduces PaliGemma: A New Family of Vision Language Models

Practical AI Solutions for Your Business Google AI Introduces PaliGemma: A New Family of Vision Language Models Google has launched PaliGemma, a powerful vision language model that understands both text and visual information. It consists of…

AI Tech News
Alibaba Launches Babel: A Multilingual LLM for 90% of Global Speakers

Addressing Language Imbalance in AI Many existing large language models (LLMs) focus primarily on languages with ample training resources, such as English, French, and German. This leaves widely spoken but underrepresented languages like Hindi, Bengali, and…

AI Tech News
s1: A Simple Yet Powerful Test-Time Scaling Approach for LLMs

Understanding Language Models and Test-Time Scaling Language models (LMs) have evolved rapidly due to advancements in computational power and large-scale training methods. Recently, a new technique called test-time scaling has emerged, which focuses on improving model…

AI Tech News
ScrapeGraphAI: A Web Scraping Python Library that Uses LLMs to Create Scraping Pipelines for Websites, Documents, and XML Files

Practical AI Solutions for Data Extraction Efficient Data Extraction for Businesses and Researchers Extracting information quickly and efficiently from websites and digital documents is crucial for businesses, researchers, and developers. They require specific data from various…

AI Tech News
Meta AI Introduces AdaCache: A Training-Free Method to Accelerate Video Diffusion Transformers (DiTs)

Video Generation in AI Video generation is a key area in artificial intelligence, focusing on creating high-quality, consistent videos. The latest machine learning models, especially diffusion transformers (DiTs), are leading the way, offering better quality than…

AI Tech News
Autonomous Navigation for Aerial Vehicles at Night

The Value of Autonomous Navigation for Aerial Vehicles at Night Vision-based Autonomous Flight Nighttime autonomous navigation is made possible through advanced sensing technologies and vision-based algorithms, enabling robust autonomous navigation and landing of Micro Aerial Vehicles…

AI Tech News
Towards Generative AI for Model Architecture

“Intelligent Model Architecture Design (MAD)” explores the idea of using generative AI to guide researchers in designing more effective and efficient deep learning model architectures. By leveraging techniques like Neural Architecture Search (NAS) and graph-based approaches,…

AI Tech News
InternVideo2.5: Hierarchical Token Compression and Task Preference Optimization for Video MLLMs

Understanding Multimodal Large Language Models (MLLMs) Multimodal large language models (MLLMs) are a promising step towards achieving artificial general intelligence. They combine different types of sensory information into one system. However, they struggle with basic vision…

AI Tech News
Accenture AI vs IBM Watsonx: Improve Product Analytics and Cut Cloud Spend

Technical Relevance In today’s fast-paced and data-driven environment, retail and logistics sectors are increasingly turning to artificial intelligence (AI) to gain a competitive edge. Accenture Applied Intelligence is one such framework that leverages predictive analytics to…

Tools
Researchers from CMU and Peking Introduces ‘DiffTOP’ that Uses Differentiable Trajectory Optimization to Generate the Policy Actions for Deep Reinforcement Learning and Imitation Learning

Recent studies show that policy depiction strongly influences learning performance. Carnegie Mellon University and Peking University researchers propose using differentiable trajectory optimization for deep reinforcement and imitation learning. Their approach, DiffTOP, outperforms previous methods in both…

AI Tech News
Google’s Open-Source Full-Stack AI Agent: Gemini 2.5 & LangGraph for Enhanced Web Research

The Need for Dynamic AI Research Assistants Artificial intelligence has come a long way, especially in the realm of conversational agents. However, many large language models (LLMs) still grapple with certain limitations. Primarily, they rely on…

AI Tech News
Researchers from MIT and Peking University Introduce a Self-Correction Mechanism for Improving the Safety and Reliability of Large Language Models

Practical Solutions and Value of Self-Correction Mechanisms in AI Enhancing Large Language Models (LLMs) Self-correction mechanisms in AI, particularly in LLMs, aim to improve response quality without external inputs. Challenges Addressed Traditional models rely on human…

AI Tech News
Researchers from Cambridge have Developed a Virtual Reality Application Using Machine Learning to Give Users the ‘Superhuman’ Ability to Open and Control Tools in Virtual Reality

Researchers from the University of Cambridge have developed a VR program called “HotGestures” that allows users to access and use 3D modeling tools through hand gestures. Using machine learning, the system recognizes gestures and enables quick…

AI Tech News
Attribution Graphs: Unveiling Internal Reasoning in Claude 3.5 Haiku

Understanding Attribution Graphs in AI Understanding Attribution Graphs: A New Approach to AI Interpretability Introduction In recent developments in artificial intelligence, researchers from Anthropic have introduced a novel technique known as attribution graphs. This method aims…

AI Tech News