IBM Researchers ACPBench: An AI Benchmark for Evaluating the Reasoning Tasks in the Field of Planning

IBM Researchers ACPBench: An AI Benchmark for Evaluating the Reasoning Tasks in the Field of Planning

Understanding LLMs and Their Role in Planning

Large Language Models (LLMs) are becoming increasingly important as various industries explore artificial intelligence for better planning and decision-making. These models, particularly generative and foundational ones, are essential for performing complex reasoning tasks. However, we still need improved benchmarks to evaluate their reasoning and decision-making capabilities effectively.

Challenges in Evaluating LLMs

Despite advancements, validating these models remains difficult due to their rapid evolution. For instance, even if a model checks all the boxes for a goal, it doesn’t guarantee actual planning abilities. Additionally, real-world scenarios often present multiple possible plans, complicating the evaluation process. Researchers worldwide are focused on enhancing LLMs for effective planning, highlighting the need for robust benchmarks to determine their reasoning capabilities.

Introducing ACPBench

ACPBench is a comprehensive evaluation benchmark for LLM reasoning developed by IBM Research. It consists of seven reasoning tasks across 13 planning domains and includes:

  • Applicability: Identifies valid actions in specific situations.
  • Progression: Analyzes the outcome of an action or change.
  • Reachability: Assesses whether the end goal can be achieved through various actions.
  • Action Reachability: Identifies prerequisites needed to carry out specific functions.
  • Validation: Evaluates if a sequence of actions is valid and achieves the goal.
  • Justification: Determines if an action is necessary.
  • Landmarks: Identifies necessary subgoals to reach the main goal.

Unique Features of ACPBench

Unlike previous benchmarks limited to a few domains, ACPBench generates datasets using the Planning Domain Definition Language (PDDL). This approach allows for the creation of diverse problems without human input.

Testing and Results

ACPBench was tested on 22 open-source and advanced LLMs, including well-known models like GPT-4o and LLAMA. Results showed that even the top models struggled with certain tasks. For example, GPT-4o had an average accuracy of only 52% on planning tasks. However, through careful prompt crafting and fine-tuning, smaller models like Granite-code 8B achieved performance comparable to larger models.

Key Takeaway

The findings indicate that LLMs generally underperform in planning tasks, regardless of their size. Yet, with appropriate techniques, their capabilities can be significantly enhanced.

Get Involved and Stay Updated

For more insights, check out our Paper, GitHub, and Project. Follow us on Twitter, and join our Telegram Channel and LinkedIn Group. If you enjoy our work, consider subscribing to our newsletter and joining our ML SubReddit community of over 50k members.

Upcoming Event

RetrieveX: The GenAI Data Retrieval Conference on Oct 17, 2023.

Enhance Your Business with AI

To ensure your company stays competitive, consider utilizing IBM Researchers’ ACPBench for planning evaluation. Here’s how:

  • Identify Automation Opportunities: Find customer interaction points to enhance with AI.
  • Define KPIs: Ensure your AI initiatives positively impact business outcomes.
  • Select an AI Solution: Choose tools that fit your needs and allow for customization.
  • Implement Gradually: Start small, collect data, and expand AI use carefully.

For AI KPI management advice, contact us at hello@itinai.com. For ongoing insights into leveraging AI, follow us on Telegram or @itinaicom.

Discover how AI can transform your sales processes and customer engagement by visiting itinai.com.

List of Useful Links:

AI Products for Business or Custom Development

AI Sales Bot

Welcome AI Sales Bot, your 24/7 teammate! Engaging customers in natural language across all channels and learning from your materials, it’s a step towards efficient, enriched customer interactions and sales

AI Document Assistant

Unlock insights and drive decisions with our AI Insights Suite. Indexing your documents and data, it provides smart, AI-driven decision support, enhancing your productivity and decision-making.

AI Customer Support

Upgrade your support with our AI Assistant, reducing response times and personalizing interactions by analyzing documents and past engagements. Boost your team and customer satisfaction

AI Scrum Bot

Enhance agile management with our AI Scrum Bot, it helps to organize retrospectives. It answers queries and boosts collaboration and efficiency in your scrum processes.

AI news and solutions

  • BBC blocks ChatGPT bot, explores Gen AI to create content

    The BBC has blocked OpenAI’s ChatGPT bot and the Common Crawl bot from scraping its news and media content. The decision follows a trend of websites blocking AI bots from using their data to train AI models. The BBC plans to explore using generative AI in content creation and operations, but acknowledges the risks concerning…

  • Can We Truly Trust Artificial Intelligence AI Watermarking? This AI Paper Unmasks the Vulnerabilities in Current Deepfake Method’s Defense

    Advancements in generative AI have led to the creation of hyper-realistic digital content known as deepfakes, raising concerns about misinformation and fraud. Researchers have developed methods such as watermarking to distinguish between authentic and AI-generated material. The study found a trade-off between evasion and spoofing errors in image watermarking, as well as vulnerabilities to spoofing…

  • AI decodes speech from non-invasive brain recordings

    Researchers at Meta AI have developed a non-invasive method to decode speech from brain activity. By using magneto-encephalography (MEG) and electroencephalography (EEG), they recorded the brain waves of volunteers and identified the words associated with specific brain wave patterns. Although further work is needed to enable communication based on thought recognition, the study shows promise…

  • Stanford Researchers Propose MAPTree: A Bayesian Approach to Decision Tree Induction with Enhanced Robustness and Performance

    The MAPTree algorithm, developed by researchers at Stanford University, improves decision tree models beyond what was previously believed to be optimal. It assesses the posterior distribution of Bayesian Classification and Regression Trees (BCART) to create more efficient and effective tree architectures. MAPTree outperforms earlier strategies in terms of computational efficiency and produces superior trees compared…

  • Meet SynthIA (Synthetic Intelligent Agent) 7B-v1.3: A Mistral-7B-v0.1 Model Trained on Orca Style Datasets

    SynthIA-7B-v1.3 is a robust and flexible large language model with 7 billion parameters. It can be used for various purposes such as text creation, translation, generating original content, and answering questions. It is suitable for researchers, educators, and businesses. Detailed instructions and sample inputs can improve its performance. For more information, visit the link provided.

  • UK politicians speak out over police’s use of facial recognition

    UK parliamentarians and advocacy organizations are calling for a temporary halt to the use of live facial recognition technology by the police. Concerns are being raised about the potential misuse and ineffectiveness of the technology, as well as its impact on civil liberties and privacy. The move comes in response to a proposal that would…

  • Protestors criticize Meta’s open source approach to AI development

    Open source AI, particularly Meta’s Llama models, has sparked debate and protest regarding the risks of publicly releasing powerful AI models. Protestors argue that open source AI can lead to irreversible proliferation of dangerous technology, while others believe it is necessary for democratizing and building trust in AI. There is ambiguity around the definition and…

  • AI-created musicians are receiving record labels signings, sorry humans

    AI-generated pop stars like Noonoouri, a virtual influencer created by German designer Joerg Zuber, are making waves in the music industry. Noonoouri recently signed a record deal with Warner Music and has a large following on social media. This blend of technology and music has sparked debates about the authenticity of AI-generated artists. While some…

  • Researchers from ITU Denmark Introduce Neural Developmental Programs: Bridging the Gap Between Biological Growth and Artificial Neural Networks

    The human brain is a complex organ that processes information hierarchically and in parallel. Can these techniques be applied to deep learning? Yes, researchers at the University of Copenhagen have developed a neural network called Neural Developmental Program (NDP) that uses hierarchy and parallel processing. The NDP architecture combines a Multilayer Perceptron and a Graph…

  • Do All the Roads Lead to Rome?

    The author discusses using Python, network science, and geospatial data to answer the question of whether all roads lead to Rome. They load and visualize the Roman road network data using GeoPandas and Matplotlib. They transform the road network into a graph object using the OSMNx package. They then visualize the network using Gephi. Next,…

  • Google DeepMind Researchers Introduce Promptbreeder: A Self-Referential and Self-Improving AI System that can Automatically Evolve Effective Domain-Specific Prompts in a Given Domain

    PromptBreeder is a new technique developed by Google DeepMind researchers that autonomously evolves prompts for Large Language Models (LLMs). It aims to improve the performance of LLMs across various tasks and domains by iteratively improving both task prompts and mutation prompts. PromptBreeder has shown promising results in benchmark tasks and does not require parameter updates…

  • Scientists Achieve 70% Accuracy in AI-Driven Earthquake Predictions

    In a groundbreaking study, researchers from The University of Texas at Austin trained an AI system to predict earthquakes with 70% accuracy. The AI tool successfully anticipated 14 earthquakes during a seven-month trial in China, placing the seismic events within approximately 200 miles of the estimated locations. This advancement in AI-driven earthquake predictions aims to…

  • Breaking Boundaries in 3D Instance Segmentation: An Open-World Approach with Improved Pseudo-Labeling and Realistic Scenarios

    The article discusses the challenges and advancements in 3D instance segmentation, specifically in an open-world environment. It highlights the need for identifying unfamiliar objects and proposes a method for progressively learning new classes without retraining. The authors present experimental protocols and splits to evaluate the effectiveness of their approach.

  • BrainChip Unveils Second-Generation Akida Platform for Edge AI Advancements

    BrainChip has introduced the second-generation Akida platform, a breakthrough in Edge AI that provides edge devices with powerful processing capabilities and reduces dependence on the cloud. The platform features Temporal Event-Based Neural Network (TENN) acceleration and optional vision transformer hardware, improving performance and reducing computational load. BrainChip has initiated an “early access” program for the…

  • Meta AI Researchers Introduce RA-DIT: A New Artificial Intelligence Approach to Retrofitting Language Models with Enhanced Retrieval Capabilities for Knowledge-Intensive Tasks

    Researchers from Meta have introduced Retrieval-Augmented Dual Instruction Tuning (RA-DIT), a lightweight fine-tuning methodology to equip large language models (LLMs) with efficient retrieval capabilities. RA-DIT operates through two stages, optimizing the LLM’s use of retrieved information and refining the retriever’s results. It outperforms existing models in knowledge-intensive zero and few-shot learning tasks, showcasing its effectiveness…

  • Meta AI Researchers Propose Advanced Long-Context LLMs: A Deep Dive into Upsampling, Training Techniques, and Surpassing GPT-3.5-Turbo-16k’s Performance

    Large Language Models (LLMs) are revolutionizing natural language processing by leveraging vast amounts of data and computational resources. The capacity to process long-context inputs is a crucial feature for these models. However, accessible solutions for long-context LLMs have been limited. A new Meta research presents an approach to constructing long-context LLMs that outperform existing open-source…

  • Overcoming Hallucinations in AI: How Factually Augmented RLHF Optimizes Vision-Language Alignment in Large Multimodal Models

    The text discusses the challenges in building Large Multimodal Models (LMMs) due to the disparity between multimodal data and text-only datasets. The researchers present LLaVA-RLHF, a vision-language model trained for enhanced multimodal alignment. They adapt the Reinforcement Learning from Human Feedback (RLHF) paradigm to fine-tune LMMs and address the problem of hallucinatory outputs. Their strategy…

  • Can “constitutional AI” solve the issue of problematic AI behavior?

    The increasing presence of AI models in our lives has raised concerns about their limitations and reliability. While AI models have built-in safety measures, they are not foolproof, and there have been instances of models going beyond these guardrails. To address this, companies like Anthropic and Google DeepMind are developing AI constitutions, which are sets…

  • A Step By Step Guide to Selecting and Running Your Own Generative Model

    The past few months have seen a reduction in the size of generative models, making personal assistant AI enabled through local computers more accessible. To experiment with different models before using an API model, you can find a variety of models on HuggingFace. Look for models that have been downloaded and liked by many users…

  • All You Need To Know About The Qwen Large Language Models (LLMs) Series

    The QWEN series of large language models (LLMs) has been introduced by a group of researchers. QWEN consists of base pretrained language models and refined chat models. The models demonstrate outstanding performance in various tasks, including coding and mathematics. They outperform open-source alternatives and have the potential to transform the field of AI.