The researchers from Columbia University and Apple have developed Ferret, a multimodal large language model (MLLM) that combines referencing and grounding for improved image understanding and description. Ferret uses a hybrid region representation and a spatial-aware visual sampler to handle a variety of regional forms and can handle input that combines free-form text and referenced areas. It outperforms other MLLMs by an average of 20.4% and reduces object hallucinations. The researchers have also created a dataset called GRIT for model training and introduced the Ferret-Bench to evaluate tasks that require referring, grounding, semantics, knowledge, and reasoning simultaneously.
**Researchers from Columbia University and Apple Introduce Ferret: A Groundbreaking Multimodal Language Model for Advanced Image Understanding and Description**
In the field of vision-language learning, one of the major challenges is how to facilitate spatial knowledge of models. This involves two important capabilities: referencing and grounding. Referencing requires the model to understand and locate specific regions based on semantic descriptions, while grounding involves the model fully comprehending the semantics of supplied regions. The alignment of geographical information and semantics is crucial for both referencing and grounding.
However, current texts often teach referencing and grounding separately, whereas humans can effortlessly combine these capacities in everyday discussions and reasoning. They can learn from one activity and apply the knowledge to another without difficulty.
To address this disparity, researchers from Columbia University and Apple AI/ML have developed Ferret, a new refer-and-ground Multimodal Large Language Model (MLLM). Ferret combines referencing and grounding into a single framework, complementing each other. It utilizes a hybrid region representation that includes discrete coordinates and continuous visual characteristics to handle various regional forms, such as strokes, scribbles, and polygons. Ferret can handle input that combines free-form text and referenced areas, automatically creating coordinates for each object and text.
Ferret is the first application to handle inputs from MLLMs with free-formed regions. To train Ferret, the researchers have created the GRIT dataset, which contains 1.1 million samples for refer-and-ground instruction-tuning. The dataset includes spatial knowledge layers, descriptions of regions, connections, objects, and complex reasoning. It also includes data that combines location and text in both input and output, allowing for referring and grounding tasks.
To further enhance Ferret’s capabilities, 34,000 refer-and-ground instruction-tuning chats were gathered using ChatGPT/GPT-4. The researchers also performed spatially aware negative data mining to improve the model’s robustness. Ferret demonstrates high open-vocabulary spatial awareness and localization ability, outperforming traditional referencing and grounding activities. It also reduces object hallucinations.
The researchers have made three contributions: introducing Ferret, which enables fine-grained and open-vocabulary reference and grounding in MLLM; creating the GRIT dataset for model training; and developing the Ferret-Bench, which covers new types of tasks for evaluating Ferret’s performance.
If you want to leverage AI to evolve your company and stay competitive, consider using Ferret. It can redefine your way of work by providing advanced image understanding and description capabilities. To get started with AI, follow these steps:
1. Identify Automation Opportunities: Locate key customer interaction points that can benefit from AI.
2. Define KPIs: Ensure your AI endeavors have measurable impacts on business outcomes.
3. Select an AI Solution: Choose tools that align with your needs and provide customization.
4. Implement Gradually: Start with a pilot, gather data, and expand AI usage judiciously.
For AI KPI management advice and continuous insights into leveraging AI, connect with us at hello@itinai.com. You can also explore practical AI solutions, such as the AI Sales Bot from itinai.com/aisalesbot, which automates customer engagement and manages interactions across all customer journey stages.
Discover how AI can redefine your sales processes and customer engagement. Visit itinai.com for more information.