Researchers from Nankai University and ByteDance have developed a framework called ChatAnything that generates anthropomorphized personas for large language model (LLM)-based characters. The framework uses in-context learning and system prompts to create customized personalities, voices, and visual appearances. It introduces innovative concepts, MoV and MoD, for voice and appearance generation. The researchers address challenges in face landmark detection and propose solutions for automatic face animation. The framework comprises four main blocks and shows promising results. The work opens avenues for integrating generative models with talking head algorithms.
Introducing ChatAnything: an AI Framework for Generating LLM-Enhanced Personas
Researchers from Nankai University and ByteDance have developed a groundbreaking framework called ChatAnything. This framework enables the creation of anthropomorphized personas for large language models (LLMs) in an online manner. The goal is to generate personas with customized visual appearance, personality, and tones based solely on text descriptions.
The researchers leverage the in-context learning capability of LLMs to generate personalities using carefully designed system prompts. They introduce two innovative concepts: the mixture of voices (MoV) and the mixture of diffusers (MoD) for diverse voice and appearance generation.
MoV utilizes text-to-speech algorithms with pre-defined tones, selecting the best matching one based on user-provided text descriptions. MoD combines text-to-image generation techniques and talking head algorithms to simplify the process of generating talking objects. However, the researchers have identified a challenge where anthropomorphic objects generated by current models are often undetectable by pre-trained face landmark detectors, resulting in a failure in face motion generation. To overcome this, they have incorporated pixel-level guidance during image generation to include human face landmarks. This significantly improves the face landmark detection rate, enabling automatic face animation based on generated speech content.
The researchers highlight the recent advancements in large language models and their in-context learning capabilities, positioning them at the forefront of academic discussions. They stress the need for a framework that can generate LLM-enhanced personas with customized personalities, voices, and visual appearances. For personality generation, they leverage the in-context learning capability of LLMs, creating a pool of voice modules using text-to-speech APIs. The MoV module selects tones based on user text inputs.
To address the visual appearance of speech-driven talking motions and expressions, they utilize recent talking head algorithms. However, they face challenges when using images generated by diffusion models as input for talking head models. Only 30% of images are detectable by state-of-the-art talking head models, indicating a distribution misalignment. To bridge this gap, the researchers propose a zero-shot method that incorporates face landmarks during the image generation phase.
The proposed ChatAnything framework consists of four main blocks: LLM-based control module, portrait initializer, mixture of text-to-speech modules, and motion generation module. The researchers have incorporated diffusion models, voice changers, and structural control to create a modular and flexible system. To validate the effectiveness of their proposed method, they have created a validation dataset with prompts from different categories. They use a pre-trained face keypoint detector to assess the face landmark detection rates, demonstrating the impact of their approach.
This comprehensive framework, ChatAnything, enables the generation of LLM-enhanced personas with anthropomorphic characteristics. The researchers address challenges in face landmark detection and propose innovative solutions, showing promising results in their validation dataset. This work opens up possibilities for future research in integrating generative models with talking head algorithms and improving data distribution alignment.
For more details, you can check out the original paper and project.
Credit for this research goes to the researchers of this project.