OpenAI's ChatGPT is known for its advanced text-based interactions among the growing family of language learning models (LLMs). However, many UI clients for LLMs, including ChatGPT, are still text-based. We aimed to enhance this experience by adding an avatar to humanize LLMs and enabling voice interactions. Our goal is to give the LLM more personality and make conversations more enjoyable, although it already has a great personality!
To accomplish this task, we will need to integrate various models with LLM at the center. As LLM processes text input and generates text output, we must convert voice to text and vice versa for voice input and output to be possible. Specifically, we must:
The outcome is this:
This diagram below shows the structure of the application:
In this Proof of Concept (PoC), we use several SaSS models and APIs, as well lass open source ones:
For 3D face animation, such as lip syncing, eye blinking, and head tilting etc. there are two potential approaches:
A machine learning-based approach to generate a face from an image and text or audio. Several models can handle this, including:
Although generating a talking head from a single image (or several images) is impressive, and the output quality is convincing enough, these models are slow and difficult to make work in real-time on commercial devices.
3d-model approach: this allows us to control facial features, such as lips, eyes, and head movements. To achieve this, we require a 3D head model, making this approach less flexible. However, it is guaranteed to run in real-time on most commercial devices. Visemes play a critical role in this approach, representing the visual representation of phonemes in spoken language. By manipulating these visemes, we can align the avatar's facial movements with the synthetic speech, creating a more realistic and engaging conversational experience. To achieve this, we needed the Text-to-Speech (TTS) model to return 'blend shape' controls, which are responsible for driving the facial movements of our 3D character. These blend shapes are represented as a 2-dimensional matrix where each row signifies a frame, and each frame contains an array of 55 facial positions. Luckily, Azure TTS API conveniently returns this to apply to the 3D model.
We pick this approach since it allows real-time rendering.
For this PoC, we utilized various prompts to simulate different roles, such as:
These prompts were sourced from the excellent repository found at https://github.com/f/awesome-chatgpt-prompts, with minor adjustments made to fit our needs. The prompts were effectively executed to maintain each character's persona.
We used readily available SaSS services to build something quickly for the PoC. If you prefer open-source solutions, there are other options:
We can enhance its capabilities with tools like LangChain, AutoGPT or BabyAGI. For instance, in the English teacher use case, it can currently assist with fixing writing, grammar, and providing feedback. By incorporating a speech analysis model, it can also provide feedback on pronunciation. Additionally, we can integrate the ability to search the internet and YouTube to recommend relevant lessons for further improving English.
Here is the live demo so you can play with it: https://ava-ai-ef611.web.app/
Enjoy!