Conversational AI Framework in Unreal Engine

Developing a Conversational AI Framework for Unreal Engine: Bringing Gaya to Life
To further develop our game concept, it is crucial to understand the technologies that will bring it to life. We need to establish the Unreal Engine Conversational AI Framework, exploring its capabilities to design Gaya. This framework, primarily developed by December 2023, does not include updates from tools and technologies post that period.
6.1 Conversational AI Framework Development
Generative AI technology has existed for some time, and its open-source availability has grown recently. This project leverages tools integrating with Unreal Engine through direct API access or plugins. An API (Application Programming Interface) facilitates communication between applications, typically using standards such as HTTP and REST, ensuring ease of use and thorough documentation.
For API communication, we use JSON (JavaScript Object Notation), which consists of key-value pairs enclosed in curly braces {}. JSON supports various data types, enabling complex, nested data structures. It facilitates data exchange across programming languages through POST and GET requests.
The goal is to integrate various technologies into a seamless system where speech input translates into speech output, providing feedback in the 3D world. The initial phase involved developing a basic conversational AI for straightforward two-way verbal communication. Existing solutions like Convai lacked the flexibility and control over in-game dynamics required for this project. For instance, Convai's speech generation quality was inadequate, prompting the need for a custom setup.
This framework aims to be as "no-code" as possible, enhancing accessibility for those without advanced programming skills. It is designed to be adaptable, encouraging innovation as technologies advance. The conversational AI setup includes processes for converting speech to text, generating chat completions from text, and transforming text back into speech. Additional functionalities like image generation, sentiment analysis, and real-time lip-syncing enhance interaction. Vision capabilities were explored but proved challenging to implement during runtime.
6.2 Speech to Text (Whisper)
In integrating Speech-to-Text (STT) technology, several options were evaluated, including OpenAI's Whisper, Amazon STT, and Vosk. Whisper.cpp, a C++ port of Whisper OpenAI, was chosen for its superior balance of transcription quality, latency, and system stability. It operates in real-time on the local system, eliminating the need for an internet connection. Implemented through the open-source plugin Runtime Speech Recognizer in Unreal Engine, Whisper.cpp demonstrated remarkable resilience and accuracy.
6.3 Text Generation (GPT)
At the core of this framework and game concept is GPT, a family of text generation models developed by OpenAI. GPT models provide text outputs in response to various inputs, supporting an extensive array of human languages. This versatility allows the game, based on Hindu mythology, to incorporate Sanskrit words and phrases. We use the OpenAI API to communicate with GPT models, with prompt engineering playing a crucial role in defining AI behavior.
6.3.1 Input (Prompt)
GPT models take requests specifying messages with roles and parameters to control response randomization and uniqueness. Roles include system (instructions for the AI), user (players), and assistant (AI or NPCs). Adding the "assistant" role to the JSON message creates a chat history, enabling the AI to retain memory of past interactions.
6.3.2 Prompt Engineering
Prompt engineering involves communicating clear instructions to an AI. The RISEN framework (Role, Instructions, Steps, End goal, Narrowing) effectively prompts LLMs or chatbots like ChatGPT. Strategies include adopting a persona, using delimiters, specifying task steps, providing examples, and controlling output length. These tactics enhance the AI's response consistency and quality.
6.3.3 Differences in GPT Models
GPT models vary in performance and accuracy. GPT-4, though advanced, was costly, leading to the use of GPT-4 Turbo for its cost-effectiveness and impressive performance. Fine-tuning emerged as a strategy to enhance model performance for specific needs, reducing costs and improving response handling.
6.3.4 Testing GPT
Various tests were conducted to assess GPT's capabilities, including acting as a game master, simulating three-way conversations, and fulfilling tasks. These tests were integrated into Unreal Engine using a voice interface, demonstrating GPT's versatility and effectiveness.
6.3.5 Gaya’s System Prompt
Gaya's character is crafted through a blend of fine-tuning and a foundational system prompt. The prompt defines her as a sarcastic river deity and guide for the players, providing responses in a specified JSON format. This setup ensures Gaya’s behavior aligns with her personality and game requirements.
6.3.6 Fine-Tuning
Fine-tuning enhances character-driven responses, enforcing a structured personality and reducing deviations. A dataset for Gaya was created by generating dialogues that capture her tone of voice. This dataset was used to fine-tune a GPT-3.5 Turbo model, ensuring consistent and cost-effective responses.
6.4 Text to Speech
Text-to-speech (TTS) technologies, including Microsoft Azure TTS and ElevenLabs, were explored. ElevenLabs was chosen for its human-like voice nuances and multilingual support. It understands the context and performs sentiment analysis, producing realistic speech. Future expansions may involve collaborating with a voice actor to train an ElevenLabs clone of Gaya’s voice.
6.5 Image Generation
Image generation enhances the game experience by providing real-time visuals. Initially, local processing with Stable Diffusion faced stability issues, leading to the use of cloud-based services via API calls. ControlNET and OpenAI's DALLE3 streamline the workflow, reducing local system load.
6.6 Sentiment Analysis
Sentiment analysis, supported by Google’s Natural Language API, plays a critical role in the game. It evaluates the emotional tone of player dialogue, providing another layer of interaction. Whisper technology captures players' emotions through punctuation, enhancing AI understanding.
6.7 Viseme Generation and Additional Audio Effects
Viseme generation enables real-time lip-sync animations for characters, synchronizing facial expressions with speech. Tools like Meta’s OVR Lip Sync and NVIDIA's Audio2Face were explored. The Text to LipSync plugin was chosen for its user-friendliness and performance, ensuring visual feedback from the Conversational AI.
6.8 Unreal Engine
The framework development took place in Unreal Engine 5.3, integrating GPT, Speech Recognition, ElevenLabs, and Stable Diffusion. The goal was to create a no-code setup using Unreal Engine Blueprints and plugins like VARest for JSON requests. The final process is summarized in a comprehensive diagram, detailing the Conversational AI Framework for Gaya in Unreal Engine 5.3.
In conclusion, this project aims to develop a seamless Conversational AI Framework using various technologies integrated into Unreal Engine. The framework supports dynamic interactions, providing an immersive gaming experience with Gaya, the river deity. Through iterative testing and fine-tuning, the framework ensures cost-effective and high-quality AI-driven interactions, paving the way for innovative game design.

Finetuning dataset: link
Gaya System Prompt: link