CharactAR

Inspiration

Recently, character experiences powered by LLMs have become extremely popular. latforms like Character.AI, boasting 54M monthly active users and a staggering 230M monthly visits, are a testament to this trend. Yet, despite these figures, most experiences in the market offer text-to-text interfaces with little variation.

We wanted to take the chat with characters to the next level. Instead of a simple and standard text-based interface, we wanted intricate visualization of your character with a 3D model viewable in your real-life environment, actual low-latency, immersive, realistic, spoken dialogue with your character, with a really fun dynamic (generated on-the-fly) 3D graphics experience - seeing objects appear as they are mentioned in conversation - a novel innovation only made possible recently.

What it does

An overview: CharactAR is a fun, immersive, and interactive AR experience where you get to speak your character’s personality into existence, upload an image of your character or take a selfie, pick their outfit, and bring your custom character to life in a AR world, where you can chat using your microphone or type a question, and even have your character run around in AR! As an additional super cool feature, we compiled, hosted, and deployed the open source OpenAI Shap-e Model(by ourselves on Nvidia A100 GPUs from Google Cloud) to do text-to-3D generation, meaning your character is capable of generating 3D objects (mid-conversation!) and placing them in the scene. Imagine the terminator generating robots, or a marine biologist generating fish and other wildlife! Our combination and intersection of these novel technologies enables experiences like those to now be possible!

How we built it

flowchart

So how does CharactAR work?

To begin, we built https://charactar.org, a web application that utilizes Assembly AI (State of the Art Speech-To-Text) to do real time speech-to-text transcription. Simply click the “Record” button, speak your character’s personality into existence, and click the “Begin AR Experience” button to enter your AR experience. We used HTML, CSS, and Javascript to build this experience, and bought the domain using GoDaddy and hosted the website on Replit!

In the background, we’ve already used OpenAI Function Calling, a novel OpenAI product offering, to choose voices for your custom character based on the original description that you provided. Once we have the voice and description for your character, we’re ready to jump into the AR environment.

The AR platform that we chose is 8th Wall, an AR deployment platform built by Niantic, which focuses on web experiences. Due to the emphasis on web experiences, any device can use CharactAR, from mobile devices, to laptops, or even VR headsets (yes, really!).

In order to power our customizable character backend, we employed the Ready Player Me player avatar generation SDK, providing us a responsive UI that enables our users to create any character they want, from taking a selfie, to uploading an image of their favorite celebrity, or even just choosing from a predefined set of models.

Once the model is loaded into the 8th Wall experience, we then use a mix of OpenAI (Character Intelligence), InWorld (Microphone Input & Output), and ElevenLabs (Voice Generation) to create an extremely immersive character experience from the get go. We animated each character using the standard Ready Player Me animation rigs, and you can even see your character move around in your environment by dragging your finger on the screen.

Each time your character responds to you, we make an API call to our own custom hosted OpenAI Shap-e API, which is hosted on Google Cloud, running on an NVIDIA A100. A short prompt based on the conversation between you and your character is sent to OpenAI’s novel text-to-3D API to be generated into a 3D object that is automatically inserted into your environment.

For example, if you are talking with Barack Obama about his time in the White House, our Shap-E API will generate a 3D object of the White House, and it’s really fun (and funny!) in game to see what Shap-E will generate.

Challenges we ran into

One of our favorite parts of CharactAR is the automatic generation of objects during conversations with the character. However, the addition of these objects also lead to an unfortunate spike in triangle count, which quickly builds up lag. So when designing this pipeline, we worked on reducing unnecessary detail in model generation. One of these methods is the selection of the number of inference steps prior to generating 3D models with Shap-E.

The other is to compress the generated 3D model, which ended up being more difficult to integrate than expected. At first, we generated the 3D models in the .ply format, but realized that .ply files are a nightmare to work with in 8th Wall. So we decided to convert them into .glb files, which would be more efficient to send through the API and better to include in AR. The .glb files could get quite large, so we used Google’s Draco compression library to reduce file sizes by 10 to 100 times. Getting this to work required quite a lot of debugging and package dependency resolving, but it was awesome to see it functioning.

Below, we have “banana man” renders from our hosted Shap-E model.

bananaman_left bananaman_right

Even after transcoding the .glb file with Draco compression, the banana man still stands gloriously (1 MB → 78 KB).

Although 8th Wall made development much more streamlined, AR Development as a whole still has a ways to go, and here are some of the challenges we faced. There were countless undefined errors with no documentation, many of which took hours of debugging to overcome. Working with the animated Ready Player Me models and the .glbs generated by our Open AI Shap-e model imposed a lot of challenges with model formats and dynamically generating models, which required lots of reading up on 3D model formats.

Accomplishments that we're proud of

There were many small challenges in each of the interconnected portions of the project that we are proud to have persevered through the bugs and roadblocks. The satisfaction of small victories, like seeing our prompts come to 3D or seeing the character walk around our table, always invigorated us to keep on pushing.

Running AI models is computationally expensive, so it made sense for us to allocate this work to be done on Google Cloud’s servers. This allowed us to access the powerful A100 GPUs, which made Shap-E model generation thousands of times faster than would be possible on CPUs. This also provided a great opportunity to work with FastAPIs to create a convenient and extremely efficient method of inputting a prompt and receiving a compressed 3D representation of the query.

We integrated AssemblyAI's real-time transcription services to transcribe live audio streams with high accuracy and low latency. This capability was crucial for our project as it allowed us to convert spoken language into text that could be further processed by our system. The WebSocket API provided by AssemblyAI was secure, fast, and effective in meeting our requirements for transcription.

The function calling capabilities of OpenAI's latest models were an exciting addition to our project. Developers can now describe functions to these models, and the models intelligently output a JSON object containing the arguments for those functions. This feature enabled us to integrate GPT's capabilities seamlessly with external tools and APIs, offering a new level of functionality and reliability.

For enhanced user experience and interactivity between our website and the 8th Wall environment, we leveraged the URLSearchParams interface. This allowed us to send the information of the initial character prompt seamlessly.

What we learned

For the majority of the team, it was our first AR project using 8th Wall, so we learned the ins and outs of building with AR, the A-Frame library, and deploying a final product that can be used by end-users. We also had never used Assembly AI for real-time transcription, so we learned how to use websockets for Real-Time transcription streaming.

We also learned so many of the intricacies to do with 3D objects and their file types, and really got low level with the meshes, the object file types, and the triangle counts to ensure a smooth rendering experience.

Since our project required so many technologies to be woven together, there were many times where we had to find unique workarounds, and weave together our distributed systems. Our prompt engineering skills were put to the test, as we needed to experiment with countless phrasings to get our agent behaviors and 3D model generations to match our expectations. After this experience, we feel much more confident in utilizing the state-of-the-art generative AI models to produce top-notch content. We also learned to use LLMs for more specific and unique use cases; for example, we used GPT to identify the most important object prompts from a large dialogue conversation transcript, and to choose the voice for our character.

What's next for CharactAR

Using 8th Wall technology like Shared AR, we could potentially have up to 250 players in the same virtual room, meaning you could play with your friends, no matter how far away they are from you. These kinds of collaborative, virtual, and engaging experiences are the types of environments that we want CharactAR to enable. While each CharactAR custom character is animated with a custom rigging system, we believe there is potential for using the new OpenAI Function Calling schema (which we used several times in our project) to generate animations dynamically, meaning we could have endless character animations and facial expressions to match endless conversations.