Inspiration

Not all products are designed in a user-friendly and intuitive way. We often come across devices that are annoying and unclear to use. This is especially true for people with less exposure to tech, such as seniors. Whether it’s setting up a new tech gadget or controlling the AC in a new rental car, reading long user manuals or finding a random YouTube tutorial is currently the best course of action. But what if an AI could generate the tutorial specifically for you directly on your phone and visually explain the product using interactive AR?

What it does

We leave AI chatbots in the dust, by combining them with 3D stable diffusion and Augmented Reality to create a user experience as if an expert is physically next to you, visually answering your question with a helpful virtual demonstration.

Workflow

  1. User wants to know how to interact with an object.
  2. They open the app and place their camera in-front of the object.
  3. The user asks their question e.g. How do I do 'X'?
  4. Object detection model detects the item in-front of user.
  5. Speech to text understands the user’s question and sends the label and prompt to the backend LLM instruction agent.
  6. The instruction agent takes the user's prompt and generates a list of clear instructions to resolve the user’s problem.
  7. The detected object and contextualised instructions are fed into a 3D stable diffusion model which generates a digital twin that is displayed alongside the real object in AR.
  8. The 3D models are positioned in AR space as a visual guidance for the written instructions, which are also shown to the user.

How we built it

FrontEnd: The core frontend was developed using Swift UI, using ARKit for rendering the tutorials in space and CoreML as the on-device model to detect the object in front of the camera. We also used AVFoundation to enable speech-to-text capabilities to simplify the user experience. For more complex and involved tutorials we aim to make the frontend compatible with the Apple Vision Pro in the near future.

Instruction Agent: The instruction agent simplifies user guidance by generating concise instructions in three clear steps. It receives prompts via a REST API from the front-end, incorporating them into the output JSON format. These instructions are then contextualised for the Text-to-3D model, which facilitates the generation and positioning of AR objects. This process involves passing the question and label through a LLM to produce the finalised JSON.

Text to 3D Stable Diffusion: The text to 3D stable diffusion model was developed using a pre-trained 2D text-to-image diffusion model to perform text-to-3D synthesis. We used probability density distillation loss to optimise a NeRF model using gradient descent. The resulting model can be viewed from any angle and requires no 3D training data or modifications to the image diffusion model. Because querying each ray in a NeRF requires a lot of computation we used a Sparse Neural Radiance Grid (SNeRG) that enables real-time rendering. This involved reformulation of the architecture using a sparse voxel grid representation with learned feature vectors. We used USDPython with ARConvert for Usdz compatibility on iOS.

The following papers were used as technical support and inspiration:

Challenges we ran into

Rendering 3D models at high speed and quality turned out to be very tough. Our model started out producing low quality AR objects within one minute, and after precomputing and storing the NeRF into a SNeRG, we were able to cut that time down to several seconds. Producing the highest quality models takes longer and is a challenge that we want to address in the future. For now, the lower quality version suffices, and on the small size of a smartphone screen, is not much of an issue.

Accomplishments that we're proud of

We made a fully functional demo and MVP! Despite facing many technical challenges along the way, we managed to overcome them all and are proud of the functionality and complexity of our product. We were able to integrate many packages and models into a complex pipeline that seamlessly converts the user’s question into a visual tutorial. The technical complexity of our solution was both challenging and rewarding, and we are excited to work on this further and see how far we can push the performance and quality of the model, especially considering how close to the edge of research it is.

What we learned

We used many new packages and techniques in this project, significantly expanding our skillset. Our biggest breakthrough was getting the 3D stable diffusion algorithm to work, as this was something we had never done before. We also expanded our AR capabilities by learning about ARKit, RealityKit and AVFoundation as well as using the ‘Combine’ and ‘Speech’ packages to transcribe the user’s spoken prompt and ensure a smooth experience.

What's next for Aira

Our next goal is to improve the model to animate the AR objects generated using 3D stable diffusion. This involves identifying each moving component as a separate object, generating them separately, and then using the contextualisation ability of the instruction agent to understand the desired movement of the components relative to each other, and outputting the motion in polar coordinates. Following this, we will further fine-tune and optimise our model to cut down the time it takes to generate the 3D AR models. To improve the UX we also plan to add arrows visualising the actions needed to interact with the user too.

Deck: https://www.canva.com/design/DAF9EZRlAW8/lDw9k8mMUDGqLUeVQBfBbw/edit?utm_content=DAF9EZRlAW8&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton

Built With

Share this project:

Updates