Realtime narrator

I tested the response of the API my backend called with a picture taken in my frontend

Inspiration

Since the theme is building foundations, what could be more important than vision? (Both for humans and computers!)

According to gov.uk, at least 1 in 5 people in the UK have a long term illness, impairment or disability. Many more have a temporary disability.

I can only imagine how that might make everyday tasks I take for granted exceedingly difficult, for example, struggling to pick out items from cupboard, or worse, not being able to see my keys in front of me.

What it does

I provide a user interface which allows users to send a picture of their current environment, have a computer vision model analyse it in real time, and then verbalize the result using Neuphonic's text to speech capability.

It's not limited to describing a scene. For example, someone with impaired vision and a dietary requirement may be struggling to purchase the specific version of a product they need.

They can't tell the difference between the gluten-free version because the packaging is too similar. They could take a picture of the two items, and ask which is which, freeing them to complete everyday tasks like shopping on their own, without needing the assistance of a full-time carer.

How I built it

I am familiar with Go, and hence it is my Go-to choice for any sort of web application, with the same as to Tailwind and the rest of my tech stack. I am already running an nginx reverse proxy, so on deployment it was as simple as adding an additional subdomain to my configuration, to redirect to my linux service (usually running on port 1337).

A key component of my application is the vision model, without the hard work of open source research, none of this would be possible. I talk about my specific process in the next section.

As for the frontend, I scrambled together parts of Neuphonic's new JS SDK with few event listeners to communicate with my backend. I reused specific functions from Neuphonic's SDK that I needed to decode their wav audio response, as upon testing, it did not work out of the box with popular base64 decoders online.

I managed to call Neuphonic's services using raw web requests in Go, meaning no API keys are exposed in the public frontend, as desperation was beginning to tempt me...

Challenges I ran into

When first deciding how to run my vision model, I had a number of options. As it is advertised to run on mobile devices, I first attempted to run it directly on my laptop for simplicity. I quickly ran into dependency issues with numpy, needing version one and not two, and later I found out it even needed my toaster to be blue! (almost)

Learning the proper dependency management strategy for python (as I don't use it often, I initially wanted to use it only as a simple script for model inference), I later found out that I am GPU POOR with my integrated intel graphics, and running the model would not work with a script that imported CUDA!

Having sunk some time into this already, I had to look around for alternative solutions, with many possible and all seemingly valid. I attempted to use a hugging face space using the Gradio API, but I had issues with finding any documentation whatsoever on how I could upload an image to use in my inference.

This lead me to be tempted by the self-hosted approach again, and in an ideal world I would have used Google Colab, but I settled on running inference through Replicate as it was the most "plug and play", since it was what I needed in a pinch and I had spent enough time on this already, knowing I still had to implement Neuphonic's TTS.

Moving onto the next challenge, Neuphonic only has SDKs for python and recently Javascript, and the examples on the website do not explicitly state how the received audio response is decoded and played back.

Calling the API was not a problem, as curl examples are provided, but parsing the response was not explained in detail, even in Javascript examples.

Luckily, Jiameng from Neuphonic was at hand to explain the format of the file, so at least I knew I had to decode to wav. I almost decided to change my approach and just use the javascript SDK on the frontend exposing my API key, but I decided to give the proper route one last try, given how I already managed to call the API and parse the response to get the encoded audio, and the last step was merely playing it back.

My last attempt involved scouting through the recently created Javascript SDK for Neuphonic, taking functions that looked promising. ("toWav" is certainly my all time favourite.)

Accomplishments that I am proud of

I spent longer than I'd like to admit on parsing Neuphonic's API response, but when I finally managed to get it working, I screamed internally from joy.

To me, this paled my previous accomplishment of setting up a frontend and forwarding pictures to the vision model, as now I had a complete MVP on my hands, rather than just a text description of an environment, which was still an exciting first step at the time, so much so that I sent a picture of it to my parents!

What I learned

I will be sure to use virtual python environments next time when dealing with anything remotely complicated.

I'll also try and make better use of the GPU I have at home to run different experiments (it's a little too slow for stable diffusion, but I reckon it would run this vision model just fine), instead of just playing Elden Ring.

I also expanded my rudimentary JavaScript knowledge, as I typically use the bare minimum to get by, with my focus remaining on the backend as it is one of my primary passions of technology.

What's next for Realtime narrator

I would make it truly real time, instead of waiting for a user to specify when they want to describe a frame of their video, as it may open up broader use cases such as giving warnings when unexpected hazards occur, which a person suffering with limited vision may miss and be a danger to.

This is completely feasible using a quantized model, as according to the authors.

To add to this, I would want to run the model locally on the users device, so that they can use it without internet or a subscription fee, and make it open source, as this would not be possible without open source technology either way, however, in a hackathon scenario, I was tasked with using Neuphonic's API.

Moreover, for a production application, I would perform extensive accessibility testing and user testing to ensure that my intended users actually benefit from the application, and they find it easy to use. I would need to test screenreader support.

Built With

docker
go
javascript
neuphonic
nginx
replicate
tailwind

Updates

Daniel Laczkowski started this project — Mar 22, 2025 10:43 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.