Architectural and process flow diagram
Age-related macular degeneration

Problem Statement

People with vision and physical impairment are becoming more disconnected from our rapidly evolving digital world.

Mission

To provide every human with access to the internet.

Vision

Our vision is to be Earth’s most customer-centric company; to build a reality where no matter your circumstance you have access to collective knowledge of all humans.

If the internet is the collective consciousness of humanity — then every human has the right to its full access.

Values

Accessibility

We want our software to be accessible to everyone, regardless of who they are their physical capabilities.

Personal Freedom

Our team is passionate about empowering people and opening ways to sense and shape the world around them.

Customer-focused

By prioritising our target market we motivate ourselves to create robust and high-quality products at reasonable speeds.

About the project

Inspiration

People with vision and physical impairment are becoming more disconnected from our rapidly evolving digital world.

How are people with vision and physical impairment supposed to coordinate with society without tools made for them?

Trivial tasks like reading and answering emails become close to impossible without an assistive person.

Target market

The WHO has reported that:

Globally, at least 2.2 billion people have a near or distance vision impairment. In at least 1 billion of these, vision impairment could have been prevented or is yet to be addressed.
200 million people worldwide are estimated to have AMD (age-related macular degeneration) and by 2040 this number will have risen to 300 million [2].
- It’s a leading cause of vision loss for older adults
- AMD doesn’t cause complete blindness, but losing your central vision can make it harder to see faces, read, drive, or do close-up work like cooking or fixing things around the house.
43 million people living with blindness and 295 million people living with moderate-to-severe visual impairment

1.1 billion people live with vision loss. (IAPB Vision Atlas) [1]

43 million people are blind (crude prevalence 0.5%)

1.6 billion with other visual impairments

295 million people have moderate to severe visual impairment (crude prevalence 3.7%)

258 million people have mild visual impairment (crude prevalence 3.3%)

510 million people have near vision problems (crude prevalence 6.5%)
Vision impairment poses an enormous global financial burden, with the annual global cost of productivity estimated to be US$ 411 billion.
Vision loss can affect people of all ages; however, most people with vision impairment and blindness are over the age of 50 years.
We have an aging population with people using screens alot more - this (stats) also contributes to an increasing rise of vision impairment.
Over a million people in Australia suffer from physical disabilities [5] that hamper their productivity and ability to use online services. Of which some of these services are a major part in people’s daily life [6].

Existing Solutions

Certain IoT technology like Echo (Amazon) and Google Voice provide limited functionality for our target market.
Screen readers like VoiceOver (iOS) and TalkBack (Android)
Refreshable braille displays. These are too expensive an inaccessible to meet the requirements of scale we need.
Be My Eyes: Connects blind and visually impaired users with sighted volunteers through live video calls, allowing volunteers to assist with tasks such as reading text or navigating surroundings. Our take: What if you didn’t need another person?
Seeing AI: By Microsoft this mobile app uses the device camera to identify people and objects, and then the app audibly describes those objects for people with visual impairment. Our take: This is great for understanding the world you. But how do you interact with that world?

Solution

What it does

Right now, assistive technology on the web consists largely of using a screen reader to read content to you, and then touch-typing or using a Braille keyboard to act on a website, which is totally impractical for those with acquired visual impairment, especially the elderly.

Voxurf is a web extension, built entirely in Rust, that provides a simple interface where users can speak a command, which is interpreted by an AI in the context of a simplified version of the active webpage. It then executes actions corresponding to the parts of the user's command, allowing the blind and visually impaired to act in a way screen readers could never allow.

How we built it

We were committed to doing this entire project in Rust for extreme speed and safety, so we began by setting up a web extension that uses Rust to do everything, from rendering the popup to interacting with the browser. After writing a little bit of glue code and wading through some content security policy documentation, we got this working pretty quickly, and proceeded on to get transcription and website structure extraction working. One member of our team worked on a server that would support transcribing speech to text using OpenAI's Whisper model, all completely locally, while another worked on the UI and another on the extraction component.

By using Chrome's DevTools protocol, we were able to extract the browser's computed accessibility tree, which then needed to be deserialised into an appropriate data structure in Rust, and filtered to remove irrelevant elements. Once this was done (which took unreasonably long, thank you JS types and Chrome docs), we formatted this in a way an LLM could ingest, and engineered a prompt that would get it to produce some JavaScript code that would execute the actions the user's command corresponded to.

For that, we needed to implement a system that resolved the backend IDs the accessibility API returns into frontend IDs that could be used to reference the nodes in the DOM API, and then we needed to add attributes to those nodes so the JS code could reference these. Again, thank you Chrome.

Then, with the transcription server ready, we threw everything together in the UI and got a record-transcribe-execute loop working! After figuring out a way of executing JS from a web extension, bypassing Chrome's normal security settings, we were in business, and could execute the mind-blowing action of hiding a sidebar in the Chrome developer documentation, using your voice! Very happily, no further code changes were needed to use the system to file GitHub issues, and even fill out the UniHack registration form.

The architectural diagram is included as an asset in the submission.

Challenges we ran into

Along the way, we ran into plenty of challenges. The first was in getting a Wasm-based web extension to work: that required manually writing the glue code between Wasm and JS, rather than using a build tool, because inline execution is disabled in v3 Chrome extensions. Then we had to work with the accessibility tree, which, as mentioned above, gives completely different IDs than the DOM API uses, and the latter all start as 0 unless you specially initialise them through an undocumented method. Figuring this out involved going through Chromium's actual source code, which was a spiritual experience, in some sense or another.

Even worse than this, half the elements provided in that tree have ignored: true, and are totally irrelevant to completing actions on the page. But, Chrome provides a flat tree structure, where each element references the IDs of its parent and children. If you filter out irrelevant nodes, you can't reconstitute the nested structure of the tree (because their children might be relevant), so we had to implement an algorithm to reconstitute the tree while also hoisting relevant children out from under irrelevant parents. With that done, we found ourselves stymied by JS execution, which was the biggest bottleneck by far.

In manifest v3, Chrome prevents extensions from executing anything dynamic, only allowing static scripts packed in with the extension to be run. This means no eval(), no inserting scripts into the <head> of the host page, and even enabling the userScripts privilege didn't work, because we needed to dynamically register and execute scripts, which that facility couldn't do (it only supports running scripts that have been pre-registered when a page loads). After a good deal of head-banging and trying to sneak eval() calls in anywhere we could, we realised the Chrome debugger API actually supports arbitrary code execution, so we used that, with its very convenient ability to totally bypass content security policies. Interestingly, Chrome requires users to enable developer mode to use the userScripts privilege, but the infinitely more powerful debugger permission does not require this, rendering that entire part of their security model utterly pointless.

We also initially tried to do transcription inside the browser, but this completely failed due to compilation issues, so we settled with a native server to communicate with over REST for this prototype.

The source code for this project is available at https://github.com/arctic-hen7/voxurf

Accomplishments that we're proud of

We managed to build a Rust-based web extension, using a Rust frontend framework to render everything, and we also managed to make that extension let you use websites with your voice, even submitting GitHub issues with it! That was enough of an exciting moment that we got a noise complaint in the Library ;)

What we learned

Through this project, we learned how to build a web extension, none of us having ever done that before, and one in Rust at that! We also implemented a complex tree traversal algorithm that taught us both algorithmic lessons and some lessons in Rust lifetimes. We also spent plenty of time prompt engineering, learning about what GPT-3.5 interprets as subtle signals to either follow your instructions better, or completely ignore them.

From working with the DevTools API, we learned all sorts of things about how Chrome processes the DOM, and how JS works under the hood, including how a part of Chrome's extension security model is largely pointless.

What's next for Voxurf

Our roadmap can be split into three phases:

Phase I

This is where we are now, so we'll polish the code we have, especially focusing on getting recording and transcription working natively without a separate server (which, while it still runs locally, makes portability of the extension much more complex). We'll also continue to refine the prompt to allow the AI to execute composite actions (where it runs some code, then looks at the new state of the page, and then does some more). We tried to get that working for our prototype, but it was very unreliable.

Phase II

Here, we'll aim to integrate our solution with Sotto, a previous project of one of our team members, allowing users to dictate and edit longform text (imagine Vim, but for voice), and to then interact with the web, to do anything from write and edit an email to post an academic paper to ArXiV.

We also want to scale our solution to work with web-based apps (e.g. Electron), and eventually any native app through an OS layer. Mobile support will also be in order here.

Phase III

Here, we would build a proprietary hardware device that could run our system against any app or interface, allowing the blind and visually impaired to interact with a device completely optimised for them, while also providing a productivity aid to anyone who wants it.

References

[1] https://www.iapb.org/learn/vision-atlas/magnitude-and-projections/global/

[2] Steinmetz, J.D., Bourne, R.R., Briant, P.S., Flaxman, S.R., Taylor, H.R., Jonas, J.B., Abdoli, A.A., Abrha, W.A., Abualhasan, A., Abu-Gharbieh, E.G. and Adal, T.G., 2021.

[3] https://www.healthline.com/health/jobs-for-blind-people

[4] http://webinsight.cs.washington.edu/papers/webanywhere-html/

[5] https://www.aihw.gov.au/reports/disability/people-with-disability-in-australia/contents/people-with-disability/prevalence-of-disability#dis_type

[6] https://www.pewresearch.org/internet/2004/08/11/the-internet-and-daily-life/