SharedSight

Inspiration

SharedSight was designed with the intention of building an explainable AI system that could detect misinformation, but learn to be wrong. As such, SharedSight offers an efficient search platform that will not only identify misinformation, but justify it's decisions and correct itself based on human feedback.

What it does

SharedSight is a search engine for News articles published by HuffingtonPost between 2012-2022, but the system can work with any type of natural language text corpus. SharedSight also identifies misinformation, but it focuses on explaining its decisions with nested semantic search and letting users correct it if it produces incorrect results.

How we built it

I built SharedSight using Co:here text embeddings, FAISS for similarity search using HNSW, and Python to string everything together. Other utilities used were PyTorch, ArgParse, Matplotlib, SKLearn, and more.

Challenges we ran into

Finding a good embedding space to process the text in was quite tricky, as the embedding step is the most crucial step of the process. If the mapping did not produce an embedding that properly highlighted the topology of the topic space, any similarity search we did would be ultimately useless.

Accomplishments that we're proud of

We're proud of the fact that we built a similarity search that runs in near real-time that also can detect misinformation, but it also can identify misinformation and justify its decisions of misinformation. This accomplishment is a big first step I've taken in building explainable and effective AI, and a step that I'm quite proud of.

What we learned

I learned a great deal, especially involving things like Transformer Embeddings, BERT, Dimensionality Reduction using tSNE, Similarity Search with FAISS, Prompt Engineering, and Topic Modelling.

What's next for SharedSight

Some next steps for SharedSight could be finding a smarter embedding space, perhaps one that supports joint image-text embeddings to allow for interchangeable search between articles and images. Additionally, the system could be made to be decentralized, which could be an active area of research worked on by students at UC soon. The system could also be integrated into web-apps or other utilities, hosted using a Flask/FastAPI webserver or integrated with other systems like ElasticSearch. Finally, this system could be used with a larger text corpus, as it only uses 2k text samples of the 201k text samples in the Huffington Post Dataset.