X-arena

Inspiration

People have always been bad at predicting the future, but they keep doing it anyway. At hackathons, conferences, and product launches, everyone is building for a future shaped by technology. We noticed that when people try to understand that future, especially students, they turn to Twitter.

Tech Twitter has effectively become an informal curriculum. Students use it to learn about AI, robotics, biotech, and space because technology moves faster than college courses can keep up with. The problem is that this stream of information is flooded with confident but inaccurate predictions. Claims like AGI by 2028 or Mars colonization by 2030 spread without any accountability. We realized that this valuable source of knowledge has no verification layer. That insight led us to ask a simple question. If we benchmark AI models for accuracy, why do we not benchmark human predictions the same way?

What it does

Our product is a prediction verification layer for Tech Twitter. We analyze years of tweets from influential figures and identify which ones are making concrete, resolvable predictions. We then evaluate whether those predictions actually came true based on real world outcomes. Each person is assigned an accuracy score and a power score that reflects how specific and testable their predictions are. The results are presented in a public leaderboard inspired by LLM Arena, ranking thought leaders across domains like AI, robotics, space, biotech, and energy. The goal is not to shame people, but to help users understand whose predictions have historically held up and whose have not.

How we built it

We scraped thousands of tweets spanning roughly a decade and stored them in an AWS S3 bucket, which served as our raw data store. We used Gemini to classify tweets into prediction and non prediction categories. This step filtered out opinions, commentary, and hype so we only evaluated statements that made concrete claims about the future. For verification, we used AWS Bedrock to run a retrieval augmented generation pipeline. Gemini accessed external information to determine whether a prediction was resolved, partially resolved, or incorrect based on historical events.

We then computed two core metrics. Accuracy was calculated based on how often a user’s predictions aligned with real outcomes. Power score was calculated using a specificity factor k, where more specific and testable predictions carried more weight. Correct predictions were rewarded using a positive multiplier and incorrect ones were penalized using a negative multiplier. Finally, we ranked users on a leaderboard interface inspired by LLM Arena to make the results intuitive and comparable.

Challenges we ran into

The hardest challenge was defining what counts as a valid prediction. Many tweets are vague, speculative, or intentionally ambiguous, which makes evaluation difficult. Another major challenge was reducing hallucinations when verifying outcomes. We addressed this by combining retrieval with model reasoning instead of relying on pure generation. We also had to design a scoring system that balanced accuracy with specificity. Without that, users making extremely safe predictions would dominate the rankings.

Accomplishments that we're proud of

We built an end to end system that goes from raw social media data to a ranked, interpretable leaderboard. We successfully treated human predictions as evaluable artifacts rather than opinions, which is a nontrivial shift in how online discourse is usually handled. We also demonstrated that large language models can be used not just to generate content, but to audit and evaluate historical claims when combined with retrieval and structure.

What we learned

We learned that prediction quality is strongly correlated with specificity. People who make bold but vague claims tend to perform worse than those who make narrow, testable predictions. We also learned that verification is just as important as generation in AI systems. Without grounding and retrieval, even strong models struggle with historical accuracy. Finally, we learned that trust in information is something that can be engineered, not just assumed.

What's next

Next, we want to expand beyond Twitter to include podcasts, blog posts, and conference talks. We also want to allow users to drill into individual predictions to see exactly how and why they were scored. Long term, we see this becoming a reputation layer for public forecasting that helps students, investors, and builders identify which voices are worth listening to when thinking about the future.

Built With

amazon-web-services
aws-bedrock
aws-vectordbs
gemini-api
python
rag
react
typescript

Updates

Talha Gondal started this project — Feb 01, 2026 11:35 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.