Inspiration

Repost detection on Reddit is dominated by image-hash bots like Repost Sleuth and Magic Eye. They're great for memes and screenshots, but they ignore text posts entirely — or fall back to keyword overlap, which breaks the moment someone rewords a question.

Anyone who has moderated a discussion-heavy subreddit (r/AskReddit, r/personalfinance, r/explainlikeimfive, r/AskHistorians) has watched the same question come in dozens of times, phrased differently each time. "Why are NYC apartments impossible to afford?" and "How does anyone survive Manhattan rent prices?" share almost no keywords, but they're the same question. AutoModerator regex can't catch that. Manual deduplication doesn't scale.

I wanted to close that gap with something mods could actually trust — not a black-box AI score, but a tool that explains itself and fails safely.

What it does

AlreadyAsked watches new text posts in a subscribed subreddit and flags ones that are asking the same question as something posted recently — even when the wording is completely different.

When a duplicate is confirmed, the app takes a moderator-configured action: report to the mod queue, remove the post, or sticky a comment linking the original discussion. Each flag comes with a plain-language explanation of why the two posts matched, so mods aren't staring at a similarity score wondering what the bot was thinking.

Crucially, it also knows when not to flag. Two posts in the same topic neighborhood — say, "How do I get into machine learning?" and "Should I learn Python or JavaScript first?" — are related but asking different things. AlreadyAsked leaves those alone.

How we built it

The app is built on Reddit's Developer Platform (Devvit) in TypeScript. Detection runs in two stages:

  1. Embedding stage. On every new self-post, the app computes a sentence embedding of the title and body using Gemini's gemini-embedding-001 model (768 dimensions, normalized). The vector is compared via cosine similarity against every post embedding from the same subreddit within the lookback window (default 60 days). The index lives in Devvit Redis, namespaced per subreddit, with a daily prune job at 03:00 UTC.

  2. LLM verification. If the top candidate's similarity is at or above the verification floor (default 0.70), the candidate pair is sent to gemini-2.5-flash with a strict prompt asking whether the two posts are the same specific question — not just the same topic. The LLM's verdict, not the embedding score, decides whether to flag.

A backfill menu item seeds the index with the last 500 posts so the bot is useful on day one. Settings let mods tune the similarity floor, lookback window, action, and comment template.

Challenges we ran into

The hardest part wasn't catching duplicates — embeddings handle that easily. The hard part was not over-flagging. Pure embedding similarity will happily group "How do I get into ML?" with "Should I learn Python or JavaScript?" because they're in adjacent topic space. False positives erode mod trust faster than missed catches do, so the bar for flagging had to be high.

That's what drove the two-stage design. The embedding score is the recall gate; the LLM is the precision gate. Tuning the prompt to consistently distinguish "same topic" from "same specific question" took a lot of iteration on real Reddit post pairs.

The other challenge was failure handling. If the LLM call times out or returns malformed JSON, the bot fails closed — it logs and does nothing. Silent misses are recoverable; a wave of bogus flags is not.

Accomplishments that we're proud of

  • The two-stage architecture genuinely works: it catches paraphrased reposts that keyword bots miss while not freaking out on loosely related posts.
  • The flag includes a one-sentence explanation of why the posts matched, so mods can sanity-check the decision instantly.
  • Sensible defaults. A mod can install the app, paste an API key, run backfill, and be done — no other knobs to touch.
  • Fail-closed behavior on any LLM error, so a flaky API call never produces a false positive.

What we learned

Embeddings alone aren't enough for a moderation tool. They're a great first-pass filter, but the semantic neighborhood they capture is broader than what mods actually want flagged. Layering a stricter LLM verification step on top changes the product from "interesting demo" to "something a mod team could trust."

Also: explainability matters more than accuracy in moderation tooling. A 95%-accurate bot that gives no reasoning is worse than a 90%-accurate bot that tells you exactly why it made each call. Mods need to be able to override and learn the tool's behavior, and that requires transparency.

What's next for AlreadyAsked

  • Image / OCR support — extract text from image posts for parity with image-hash bots.
  • Cross-subreddit matching — for mods running networks of related subs.
  • ANN index — switch from brute-force cosine to HNSW for subreddits with >10,000 posts in the lookback window.
  • Mod-curated allow list — recurring weekly-thread titles ("What did you buy this week?") shouldn't trigger dedup.
  • False-positive feedback loop — let mods one-click "this isn't a duplicate" and have the bot adjust over time.

Built With

Share this project:

Updates