Skip to content

michi883/already-asked

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AlreadyAsked

Semantic duplicate-post detection for Reddit moderators.

AlreadyAsked catches paraphrased reposts — questions that ask the same thing in different words — using sentence embeddings plus an LLM second pass, instead of keyword or hash matching.

"Why are NYC rents impossible?" and "How can anyone afford Manhattan apartments?"

Existing repost bots see two unrelated posts. AlreadyAsked sees the same question.

Built on Reddit's Developer Platform (Devvit).


What problem this solves

Repost detection on Reddit today is dominated by image-hash bots like Repost Sleuth and Magic Eye. They work well on memes and screenshots but ignore text posts entirely — or fall back to keyword overlap, which misses anything reworded.

Moderators of discussion-heavy subs (r/AskReddit, r/personalfinance, r/explainlikeimfive, r/AskHistorians) handle the same question rephrased dozens of times. Manual deduplication is unscalable; AutoModerator regex doesn't generalize.

AlreadyAsked closes that gap.

How it works

Two-stage detection:

  1. Embedding stage. On every new self-post in a subscribed subreddit, the app computes a sentence embedding of the title + body using Gemini's gemini-embedding-001 (768 dimensions, normalized). The vector is compared (cosine similarity) against the embeddings of every post submitted to that subreddit in the configurable lookback window (default 60 days). The top candidate is identified.
  2. LLM verification. If the top candidate's similarity is at or above the verification floor (0.70), the candidate pair is sent to gemini-2.5-flash with a strict prompt asking whether the two posts are the same specific question (not just the same topic). The LLM's verdict — not the embedding score — decides whether to flag.

If the LLM call fails (timeout, malformed JSON, HTTP error), the bot fails closed: it logs the failure and does not flag. False positives erode mod trust faster than missed catches.

Confirmed duplicates trigger the configured action: report, remove, or sticky comment linking the original. Every new post is indexed regardless of outcome, so future submissions can be matched against it.

The index lives in Devvit Redis, namespaced per subreddit, with a daily prune job (03:00 UTC) to drop entries older than the lookback window.

Install

  1. Install AlreadyAsked from the Devvit App Directory (or upload your own build — see Development below).
  2. Open the app's settings:
    • Gemini API key (app-wide, set once). Used for both embeddings and verification.
    • Embedding similarity floor — default 0.70. Pairs at or above this score are sent to the LLM. Lower means more LLM calls and better recall; the LLM is the precision gate.
    • Lookback window — default 60 days.
    • Action — report / remove / comment.
    • Comment template — optional, for the comment action. Supports {permalink}, {confidence}, {why_matched}, {match_title}.
  3. Run AlreadyAsked: Backfill index (last 500 posts) from the subreddit's mod menu. This seeds the index so the bot is useful from day one.

Confidence labels

Surfaced to moderators in the report reason and sticky comment:

  • Highly likely duplicate — embedding similarity ≥ 0.90 and LLM confirms duplicate.
  • Possible duplicate — embedding similarity 0.70–0.90 and LLM confirms duplicate.

The LLM verdict is the gate; the embedding score is only used to bucket confirmed duplicates into a label.

Cost

gemini-embedding-001 at 768 dimensions and gemini-2.5-flash for verification are both inexpensive. A typical post is ~200 tokens for embedding; verification calls are short and capped at 500 output tokens with thinking disabled. A subreddit with 1,000 new posts/day will incur a small monthly bill.

Configuration philosophy

AlreadyAsked ships with opinionated defaults. The threshold, lookback window, and action are all tunable, but a moderator can install the app, set the API key, run backfill, and be done — without touching any other knob.

Roadmap

  • Image / OCR support — extract text from image thumbnails for parity with image-hash bots.
  • Cross-subreddit matching — for moderators who run a network of related subs.
  • ANN index — for subreddits with > 10,000 posts/window, switch from brute-force cosine to an approximate-nearest-neighbor index (HNSW).
  • Mod-curated allow list — recurring weekly-thread titles ("What did you buy this week?") shouldn't trigger dedup.
  • False-positive feedback loop — let mods one-click "this isn't a duplicate" and the bot adjusts thresholds.

Project structure

src/
  main.tsx        # entry point — triggers, settings, menu items, scheduled prune
  dedup.ts        # embedding-stage candidate search and Redis index
  embeddings.ts   # pluggable embedding provider (default: Gemini)
  verification.ts # LLM verification stage (Gemini Flash)
  similarity.ts   # cosine math and text preparation
  backfill.ts     # cold-start indexing from /new

Development

npm install
npm run check             # TypeScript type check
npx devvit login          # one-time
npx devvit upload         # publishes a new version
npx devvit playtest <sub> # iterate against a test subreddit
npx devvit logs <sub>     # tail logs

You'll need a Gemini API key from Google AI Studio and a Reddit account with mod access to a subreddit you can playtest against.

License

MIT — see LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors