Skip to content

natyavidhan/memoria

Repository files navigation

Memoria

An AI-powered blog engine that turns PDFs and documents into publish-ready blog posts. Built with Flask, Supabase, Groq, and Moondream.

Upload a PDF, and Memoria will extract the text and structure, pull out embedded images, generate captions with a vision model, format everything into clean Markdown using an LLM, and publish it — all while preserving the original document's content word-for-word.


Table of Contents


How It Works

  1. Upload a PDF through the admin panel or API.
  2. Extract text, headings, lists, tables, and images using PyMuPDF and pdfplumber.
  3. Caption extracted images using Moondream (local or cloud vision model).
  4. Format the extracted content into Markdown using Groq (LLaMA 3.3 70B). The AI only adds formatting — it does not rewrite, paraphrase, or alter the original text.
  5. Generate metadata: title, excerpt, tags, and meta description.
  6. Publish the result as a blog post with full SEO support, or save it as a draft for editing.

All of this happens in a background pipeline. You can also create posts manually with the built-in Markdown editor.


Architecture

Browser
  |
  v
Flask Application (WSGI)
  ├── Blog (SSR)         — public-facing pages, sitemap, RSS, robots.txt
  ├── Admin Panel        — dashboard, PDF upload, Markdown editor, site settings
  ├── Auth               — session-based login via Supabase Auth
  └── REST API           — JSON endpoints for everything
        |
        v
  Service Layer
  ├── blog_service       — Supabase CRUD for posts and images
  ├── pdf_processor      — PyMuPDF + pdfplumber text/image extraction
  ├── ai_writer          — Groq LLM formatting and metadata generation
  ├── image_service      — Moondream captioning + Supabase Storage uploads
  ├── seo_service        — meta tags, JSON-LD, sitemap, RSS feed
  └── settings_service   — database-backed site configuration & theming
        |
        v
External Services
  ├── Supabase           — Postgres database + file storage + auth
  ├── Groq               — LLM inference (LLaMA 3.3 70B)
  └── Moondream          — Vision model for image captioning (local or cloud)

Requirements

  • Python 3.12+
  • A Supabase project (free tier works)
  • A Groq API key (free tier works)
  • A CUDA-capable GPU is recommended for image captioning (CPU works but is slower)
  • Optional: A Moondream cloud API key (if you prefer cloud over local inference)

Setup

1. Clone and install

git clone https://github.com/natyavidhan/memoria.git
cd memoria

python -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

2. Configure Supabase

Create the database tables. Go to your Supabase project's SQL Editor and run these migrations in order:

  1. migrations/001_initial_schema.sql — creates:

    • posts table with full-text search, tag arrays, and slug indexing
    • images table linked to posts with cascade delete
    • Row Level Security policies (public read for published content, service role for writes)
  2. migrations/002_site_settings.sql — creates:

    • site_settings table for editable site configuration (name, tagline, icon, theme colors, etc.)
    • Default settings for branding, theme, SEO, and general options
    • RLS policies (public read, service role write)

Create storage buckets. In the Supabase dashboard under Storage:

Bucket Visibility Allowed types Max size
blog-images Public png, jpg, gif, webp, svg 5 MB
pdfs Private pdf 16 MB

Get your keys. Under Settings > API, copy:

  • The project URL
  • The service_role secret key (not the anon key)

3. Set environment variables

cp .env.example .env

Edit .env with your actual values. See the Configuration Reference for all options.

4. Run

python run.py

The app starts at http://localhost:8000. Log in at /auth/login with the ADMIN_EMAIL and ADMIN_PASSWORD you set in .env.


Docker Deployment

CPU only (image captioning runs on CPU — slower but works everywhere):

docker compose up -d

With GPU acceleration (requires NVIDIA Container Toolkit):

docker compose --profile gpu up -d

The Moondream model (~3.7 GB) is downloaded automatically on first start and cached in a Docker volume across restarts.


Site Settings & Customization

Memoria has a built-in admin settings panel at /admin/settings that lets you customize the site without touching code or environment variables. Settings are stored in Supabase and take effect immediately.

Branding

Setting Description
Site Name Blog title displayed in the navbar, page titles, and metadata.
Site Tagline Short description used in meta tags.
Site Icon Upload a custom image (PNG, SVG, ICO, etc.) used as the navbar icon and favicon. Falls back to a text letter if no image is set.
Brand Icon Text Fallback letter or emoji shown in the navbar when no icon image is uploaded.
Brand Icon Colors Background and foreground colors for the fallback text icon.

Theme

Setting Description
Accent Color Primary accent color used for links, buttons, and highlights.
Accent Hover Color Accent color on hover/focus states.
Background Color Page background color.
Text Color Primary text color.

General

Setting Description
Site URL Public URL used for sitemaps, canonical links, and RSS.
Footer Text / Link Customizable footer content.
Enable Search Toggle the search bar on the blog index.
Posts Per Page Number of posts shown per page.

SEO

Setting Description
Site Description Default meta description for the blog homepage.

All settings can also be managed directly in the site_settings table in Supabase.


Configuration Reference

All configuration is through environment variables, loaded from .env. These serve as defaults — most can be overridden from the admin settings panel.

Variable Required Default Description
SECRET_KEY Yes change-me Flask session signing key. Use a long random string in production.
FLASK_DEBUG No 0 Set to 1 for development mode with auto-reload.
SUPABASE_URL Yes Your Supabase project URL (https://xxxxx.supabase.co).
SUPABASE_KEY Yes Supabase service_role key. Not the anon/public key.
GROQ_API_KEY Yes API key from console.groq.com.
GROQ_MODEL No llama-3.3-70b-versatile Groq model identifier.
MOONDREAM_MODEL No vikhyatk/moondream2 HuggingFace model ID for local image captioning.
MOONDREAM_REVISION No 2025-06-21 Model revision/commit to use.
MOONDREAM_API_KEY No If set, uses Moondream's cloud API instead of local model.
ADMIN_EMAIL Yes admin@memoria.local Login email for the admin account.
ADMIN_PASSWORD Yes admin123 Login password. Change this.
SITE_NAME No Memoria Blog title (initial default — can be changed in admin settings).
SITE_URL No http://localhost:8000 Public URL (initial default — can be changed in admin settings).
ALLOWED_ORIGINS No * CORS allowed origins (comma-separated).
MAX_UPLOAD_SIZE_MB No 16 Maximum file upload size in megabytes.

API Reference

All API endpoints are under /api. Responses use a standard JSON envelope. Errors return {"error": "message"} with the appropriate HTTP status code.

Public Endpoints

These require no authentication.

List posts

GET /api/posts?status=published&tag=python&q=search&page=1&per_page=12&sort=created_at&order=desc

Get a single post

GET /api/posts/<slug>

Returns the post with rendered HTML, SEO metadata, and linked images.

List tags

GET /api/tags

Returns all tags with post counts.

Search

GET /api/search?q=query&page=1&per_page=12

Full-text search across titles, excerpts, and content. Rate limited to 30 requests per minute.

SEO metadata

GET /api/posts/<slug>/seo

Returns Open Graph tags, Twitter card data, and JSON-LD structured data.

Sitemap

GET /api/sitemap.xml

RSS Feed

GET /api/feed/rss

Post images

GET /api/posts/<post_id>/images

Admin Endpoints

These require a valid admin session (cookie-based) or the appropriate auth header.

Create a post

POST /api/posts
Content-Type: application/json

{
  "title": "My Post",
  "markdown_content": "# Hello\n\nContent here.",
  "status": "draft",
  "tags": ["python", "ai"],
  "excerpt": "A short summary.",
  "meta_description": "SEO description."
}

Update a post

PUT /api/posts/<post_id>
Content-Type: application/json

{
  "title": "Updated Title",
  "markdown_content": "...",
  "status": "published"
}

Delete a post

DELETE /api/posts/<post_id>

Upload an image

POST /api/posts/<post_id>/images
Content-Type: multipart/form-data

file: <image file>

Delete an image

DELETE /api/images/<image_id>

Upload a PDF for processing

POST /api/upload
Content-Type: multipart/form-data

file: <pdf file>

Returns {"task_id": "..."}. Poll for status:

GET /api/upload/status/<task_id>

Returns {"status": "processing|completed|failed", "post_id": "...", "error": "..."}.

Reformat a post with AI

POST /api/posts/<post_id>/reformat

Re-runs the AI formatting pipeline on the post's existing content. Rate limited to 5 per minute.


PDF Processing Pipeline

When a PDF is uploaded, a background thread runs this pipeline:

PDF bytes
  │
  ▼
PyMuPDF — two-pass extraction
  ├── Pass 1: collect font-size statistics to detect heading thresholds
  ├── Pass 2: classify each text block as heading, paragraph, list, or table
  └── Extract embedded images (skips icons < 50px)
  │
  ▼
pdfplumber — table extraction
  └── Converts detected tables to Markdown format
  │
  ▼
Moondream — image captioning (if available)
  └── Generates descriptive captions and alt text for each image
  │
  ▼
Supabase Storage — image upload
  └── Uploads images to the blog-images bucket, returns public URLs
  │
  ▼
Groq (LLaMA 3.3 70B) — document formatting
  ├── Converts structured content nodes into clean Markdown
  ├── Inserts image references at contextually appropriate positions
  └── Fidelity check: verifies ≥85% word overlap with original text
  │
  ▼
Groq — metadata generation
  └── Title, excerpt, tags, meta description (as structured JSON)
  │
  ▼
Supabase — post creation
  └── Stores the post with auto-generated slug and linked image records

The entire pipeline is non-blocking. The admin UI polls for status, and the API exposes a task_id for programmatic polling.


AI Integration Details

Text formatting (Groq): The system prompt strictly prohibits the LLM from changing the document's text. It may only add Markdown syntax (headings, bold, lists, code blocks, etc.) and insert image placeholders. After formatting, a fidelity check compares word overlap between the original and formatted text. If overlap drops below 85%, the original text is kept.

Image captioning (Moondream): Each extracted image is sent to the Moondream vision model, which returns a descriptive caption and a placement hint (e.g., "diagram showing network topology"). The model is loaded directly via transformers and runs on GPU (CUDA/MPS) or CPU. Moondream is optional — if unavailable, images are included without AI-generated captions. A cloud API is also supported via the MOONDREAM_API_KEY setting.

Model configuration: The Groq model defaults to llama-3.3-70b-versatile but can be changed via the GROQ_MODEL environment variable. Temperature is set to 0 for deterministic, minimal-deviation output.


Frontend & Responsive Design

Memoria uses a clean editorial design inspired by academic journals and literary magazines.

  • Typography: Newsreader (display/headings), Source Sans 3 (body text), JetBrains Mono (code blocks). Fluid sizing with clamp() for smooth scaling.
  • Color palette: Warm amber and stone tones — configurable via admin settings.
  • Responsive: Four breakpoints (768px, 480px, 360px, and landscape) with mobile-first adjustments. Touch-friendly tap targets (44px minimum), horizontally scrollable tag clouds, and responsive images constrained to viewport width.
  • Accessibility: Semantic HTML, ARIA landmarks, proper heading hierarchy, focus-visible outlines, prefers-reduced-motion support, and sufficient color contrast ratios.
  • Server-side rendered: All pages are rendered on the server as plain HTML — no JavaScript framework required. This means instant page loads, full search engine crawlability, and compatibility with AI search engines.

SEO

Every published post gets:

  • Open Graph meta tags (title, description, image, URL, type)
  • Twitter Card meta tags (summary or summary_large_image)
  • JSON-LD structured data (BlogPosting schema with headline, dates, author, keywords)
  • Canonical URL
  • Auto-generated meta description (from excerpt or AI-generated)

Site-wide:

  • XML sitemap at /sitemap.xml with all published posts
  • RSS 2.0 feed at /feed.xml with the latest 20 posts
  • robots.txt with sitemap reference and admin/API exclusions
  • Organization schema (name, logo, URL) for entity recognition
  • WebSite schema with SearchAction for sitelinks search
  • CollectionPage schema on the blog index with item list
  • BreadcrumbList schema on post pages
  • article:tag meta tags for topic signals

All SEO metadata uses the site name, URL, and icon configured in admin settings — no hardcoded values.


Security

  • Authentication: Session-based with HTTP-only, secure cookies. Single admin account authenticated via Supabase Auth.
  • CSRF protection: All form submissions are CSRF-protected via Flask-WTF. The API blueprint is exempt (uses session auth instead).
  • Rate limiting: Login (10/min), search (30/min), PDF upload (5/min), AI reformat (5/min). Uses in-memory storage — swap to Redis for multi-process deployments.
  • Input sanitization: All rendered HTML is sanitized through nh3 with a strict tag and attribute allowlist. External links get rel="noopener noreferrer".
  • File validation: Uploads are validated by extension and MIME type. Only PDFs and common image formats are accepted.
  • Security headers: X-Content-Type-Options, X-Frame-Options, X-XSS-Protection, Referrer-Policy, and Strict-Transport-Security (in production).
  • Row Level Security: Supabase RLS policies restrict public access to published posts only. Write operations require the service role key.
  • LLM prompt injection defense: Document content is wrapped in structured XML-like tags before being sent to the LLM, reducing the attack surface for prompt injection.

License

This project is licensed under the GNU General Public License v3.0. See LICENSE for details.

About

An AI-powered blog engine

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors