Skip to content

SebastianMeisel/org-parser

Repository files navigation

org-parser – Documentation (English)

org-parser

What is this?

org-parser is a small streaming Org reader and minimal Org→HTML exporter.

I built it so I can read my Org files on devices where I can’t install Emacs.

Building blocks:

  • org_reader.py: reads Org files line-by-line and expands #+INCLUDE depth-first
  • org_parser.py: streaming parser emitting events (OrgEvent)
  • org_to_html.py: minimal renderer
  • webapp.py: web viewer (Flask)
  • math_renderer.py: LaTeX → SVG (cache)

Design goals:

  • streaming / lazy
  • “good enough” Org support
  • clear state + events pipeline

Installation & running

Option A: Local (system Python)

Requirements:

  • Python ≥ 3.12 recommended
  • for math SVG: latex and dvisvgm in PATH

Minimal:

python3 -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install flask gunicorn pyyaml

Option B: Local with uv

uv venv
source .venv/bin/activate
uv pip install flask gunicorn pyyaml

Option C: Container (Podman/Docker)

Build an image and run it with (podman-)compose.

Benefits:

  • reproducible
  • no Python setup needed on the host

Build

podman build -t org-viewer:latest -f Containerfile .

Run (Compose)

podman-compose up -d
podman logs -f org-viewer

TLS options

1) TLS inside the container (Gunicorn)

  • mount cert/key to /certs
  • set CERT_FILE and KEY_FILE

2) TLS via reverse proxy (recommended)

Container serves HTTP internally — let Caddy/Nginx/Traefik handle TLS.

Math cache & permissions (important!)

The renderer writes to /app/.math-cache.

With a bind mount:

  • use Podman :U so ownership matches inside the container
  • with SELinux add :Z

Example: ./.math-cache:/app/.math-cache:rw,Z,U

(better than chmod 777 on the host)

Usage (CLI / export / web)

1) Expand includes (debug)

Shows how the reader resolves #+INCLUDE directives.

python3 org_reader.py

2) Org → HTML export

Exports an Org file to HTML.

python3 org_to_html.py org/90-feature-demo.org -o out.html

3) Start web viewer locally

Starts the Flask web app.

python3 webapp.py
# then: http://localhost:5000

4) Web viewer via Gunicorn

HTTP:

gunicorn -w 4 -b 0.0.0.0:5000 webapp:app

TLS:

gunicorn -w 4 -b 0.0.0.0:5000 webapp:app --certfile /certs/tls.crt --keyfile /certs/tls.key

API reference

This section aggregates the documentation of the internal org-parser modules.

Each module has its own file. They are included here using #+INCLUDE so they stay modular, but can also be read as one continuous document.

config_loader.py

Overview

The config_loader.py module encapsulates all configuration for the Org reader (regexes, block types, header keys, etc.) in a dedicated class.

Class: OrgReaderConfig

Container for regexes and parser settings.

Key fields (excerpt):

  • verbatim_blocks: set[str]
  • skip_header_keys: set[str]
  • quotes: dict[str,str]
  • block_re: re.Pattern
  • header_kv_re: re.Pattern
  • include_keyword_re: re.Pattern
  • section_heading_re: re.Pattern
  • comment_begin_re, comment_end_re
  • latex_macro_re

An instance is passed around to the reader/parser as a configuration object.

DEFAULT_CONFIG

DEFAULT_CONFIG is a preconfigured OrgReaderConfig instance with compiled regexes for “normal” Org files.

Intended as:

  • a sensible default configuration
  • a reference for how the YAML config file is structured

load_config(path: Path) -> OrgReaderConfig

Reads a YAML file (e.g. config.yml) and creates an OrgReaderConfig instance.

Typical behavior:

  • parses YAML
  • compiles regex strings
  • converts fields like verbatim_blocks / skip_header_keys to proper types (set, dict, …)

Used to override or customize the default configuration.

org_parser.py

Overview

org_parser.py contains the streaming parser that turns lines into OrgEvent objects and maintains the stateful OrgState.

Dataclasses

  • OrgEvent(type: str, data: dict[str, Any])
    • generic event object
    • type describes the kind of event (e.g. "heading", "block_begin", …)
    • data holds context-specific information (level, text, options, …)
  • OrgPreamble(headers: dict[str,str])
    • represents preamble headers (#+TITLE, #+AUTHOR, …)
    • convenient properties for title/author/date/options
  • OrgState
    • mutable streaming state
    • tracks e.g.:
      • whether we are in the preamble
      • whether we are inside a block / src block / comment block
      • context for lists, tables, etc.

Important functions

  • parse_org_line(line, cfg, state) -> (state, events) Core parser function:
    • takes a single line
    • uses regexes from cfg
    • updates state
    • returns a list of events (often empty or 1–2 entries)

Typical event types:

  • heading
  • block_begin / block_end
  • src_begin / src_end
  • list_item / ordered_list_item
  • table_row / table_hline
  • tblfm
  • name, caption, attr_html
  • comment, comment_block
  • latex_macro
  • line_tokens (inline tokenization of text lines)
  • tokenize_inline_org_markup(text) -> list[(type, text)] Minimal inline tokenizer supporting:
    • plaintext
    • bold_text
    • italic_text
    • code
    • link (combined url/desc)
    • math_inline (for \(..\) and $..$)
  • parse_src_block_options(arg_string) -> dict[str,str] Parses the arguments of the #+begin_src line:
    • language (e.g. python, bash)
    • header args (e.g. :results, :session, :tangle, :var …)
  • parse_html_attr_args(arg_string) -> dict[str,str] Parses lines like:
    • #+ATTR_HTML: :width 50% :class foo

    into a dictionary that the renderer turns into HTML attributes.

Events (current)

The parser currently emits, among others, the following event types:

  • preamble_kv, preamble_end
  • heading
  • block_begin, block_end
  • src_begin, src_end
  • list_item, ordered_list_item
  • table_row, table_hline, tblfm
  • name, caption, attr_html
  • comment, comment_block
  • latex_macro
  • line_tokens

org_reader.py

Overview

org_reader.py is responsible for reading Org files with #+INCLUDE support and preamble handling. The result is an iterator over lines (with includes expanded).

Functions

  • un_quote_string(string, cfg) -> str
    • removes quotes based on the rules in cfg.quotes
    • useful for paths/strings from headers or INCLUDE lines
  • resolve_include(line, path, cfg) -> Path
    • evaluates a #+INCLUDE: line
    • resolves the file path relative to path (current file)
    • returns a Path instance
  • is_include(line, cfg) -> bool
    • checks whether a line is an include directive
    • uses regexes from the configuration
  • should_skip_header_line(line, cfg) -> bool
    • decides whether a preamble line should be skipped
    • driven by settings like skip_header_keys
  • preamble_decision(line, cfg) -> (skip: bool, still_in_preamble: bool)
    • central logic for “Are we still in the preamble?”
    • determines:
      • whether the current line is ignored as preamble
      • whether the preamble ends at this line
  • read_with_includes(path, cfg, *, is_root=True) -> Iterator[str]
    • main entry point of the reader
    • reads a file line by line
    • expands #+INCLUDE: directives depth-first
    • respects preamble handling and skip rules

read_with_includes – rules

  • depth-first include expansion
  • no expansion inside:
    • blocks (state.is_inside_block)
    • drawers
    • comment blocks
  • for included files:
    • preamble is skipped until the first “real” content line
    • your preamble_decision logic defines that behavior

org_to_html.py

Overview

org_to_html.py contains the minimal Org→HTML renderer. It consumes OrgEvent streams and produces a complete HTML document.

Main functions

  • render_org_to_html_document(input_path, cfg) -> str
    • reads an Org file (via reader + parser)
    • renders the event stream into an HTML string
    • includes a basic HTML header/body scaffold
  • org_to_html(input_path, output_path, cfg) -> None
    • convenience function
    • calls render_org_to_html_document
    • writes the result to output_path

Rendering features (current)

It currently supports:

  • headings (h1..h6) with tags
  • paragraphs
  • unordered/ordered lists
  • verbatim blocks + src blocks
    • data-language attribute
    • additional data-* attributes from src header args
  • inline markup:
    • bold, italic, code, links
  • image-only lines:
    • rendered as <figure> with caption + ATTR_HTML
  • tables:
    • simple tables + a subset of TBLFM
  • comments + comment blocks:
    • rendered as collapsible sections
  • :noexport: headings:
    • treated like comment sections (collapsible)
  • verse blocks:
    • own container preserving line breaks
  • inline math:
    • $..$ and \(..\) are rendered as SVG images
    • URLs: /math/<digest>.svg

Important internal helpers (“stable API-ish”)

  • render_inline_tokens(tokens, *, preamble_macros””) -> str=
    • renders a list of inline tokens to HTML
    • optionally applies LaTeX macros from the preamble
  • flush_paragraph(..., preamble_macros””) -> None=
    • writes the currently accumulated paragraph into the output stream
    • ensures clean paragraph separation
  • math_image_url(math_src, *, preamble_macros””) -> str=
    • builds the URL/path for the SVG math image
    • delegates to the math renderer / cache

Math & Web app

Module: math_renderer.py

  • render_math_to_svg(math_src, out_path, *, preamble_macros””) -> None=
    • writes a temporary standalone LaTeX file
    • runs latexdvidvisvgm
    • the result is an SVG file at out_path
    • uses a cache so the same formula isn’t rendered repeatedly

Typical usage:

  • called indirectly from the /math/<digest>.svg endpoint
  • digest is based on the math source string + optional macros

Module: webapp.py

  • index()
    • lists README.org and files under org/*.org
    • provides entry points to view Org documents
  • view_file(filename)
    • renders an Org file via reader + parser + renderer into HTML
    • returns the HTML view in the browser
  • assets(subpath)
    • serves static assets under /assets/... (CSS, images, …)
  • math_image(digest)
    • creates (or returns from cache) an SVG image for a math expression
    • uses math_renderer.render_math_to_svg

Important:

  • math cache must be writable: /app/.math-cache
  • in containers, a volume mount with :Z,U is recommended

Permission denied: .math-cache/*.tex

Cause:

  • the container user cannot write into the bind mount

Recommended fix (Podman):

  • mount the volume with :U: ./.math-cache:/app/.math-cache:rw,Z,U

Why not chmod 777?

  • it works
  • but it’s messy and potentially unsafe

curl: “Empty reply” / “unexpected eof”

Typical causes:

  • you call the service via HTTPS, but the container serves HTTP
  • or the other way around

Checklist:

podman logs org-viewer
ss -tlpn | grep 5000
curl http://localhost:5000

TLS disabled (no cert/key found)

Possible reasons:

  • CERT_FILE / KEY_FILE incorrectly set
  • certificate volume not mounted to /certs
  • wrong filenames or ownership inside the container

Debug:

ls -al /certs

What this document tests

  • preamble parsing
  • INCLUDE (depth-first)
  • headings & tags
  • lists
  • inline markup & links
  • images + caption + ATTR_HTML + NAME anchors
  • verbatim and src blocks
  • tables + TBLFM
  • inline math + LaTeX macros

Images / anchors / caption / ATTR_HTML

Rendered example screenshot

Inline math and LaTeX

Example: \(∑superscript\) and $\textsf{\LaTeX}$

Tables + TBLFM

NameGermanMathAverage
Student123
Student211

Note

This is a hobby project, partly created with help from ChatGPT, and not meant for production use. License: GPLv3

About

Parse ORG files and render the as a template.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published