📄 Idea Pitch – BECONEX Challenge

TUM Science Hackathon 2025

Teamname: Begoonex
Datum: 22. Juni 2025
Teammitglieder: Grazvy Kuras, Houssem Kotti, Baran Kilic, Martin Mirwald, Kenny Nguyen, Hannes Nguyen
GitHub: https://github.com/Grazvy/Beconex-Challange

🔍 Our Approach

Our system combines multiple robust metrics that provide insight into possible document splits, as well as page-level localisation of the object-related information within a batch.

Using these insights, we apply predicate logic to determine split points. The goal is to cover most document structures through logic and defined features.

However, since not every batch will offer sufficient information for complete rule-based splitting, we integrate a machine learning model that predicts the most likely remaining splits.

If a split cannot be confidently made using logic, or the ML model fails to meet a defined confidence threshold, the system triggers a warning. In the manual review phase, ML predictions can be confirmed or corrected. Additionally, the system detects and highlights mistakes such as typos or missing objects/documents for manual assessment.

📍 Object Localisation

We apply both custom techniques and established methods like fuzzy search to identify the pages where object-related information originates.

📐 Metrics

We use several metrics that give structural insight into the document layout:

Page Description
Phrases like "Seite 1 von x" indicate the minimum page count a document might have.
AGB Detection
Terms and conditions (AGBs) are easy to detect and typically appear at the end of a document.
Identifier Distribution
If, for example, zip codes from different objects are found on adjacent pages, a split is likely required there.

→ We combine the results from object localisation and the metrics to reliably detect document boundaries using predicate logic.

🤖 Machine Learning

Where logical inference cannot be applied, we resort to machine learning.
We start with simple models like logistic regression, which offer confidence scores.

These models aim to predict whether a page is likely a document header, based on features such as:

Presence of both sender and receiver addresses
Large or prominent font usage

✅ Advantage

Our system is logic-first and designed for explainability. Any machine learning model—including LLMs—can be integrated as a fallback.

By relying primarily on rule-based logic, and only falling back to ML/AI where logic fails, we minimize the potential for errors while maintaining flexibility.