Inspiration

I was working on a Financial Annual Reports Chatbot, FinChat for a professor, where the challenge was managing and processing large, complex financial statements. These statements often span multiple PDFs, and linking data across them to reach accurate conclusions became time-consuming and cumbersome. Additionally, the presence of key data embedded in images further complicated the task. The problem was clear: how could I automate the extraction of tables, text, and images from these long financial reports, link them together meaningfully, and provide results in a more digestible format?

This led me to think about creating a solution that could not only handle multiple data types (tables, images, text) in a single document but also organize and present them efficiently. From there, I began building Markdrop.

What it does

Markdrop is a Python package designed to process multimodal PDFs—those that contain a combination of images, tables, and text. It converts these documents into structured .md and .html formats, making them easier to navigate and analyze. It can extract tables and images and offer functionality such as generating Excel downloads directly from HTML files. Markdrop uses two approaches to table extraction: Docling, which provides a good balance between speed and accuracy, and Table Transformers, which ensures robust, accurate extraction with a longer processing time. For .md files, Markdrop generates descriptive placeholders for tables and images, allowing users to easily understand complex visual elements, supported by six different LLM clients (both local and API-based).

Markdrop is designed to be highly extensible, making it applicable to any domain. By simply installing it via CLI, users can easily integrate it into their workflows, whether for analysis, research, or document management. In just 2 months, Markdrop has gained around 8000+ installs, showcasing its growing popularity and utility.

pip install markdrop

How I built it

The core of Markdrop was built using Python libraries for PDF processing, such as PyMuPDF and XRef Id, for extracting text and images. I leveraged Docling and Table Transformers for table extraction, implementing them to allow users to choose between speed and robustness. The integration of LLM clients for generating descriptions involved connecting various APIs, enabling Markdrop to handle multimodal content effectively. I also created a streamlined workflow for converting PDFs into .html and .md formats, ensuring the system could work with both text-based content and more complex elements like tables and images.

For the Excel integration, I used libraries like pandas and openpyxl to handle the data, making it easy to offer downloadable tables directly from the .html interface. Additionally, the .md format replaced images and tables with descriptive placeholders, making the content more accessible for researchers and those working with large datasets.

Challenges I ran into

One of the biggest challenges I faced was handling the multimodal nature of the PDFs. Tables embedded within text and images often presented formatting and extraction issues. Ensuring the accuracy and consistency of table extraction across different types of financial reports took a lot of fine-tuning. Additionally, generating meaningful and accurate descriptions for images and tables via LLMs was a bit tricky; I had to balance description quality with the need for efficiency.

Another hurdle was implementing the Excel download functionality in HTML while maintaining a smooth user experience, ensuring it worked across different browsers and for various document sizes.

Accomplishments that I am proud of

I am particularly proud of the package's ability to generate structured .md files from multimodal PDFs and offer a clear, digestible representation of tables, images, and text. The ability to download tables as Excel files directly from an .html page is a game-changer, especially for data analysts, financial professionals, and researchers. Markdrop enables users to seamlessly transform raw financial reports into meaningful insights, saving both time and effort. I received moderately-rare badge Starstruck on GitHub as well.

Additionally, the system works with various LLM clients to provide customizable descriptions for tables and images, enhancing its utility in different use cases like document hosting, training, and data analysis. Markdrop's flexibility also allows it to be extended to any domain, enabling users across various industries to adapt it to their needs.

What I learned

Through this process, I learned how powerful multimodal content extraction can be when it's done right. The combination of table extraction, image handling, and text parsing from PDFs is a highly intricate task, but it can yield valuable results if the right tools and workflows are applied. One key technical insight was the use of the Xref Id in documents. This unique identifier allows for faster extraction of both text and images, as it helps to link different objects within a document more efficiently. However, one challenge with Xref Ids is that images are often represented as text, meaning the image data is encoded in a text format, which requires additional parsing and interpretation.

Additionally, I gained valuable insights into the workings of LLMs and their integration into projects aimed at improving document accessibility. The project also taught me the importance of user-centric design, as the primary goal is to make data from complex documents easier to manipulate and analyze.

What's next for Markdrop

In the future, I plan to extend Markdrop's capabilities to support other document types, such as Word and LaTeX, in addition to PDFs. I am also working on improving the LaTeX formulas formatting in the extracted .md files. Another step will be to explore Adobe’s document intelligence and integrate relevant features that can make document processing even more intelligent and precise. For instance, Adobe's AI-powered features in Acrobat and Reader support multiple document types, including PDFs, DOCX, PPTX, TXT, and RTF files, and offer capabilities like AI-powered authoring, editing, and formatting. Integrating similar functionalities into Markdrop could enhance its utility.

I will also continue to refine the user interface for better interaction, adding features like advanced table filtering and enhanced image recognition, ensuring Markdrop becomes even more powerful and efficient for users in various domains. Peer reviews have highlighted that with these additional functionalities, Markdrop has the potential to become a strong alternative to existing document intelligence solutions.

Built With

Share this project:

Updates