PenguinPipe

Inspiration

In-Line Inspection (ILI) tools are sophisticated devices that travel through pipelines to detect and measure anomalies such as metal loss (corrosion, erosion), dents and deformations, weld anomalies etc. When a pipeline is inspected multiple times, each ILI run produces an independent dataset. These datasets are not automatically aligned—the same physical anomaly may appear at slightly different reported locations in each run due to various variables. Currently, engineers are forced to manually align datasets, which can take week, even months. To help solve this problem, we made PenguinPipe.

What it does

PenguinPipe reads, cleans and analyzes past ILI datasets and uses machine learning to match anomalies, identify new ones, and predict future anomalies, for eg. their growth rates, vulnerable zones etc. It also helps users understand anomaly clustering and interaction zones through visualization maps and severity charts.

How we built it

The backend is built in Python, using NumPy and pandas to clean and normalize datasets and establish reference frames for anomaly matching. Matplotlib is used to generate cluster visualizations, while scikit-learn and SciPy power the model training pipeline for predicting future anomalies.

The backend API is implemented with FastAPI, Uvicorn, and Pydantic, enabling efficient data validation and high-performance request handling. User-uploaded CSV files are stored in MongoDB, allowing the system to incorporate new datasets and retrain the model in near real time to improve prediction accuracy.

The frontend is built with Next.js, TypeScript, and Tailwind CSS. Additionally, we integrate the Gemini and ElevenLabs APIs to power an AI chatbot with natural language understanding and a text-to-speech feature.

Challenges we ran into

The biggest challenge was aligning the data from multiple years, especially because of the duplicate column names, inconsistent and vague data fields and the NaN values. Apart from this, it was our first experience working with data analytics, machine learning libraries, and integrating backend APIs with a frontend application, making the project challenging yet highly rewarding overall.

Accomplishments that we're proud of

We built an end-to-end system that cleans, aligns, and analyzes multi-year ILI datasets to identify, cluster, and predict anomalies by successfully handling inconsistent schemas, missing values, and duplicate columns. We are also proud of building a machine learning pipeline that not only identifies and clusters anomalies but also predicts future trends such as growth rates and vulnerable zones. Integrating visualizations like cluster maps and severity charts helped make these insights interpretable and accessible to users. Delivering a full-stack solution with real-time data ingestion, model retraining, and an AI-powered chatbot was a major achievement.

What we learned

This project gave us our first hands-on experience with data analytics, machine learning pipelines, and full-stack API integration. We learned how much of a trouble bad data can cause and how to work around messy real-world data, design resilient preprocessing pipelines, and deploy models within a scalable backend connected to a modern frontend.

What's next for PenguinPipe

Next, we plan to continuously improve model accuracy by incorporating additional features and more advanced spatial modeling techniques. We aim to enhance real-time learning by optimizing incremental retraining as new data is uploaded. On the user side, we plan to expand visualizations, add anomaly explainability, and improve deployment scalability to support larger datasets and broader use cases.