Stream Refinery 🌊🧠

"Garbage In, Gold Out" — Real-time data cleaning and enrichment using Confluent and Google Vertex AI. This project was developed specifically for the AI Partner Catalyst: Accelerate Innovation hackathon to demonstrate real-time data cleansing using GenAI.

📺 Demo Video

Click the image above to watch the full demo (2 min).

See how it cleans dirty data streams instantly using Confluent Cloud & Gemini 2.5.

Project Goal

Standardize and fix "dirty" data streams (typos in locations, product names) on the fly using LLMs, transforming a chaotic raw stream into a high-quality analytical dataset without manual intervention.

🏗️ Architecture

Real-time pipeline: Confluent Cloud streams ➡️ Gemini AI cleaning ➡️ Streamlit Dashboard.

Ingestion: Python Producer generates mock transactions with intentional errors.
Transport: Confluent Cloud (Kafka) streams the raw data to the raw-data topic.
Intelligence: Google Vertex AI (Gemini) processes the JSON, fixes typos, and enriches the data.
Output: Clean, validated data is produced back to the clean-data topic in real-time.
Visualization: A Streamlit dashboard consumes both topics to display a live side-by-side comparison with a history stack.

Setup

Install dependencies:
```
pip install -r requirements.txt
```

Configure Environment: Create a client.properties file in the root directory. You need Confluent credentials and a Google AI Studio key:

# Confluent Settings
bootstrap.servers=YOUR_BOOTSTRAP_SERVER
security.protocol=SASL_SSL
sasl.mechanisms=PLAIN
sasl.username=YOUR_KAFKA_API_KEY
sasl.password=YOUR_KAFKA_API_SECRET

# Google AI Settings
google.api.key=YOUR_GOOGLE_AI_KEY

Run the pipeline:

Terminal 1 (Data Source):
```
python producer.py
```
Terminal 2 (AI Processor):
```
python consumer.py
```
Terminal 3 (Live Dashboard):
```
streamlit run app.py
```
Verification: Open your browser at http://localhost:8501. You will see the "Dirty" stream on the left and the AI-cleaned "Enriched" stream on the right appearing in real-time.

📄 License

This project is open-source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.devcontainer		.devcontainer
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
consumer.py		consumer.py
producer.py		producer.py
prompt_template.txt		prompt_template.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Stream Refinery 🌊🧠

📺 Demo Video

Project Goal

🏗️ Architecture

Setup

📄 License

About

Uh oh!

Releases

Packages

Languages

License

vero-code/stream-refinery

Folders and files

Latest commit

History

Repository files navigation

Stream Refinery 🌊🧠

📺 Demo Video

Project Goal

🏗️ Architecture

Setup

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages