Simple Chatbot Application Using Python + Google API Key (Gemini)

I still remember the first time I needed a quick, reliable chatbot for a small internal tool. The goal wasn’t a futuristic assistant; it was something humble and practical: answer a question, keep context for a few turns, and do it in a way I could explain to a teammate in ten minutes. If you’re in the same situation, you’re in the right place. I’ll show you how I build a simple chatbot with Python and the Google AI Python SDK (the google-generativeai package), using a Google API key and a Streamlit interface. You’ll walk away with a runnable app, a mental model of what’s happening under the hood, and clear guidance on when this lightweight approach is a good fit and when it isn’t.

The focus here is the simplest practical chatbot: a clean UI, a prompt, a response, and short memory. I’ll keep the architecture intentionally small so you can ship something in an afternoon, then I’ll point out optional upgrades you can add when you need more capability.

What the Google AI Python SDK really gives you

When I say “Google AI Python SDK,” I mean the google-generativeai package. It’s the Python client for accessing Gemini models for text generation and chat. For a basic chatbot, you only need three operations: configure the SDK with your API key, create a model, and open a chat session. Everything else (UI, state, logging, error handling) you control in your Python app.

In practice, the SDK solves two important problems:

Authentication and request signing: you pass your API key once and the client handles the rest.
Chat session orchestration: you can send messages in a conversation and let the model respond without building your own request format.

That’s the core. Everything else you build is app logic.

The minimal architecture: three moving parts

I like to explain a simple chatbot as three parts that move together:

UI layer: a lightweight web form for input and output. Streamlit is perfect for this because it gives you a front end from pure Python.
Conversation state: a small piece of memory, usually stored in Streamlit’s session_state, so messages persist across reruns.
Model interaction: a helper that sends a prompt to Gemini and returns a response.

If any of those pieces get too complex, the chatbot stops being “simple.” So I keep them focused. You’ll see this pattern in the code below.

Setting up your environment (fast and clean)

I prefer a small virtual environment so the demo is clean and reproducible. Here’s what I do on a fresh project:

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install streamlit google-generativeai

That’s it for dependencies. The google-generativeai package is the SDK, and Streamlit is the UI.

Project structure I recommend

For a tiny app, you don’t need a lot of files. I keep this:

chatbot/
app.py
.env

I’ll explain .env in a moment. The app file is the whole project.

Creating and protecting your API key

I never hard-code API keys in code. It’s the easiest way to leak them in logs or commits. Instead, I store the key in an environment variable. In a local project, I usually use a .env file and load it manually.

If you prefer not to add a dependency for dotenv, you can set the key in your shell and keep your app simple:

export GOOGLEAPIKEY="YOURKEYHERE"

Then in Python, read it with os.getenv. This is enough for a tutorial and keeps your secrets out of source control.

A complete, runnable chatbot in one file

Below is the full app. It is small, but it handles all the essential details: API key configuration, model setup, memory, input/output, and a clean UI. I added comments only where the logic might surprise you.

import os
import streamlit as st
import google.generativeai as genai
---- Page setup ----
st.setpageconfig(pagetitle="Simple Gemini Chatbot", pageicon="💬")
st.title("Simple Chatbot")
st.caption("A minimal Streamlit + Gemini chat app")
---- API key ----
apikey = os.getenv("GOOGLEAPI_KEY")
if not api_key:
st.error("Missing GOOGLEAPIKEY environment variable.")
st.stop()
genai.configure(apikey=apikey)
---- Model + chat session ----
Choose a text model that supports chat. The name can change over time,
so use the one available in your account or SDK docs.
model = genai.GenerativeModel("gemini-pro")
if "chat" not in st.session_state:
# Store the chat object in session state so it persists across reruns
st.sessionstate.chat = model.startchat(history=[])
if "messages" not in st.session_state:
st.session_state.messages = []  # list of dicts: {"role": "user"|"bot", "text": "..."}
---- UI: display prior messages ----
for msg in st.session_state.messages:
with st.chat_message("user" if msg["role"] == "user" else "assistant"):
st.markdown(msg["text"])
---- UI: input box ----
userinput = st.chatinput("Ask me anything")
if user_input:
# Add user message to transcript
st.sessionstate.messages.append({"role": "user", "text": userinput})
with st.chat_message("user"):
st.markdown(user_input)
# Send message to Gemini
with st.chat_message("assistant"):
try:
response = st.sessionstate.chat.sendmessage(user_input)
reply_text = response.text if response else "(No response)"
except Exception as exc:
reply_text = f"Error: {exc}"
st.markdown(reply_text)
st.sessionstate.messages.append({"role": "bot", "text": replytext})

That’s a complete app. Run it with:

streamlit run app.py

When I run this, I can chat in the browser and see the conversation persist, even though Streamlit re-runs the script after each input. That’s the magic of session_state.

Why this works: a quick mental model

Streamlit runs the script top to bottom every time you interact with it. If you don’t store state, your chat history disappears. That’s why I stash two things in session_state:

chat: the Gemini chat session object, so the model has context.
messages: the list of visible messages, so the UI can repaint the transcript.

This combination keeps the UI and the model aligned. I’ve seen people keep only one of these and then wonder why the assistant forgets the earlier turns or why the UI resets. Keeping both avoids that confusion.

A small but important update: safety and clarity

When I build something user-facing, I add two tiny layers of safety:

Input size guard: limit how large user input can be.
Error handling: show a friendly message if the API call fails.

Here’s how I do it without complicating the app:

MAX_CHARS = 4000
if user_input:
if len(userinput) > MAXCHARS:
st.warning("Your message is too long. Please shorten it.")
st.stop()

That’s enough for most demo use cases and prevents runaway prompts or accidental pastes of large files.

How I think about prompts in a simple chatbot

A basic chatbot doesn’t need complex prompt engineering, but it does need a baseline instruction. I usually set a short “system-style” instruction and add it to the first message. The SDK can accept a system_instruction in model config, or you can manually include it in your conversation. I keep it simple:

SYSTEM_PROMPT = "You are a helpful assistant. Keep answers concise and practical."
model = genai.GenerativeModel("gemini-pro", systeminstruction=SYSTEMPROMPT)

That single line brings consistency to responses without adding complexity. I avoid long prompt templates at this stage; those can come later when you know your users’ needs.

Traditional vs modern: what changed by 2026

In 2026, the big shift is the ease of model access. Five years ago, I’d often write a lot of HTTP client code, manage retries, and shape prompt payloads by hand. Now, SDKs handle most of that. Here’s a quick comparison of how I think about this today.

Approach

What it looks like

Why I use it —

—

— Traditional (manual HTTP calls)

Custom requests code, explicit JSON payloads, hand-built retry logic

Useful when you need full control or advanced routing Modern (SDK-based)

google-generativeai model objects with startchat and sendmessage

Faster setup, fewer bugs, clear intent, less boilerplate

For a simple chatbot, I choose the SDK almost every time. It saves time and keeps the example clear for readers and teammates.

Common mistakes I see (and how to avoid them)

I’ve reviewed a lot of small chatbot demos, and these mistakes come up again and again:

Putting the API key in the code: It leaks when you share the file. Use environment variables instead.
Recreating the chat session on every turn: The model forgets previous messages. Store it in session_state.
Showing only the model response: Users want the full transcript. Store and render both sides of the conversation.
No error handling: Any network hiccup becomes a blank screen. Catch exceptions and show an error message.
Sending huge prompts: Long inputs slow response and raise costs. Add a length guard.

I recommend fixing these early. Each takes only a few lines, and they save time later.

When this approach is perfect

This lightweight pattern is ideal in specific situations. I use it when:

I need a demo or proof of concept.
I’m building a single-purpose assistant, like a FAQ helper or coding companion.
I want to test a prompt quickly with real users.
I need a small internal tool that doesn’t justify a larger frontend.

If any of those match your goal, the code above is enough. It’s fast, clear, and easy to share.

When this approach is not enough

There are also times I skip this pattern entirely. You should consider a larger architecture when:

You need long-term memory across many sessions.
You have compliance requirements that require audit trails.
You need user authentication and multi-tenant access.
You must do streaming responses with advanced UI control.
You need tool calls or structured outputs with strict schemas.

In those cases, I move to a proper backend with a database, job queue, and a frontend that can handle streaming updates. I still prototype with Streamlit, but I don’t deploy it directly.

Adding basic persistence (optional upgrade)

If you want your conversation to persist even after a browser refresh, you’ll need a tiny storage layer. For a local demo, I use a JSON file. It isn’t robust, but it works for small tests. Here’s a simple pattern:

import json
from pathlib import Path
STOREPATH = Path("chathistory.json")
Load at startup
if STOREPATH.exists() and "messages" not in st.sessionstate:
st.sessionstate.messages = json.loads(STOREPATH.read_text())
Save after every turn
STOREPATH.writetext(json.dumps(st.session_state.messages, indent=2))

I only recommend this for a quick demo. For real usage, move to a database and handle multiple users properly.

Performance and responsiveness: what to expect

A simple chatbot with a hosted model is usually fast enough for interactive use. In my experience, you’ll see response times that feel snappy for short prompts and short replies. On a typical connection, I often see response times in the 400–1500ms range for short questions, and 1.5–4s for longer or more complex prompts. That’s enough for a conversational experience.

If you need faster results, here are the practical levers:

Keep prompts short and focused.
Ask for concise answers explicitly in the system prompt.
Limit the visible history to the last few messages for the model.

I avoid hard promises about speed, but these general patterns hold in most real-world usage.

A simple memory window (another optional upgrade)

If you ask the model to keep every message forever, the conversation grows and slows down. A simple fix is a rolling window: keep only the last N turns for model context, but keep the full transcript for display. That keeps the experience consistent without large prompts.

Here’s a simple way to do it:

MAX_TURNS = 8  # user+assistant pairs
When sending to the model, use a short history window
recentmessages = st.sessionstate.messages[-(MAX_TURNS * 2):]
Recreate a chat with trimmed history (simple approach)
chat = model.start_chat(history=[])
for msg in recent_messages:
role = "user" if msg["role"] == "user" else "model"
chat.history.append({"role": role, "parts": [msg["text"]]})
response = chat.sendmessage(userinput)

This is a basic pattern and may vary depending on SDK updates. The concept is the key: short memory keeps things fast and cheap.

Streamlit UI touches that matter

Even in a small demo, a few UI choices make the chatbot feel more polished:

A short caption that explains what the bot is for.
Consistent chat bubbles using st.chat_message.
A reset button to clear conversation state.

Here’s how I add a reset button:

if st.sidebar.button("Reset chat"):
st.session_state.messages = []
st.sessionstate.chat = model.startchat(history=[])
st.experimental_rerun()

This is tiny, but users always appreciate a reset action.

Security notes I don’t skip

Even for a toy app, I follow basic rules:

Keep the API key server-side: don’t expose it in frontend code or browser logs.
Avoid logging raw user prompts if there’s sensitive data.
Rotate keys if you ever post code publicly.

If you plan to share your app, store keys in an environment variable or secret manager. It’s one of those habits that saves you later.

Handling errors with a user-friendly message

Most errors in this setup will be network or authentication issues. I handle them gently so the user isn’t confused. Instead of showing a traceback, I show a short error message and a hint.

try:
response = st.sessionstate.chat.sendmessage(user_input)
reply_text = response.text
except Exception:
reply_text = "I had trouble reaching the model. Please try again in a moment."

This keeps the UI clean and avoids revealing details to the user.

Real-world scenarios where this chatbot shines

I’ve used variations of this pattern in several real situations:

Internal tool Q&A: team asks a bot about build rules or dev workflows.
Product discovery: a bot that answers FAQs about a small product line.
Teaching assistant: a bot that helps learners practice Python basics.
Creative helper: a bot that brainstorms copy ideas or names.

Each of these started as a simple Streamlit app with a minimal prompt, and only grew later if needed.

Edge cases worth planning for

Even a simple chatbot can hit tricky scenarios. These are the ones I plan for:

Empty input: ignore it.
Very long text: ask for a shorter version.
Sensitive content: set boundaries.

I’ll go deeper on these below and show a few practical fixes that keep the app robust without adding complexity.

Deep dive: a clearer prompt strategy without overengineering

A common question I get is, “Do I need a system prompt?” Short answer: yes, but keep it short. A single instruction sets the tone and reduces randomness. I treat prompts in three layers, even for a simple app:

System instruction: defines the assistant’s role and tone.
User input: the question or task.
Optional context: only if you have specific references the bot should use.

For a simple chatbot, I keep the third layer empty. Here’s a small variation that works well in practice:

SYSTEM_PROMPT = (
"You are a helpful assistant for a small internal tool. "
"Be concise, practical, and ask a clarifying question when needed."
)
model = genai.GenerativeModel("gemini-pro", systeminstruction=SYSTEMPROMPT)

That line does three things: sets tone, controls length, and gives you a backup behavior for ambiguous input. You don’t need a complex prompt template. You need a stable baseline.

A slightly more complete app (still simple)

If you’re willing to add a few more lines, you can make the chatbot feel more “real” without bloating the code. Here’s a version that includes a reset button, input limit, and safer error messaging while keeping the overall flow identical.

import os
import streamlit as st
import google.generativeai as genai
st.setpageconfig(pagetitle="Simple Gemini Chatbot", pageicon="💬")
st.title("Simple Chatbot")
st.caption("A minimal Streamlit + Gemini chat app")
apikey = os.getenv("GOOGLEAPI_KEY")
if not api_key:
st.error("Missing GOOGLEAPIKEY environment variable.")
st.stop()
genai.configure(apikey=apikey)
SYSTEM_PROMPT = "You are a helpful assistant. Keep answers concise and practical."
model = genai.GenerativeModel("gemini-pro", systeminstruction=SYSTEMPROMPT)
if "chat" not in st.session_state:
st.sessionstate.chat = model.startchat(history=[])
if "messages" not in st.session_state:
st.session_state.messages = []
Sidebar controls
if st.sidebar.button("Reset chat"):
st.session_state.messages = []
st.sessionstate.chat = model.startchat(history=[])
st.experimental_rerun()
Render transcript
for msg in st.session_state.messages:
with st.chat_message("user" if msg["role"] == "user" else "assistant"):
st.markdown(msg["text"])
MAX_CHARS = 4000
userinput = st.chatinput("Ask me anything")
if user_input:
if len(userinput) > MAXCHARS:
st.warning("Your message is too long. Please shorten it.")
st.stop()
st.sessionstate.messages.append({"role": "user", "text": userinput})
with st.chat_message("user"):
st.markdown(user_input)
with st.chat_message("assistant"):
try:
response = st.sessionstate.chat.sendmessage(user_input)
reply_text = response.text if response else "(No response)"
except Exception:
reply_text = "I had trouble reaching the model. Please try again in a moment."
st.markdown(reply_text)
st.sessionstate.messages.append({"role": "bot", "text": replytext})

This is still a single file, still easy to share, and now includes minimal polish that users will notice.

Handling empty input and whitespace-only input

Streamlit’s st.chat_input won’t send an empty string, but users can still send whitespace or paste blank lines. I add a simple guard that ignores whitespace-only input. It keeps your transcript clean and avoids paying for empty prompts.

userinput = st.chatinput("Ask me anything")
if userinput and userinput.strip():
# proceed as usual
pass

It’s a tiny detail, but it prevents weird UX.

A basic “safe mode” for sensitive content

Even a toy chatbot can be used in unexpected ways. If you’re building something public or semi-public, I add a small safety rule to the system prompt. This doesn’t replace formal safety features, but it does shape responses.

SYSTEM_PROMPT = (
"You are a helpful assistant. Keep answers concise and practical. "
"Avoid providing harmful, illegal, or sensitive advice."
)

This isn’t a silver bullet. It’s a first layer that keeps the experience clean for most users.

Limiting model context: the practical budget mindset

When you run a chatbot with a hosted model, you’re trading context length for latency and cost. A small app benefits from a simple rule: keep the last N turns for the model, but keep the full transcript for the user. You already saw the code for this above, but here’s why it matters:

Latency grows with prompt size. Shorter history is usually faster.
Cost often scales with token usage. Smaller prompts mean lower costs.
Signal-to-noise improves when the model sees recent context instead of stale details.

I treat this as a default for any simple chatbot used by more than a handful of people.

Optional upgrade: streaming responses (when you want more “chat feel”)

Even though I consider streaming an “advanced” feature for a small app, it’s a powerful UX upgrade. Users perceive the bot as faster because they see words appearing. If the SDK supports streaming in your environment, the pattern looks like this (simplified):

with st.chat_message("assistant"):
response = st.sessionstate.chat.sendmessage(user_input, stream=True)
chunks = []
placeholder = st.empty()
for chunk in response:
if hasattr(chunk, "text") and chunk.text:
chunks.append(chunk.text)
placeholder.markdown("".join(chunks))

If you choose this upgrade, keep the rest of the app minimal. Streaming is a UX feature, not a reason to redesign the whole architecture.

Optional upgrade: a tiny log for debugging

When I’m demoing a chatbot to a team, I keep a minimal debug log so I can reproduce errors. I don’t log full prompts, just metadata:

import time
log_entry = {
"ts": time.time(),
"promptchars": len(userinput),
"responsechars": len(replytext),
}

I keep this in memory or write to a small local file. The point is to keep visibility without storing sensitive data.

Common pitfalls in detail (and what I do instead)

Here are a few subtle mistakes that show up once you move beyond the first test:

Mixing UI transcript with model history: You’ll see people append assistant replies to the model history as plain text, then also keep a UI transcript. If those diverge, the conversation becomes inconsistent. I keep UI transcript and model history separate but aligned.
Rebuilding the model object each turn: This works, but it’s wasteful. Create the model once at the top, then reuse it.
Assuming the model returns text: Sometimes responses can be empty or structured. I always add a fallback: response.text if response else "(No response)".
Overloading the system prompt: A long system prompt is easy to break. I keep it short and stable.
Adding too many features too soon: The whole point is a simple chatbot. If you add tools, file uploads, and database storage on day one, you lose the point of the exercise.

If you avoid these, your demo will feel steady and professional.

Comparison: simplest chatbot vs a production-ready assistant

I often show this contrast to teams so they don’t accidentally overbuild:

Dimension

Simple Chatbot

Production Assistant —

—

— UI

Streamlit

Dedicated frontend (React, Next.js, etc.) State

In-memory session

Database-backed, multi-user Auth

None

OAuth or SSO Memory

Short rolling window

Summaries + long-term storage Logging

Minimal

Full observability pipeline Safety

System prompt

Policy checks + moderation

This table keeps expectations aligned. A simple chatbot is a prototype, a demo, or a small internal tool. It shouldn’t pretend to be a production assistant.

Practical scenario walkthrough: internal FAQ bot

Let me show how the simple chatbot fits a realistic internal use case. Suppose your team has a list of build rules and quick tips stored in a wiki. You can’t ship a full search system, but you want a conversational helper that answers small questions. Here’s how I’d do it with this lightweight pattern:

Keep the prompt narrow: “You are a helpful assistant for our build rules. Keep answers short.”
Add a short context snippet: paste a few key rules in the system prompt.
Keep memory short: only last 3–5 turns.

That’s it. You don’t need a database. You don’t need custom authentication. It’s enough to get a feel for the user experience and decide whether to build something bigger.

Practical scenario walkthrough: learning buddy for Python basics

This is another common use case. I keep the assistant constrained: short answers, gentle tone, and a suggestion to try a small exercise. The prompt could look like this:

SYSTEM_PROMPT = (
"You are a helpful Python tutor. Explain concepts simply, "
"give short examples, and suggest a tiny practice task."
)

That small change turns a general chatbot into a focused tutor without changing the rest of the code.

Practical scenario walkthrough: copy and naming helper

If you need a quick creative tool for a marketing team, the same app works. You simply change the system prompt and add an output rule. For example:

SYSTEM_PROMPT = (
"You are a creative assistant for product naming. "
"Provide 5 options and a one-line rationale for each."
)

The app stays exactly the same, but the output feels tailored. This is a great example of why a simple chatbot is still valuable.

Edge case: repeated prompt injections

Even in small apps, users can try to “break” the instructions. You can’t fully prevent this, but a short defensive system prompt helps. I also recommend you avoid putting sensitive information in the prompt itself. If you include internal rules, assume they could be revealed.

If you need strong protections, this simple approach is no longer enough. That’s a good sign to move to a more robust system.

Edge case: multi-user collisions

Streamlit’s session_state is scoped per user session, which is good, but if you deploy this somewhere public with a single process, you might still see weirdness when scaling. The simple fix is to run multiple app instances and rely on Streamlit’s session isolation. The robust fix is a proper backend with explicit user IDs and storage.

For a small internal tool, the simple approach is usually fine.

Optional upgrade: an ultra-simple “memory summary”

If your conversation window grows, you can keep it short while preserving meaning by summarizing. Here’s a small pattern I’ve used for demos:

When the conversation hits N turns, ask the model to summarize the chat so far.
Replace the old messages with the summary plus the last few turns.

It’s not perfect, but it’s a useful bridge before building a real memory system.

if len(st.session_state.messages) > 20:
summary_prompt = "Summarize this conversation in 6 bullet points."
summary = st.sessionstate.chat.sendmessage(summary_prompt).text
st.session_state.messages = [
{"role": "bot", "text": "Summary: " + summary}
] + st.session_state.messages[-6:]

That one block keeps the chat usable without major architecture changes.

Alternative approaches (if you want a different UI)

Streamlit is great for speed, but it’s not the only option. Here’s a quick comparison of UI choices:

UI Option

Pros

Cons —

—

— Streamlit

Fastest to build, pure Python

Limited UI customization Flask + HTML

More control, simple deployment

More boilerplate FastAPI + React

Production-ready, scalable

More code and setup

If your goal is to ship in a day, Streamlit wins. If your goal is a polished consumer app, use a dedicated frontend.

Alternative approach: direct REST calls without the SDK

You can skip the SDK and call the API directly using requests. I only do this when I need custom routing or advanced control. For a simple chatbot, it’s extra work. Still, it’s worth knowing what it looks like:

import requests
url = "https://..."
headers = {"Authorization": f"Bearer {api_key}"}
payload = {"contents": [{"role": "user", "parts": [{"text": user_input}]}]}
response = requests.post(url, headers=headers, json=payload)

This gives you flexibility but adds complexity. For a basic chatbot, the SDK is the faster, safer choice.

Production considerations (only if you choose to scale)

If your “simple chatbot” starts to grow, the next steps typically look like this:

Authentication: tie chat sessions to real users.
Persistence: store messages in a database.
Observability: track latency, token usage, and error rates.
Rate limiting: prevent abuse and control costs.
Structured outputs: use JSON or schemas when responses must be consistent.

I treat these as a separate project. They are not small add-ons; they are architectural changes.

Deployment choices for a simple app

If you want to deploy the Streamlit app for a small group, I use one of these patterns:

Local network: run it on a shared machine and restrict access to your LAN.
Lightweight server: deploy to a small VM with a reverse proxy.
Temporary demo: run it on your laptop and share the URL on a call.

For most internal teams, the local-network approach is enough. If it’s a long-term tool, you should think about authentication and logs.

Monitoring and cost awareness

A “simple” chatbot can still generate cost if it’s used heavily. Even without a full monitoring setup, I add two small controls:

Message length limit: stops runaway prompts.
Turn limit: optionally reset the chat after N turns.

This keeps usage predictable and avoids surprises.

Testing the chatbot (quick sanity checks)

I don’t over-test a demo, but I do run a few quick checks:

Can it answer a simple question?
Does the transcript persist across a second input?
Does the reset button clear the conversation?
Does it handle a long message with a warning?
Does it show a graceful error when the API key is missing?

These checks take two minutes and catch the most common issues.

Debugging checklist when things don’t work

If your app fails, here’s the order I check:

Environment variable: is GOOGLEAPIKEY set correctly?
Model name: is the model string valid for your SDK version?
Network: is the server able to reach the API?
Session state: are you accidentally overwriting st.session_state?
Input: are you sending a non-empty prompt?

In most cases, one of these is the issue.

A simple checklist before sharing the app

If you plan to share the app with teammates, I use this quick checklist:

API key is in an environment variable, not in code.
.env is in .gitignore if you use it.
Error handling shows friendly messages.
The reset button works.
A short caption explains what the bot is for.

Small details, big impact.

Summary: the simplest chatbot that still feels solid

A simple chatbot is not a toy if you build it with care. It can be a demo, an internal assistant, or a learning tool. The recipe is straightforward:

Use the Google AI Python SDK for a clean chat interface.
Store a chat object and transcript in Streamlit session_state.
Add minimal safety: prompt length guards, friendly error handling.
Keep prompts short and focused.
Know when to stop adding features and ship.

If you follow this pattern, you’ll have a working chatbot in an afternoon and a foundation you can scale later. That’s the sweet spot I aim for every time.

Next steps (if you want to go further)

If you decide to keep building, here are the natural extensions I explore next:

Add streaming to improve perceived speed.
Add persistence with a lightweight database (SQLite) for multi-session memory.
Add user authentication if you plan to share widely.
Add tool calls if you want the bot to perform actions, not just answer.

But don’t rush. The simple version is powerful, and you’ll learn more by shipping it early than by perfecting it in isolation.