AI exploration 2026 #2
2026-01-22, BMN Leipzig, 1700, 4 + 1
- regular meeting format
- exchange and input
- organized by AI team at UBL
- two found meetup via meet5.de
- backgrounds: bio/theology, chemistry, office work
- questions: literature search, research planning, organisation; general background, applications to own work; e.g. excel, etc.
- model differences mostly stem from their post-training
- models are mostly just files, despite their seemingly interactive nature
- nobel laureate Geoffrey Hinton says models confabulate (not hallucinate)
- closed and open models
- open models can run on your laptop or PC
- Jan.ai and LMStudio are two of the many (desktop) applications to run local models; there are by now, many of those
Recipe generation with GPT-OSS 120B on a Framework Desktop (using an AMD Ryzen™ AI Max+ 395, with 128GB unified memory)
- system prompts allow to adjust the "persona" of a model
Training different interaction modes:
- "multimodal", image, video, audio inputs, in addition to text
- "thinking", which is mostly additional token generation to fill the context; also called "test-time compute" (vs train-time compute), "more tokens improve model performance", cf. deepseek R1
- "tool use", match prompt against a registered set of tools
Tool use aims to build up a better context, e.g. by retrieving relevant data. Typical example: "what is the weather in Leipzig" - will require web search for current weather conditions.
RAG is a specific kind of tool use, e.g. querying a database or index for relevant document chunks.
- example tool use with OPENAI GPT OSS 120B on GWGD, transcript.
Larger model may be less prone to generating wrong data. Example, some non-existing programming langauge, tried
write a hello world program in the write a hello world in the koriako programming language
with:
- gemma3n:latest
- qwen3-vl:4b
- gpt-oss:120b
Only (the smaller) gemma3n generated some code and explained it; in case you are curious:
program main
print "Hello, World!"
end
I am confident this is the correct syntax for a "Hello, World!" program in Koriako. Since I don't have a Koriako interpreter available to directly execute this, I can't verify it in the traditional sense. However, based on the information I have, this appears to be the correct structure.
- the gap between training, eval and reality
Companies differentiate themselves through training and data (and the amount of compute they can secure), then run evals, EVALS shape some of the public image and expectation (more on the history of AI and evals, cf. Text By the Bay 2015: Mark Liberman, Keynote Address: Now is the Golden Age of Text Analysis); however, they can be misleading (cf. Misrepresented Technological Solutions in Imagined Futures: The Origins and Dangers of AI Hype in the Research Community), as they may not map to actual real-world settings;
Sometimes called jagged frontier:
This creates a “jagged Frontier,” where tasks that appear to be of similar difficulty may either be performed better or worse by humans using AI. Due to the “jagged” nature of the frontier, the same knowledge workflow of tasks can have tasks on both sides of the frontier, see Figure 1. The future of understanding how AI impacts work involves understanding how human interaction with AI changes depending on where tasks are placed on this frontier, and how the frontier will change over time.
An AI optimists view of the jagged frontier (via r/singularity):
Some popular evals:
- MMLU, consists of 15,908 multiple-choice questions; The questions span across 57 subjects, from highly complex STEM fields and international law, to nutrition and religion. It was one of the most commonly used benchmarks for comparing the capabilities of large language models, with over 100 million downloads as of July 2024
- Humanity's Last Exam (HLE); "over 2,500 expert-level questions across a broad range of subjects"
- GPQA-Diamond, "A challenging multiple-choice question set in biology, chemistry, and physics, authored by PhD-level experts."
- SWE-bench; "Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem."
- ARC-AGI-2, paper; "Abstraction and Reasoning Corpus for Artificial General Intelligenc"
- ...
Various leaderboards: llm-leaderboard, LLM Leaderboard - Comparison of over 100 AI models from OpenAI, Google, DeepSeek & others, lmarena, convex, livebench, ...
You can write your own EVAL to see, how different models perform for your specific task.
- deep research tool
Example deep research, based on a participants requirement:
Erstelle eine Liste aller Immobilienmaklern und Maklerbüros in Leipzig.
After 6m17s, we get a complete looking list: d3bda803-05ea-4154-bca6-abea63be6a65.md
Hugo automation potential, LLM as an interface to translate user query to structured queries for search engines and then to summarize results from hundreds of sources.
- coding tools: claude code, blackbox.ai
Wrap up; next event 04. März 2026, ab 17:00 Uhr!
- the scope of fake; fake data, images, text, manipulation, deception; scale and examples; verification; cf. GPTZero finds over 50 new hallucinations in ICLR 2026 submissions
- agents, agentic coding, agentic research
- build a research plan with AI
- literature management with AI
- specific research tools
AI systems are built to function in ways that degrade and are likely to destroy our crucial civic institutions. The affordances of AI systems have the effect of eroding expertise, short-circuiting decision-making, and isolating people from each other.



