Implement continuous batching#1642

Open

rltakashige wants to merge 27 commits intomainfrom

leo/implement-continuous-batching

Collaborator

rltakashige commented Mar 2, 2026 •

edited

Loading

Motivation

Following the changes made in #1632 !
Closes #1020

Changes

Why It Works

Test Plan

Manual Testing

Automated Testing

rltakashige force-pushed the leo/implement-continuous-batching branch from f0938d6 to e660cab Compare

March 2, 2026 02:29

rltakashige force-pushed the leo/prepare-batch-implementation branch from 904b12f to 33f57c6 Compare

March 2, 2026 02:30

rltakashige force-pushed the leo/implement-continuous-batching branch from e660cab to b64d929 Compare

March 2, 2026 02:30

rltakashige force-pushed the leo/prepare-batch-implementation branch 2 times, most recently from 7ae0586 to b771f67 Compare

March 2, 2026 03:09


          Refactor runner for implementing batching

8cb9bac

rltakashige force-pushed the leo/prepare-batch-implementation branch from b771f67 to 8cb9bac Compare

March 2, 2026 03:09

rltakashige added 2 commits

March 2, 2026 03:11


          Match with image runner

7065fca


          Implement continuous batching

fbd14c7

rltakashige force-pushed the leo/implement-continuous-batching branch from b64d929 to fbd14c7 Compare

March 2, 2026 03:33

rltakashige requested a review from Evanev7

March 2, 2026 03:34

rltakashige marked this pull request as ready for review

March 2, 2026 03:34

rltakashige force-pushed the leo/prepare-batch-implementation branch 2 times, most recently from ca53647 to 0b00f1a Compare

March 2, 2026 17:00

Evanev7 force-pushed the leo/prepare-batch-implementation branch from 593cd59 to b05ddff Compare

March 2, 2026 17:03

rltakashige force-pushed the leo/prepare-batch-implementation branch 2 times, most recently from b9c4199 to f6eccf1 Compare

March 2, 2026 17:26

Evanev7 mentioned this pull request

fix(worker): emit error chunks when a runner dies mid-command #1645

Merged

Evanev7 force-pushed the leo/prepare-batch-implementation branch from f6eccf1 to 6962838 Compare

March 3, 2026 10:49

Evanev7 mentioned this pull request

Support multimodality (Image, PDF) input #1002

Open

Base automatically changed from leo/prepare-batch-implementation to main

March 3, 2026 14:38

rltakashige added 3 commits

March 3, 2026 14:42


          Move mx_all_gather_tasks into utils_mlx

8b15ce0


          Merge branch 'main' into leo/implement-continuous-batching

e0eb5a6


          Wire up batch generator post previous refactors

c79e0b1

rltakashige force-pushed the leo/implement-continuous-batching branch from 7622da0 to c79e0b1 Compare

March 3, 2026 15:49

rltakashige and others added 5 commits

March 3, 2026 15:55


          Don't del if it doesn't exist

e2889c3


          Small refactor

3bed47c


          Merge branch 'main' into leo/implement-continuous-batching

d41abdc


          Fix exact match

6b5a440


          Fix batch cancellation

9545ab5


          Pass CI

f7b00ad

rltakashige force-pushed the leo/implement-continuous-batching branch 3 times, most recently from 6153a8f to d82a7ad Compare

March 4, 2026 23:10


          Right pad batches and exo bench update

c87f1c7

rltakashige force-pushed the leo/implement-continuous-batching branch from d82a7ad to c87f1c7 Compare

March 5, 2026 00:21

rltakashige and others added 6 commits

March 5, 2026 10:44


          Skip tool calls on benchmarks

f864baf


          Set an explicit max concurrency.

bdf14fc


          save for now...

16e6628


          exo eval

5667f3d


          well, exo eval says that this makes no difference. let's move forward

33649a1


          tidy tidy (#1655)

e30d7fb

tidy tidy

---------

Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net>
Co-authored-by: rltakashige <rl.takashige@gmail.com>

Member

Evanev7 commented Mar 6, 2026

ready for review? or not quite yet

rltakashige added 3 commits

March 6, 2026 12:48


          Update exo eval

53ede4f


          Merge branch 'main' into leo/implement-continuous-batching

4c1eb52


          Update exo eval

0164aee

rltakashige force-pushed the leo/implement-continuous-batching branch from d4d3c03 to 0164aee Compare

March 6, 2026 14:26


          Fix a few mistakes

2d4e0a1

rltakashige mentioned this pull request

Bug: Runner crashes with KeyError in handle_generation_tasks when task is deleted twice #1678

Open

rltakashige added 3 commits

March 7, 2026 14:11


          Update mlx lm to ml-explore/mlx-lm#960

f0b6628


          Cleanup

66a7228


          Fix livecodebench

2f5b829

rltakashige force-pushed the leo/implement-continuous-batching branch from 911d362 to 2f5b829 Compare

March 7, 2026 20:39


          Add min_p and use llama.cpp defaults

35120d6

Evanev7 approved these changes

View reviewed changes

Member

Evanev7 left a comment

exciting!!! these are mostly stylistic changes with one or two minor correctness things we were probably doing wrong before anyway.

!!!!! continuous batching !!!!!

src/exo/worker/engines/mlx/generator/batch_generate.py

+                      else:
+                          cache = make_kv_cache(self.model)
+                      seed = task_params.seed or 42

Member

Evanev7 Mar 8, 2026

minor: should use explicit None check - this will override 0 with 42.

src/exo/worker/engines/mlx/generator/batch_generate.py


		last_tokens = prompt_tokens[-2:]

		logits_processors: list[Any] = []

Member

Evanev7 Mar 8, 2026

i believe we have some repetition penalty logits processors? assuming that merged, this should presumably duplicate that logic (or maybe a single make_logits_processors idk)

src/exo/worker/engines/mlx/generator/batch_generate.py


		max_tokens = task_params.max_output_tokens or MAX_TOKENS

		uids = self._mlx_gen.insert(

Member

Evanev7 Mar 8, 2026

can this ever return multiple uids? we should guard that case

Member

Evanev7 Mar 8, 2026

reading further this seems to be for multiple insertion - assuming we have no interest in multiple insertion, we should just assert it's a single uid.

src/exo/worker/engines/mlx/generator/batch_generate.py

+                          return []
+                      responses = self._mlx_gen.next()
+                      mx.clear_cache()

Member

Evanev7 Mar 8, 2026

this clear_cache could use a comment imo

src/exo/worker/engines/mlx/generator/batch_generate.py

+                      results: list[tuple[int, GenerationResponse]] = []
+                      for response in responses:
+                          if response.uid not in self._active_tasks:

Member

Evanev7 Mar 8, 2026

feels like an error we should report

src/exo/worker/runner/llm_inference/batch_generator.py

+                  ] = field(default_factory=dict, init=False)
+                  def __post_init__(self) -> None:
+                      self._mlx_gen = ExoBatchGenerator(

Member

Evanev7 Mar 8, 2026

_mlx_gen -> _exo_gen

src/exo/worker/runner/llm_inference/batch_generator.py

+                      while self._queue and len(self._active_tasks) < EXO_MAX_CONCURRENT_REQUESTS:
+                          task = self._queue.popleft()
+                          try:
+                              uid = self._build_generator(task)

Member

Evanev7 Mar 8, 2026

it is not clear that both _build_generator and _mlx_gen.submit run prefill immediately - i think we should keep behaviour as is but change the interface slightly

src/exo/worker/runner/llm_inference/batch_generator.py

+                          self._active_tasks[uid] = (task, queue, output_generator)
+                      if not self._mlx_gen.has_work:
+                          return self._drain_cancellations()

Member

Evanev7 Mar 8, 2026

similarly "drain cancellations" imo implies removing a cancellation rather than removing it's corresponding task

src/exo/worker/runner/llm_inference/batch_generator.py

+                      ] = []
+                      for uid, response in results:
+                          if uid not in self._active_tasks:
+                              # should we error here?

Member

Evanev7 Mar 8, 2026

ref comment and review comment above, i think at least a log is due here

src/exo/worker/runner/llm_inference/runner.py

                   def build(
                       self,
                   ) -> InferenceGenerator:
+                      import os

Member

Evanev7 Mar 8, 2026

no reason to import late here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet