Conversation
f0938d6 to
e660cab
Compare
904b12f to
33f57c6
Compare
e660cab to
b64d929
Compare
7ae0586 to
b771f67
Compare
b771f67 to
8cb9bac
Compare
b64d929 to
fbd14c7
Compare
ca53647 to
0b00f1a
Compare
593cd59 to
b05ddff
Compare
b9c4199 to
f6eccf1
Compare
f6eccf1 to
6962838
Compare
7622da0 to
c79e0b1
Compare
6153a8f to
d82a7ad
Compare
d82a7ad to
c87f1c7
Compare
tidy tidy --------- Co-authored-by: Ryuichi Leo Takashige <leo@exolabs.net> Co-authored-by: rltakashige <rl.takashige@gmail.com>
|
ready for review? or not quite yet |
d4d3c03 to
0164aee
Compare
911d362 to
2f5b829
Compare
Evanev7
left a comment
There was a problem hiding this comment.
exciting!!! these are mostly stylistic changes with one or two minor correctness things we were probably doing wrong before anyway.
!!!!! continuous batching !!!!!
| else: | ||
| cache = make_kv_cache(self.model) | ||
|
|
||
| seed = task_params.seed or 42 |
There was a problem hiding this comment.
minor: should use explicit None check - this will override 0 with 42.
|
|
||
| last_tokens = prompt_tokens[-2:] | ||
|
|
||
| logits_processors: list[Any] = [] |
There was a problem hiding this comment.
i believe we have some repetition penalty logits processors? assuming that merged, this should presumably duplicate that logic (or maybe a single make_logits_processors idk)
|
|
||
| max_tokens = task_params.max_output_tokens or MAX_TOKENS | ||
|
|
||
| uids = self._mlx_gen.insert( |
There was a problem hiding this comment.
can this ever return multiple uids? we should guard that case
There was a problem hiding this comment.
reading further this seems to be for multiple insertion - assuming we have no interest in multiple insertion, we should just assert it's a single uid.
| return [] | ||
|
|
||
| responses = self._mlx_gen.next() | ||
| mx.clear_cache() |
There was a problem hiding this comment.
this clear_cache could use a comment imo
| results: list[tuple[int, GenerationResponse]] = [] | ||
|
|
||
| for response in responses: | ||
| if response.uid not in self._active_tasks: |
There was a problem hiding this comment.
feels like an error we should report
| ] = field(default_factory=dict, init=False) | ||
|
|
||
| def __post_init__(self) -> None: | ||
| self._mlx_gen = ExoBatchGenerator( |
| while self._queue and len(self._active_tasks) < EXO_MAX_CONCURRENT_REQUESTS: | ||
| task = self._queue.popleft() | ||
| try: | ||
| uid = self._build_generator(task) |
There was a problem hiding this comment.
it is not clear that both _build_generator and _mlx_gen.submit run prefill immediately - i think we should keep behaviour as is but change the interface slightly
| self._active_tasks[uid] = (task, queue, output_generator) | ||
|
|
||
| if not self._mlx_gen.has_work: | ||
| return self._drain_cancellations() |
There was a problem hiding this comment.
similarly "drain cancellations" imo implies removing a cancellation rather than removing it's corresponding task
| ] = [] | ||
| for uid, response in results: | ||
| if uid not in self._active_tasks: | ||
| # should we error here? |
There was a problem hiding this comment.
ref comment and review comment above, i think at least a log is due here
| def build( | ||
| self, | ||
| ) -> InferenceGenerator: | ||
| import os |
Motivation
Following the changes made in #1632 !
Closes #1020
Changes
Why It Works
Test Plan
Manual Testing
Automated Testing