Points: 2-3 days
Description: Unify Logprobs handling and UsageInfo in v1/chat/completions and v1/completions. Reduce the repetitive code, increase code reusability and structure
Deliverables:
Task 1: Token Logprobs Handling
Current logic in adapter.py#L1327-L1368:
|
logprobs = False |
|
if isinstance(request, list) and request[idx].logprobs: |
|
logprobs = True |
|
elif (not isinstance(request, list)) and request.logprobs: |
|
logprobs = True |
|
if logprobs: |
|
logprobs = to_openai_style_logprobs( |
|
output_token_logprobs=ret_item["meta_info"]["output_token_logprobs"], |
|
output_top_logprobs=ret_item["meta_info"].get( |
|
"output_top_logprobs", None |
|
), |
|
) |
|
token_logprobs = [] |
|
for token_idx, (token, logprob) in enumerate( |
|
zip(logprobs.tokens, logprobs.token_logprobs) |
|
): |
|
token_bytes = list(token.encode("utf-8")) |
|
top_logprobs = [] |
|
if logprobs.top_logprobs: |
|
for top_token, top_logprob in logprobs.top_logprobs[ |
|
token_idx |
|
].items(): |
|
top_token_bytes = list(top_token.encode("utf-8")) |
|
top_logprobs.append( |
|
TopLogprob( |
|
token=top_token, |
|
bytes=top_token_bytes, |
|
logprob=top_logprob, |
|
) |
|
) |
|
token_logprobs.append( |
|
ChatCompletionTokenLogprob( |
|
token=token, |
|
bytes=token_bytes, |
|
logprob=logprob, |
|
top_logprobs=top_logprobs, |
|
) |
|
) |
|
|
|
choice_logprobs = ChoiceLogprobs(content=token_logprobs) |
|
else: |
|
choice_logprobs = None |
New logic in serving_chat.py:
|
def _process_response_logprobs(self, ret_item: Dict[str, Any]) -> ChoiceLogprobs: |
|
"""Process logprobs for non-streaming response""" |
|
logprobs = to_openai_style_logprobs( |
|
output_token_logprobs=ret_item["meta_info"]["output_token_logprobs"], |
|
output_top_logprobs=ret_item["meta_info"].get("output_top_logprobs", None), |
|
) |
|
|
|
token_logprobs = self._process_logprobs_tokens(logprobs, use_token_index=True) |
|
return ChoiceLogprobs(content=token_logprobs) |
Non streaming logprobs first calls _process_response_logprobs, then calls _process_logprobs_tokens.
Unify Logprobs
- Logic is fine, but it has quite some convoluated with streaming logprobs, and completions endpoint.
Inconsistent Entry Points:
- Chat has 2 different methods (
_process_response_logprobs vs _process_streaming_logprobs ) for similar work
- Duplicated Logic: Both chat and completions call
to_openai_style_logprobs
- Mixed Responsibilities: Some methods do conversion + processing, others just processing
- Hard to Test: Complex call chains make unit testing difficult
Design
Task 2: UsageInfo
Current Problem
- Code Duplication:
aggregate_token_usage (utils.py) vs _calculate_streaming_usage_base (serving_base.py)
- Different Data Formats: Non-streaming uses response lists, streaming uses token dictionaries
- Similar Logic: Both calculate total tokens with n_choices handling and cache reporting
Design Recommendation
-
Approach: Create unified UsageProcessor following same factory pattern as LogProbs.
-
New File: sglang/python/sglang/srt/entrypoints/openai/usage_processor.py
-
Files to Update:
serving_chat.py: Replace aggregate_token_usage calls with factory methods
serving_completions.py: Replace aggregate_token_usage calls with factory methods
serving_base.py: Replace _calculate_streaming_usage_base with factory calls
utils.py: Deprecate aggregate_token_usage function
-
Functions to Consolidate:
aggregate_token_usage (from utils.py) → UsageProcessor.calculate_response_usage
_calculate_streaming_usage_base (from serving_base.py) → UsageProcessor.calculate_streaming_usage
Points: 2-3 days
Description: Unify Logprobs handling and UsageInfo in
v1/chat/completionsandv1/completions. Reduce the repetitive code, increase code reusability and structureDeliverables:
Task 1: Token Logprobs Handling
Current logic in
adapter.py#L1327-L1368:sglang/python/sglang/srt/openai_api/adapter.py
Lines 1327 to 1368 in ca92911
New logic in
serving_chat.py:sglang/python/sglang/srt/entrypoints/openai/serving_chat.py
Lines 786 to 794 in 70c471a
Non streaming logprobs first calls
_process_response_logprobs, then calls_process_logprobs_tokens.Unify Logprobs
Inconsistent Entry Points:
_process_response_logprobsvs_process_streaming_logprobs) for similar workto_openai_style_logprobsDesign
Approach: Create a unified
LogProbsProcessorusing factory pattern to eliminate code duplication and inconsistent APIs.New File:
sglang/python/sglang/srt/entrypoints/openai/logprobs_processor.pyHigh Level Design:
serving_chat.py: Replace_process_streaming_logprobsand_process_response_logprobswith factory calls, remove_process_logprobs_tokensserving_completions.py: Replace inlineto_openai_style_logprobscalls with factory methodsutils.py: Deprecate or removeto_openai_style_logprobsfunctionTask 2: UsageInfo
Current Problem
aggregate_token_usage(utils.py) vs_calculate_streaming_usage_base(serving_base.py)Design Recommendation
Approach: Create unified
UsageProcessorfollowing same factory pattern as LogProbs.New File:
sglang/python/sglang/srt/entrypoints/openai/usage_processor.pyFiles to Update:
serving_chat.py: Replaceaggregate_token_usagecalls with factory methodsserving_completions.py: Replaceaggregate_token_usagecalls with factory methodsserving_base.py: Replace_calculate_streaming_usage_basewith factory callsutils.py: Deprecateaggregate_token_usagefunctionFunctions to Consolidate:
aggregate_token_usage(from utils.py) →UsageProcessor.calculate_response_usage_calculate_streaming_usage_base(from serving_base.py) →UsageProcessor.calculate_streaming_usage