Since jump forward breaks a decoding process to multiple ones, the number of prompt_tokens and completion_tokens are incorrect. Here is an example:
Request:
regex = (r"""\{\n"""
+ r""" "name": "[\w]{1,8}",\n"""
+ r""" "description": "[\w\d\s]{1,64}"\n"""
+ r"""\}"""
)
response = requests.post(
url + "/generate",
json={
"text": "Here is the info of France's capital: ",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 128,
"regex": regex
},
"stream": True,
},
stream=True,
)
Streaming response by chunk:
Chunk (prompt 10, decode 1): {
Chunk (prompt 15, decode 1): {
"name": "Paris
Chunk (prompt 22, decode 1): {
"name": "Paris",
"description": "Capital
Chunk (prompt 22, decode 2): {
"name": "Paris",
"description": "Capital city
Chunk (prompt 22, decode 10): {
"name": "Paris",
"description": "Capital city of France and one of the most beautiful
Chunk (prompt 37, decode 1): {
"name": "Paris",
"description": "Capital city of France and one of the most beautiful cities in"
}
Non-streaming response:
{'prompt_tokens': 37, 'completion_tokens': 1, 'id': '44f7ddf966de459da6954d0de1e4434d'}
{
"name": "Paris",
"description": "Capital city of France and one of the most beautiful cities in"
}
Note that the correct number of prompt tokens is 10 and the number of completion tokens is 28. We may fix the prompt token issue by taking the number of the first chunk, but we probably need to directly lookup the length of the final decoding output to fix the completion tokens.
Since jump forward breaks a decoding process to multiple ones, the number of prompt_tokens and completion_tokens are incorrect. Here is an example:
Request:
Streaming response by chunk:
Non-streaming response:
Note that the correct number of prompt tokens is 10 and the number of completion tokens is 28. We may fix the prompt token issue by taking the number of the first chunk, but we probably need to directly lookup the length of the final decoding output to fix the completion tokens.