Skip to content

Deepseek R1 tokenization support#159

Merged
pcuenca merged 4 commits intomainfrom
jinja-upgrade
Jan 24, 2025
Merged

Deepseek R1 tokenization support#159
pcuenca merged 4 commits intomainfrom
jinja-upgrade

Conversation

@pcuenca
Copy link
Member

@pcuenca pcuenca commented Jan 24, 2025

The new test does not pass for some reason. Is there anything you think I'm doing wrong @DePasqualeOrg?

@DePasqualeOrg
Copy link
Contributor

DePasqualeOrg commented Jan 24, 2025

The encoded tokens from the test look like this:

[4913, 70398, 788, 895, 11, 13265, 1313, 788, 330, 19337, 3323, 497, 330, 38460, 788, 830, 11, 330, 75, 13105, 788, 895, 11, 330, 1796, 788, 330, 151646, 497, 330, 15338, 13533, 788, 895, 92, 151644, 74785, 279, 23670, 15473, 4128, 13, 151645]

And the test target looks like this:

[151646, 151644, 74785, 279, 23670, 15473, 4128, 13, 151645]

But even when changing the test target to the actually encoded tokens, it still crashes, so I still need to investigate. In any case, I have already verified that the DeepSeek models work with the latest Jinja.

@pcuenca
Copy link
Member Author

pcuenca commented Jan 24, 2025

The test targets were obtained from the Python tokenizer, they correspond to <|begin▁of▁sentence|><|User|>Describe the Swift programming language.<|Assistant|>. The problem here is that the bos_token is passed in the context as a dictionary, not a String. This means that the result from applyChatTemplate won't be correct, as the test shows.

Looking into it.

@DePasqualeOrg
Copy link
Contributor

As you can see here, the prompt from the test is being encoded correctly, and there are no problems interacting with the model, but as soon as you call decode on the encoded tokens, it crashes. I think the problem must be somewhere in swift-transformers. Perhaps it's a text encoding issue. I noticed that spaces are getting encoded as an unusual character.

@pcuenca, since you're more familiar with the library's internals than me, perhaps you have a better intuition about how to approach the solution. My initial attempts at solutions with Sonnet and the entire library as context were unsuccessful.

Serialized AddedToken class partially supported (in addition to String
values)
@pcuenca pcuenca changed the title Update Jinja, add Qwen R1 test Qwen R1 tokenization support Jan 24, 2025
@pcuenca pcuenca changed the title Qwen R1 tokenization support Deepseek R1 tokenization support Jan 24, 2025
Package.swift Outdated
dependencies: [
.package(url: "https://github.com/apple/swift-argument-parser.git", from: "1.4.0"),
.package(url: "https://github.com/maiqingqiang/Jinja", from: "1.0.6")
.package(url: "https://github.com/johnmai-dev/Jinja", from: "1.1.0")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to move this and the chat template test to a new PR, since the rest of the fixes here are more general and unrelated to the new jinja engine.

@pcuenca
Copy link
Member Author

pcuenca commented Jan 24, 2025

Merging this. As explained, the jinja upgrade will come momentarily as these changes are general.

@pcuenca pcuenca merged commit 1fab24c into main Jan 24, 2025
1 check passed
@pcuenca pcuenca deleted the jinja-upgrade branch January 24, 2025 19:11
@pcuenca pcuenca mentioned this pull request Jan 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants