GPT3 tokenizer#1869
Merged
johnoliver merged 25 commits intomicrosoft:experimental-javafrom Jul 18, 2023
milderhc:gpt3-tokenizer
Merged
GPT3 tokenizer#1869johnoliver merged 25 commits intomicrosoft:experimental-javafrom milderhc:gpt3-tokenizer
johnoliver merged 25 commits intomicrosoft:experimental-javafrom
milderhc:gpt3-tokenizer
Conversation
### Motivation and Context ### Description Opening a PR with initial CI changes to build and run tests against Java packages. ### Contribution Checklist <!-- Before submitting this PR, please make sure: --> - [ ] The code builds clean without any errors or warnings - [ ] The PR follows SK Contribution Guidelines (https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) - [ ] The code follows the .NET coding conventions (https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions) verified with `dotnet format` - [ ] All unit tests pass, and I have added new tests where possible - [ ] I didn't break anyone 😄 --------- Co-authored-by: joe-braley <joebraley@microsoft.com> Co-authored-by: Luigi96 <luiseduardom@microsoft.com> Co-authored-by: Mark Wallace <127216156+markwallace-microsoft@users.noreply.github.com>
### Motivation and Context <!-- Thank you for your contribution to the semantic-kernel repo! Please help reviewers and future users, providing the following information: 1. Why is this change required? 2. What problem does it solve? 3. What scenario does it contribute to? 4. If it fixes an open issue, please link to the issue here. --> ### Description <!-- Describe your changes, the overall approach, the underlying design. These notes will help understanding how your code works. Thanks! --> ### Contribution Checklist <!-- Before submitting this PR, please make sure: --> - [ ] The code builds clean without any errors or warnings - [ ] The PR follows SK Contribution Guidelines (https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) - [ ] The code follows the .NET coding conventions (https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions) verified with `dotnet format` - [ ] All unit tests pass, and I have added new tests where possible - [ ] I didn't break anyone 😄 Co-authored-by: Mark Wallace <127216156+markwallace-microsoft@users.noreply.github.com>
### Motivation and Context <!-- Thank you for your contribution to the semantic-kernel repo! Please help reviewers and future users, providing the following information: 1. Why is this change required? 2. What problem does it solve? 3. What scenario does it contribute to? 4. If it fixes an open issue, please link to the issue here. --> Complete the implementation of VolatileMemoryStoreTests ### Description <!-- Describe your changes, the overall approach, the underlying design. These notes will help understanding how your code works. Thanks! --> Complete the implementation of VolatileMemoryStoreTests. Make implementation consistent with tests. Please note that I added equals and hashCode methods to Embedding, MemoryRecord, and MemoryRecordMetadata because these unit tests use assertEquals. Alternatively, I could have created methods in VolatileMemoryStoreTests to check equality. I'm good with either way. ### Contribution Checklist <!-- Before submitting this PR, please make sure: --> - [x] The code builds clean without any errors or warnings - [x] The PR follows SK Contribution Guidelines (https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) - [x] ~The code follows the .NET coding conventions (https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions) verified with `dotnet format`~ Java code follows AOSP style - [x] All unit tests pass, and I have added new tests where possible - [x] I didn't break anyone 😄 --------- Co-authored-by: Mark Wallace <127216156+markwallace-microsoft@users.noreply.github.com>
### Motivation and Context ### Description Add command to PRs to properly format Java code. ### Contribution Checklist <!-- Before submitting this PR, please make sure: --> - [ ] The code builds clean without any errors or warnings - [ ] The PR follows SK Contribution Guidelines (https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) - [ ] The code follows the .NET coding conventions (https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions) verified with `dotnet format` - [ ] All unit tests pass, and I have added new tests where possible - [ ] I didn't break anyone 😄 --------- Co-authored-by: Mark Wallace <127216156+markwallace-microsoft@users.noreply.github.com>
…ntic-kernel into experimental-java
…ntic-kernel into experimental-java
…ntic-kernel into experimental-java
…ntic-kernel into experimental-java
…ntic-kernel into experimental-java
…ntic-kernel into experimental-java
dsgrieve
approved these changes
Jul 6, 2023
.../semantickernel-api/src/main/java/com/microsoft/semantickernel/tokenizers/GPT3Tokenizer.java
Show resolved
Hide resolved
.../semantickernel-api/src/main/java/com/microsoft/semantickernel/tokenizers/GPT3Tokenizer.java
Outdated
Show resolved
Hide resolved
.../semantickernel-api/src/main/java/com/microsoft/semantickernel/tokenizers/GPT3Tokenizer.java
Show resolved
Hide resolved
.../semantickernel-api/src/main/java/com/microsoft/semantickernel/tokenizers/GPT3Tokenizer.java
Show resolved
Hide resolved
.../semantickernel-api/src/main/java/com/microsoft/semantickernel/tokenizers/GPT3Tokenizer.java
Show resolved
Hide resolved
.../semantickernel-api/src/main/java/com/microsoft/semantickernel/tokenizers/GPT3Tokenizer.java
Show resolved
Hide resolved
...nel-api/src/main/java/com/microsoft/semantickernel/tokenizers/settings/EmbeddedResource.java
Outdated
Show resolved
Hide resolved
johnoliver
reviewed
Jul 7, 2023
johnoliver
reviewed
Jul 7, 2023
.../semantickernel-api/src/main/java/com/microsoft/semantickernel/tokenizers/GPT3Tokenizer.java
Outdated
Show resolved
Hide resolved
Member
|
/spotless |
added 5 commits
July 11, 2023 23:21
johnoliver
reviewed
Jul 12, 2023
added 3 commits
July 12, 2023 08:15
brunoborges
reviewed
Jul 12, 2023
Member
brunoborges
left a comment
There was a problem hiding this comment.
Besides the comments I've added, another one:
What is the purpose of the encoder.json?
.../semantickernel-gpt3-tokenizer/src/main/java/com/microsoft/semantickernel/GPT3Tokenizer.java
Show resolved
Hide resolved
.../semantickernel-gpt3-tokenizer/src/main/java/com/microsoft/semantickernel/GPT3Tokenizer.java
Show resolved
Hide resolved
…ntic-kernel into experimental-java
Contributor
Author
|
It maps the tokens to their IDs, which is memory efficient. The original string can be restored afterwards with those IDs. |
johnoliver
previously approved these changes
Jul 13, 2023
markwallace-microsoft
previously approved these changes
Jul 14, 2023
Member
markwallace-microsoft
left a comment
There was a problem hiding this comment.
Just a couple of nits
...mple-code/src/main/java/com/microsoft/semantickernel/syntaxexamples/Example29_Tokenizer.java
Show resolved
Hide resolved
...mple-code/src/main/java/com/microsoft/semantickernel/syntaxexamples/Example29_Tokenizer.java
Show resolved
Hide resolved
1d0419f
johnoliver
approved these changes
Jul 18, 2023
This was referenced Aug 1, 2023
johnoliver
added a commit
to johnoliver/semantic-kernel
that referenced
this pull request
Jun 5, 2024
Add GPT3 tokenizer. Based on .NET implementation. --------- Co-authored-by: Luigi Montoya <yayodelta@gmail.com> Co-authored-by: joe-braley <joebraley@microsoft.com> Co-authored-by: Luigi96 <luiseduardom@microsoft.com> Co-authored-by: Mark Wallace <127216156+markwallace-microsoft@users.noreply.github.com> Co-authored-by: John Oliver <1615532+johnoliver@users.noreply.github.com> Co-authored-by: David Grieve <dsgrieve@yahoo.com>
johnoliver
added a commit
to johnoliver/semantic-kernel
that referenced
this pull request
Jun 5, 2024
Add GPT3 tokenizer. Based on .NET implementation. --------- Co-authored-by: Luigi Montoya <yayodelta@gmail.com> Co-authored-by: joe-braley <joebraley@microsoft.com> Co-authored-by: Luigi96 <luiseduardom@microsoft.com> Co-authored-by: Mark Wallace <127216156+markwallace-microsoft@users.noreply.github.com> Co-authored-by: John Oliver <1615532+johnoliver@users.noreply.github.com> Co-authored-by: David Grieve <dsgrieve@yahoo.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add GPT3 tokenizer.
Based on .NET implementation.