Skip to content

GPT3 tokenizer#1869

Merged
johnoliver merged 25 commits intomicrosoft:experimental-javafrom
milderhc:gpt3-tokenizer
Jul 18, 2023
Merged

GPT3 tokenizer#1869
johnoliver merged 25 commits intomicrosoft:experimental-javafrom
milderhc:gpt3-tokenizer

Conversation

@milderhc
Copy link
Contributor

@milderhc milderhc commented Jul 6, 2023

Add GPT3 tokenizer.

Based on .NET implementation.

Luigi96 and others added 11 commits June 2, 2023 08:24
### Motivation and Context

### Description
Opening a PR with initial CI changes to build and run tests against Java
packages.

### Contribution Checklist
<!-- Before submitting this PR, please make sure: -->
- [ ] The code builds clean without any errors or warnings
- [ ] The PR follows SK Contribution Guidelines
(https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
- [ ] The code follows the .NET coding conventions
(https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions)
verified with `dotnet format`
- [ ] All unit tests pass, and I have added new tests where possible
- [ ] I didn't break anyone 😄

---------

Co-authored-by: joe-braley <joebraley@microsoft.com>
Co-authored-by: Luigi96 <luiseduardom@microsoft.com>
Co-authored-by: Mark Wallace <127216156+markwallace-microsoft@users.noreply.github.com>
### Motivation and Context
<!-- Thank you for your contribution to the semantic-kernel repo!
Please help reviewers and future users, providing the following
information:
  1. Why is this change required?
  2. What problem does it solve?
  3. What scenario does it contribute to?
  4. If it fixes an open issue, please link to the issue here.
-->


### Description
<!-- Describe your changes, the overall approach, the underlying design.
These notes will help understanding how your code works. Thanks! -->


### Contribution Checklist
<!-- Before submitting this PR, please make sure: -->
- [ ] The code builds clean without any errors or warnings
- [ ] The PR follows SK Contribution Guidelines
(https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
- [ ] The code follows the .NET coding conventions
(https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions)
verified with `dotnet format`
- [ ] All unit tests pass, and I have added new tests where possible
- [ ] I didn't break anyone 😄

Co-authored-by: Mark Wallace <127216156+markwallace-microsoft@users.noreply.github.com>
### Motivation and Context
<!-- Thank you for your contribution to the semantic-kernel repo!
Please help reviewers and future users, providing the following
information:
  1. Why is this change required?
  2. What problem does it solve?
  3. What scenario does it contribute to?
  4. If it fixes an open issue, please link to the issue here.
-->
Complete the implementation of VolatileMemoryStoreTests


### Description
<!-- Describe your changes, the overall approach, the underlying design.
These notes will help understanding how your code works. Thanks! -->
Complete the implementation of VolatileMemoryStoreTests. Make
implementation consistent with tests.

Please note that I added equals and hashCode methods to Embedding,
MemoryRecord, and MemoryRecordMetadata because these unit tests use
assertEquals. Alternatively, I could have created methods in
VolatileMemoryStoreTests to check equality. I'm good with either way.

### Contribution Checklist
<!-- Before submitting this PR, please make sure: -->
- [x] The code builds clean without any errors or warnings
- [x] The PR follows SK Contribution Guidelines
(https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
- [x] ~The code follows the .NET coding conventions
(https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions)
verified with `dotnet format`~ Java code follows AOSP style
- [x] All unit tests pass, and I have added new tests where possible
- [x] I didn't break anyone 😄

---------

Co-authored-by: Mark Wallace <127216156+markwallace-microsoft@users.noreply.github.com>
### Motivation and Context

### Description
Add command to PRs to properly format Java code.

### Contribution Checklist
<!-- Before submitting this PR, please make sure: -->
- [ ] The code builds clean without any errors or warnings
- [ ] The PR follows SK Contribution Guidelines
(https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
- [ ] The code follows the .NET coding conventions
(https://learn.microsoft.com/dotnet/csharp/fundamentals/coding-style/coding-conventions)
verified with `dotnet format`
- [ ] All unit tests pass, and I have added new tests where possible
- [ ] I didn't break anyone 😄

---------

Co-authored-by: Mark Wallace <127216156+markwallace-microsoft@users.noreply.github.com>
@milderhc milderhc requested a review from a team as a code owner July 6, 2023 08:24
@dmytrostruk dmytrostruk added the java Issue or PR regarding Java code label Jul 6, 2023
@markwallace-microsoft
Copy link
Member

/spotless

Copy link
Member

@brunoborges brunoborges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the comments I've added, another one:

What is the purpose of the encoder.json?

@milderhc
Copy link
Contributor Author

It maps the tokens to their IDs, which is memory efficient. The original string can be restored afterwards with those IDs.

johnoliver
johnoliver previously approved these changes Jul 13, 2023
Copy link
Member

@markwallace-microsoft markwallace-microsoft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of nits

@johnoliver johnoliver merged commit e20b6f4 into microsoft:experimental-java Jul 18, 2023
@milderhc milderhc deleted the gpt3-tokenizer branch July 18, 2023 21:43
johnoliver added a commit to johnoliver/semantic-kernel that referenced this pull request Jun 5, 2024
Add GPT3 tokenizer.

Based on .NET implementation.

---------

Co-authored-by: Luigi Montoya <yayodelta@gmail.com>
Co-authored-by: joe-braley <joebraley@microsoft.com>
Co-authored-by: Luigi96 <luiseduardom@microsoft.com>
Co-authored-by: Mark Wallace <127216156+markwallace-microsoft@users.noreply.github.com>
Co-authored-by: John Oliver <1615532+johnoliver@users.noreply.github.com>
Co-authored-by: David Grieve <dsgrieve@yahoo.com>
johnoliver added a commit to johnoliver/semantic-kernel that referenced this pull request Jun 5, 2024
Add GPT3 tokenizer.

Based on .NET implementation.

---------

Co-authored-by: Luigi Montoya <yayodelta@gmail.com>
Co-authored-by: joe-braley <joebraley@microsoft.com>
Co-authored-by: Luigi96 <luiseduardom@microsoft.com>
Co-authored-by: Mark Wallace <127216156+markwallace-microsoft@users.noreply.github.com>
Co-authored-by: John Oliver <1615532+johnoliver@users.noreply.github.com>
Co-authored-by: David Grieve <dsgrieve@yahoo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

java Issue or PR regarding Java code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants