Skip to content

Continue on empty line#529

Closed
thement wants to merge 1 commit intoggml-org:masterfrom
thement:continue-on-empty-line
Closed

Continue on empty line#529
thement wants to merge 1 commit intoggml-org:masterfrom
thement:continue-on-empty-line

Conversation

@thement
Copy link
Contributor

@thement thement commented Mar 26, 2023

Do not insert a "newline" token if user inputs empty line. This let's
user to continue the output after she has been asked by reverse prompt
for more data. Otherwise an empty-line input would insert a "newline"
token which would break the flow of the conversation.

Do not insert a "newline" token if user inputs empty line. This let's
user to continue the output after she has been asked by reverse prompt
for more data. Otherwise an empty-line input would insert a "newline"
token which would break the flow of the conversation.
@thement thement marked this pull request as ready for review March 26, 2023 14:57
@anzz1 anzz1 self-requested a review March 26, 2023 15:51
Copy link
Contributor

@anzz1 anzz1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change, however it needs some changes to work with the instruct mode.
As currently you wrap the suffix "### Response:" inside the buffer.length check but not the prefix "### Instruction:", breaking the model.
The prefix needs to be wrapped too. Something like this:

Line 437:

                if (params.instruct) {
                    printf("\n> ");
                }

Line 468:

                // Add tokens to buffer only if the line is non-empty.
                // This let's the user make the chat continue if it was stopped
                // on a reverse prompt.
                if (buffer.length() > 1) {
                    if (params.instruct) {
                        n_consumed = embd_inp.size();
                        embd_inp.insert(embd_inp.end(), inp_pfx.begin(), inp_pfx.end());
                    }

                    auto line_inp = ::llama_tokenize(ctx, buffer, false);
                    embd_inp.insert(embd_inp.end(), line_inp.begin(), line_inp.end());

                    if (params.instruct) {
                        embd_inp.insert(embd_inp.end(), inp_sfx.begin(), inp_sfx.end());
                    }

                    n_remain -= line_inp.size();
               }

Also replace the tabs on L480 with spaces

edit

or actually should it be:

Line 437:

                if (params.instruct) {
                    n_consumed = embd_inp.size();
                    printf("\n> ");
                }

Line 468:

                // Add tokens to buffer only if the line is non-empty.
                // This let's the user make the chat continue if it was stopped
                // on a reverse prompt.
                if (buffer.length() > 1) {
                    if (params.instruct) {
                        embd_inp.insert(embd_inp.end(), inp_pfx.begin(), inp_pfx.end());
                    }

                    auto line_inp = ::llama_tokenize(ctx, buffer, false);
                    embd_inp.insert(embd_inp.end(), line_inp.begin(), line_inp.end());

                    if (params.instruct) {
                        embd_inp.insert(embd_inp.end(), inp_sfx.begin(), inp_sfx.end());
                    }

                    n_remain -= line_inp.size();
               }

I'm not 100% certain of the n_consumed part which place is the right for it. Should test for both.

ping @blackhole89 @ggerganov Requesting backup 😄

@anzz1 anzz1 requested review from blackhole89 and ggerganov March 26, 2023 16:25
@thement
Copy link
Contributor Author

thement commented Mar 26, 2023

This is a good change, however it needs some changes to work with the instruct mode.

I wasn't aware of instruct mode. I'll look into it and update the pull request.

@thement thement marked this pull request as draft March 26, 2023 19:18
@rabidcopy
Copy link
Contributor

rabidcopy commented Mar 26, 2023

This is a good change, however it needs some changes to work with the instruct mode.

I wasn't aware of instruct mode. I'll look into it and update the pull request.

Apparently @ggerganov is looking to move instruct mode out of main.cpp and as its own thing according to #508. May be worth waiting for progress on that or any code on this pertaining to instruct mode may be for naught.

@thement
Copy link
Contributor Author

thement commented Mar 26, 2023

This is a good change, however it needs some changes to work with the instruct mode.

I wasn't aware of instruct mode. I'll look into it and update the pull request.

Apparently @ggerganov is looking to move instruct mode out of main.cpp and as its own thing according to #508. May be worth waiting for progress on that or any code on this pertaining to instruct mode may be for naught.

Very well. Let's close it for now then.

@thement thement closed this Mar 26, 2023
@anzz1
Copy link
Contributor

anzz1 commented Mar 26, 2023

Hmmm the change is so small and the work is already done so why not just merge it? The only thing need to know anymore is the whether the option 1 or option 2 above for the placement of n_consumed is right, and thats why I pinged @blackhole89 since he originally made the instruct option and probably knows the answer on spot. The work would transfer over to the instruct implementation then when its moved.

@rabidcopy
Copy link
Contributor

Ah, sorry. Didn't mean nudge this down. Above is right. The change is still good as far as I can tell.

@anzz1
Copy link
Contributor

anzz1 commented Mar 27, 2023

I'll reopen this so that whichever happens first (this gets merged or instruct mode moved to another file), that in any case it doesnt get forgotten.

@anzz1 anzz1 reopened this Mar 27, 2023
@anzz1 anzz1 added enhancement New feature or request generation quality Quality of model output labels Mar 27, 2023
@anzz1
Copy link
Contributor

anzz1 commented Mar 27, 2023

Will be implemented in #555 , closing.

Merged 7b8dbcb (#571)

@anzz1 anzz1 closed this Mar 27, 2023
SamuelOliveirads pushed a commit to SamuelOliveirads/llama.cpp that referenced this pull request Dec 29, 2025
* New iq4_kt trellis

The new trellis generates int8_t values via
sum_as_uint8_t[(ka * idx + kb) & 0x3f33f3f3f] - 126.
CUDA dequantize works.
AVX2 case Ny > 32 works, and we get 273 t/s for L3-8B.
PPL is on par or even slightly lower than original QTIP trellis.

* Something is not working with the AVX2 dot product

* New iq4_kt: CUDA MMVQ

* New iq4_kt: CUDA MMQ

* For now have only iq4_kt use the new trellis

* Fix iq2_kt that got broken along the way

* New iq4_kt: AVX2 dot product finally works

We get 13.6 t/s vs 8.4 t/s with the f16 trellis and f32 arithmetic.
Still somewhat slower than other quants, but no longer pathetic.

* New iq4_kt: fix vanilla AVX2

* New iq4_kt: NEON implementation

We get very respectable PP-512 = 120 t/s.
TG-128 is pathetic at 5.3 t/s, so 20+% slower than the f16 variant.

* New iq4_kt: slightly faster NEON

* New iq4_kt: slightly faster NEON

* New iq4_kt: faster NEON

We are now at 9.4 t/s, up from 6.6 t/s for the f16 trellis.

* Minor

* New iq4_kt trellis: not working Metal implementation

* Remove the extra 4 bytes of row meta data that is no longer used

* Cleanup

* Adding forgottent file

* Switching iq2_kt to new trellis - CUDA MMQ

* New iq2_kt: CUDA GEMV

* New iq2_kt: AVX2 dequantize

* New iq2_kt: AVX2 GEMM/GEMV

* Adding forgotten file

* New iq2_kt: NEON GEMM/GEMV

* New iq2_kt: slightly faster NEON GEMM

* New iq2_kt: Metal - very slow.

It seems Apple Silicon cannot quickly add 4 8-bit ints.
Or I don't know how to do it - but I didn't find anything
in the Metal Shading Language Specification.
So, performance is quite a bit worse than the original trellis.

* Add missing break

* Trying @louiehelm's multiplier

* CPU

* iq3_kt: use integer trellis + CUDA dequantize and MMVQ

* iq3_kt: MMQ

* iq3_kt: AVX2 GEMM

* iq3_kt: AVX2 GEMV

* The trellis quants now need super-blocks of 256, so we need a check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request generation quality Quality of model output

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants