Continue on empty line by thement · Pull Request #529 · ggml-org/llama.cpp

thement · 2023-03-26T14:55:01Z

Do not insert a "newline" token if user inputs empty line. This let's
user to continue the output after she has been asked by reverse prompt
for more data. Otherwise an empty-line input would insert a "newline"
token which would break the flow of the conversation.

Do not insert a "newline" token if user inputs empty line. This let's user to continue the output after she has been asked by reverse prompt for more data. Otherwise an empty-line input would insert a "newline" token which would break the flow of the conversation.

anzz1

This is a good change, however it needs some changes to work with the instruct mode.
As currently you wrap the suffix "### Response:" inside the buffer.length check but not the prefix "### Instruction:", breaking the model.
The prefix needs to be wrapped too. Something like this:

Line 437:

                if (params.instruct) {
                    printf("\n> ");
                }

Line 468:

                // Add tokens to buffer only if the line is non-empty.
                // This let's the user make the chat continue if it was stopped
                // on a reverse prompt.
                if (buffer.length() > 1) {
                    if (params.instruct) {
                        n_consumed = embd_inp.size();
                        embd_inp.insert(embd_inp.end(), inp_pfx.begin(), inp_pfx.end());
                    }

                    auto line_inp = ::llama_tokenize(ctx, buffer, false);
                    embd_inp.insert(embd_inp.end(), line_inp.begin(), line_inp.end());

                    if (params.instruct) {
                        embd_inp.insert(embd_inp.end(), inp_sfx.begin(), inp_sfx.end());
                    }

                    n_remain -= line_inp.size();
               }

Also replace the tabs on L480 with spaces

edit

or actually should it be:

Line 437:

                if (params.instruct) {
                    n_consumed = embd_inp.size();
                    printf("\n> ");
                }

Line 468:

                // Add tokens to buffer only if the line is non-empty.
                // This let's the user make the chat continue if it was stopped
                // on a reverse prompt.
                if (buffer.length() > 1) {
                    if (params.instruct) {
                        embd_inp.insert(embd_inp.end(), inp_pfx.begin(), inp_pfx.end());
                    }

                    auto line_inp = ::llama_tokenize(ctx, buffer, false);
                    embd_inp.insert(embd_inp.end(), line_inp.begin(), line_inp.end());

                    if (params.instruct) {
                        embd_inp.insert(embd_inp.end(), inp_sfx.begin(), inp_sfx.end());
                    }

                    n_remain -= line_inp.size();
               }

I'm not 100% certain of the n_consumed part which place is the right for it. Should test for both.

ping @blackhole89 @ggerganov Requesting backup 😄

thement · 2023-03-26T19:18:07Z

This is a good change, however it needs some changes to work with the instruct mode.

I wasn't aware of instruct mode. I'll look into it and update the pull request.

rabidcopy · 2023-03-26T20:59:15Z

This is a good change, however it needs some changes to work with the instruct mode.

I wasn't aware of instruct mode. I'll look into it and update the pull request.

Apparently @ggerganov is looking to move instruct mode out of main.cpp and as its own thing according to #508. May be worth waiting for progress on that or any code on this pertaining to instruct mode may be for naught.

thement · 2023-03-26T21:00:29Z

This is a good change, however it needs some changes to work with the instruct mode.

I wasn't aware of instruct mode. I'll look into it and update the pull request.

Apparently @ggerganov is looking to move instruct mode out of main.cpp and as its own thing according to #508. May be worth waiting for progress on that or any code on this pertaining to instruct mode may be for naught.

Very well. Let's close it for now then.

anzz1 · 2023-03-26T22:48:43Z

Hmmm the change is so small and the work is already done so why not just merge it? The only thing need to know anymore is the whether the option 1 or option 2 above for the placement of n_consumed is right, and thats why I pinged @blackhole89 since he originally made the instruct option and probably knows the answer on spot. The work would transfer over to the instruct implementation then when its moved.

rabidcopy · 2023-03-27T00:28:01Z

Ah, sorry. Didn't mean nudge this down. Above is right. The change is still good as far as I can tell.

anzz1 · 2023-03-27T05:50:47Z

I'll reopen this so that whichever happens first (this gets merged or instruct mode moved to another file), that in any case it doesnt get forgotten.

anzz1 · 2023-03-27T13:34:53Z

~~Will be implemented in #555 , closing.~~

Merged 7b8dbcb (#571)

@louiehelm

* New iq4_kt trellis The new trellis generates int8_t values via sum_as_uint8_t[(ka * idx + kb) & 0x3f33f3f3f] - 126. CUDA dequantize works. AVX2 case Ny > 32 works, and we get 273 t/s for L3-8B. PPL is on par or even slightly lower than original QTIP trellis. * Something is not working with the AVX2 dot product * New iq4_kt: CUDA MMVQ * New iq4_kt: CUDA MMQ * For now have only iq4_kt use the new trellis * Fix iq2_kt that got broken along the way * New iq4_kt: AVX2 dot product finally works We get 13.6 t/s vs 8.4 t/s with the f16 trellis and f32 arithmetic. Still somewhat slower than other quants, but no longer pathetic. * New iq4_kt: fix vanilla AVX2 * New iq4_kt: NEON implementation We get very respectable PP-512 = 120 t/s. TG-128 is pathetic at 5.3 t/s, so 20+% slower than the f16 variant. * New iq4_kt: slightly faster NEON * New iq4_kt: slightly faster NEON * New iq4_kt: faster NEON We are now at 9.4 t/s, up from 6.6 t/s for the f16 trellis. * Minor * New iq4_kt trellis: not working Metal implementation * Remove the extra 4 bytes of row meta data that is no longer used * Cleanup * Adding forgottent file * Switching iq2_kt to new trellis - CUDA MMQ * New iq2_kt: CUDA GEMV * New iq2_kt: AVX2 dequantize * New iq2_kt: AVX2 GEMM/GEMV * Adding forgotten file * New iq2_kt: NEON GEMM/GEMV * New iq2_kt: slightly faster NEON GEMM * New iq2_kt: Metal - very slow. It seems Apple Silicon cannot quickly add 4 8-bit ints. Or I don't know how to do it - but I didn't find anything in the Metal Shading Language Specification. So, performance is quite a bit worse than the original trellis. * Add missing break * Trying @louiehelm's multiplier * CPU * iq3_kt: use integer trellis + CUDA dequantize and MMVQ * iq3_kt: MMQ * iq3_kt: AVX2 GEMM * iq3_kt: AVX2 GEMV * The trellis quants now need super-blocks of 256, so we need a check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

thement marked this pull request as ready for review March 26, 2023 14:57

anzz1 self-requested a review March 26, 2023 15:51

anzz1 suggested changes Mar 26, 2023

View reviewed changes

anzz1 requested review from blackhole89 and ggerganov March 26, 2023 16:25

thement marked this pull request as draft March 26, 2023 19:18

thement closed this Mar 26, 2023

anzz1 reopened this Mar 27, 2023

anzz1 added enhancement New feature or request generation quality Quality of model output labels Mar 27, 2023

anzz1 closed this Mar 27, 2023

ScarletEmerald mentioned this pull request Mar 29, 2023

User should be able to return control without inserting a newline #587

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continue on empty line#529

Continue on empty line#529
thement wants to merge 1 commit intoggml-org:masterfrom
thement:continue-on-empty-line

thement commented Mar 26, 2023

Uh oh!

anzz1 left a comment •

edited

Loading

Uh oh!

thement commented Mar 26, 2023

Uh oh!

rabidcopy commented Mar 26, 2023 •

edited

Loading

Uh oh!

thement commented Mar 26, 2023

Uh oh!

anzz1 commented Mar 26, 2023 •

edited

Loading

Uh oh!

rabidcopy commented Mar 27, 2023

Uh oh!

anzz1 commented Mar 27, 2023

Uh oh!

anzz1 commented Mar 27, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

thement commented Mar 26, 2023

Uh oh!

anzz1 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thement commented Mar 26, 2023

Uh oh!

rabidcopy commented Mar 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thement commented Mar 26, 2023

Uh oh!

anzz1 commented Mar 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rabidcopy commented Mar 27, 2023

Uh oh!

anzz1 commented Mar 27, 2023

Uh oh!

anzz1 commented Mar 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anzz1 left a comment •

edited

Loading

rabidcopy commented Mar 26, 2023 •

edited

Loading

anzz1 commented Mar 26, 2023 •

edited

Loading

anzz1 commented Mar 27, 2023 •

edited

Loading