Continue on empty line#529
Continue on empty line#529thement wants to merge 1 commit intoggml-org:masterfrom thement:continue-on-empty-line
Conversation
Do not insert a "newline" token if user inputs empty line. This let's user to continue the output after she has been asked by reverse prompt for more data. Otherwise an empty-line input would insert a "newline" token which would break the flow of the conversation.
There was a problem hiding this comment.
This is a good change, however it needs some changes to work with the instruct mode.
As currently you wrap the suffix "### Response:" inside the buffer.length check but not the prefix "### Instruction:", breaking the model.
The prefix needs to be wrapped too. Something like this:
Line 437:
if (params.instruct) {
printf("\n> ");
}Line 468:
// Add tokens to buffer only if the line is non-empty.
// This let's the user make the chat continue if it was stopped
// on a reverse prompt.
if (buffer.length() > 1) {
if (params.instruct) {
n_consumed = embd_inp.size();
embd_inp.insert(embd_inp.end(), inp_pfx.begin(), inp_pfx.end());
}
auto line_inp = ::llama_tokenize(ctx, buffer, false);
embd_inp.insert(embd_inp.end(), line_inp.begin(), line_inp.end());
if (params.instruct) {
embd_inp.insert(embd_inp.end(), inp_sfx.begin(), inp_sfx.end());
}
n_remain -= line_inp.size();
}Also replace the tabs on L480 with spaces
edit
or actually should it be:
Line 437:
if (params.instruct) {
n_consumed = embd_inp.size();
printf("\n> ");
}Line 468:
// Add tokens to buffer only if the line is non-empty.
// This let's the user make the chat continue if it was stopped
// on a reverse prompt.
if (buffer.length() > 1) {
if (params.instruct) {
embd_inp.insert(embd_inp.end(), inp_pfx.begin(), inp_pfx.end());
}
auto line_inp = ::llama_tokenize(ctx, buffer, false);
embd_inp.insert(embd_inp.end(), line_inp.begin(), line_inp.end());
if (params.instruct) {
embd_inp.insert(embd_inp.end(), inp_sfx.begin(), inp_sfx.end());
}
n_remain -= line_inp.size();
}I'm not 100% certain of the n_consumed part which place is the right for it. Should test for both.
ping @blackhole89 @ggerganov Requesting backup 😄
I wasn't aware of instruct mode. I'll look into it and update the pull request. |
Apparently @ggerganov is looking to move instruct mode out of main.cpp and as its own thing according to #508. May be worth waiting for progress on that or any code on this pertaining to instruct mode may be for naught. |
Very well. Let's close it for now then. |
|
Hmmm the change is so small and the work is already done so why not just merge it? The only thing need to know anymore is the whether the option 1 or option 2 above for the placement of n_consumed is right, and thats why I pinged @blackhole89 since he originally made the instruct option and probably knows the answer on spot. The work would transfer over to the instruct implementation then when its moved. |
|
Ah, sorry. Didn't mean nudge this down. Above is right. The change is still good as far as I can tell. |
|
I'll reopen this so that whichever happens first (this gets merged or instruct mode moved to another file), that in any case it doesnt get forgotten. |
* New iq4_kt trellis The new trellis generates int8_t values via sum_as_uint8_t[(ka * idx + kb) & 0x3f33f3f3f] - 126. CUDA dequantize works. AVX2 case Ny > 32 works, and we get 273 t/s for L3-8B. PPL is on par or even slightly lower than original QTIP trellis. * Something is not working with the AVX2 dot product * New iq4_kt: CUDA MMVQ * New iq4_kt: CUDA MMQ * For now have only iq4_kt use the new trellis * Fix iq2_kt that got broken along the way * New iq4_kt: AVX2 dot product finally works We get 13.6 t/s vs 8.4 t/s with the f16 trellis and f32 arithmetic. Still somewhat slower than other quants, but no longer pathetic. * New iq4_kt: fix vanilla AVX2 * New iq4_kt: NEON implementation We get very respectable PP-512 = 120 t/s. TG-128 is pathetic at 5.3 t/s, so 20+% slower than the f16 variant. * New iq4_kt: slightly faster NEON * New iq4_kt: slightly faster NEON * New iq4_kt: faster NEON We are now at 9.4 t/s, up from 6.6 t/s for the f16 trellis. * Minor * New iq4_kt trellis: not working Metal implementation * Remove the extra 4 bytes of row meta data that is no longer used * Cleanup * Adding forgottent file * Switching iq2_kt to new trellis - CUDA MMQ * New iq2_kt: CUDA GEMV * New iq2_kt: AVX2 dequantize * New iq2_kt: AVX2 GEMM/GEMV * Adding forgotten file * New iq2_kt: NEON GEMM/GEMV * New iq2_kt: slightly faster NEON GEMM * New iq2_kt: Metal - very slow. It seems Apple Silicon cannot quickly add 4 8-bit ints. Or I don't know how to do it - but I didn't find anything in the Metal Shading Language Specification. So, performance is quite a bit worse than the original trellis. * Add missing break * Trying @louiehelm's multiplier * CPU * iq3_kt: use integer trellis + CUDA dequantize and MMVQ * iq3_kt: MMQ * iq3_kt: AVX2 GEMM * iq3_kt: AVX2 GEMV * The trellis quants now need super-blocks of 256, so we need a check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Do not insert a "newline" token if user inputs empty line. This let's
user to continue the output after she has been asked by reverse prompt
for more data. Otherwise an empty-line input would insert a "newline"
token which would break the flow of the conversation.