As discussd in issue https://github.com/unslothai/unsloth/issues/154#issue-2119969174 , I am also working with extended tokenizer to accomodate words of a new language. I've merged Llama 3.2 tokenizer with my tokenizer and the size was increased to 146,452 (as opposed to 128,256, which is the size of the original Llama3.2 tokenizer). I am running a continual pretraining, and saving checkpoints at a certain number of steps. I want to finetune the checkpoints further with instructional dataset to track their performances. However, I am not able to load the checkpoints due to the mismatch in tokenizer size of the base model and the adapter. I read about the suggested solution: to merge and save the checkpoints. However, since unlsoth is automatically saving the checkpoints, I don't have the chance to do that without first loading the models. So, what should I do? Any suggestion is appreciated!
As discussd in issue https://github.com/unslothai/unsloth/issues/154#issue-2119969174 , I am also working with extended tokenizer to accomodate words of a new language. I've merged Llama 3.2 tokenizer with my tokenizer and the size was increased to 146,452 (as opposed to 128,256, which is the size of the original Llama3.2 tokenizer). I am running a continual pretraining, and saving checkpoints at a certain number of steps. I want to finetune the checkpoints further with instructional dataset to track their performances. However, I am not able to load the checkpoints due to the mismatch in tokenizer size of the base model and the adapter. I read about the suggested solution: to merge and save the checkpoints. However, since unlsoth is automatically saving the checkpoints, I don't have the chance to do that without first loading the models. So, what should I do? Any suggestion is appreciated!