-
Notifications
You must be signed in to change notification settings - Fork 244
Documentation
Initialize the inference session, tokenizer, and runtime environment.
Parameters
-
model_name(str): Identifier of the model (llama3-1B/3B/8B-chat, gpt-oss-20B, qwen3-next-80B). -
device(str): Target device, usually"cuda:0". -
logging(bool): IfTrue, enables stats logging (layer load time, kvcache save time, etc).
Load or download the model weights.
Parameters
-
models_dir(str): Directory path where models are stored. -
force_download(bool): IfTrue, forces re-download even if model exists locally.
Offload a specified number of transformer layers from disk to CPU memory. This allows layer weights to be loaded from CPU RAM instead of the SSD, significantly improving inference speed. For example, Llama3-8B has 32 layers, each approximately 0.46 GB (~15 GB / 32). We strongly recommend keeping at least 6 GB of RAM free for the operating system and background processes.
Parameters
-
layers_num(int): Number of layers to keep on CPU.
Create and manage disk-based (SSD strongly recommended) key/value cache for long context inference.
Parameters
-
cache_dir(str): Directory to store serialized KV tensors.
Run model inference and generate new tokens. Supports KV cache and streaming output.
Parameters
-
input_ids(Tensor): Encoded input sequence. -
past_key_values(DiskCache or None): If None, will use default KVCache with not disk offloading. -
max_new_tokens(int): Maximum number of new tokens to generate. Default =500. -
streamer(TextStreamer, optional): Streamer for real-time token output. -
**kwargs: Additional HuggingFacegenerate()arguments.
Stream generated tokens as they are produced.
Parameters
-
tokenizer: HuggingFace tokenizer instance. -
skip_prompt(bool): IfTrue, the original prompt is excluded from the stream. -
skip_special_tokens(bool): IfTrue, removes special tokens from streamed output.