infill command
Generate an infill completion for a given suffix and prefix texts
Usage
shell
npx --no node-llama-cpp infill [modelPath]
Options
| Option | Description |
|---|---|
-m [string], --modelPath [string], --model [string], --path [string], --url [string], --uri [string] |
Model file to use for the infill. Can be a path to a local file or a URI of a model file to download. Leave empty to choose from a list of recommended models (string) |
-H [string], --header [string] |
Headers to use when downloading a model from a URL, in the format key: value. You can pass this option multiple times to add multiple headers. (string[]) |
--gpu [string] |
Compute layer implementation type to use for llama.cpp. If omitted, uses the latest local build, and fallbacks to "auto" (default: Uses the latest local build, and fallbacks to "auto") (string)
|
-i, --systemInfo |
Print llama.cpp system info (default: false) (boolean) |
--prefix [string] |
First prefix text to automatically load (string) |
--prefixFile [string] |
Path to a file to load prefix text from automatically (string) |
--suffix [string] |
First suffix text to automatically load. Requires prefix or prefixFile to be set (string) |
--suffixFile [string] |
Path to a file to load suffix text from automatically. Requires prefix or prefixFile to be set (string) |
-c <number>, --contextSize <number> |
Context size to use for the model context (default: Automatically determined based on the available VRAM) (number) |
-b <number>, --batchSize <number> |
Batch size to use for the model context (number) |
--flashAttention, --fa |
Enable flash attention (default: false) (boolean) |
--swaFullCache, --noSwa |
Disable SWA (Sliding Window Attention) on supported models (default: false) (boolean) |
--threads <number> |
Number of threads to use for the evaluation of tokens (default: Number of cores that are useful for math on the current machine) (number) |
-t <number>, --temperature <number> |
Temperature is a hyperparameter that controls the randomness of the generated text. It affects the probability distribution of the model's output tokens. A higher temperature (e.g., 1.5) makes the output more random and creative, while a lower temperature (e.g., 0.5) makes the output more focused, deterministic, and conservative. The suggested temperature is 0.8, which provides a balance between randomness and determinism. At the extreme, a temperature of 0 will always pick the most likely next token, leading to identical outputs in each run. Set to 0 to disable. (default: 0) (number) |
--minP <number>, --mp <number> |
From the next token candidates, discard the percentage of tokens with the lowest probability. For example, if set to 0.05, 5% of the lowest probability tokens will be discarded. This is useful for generating more high-quality results when using a high temperature. Set to a value between 0 and 1 to enable. Only relevant when temperature is set to a value greater than 0. (default: 0) (number) |
-k <number>, --topK <number> |
Limits the model to consider only the K most likely next tokens for sampling at each step of sequence generation. An integer number between 1 and the size of the vocabulary. Set to 0 to disable (which uses the full vocabulary). Only relevant when temperature is set to a value greater than 0. (default: 40) (number) |
-p <number>, --topP <number> |
Dynamically selects the smallest set of tokens whose cumulative probability exceeds the threshold P, and samples the next token only from this set. A float number between 0 and 1. Set to 1 to disable. Only relevant when temperature is set to a value greater than 0. (default: 0.95) (number) |
--seed <number> |
Used to control the randomness of the generated text. Only relevant when using temperature. (default: The current epoch time) (number) |
--gpuLayers <number>, --gl <number> |
number of layers to store in VRAM (default: Automatically determined based on the available VRAM) (number) |
--repeatPenalty <number>, --rp <number> |
Prevent the model from repeating the same token too much. Set to 1 to disable. (default: 1.1) (number) |
--lastTokensRepeatPenalty <number>, --rpn <number> |
Number of recent tokens generated by the model to apply penalties to repetition of (default: 64) (number) |
--penalizeRepeatingNewLine, --rpnl |
Penalize new line tokens. set --no-penalizeRepeatingNewLine or --no-rpnl to disable (default: true) (boolean) |
--repeatFrequencyPenalty <number>, --rfp <number> |
For n time a token is in the punishTokens array, lower its probability by n * repeatFrequencyPenalty. Set to a value between 0 and 1 to enable. (number) |
--repeatPresencePenalty <number>, --rpp <number> |
Lower the probability of all the tokens in the punishTokens array by repeatPresencePenalty. Set to a value between 0 and 1 to enable. (number) |
--maxTokens <number>, --mt <number> |
Maximum number of tokens to generate in responses. Set to 0 to disable. Set to -1 to set to the context size (default: 0) (number) |
--tokenPredictionDraftModel [string], --dm [string], --draftModel [string] |
Model file to use for draft sequence token prediction (speculative decoding). Can be a path to a local file or a URI of a model file to download (string) |
--tokenPredictionModelContextSize <number>, --dc <number>, --draftContextSize <number>, --draftContext <number> |
Max context size to use for the draft sequence token prediction model context (default: 4096) (number) |
-d, --debug |
Print llama.cpp info and debug logs (default: false) (boolean) |
--numa [string] |
NUMA allocation policy. See the numa option on the getLlama method for more information (default: false) (string)
|
--meter |
Log how many tokens were used as input and output for each response (default: false) (boolean) |
--timing |
Print how how long it took to generate each response (default: false) (boolean) |
--noMmap |
Disable mmap (memory-mapped file) usage (default: false) (boolean) |
--printTimings, --pt |
Print llama.cpp's internal timings after each response (default: false) (boolean) |
-h, --help |
Show help |
-v, --version |
Show version number |