Skip to content

infill command

Generate an infill completion for a given suffix and prefix texts

Usage

shell
npx --no node-llama-cpp infill [modelPath]

Options

Option Description
-m [string], --modelPath [string], --model [string], --path [string], --url [string], --uri [string] Model file to use for the infill. Can be a path to a local file or a URI of a model file to download. Leave empty to choose from a list of recommended models (string)
-H [string], --header [string] Headers to use when downloading a model from a URL, in the format key: value. You can pass this option multiple times to add multiple headers. (string[])
--gpu [string] Compute layer implementation type to use for llama.cpp. If omitted, uses the latest local build, and fallbacks to "auto" (default: Uses the latest local build, and fallbacks to "auto") (string)

choices: auto, metal, cuda, vulkan, false

-i, --systemInfo Print llama.cpp system info (default: false) (boolean)
--prefix [string] First prefix text to automatically load (string)
--prefixFile [string] Path to a file to load prefix text from automatically (string)
--suffix [string] First suffix text to automatically load. Requires prefix or prefixFile to be set (string)
--suffixFile [string] Path to a file to load suffix text from automatically. Requires prefix or prefixFile to be set (string)
-c <number>, --contextSize <number> Context size to use for the model context (default: Automatically determined based on the available VRAM) (number)
-b <number>, --batchSize <number> Batch size to use for the model context. The default value is the context size (number)
--flashAttention, --fa Enable flash attention (default: false) (boolean)
--threads <number> Number of threads to use for the evaluation of tokens (default: Number of cores that are useful for math on the current machine) (number)
-t <number>, --temperature <number> Temperature is a hyperparameter that controls the randomness of the generated text. It affects the probability distribution of the model's output tokens. A higher temperature (e.g., 1.5) makes the output more random and creative, while a lower temperature (e.g., 0.5) makes the output more focused, deterministic, and conservative. The suggested temperature is 0.8, which provides a balance between randomness and determinism. At the extreme, a temperature of 0 will always pick the most likely next token, leading to identical outputs in each run. Set to 0 to disable. (default: 0) (number)
--minP <number>, --mp <number> From the next token candidates, discard the percentage of tokens with the lowest probability. For example, if set to 0.05, 5% of the lowest probability tokens will be discarded. This is useful for generating more high-quality results when using a high temperature. Set to a value between 0 and 1 to enable. Only relevant when temperature is set to a value greater than 0. (default: 0) (number)
-k <number>, --topK <number> Limits the model to consider only the K most likely next tokens for sampling at each step of sequence generation. An integer number between 1 and the size of the vocabulary. Set to 0 to disable (which uses the full vocabulary). Only relevant when temperature is set to a value greater than 0. (default: 40) (number)
-p <number>, --topP <number> Dynamically selects the smallest set of tokens whose cumulative probability exceeds the threshold P, and samples the next token only from this set. A float number between 0 and 1. Set to 1 to disable. Only relevant when temperature is set to a value greater than 0. (default: 0.95) (number)
--seed <number> Used to control the randomness of the generated text. Only relevant when using temperature. (default: The current epoch time) (number)
--gpuLayers <number>, --gl <number> number of layers to store in VRAM (default: Automatically determined based on the available VRAM) (number)
--repeatPenalty <number>, --rp <number> Prevent the model from repeating the same token too much. Set to 1 to disable. (default: 1.1) (number)
--lastTokensRepeatPenalty <number>, --rpn <number> Number of recent tokens generated by the model to apply penalties to repetition of (default: 64) (number)
--penalizeRepeatingNewLine, --rpnl Penalize new line tokens. set --no-penalizeRepeatingNewLine or --no-rpnl to disable (default: true) (boolean)
--repeatFrequencyPenalty <number>, --rfp <number> For n time a token is in the punishTokens array, lower its probability by n * repeatFrequencyPenalty. Set to a value between 0 and 1 to enable. (number)
--repeatPresencePenalty <number>, --rpp <number> Lower the probability of all the tokens in the punishTokens array by repeatPresencePenalty. Set to a value between 0 and 1 to enable. (number)
--maxTokens <number>, --mt <number> Maximum number of tokens to generate in responses. Set to 0 to disable. Set to -1 to set to the context size (default: 0) (number)
--tokenPredictionDraftModel [string], --dm [string], --draftModel [string] Model file to use for draft sequence token prediction (speculative decoding). Can be a path to a local file or a URI of a model file to download (string)
--tokenPredictionModelContextSize <number>, --dc <number>, --draftContextSize <number>, --draftContext <number> Max context size to use for the draft sequence token prediction model context (default: 4096) (number)
-d, --debug Print llama.cpp info and debug logs (default: false) (boolean)
--meter Log how many tokens were used as input and output for each response (default: false) (boolean)
--timing Print how how long it took to generate each response (default: false) (boolean)
--noMmap Disable mmap (memory-mapped file) usage (default: false) (boolean)
--printTimings, --pt Print llama.cpp's internal timings after each response (default: false) (boolean)
-h, --help Show help
-v, --version Show version number