Skip to content

infill command

Generate an infill completion for a given suffix and prefix texts

Usage

shell
npx --no node-llama-cpp infill [modelPath]

Options

Option Description
-m [string], --modelPath [string], --model [string], --path [string], --url [string], --uri [string] Model file to use for the infill. Can be a path to a local file or a URI of a model file to download. Leave empty to choose from a list of recommended models (string)
-H [string], --header [string] Headers to use when downloading a model from a URL, in the format key: value. You can pass this option multiple times to add multiple headers. (string[])
--gpu [string] Compute layer implementation type to use for llama.cpp. If omitted, uses the latest local build, and fallbacks to "auto" (default: Uses the latest local build, and fallbacks to "auto") (string)

choices: auto, metal, cuda, vulkan, false

-i, --systemInfo Print llama.cpp system info (default: false) (boolean)
--prefix [string] First prefix text to automatically load (string)
--prefixFile [string] Path to a file to load prefix text from automatically (string)
--suffix [string] First suffix text to automatically load. Requires prefix or prefixFile to be set (string)
--suffixFile [string] Path to a file to load suffix text from automatically. Requires prefix or prefixFile to be set (string)
-c <number>, --contextSize <number> Context size to use for the model context (default: Automatically determined based on the available VRAM) (number)
-b <number>, --batchSize <number> Batch size to use for the model context (number)
--flashAttention, --fa Enable flash attention (default: false) (boolean)
--swaFullCache, --noSwa Disable SWA (Sliding Window Attention) on supported models (default: false) (boolean)
--threads <number> Number of threads to use for the evaluation of tokens (default: Number of cores that are useful for math on the current machine) (number)
-t <number>, --temperature <number> Temperature is a hyperparameter that controls the randomness of the generated text. It affects the probability distribution of the model's output tokens. A higher temperature (e.g., 1.5) makes the output more random and creative, while a lower temperature (e.g., 0.5) makes the output more focused, deterministic, and conservative. The suggested temperature is 0.8, which provides a balance between randomness and determinism. At the extreme, a temperature of 0 will always pick the most likely next token, leading to identical outputs in each run. Set to 0 to disable. (default: 0) (number)
--minP <number>, --mp <number> From the next token candidates, discard the percentage of tokens with the lowest probability. For example, if set to 0.05, 5% of the lowest probability tokens will be discarded. This is useful for generating more high-quality results when using a high temperature. Set to a value between 0 and 1 to enable. Only relevant when temperature is set to a value greater than 0. (default: 0) (number)
-k <number>, --topK <number> Limits the model to consider only the K most likely next tokens for sampling at each step of sequence generation. An integer number between 1 and the size of the vocabulary. Set to 0 to disable (which uses the full vocabulary). Only relevant when temperature is set to a value greater than 0. (default: 40) (number)
-p <number>, --topP <number> Dynamically selects the smallest set of tokens whose cumulative probability exceeds the threshold P, and samples the next token only from this set. A float number between 0 and 1. Set to 1 to disable. Only relevant when temperature is set to a value greater than 0. (default: 0.95) (number)
--seed <number> Used to control the randomness of the generated text. Only relevant when using temperature. (default: The current epoch time) (number)
--xtc [string] Exclude Top Choices (XTC) removes the top tokens from consideration and avoids more obvious and repetitive generations. probability (a number between 0 and 1) controls the chance that the top tokens will be removed in the next token generation step, threshold (a number between 0 and 1) controls the minimum probability of a token for it to be removed. Set this argument to probability,threshold to set both values. For example, 0.5,0.1 (string)
--gpuLayers <number>, --gl <number> number of layers to store in VRAM (default: Automatically determined based on the available VRAM) (number)
--repeatPenalty <number>, --rp <number> Prevent the model from repeating the same token too much. Set to 1 to disable. (default: 1.1) (number)
--lastTokensRepeatPenalty <number>, --rpn <number> Number of recent tokens generated by the model to apply penalties to repetition of (default: 64) (number)
--penalizeRepeatingNewLine, --rpnl Penalize new line tokens. set --no-penalizeRepeatingNewLine or --no-rpnl to disable (default: true) (boolean)
--repeatFrequencyPenalty <number>, --rfp <number> For n time a token is in the punishTokens array, lower its probability by n * repeatFrequencyPenalty. Set to a value between 0 and 1 to enable. (number)
--repeatPresencePenalty <number>, --rpp <number> Lower the probability of all the tokens in the punishTokens array by repeatPresencePenalty. Set to a value between 0 and 1 to enable. (number)
--dryRepeatPenaltyStrength <number>, --drps <number>, --dryStrength <number> The strength for DRY (Do Repeat Yourself) penalties. A number between 0 and 1, where 0 means no DRY penalties, and 1 means full strength DRY penalties. The recommended value is 0.8. (default: 0) (number)
--dryRepeatPenaltyBase <number>, --drpb <number>, --dryBase <number> The base value for the exponential penality calculation for DRY (Do Repeat Yourself) penalties. A higher value will lead to more aggressive penalization of repetitions. (default: 1.75) (number)
--dryRepeatPenaltyAllowedLength <number>, --drpal <number>, --dryAllowedLength <number> The maximum sequence length (in tokens) that DRY (Do Repeat Yourself) will allow to be repeated without being penalized. Repetitions shorter than or equal to this length will not be penalized. (default: 2) (number)
--dryRepeatPenaltyLastTokens <number>, --drplt <number>, --dryLastTokens <number> Number of recent tokens generated by the model for DRY (Do Repeat Yourself) to consider for sequence repetition matching. Set to -1 to consider all tokens in the context sequence. Setting to 0 will disable DRY penalties. (default: -1) (number)
--maxTokens <number>, --mt <number> Maximum number of tokens to generate in responses. Set to 0 to disable. Set to -1 to set to the context size (default: 0) (number)
--tokenPredictionDraftModel [string], --dm [string], --draftModel [string] Model file to use for draft sequence token prediction (speculative decoding). Can be a path to a local file or a URI of a model file to download (string)
--tokenPredictionModelContextSize <number>, --dc <number>, --draftContextSize <number>, --draftContext <number> Max context size to use for the draft sequence token prediction model context (default: 4096) (number)
-d, --debug Print llama.cpp info and debug logs (default: false) (boolean)
--numa [string] NUMA allocation policy. See the numa option on the getLlama method for more information (default: false) (string)

choices: distribute, isolate, numactl, mirror, false

--meter Log how many tokens were used as input and output for each response (default: false) (boolean)
--timing Print how how long it took to generate each response (default: false) (boolean)
--noMmap Disable mmap (memory-mapped file) usage (default: false) (boolean)
--useDirectIo Use Direct I/O usage when available (default: false) (boolean)
--printTimings, --pt Print llama.cpp's internal timings after each response (default: false) (boolean)
-h, --help Show help
-v, --version Show version number