`infill` command

Generate an infill completion for a given suffix and prefix texts

Usage

shell

npx --no node-llama-cpp infill [modelPath]

Options

Option	Description
`-m [string]`, `--modelPath [string]`, `--model [string]`, `--path [string]`, `--url [string]`, `--uri [string]`	Model file to use for the infill. Can be a path to a local file or a URI of a model file to download. Leave empty to choose from a list of recommended models `(string)`
`-H [string]`, `--header [string]`	Headers to use when downloading a model from a URL, in the format `key: value`. You can pass this option multiple times to add multiple headers. `(string[])`
`--gpu [string]`	Compute layer implementation type to use for llama.cpp. If omitted, uses the latest local build, and fallbacks to "auto" (default: Uses the latest local build, and fallbacks to "auto") `(string)` choices: `auto`, `metal`, `cuda`, `vulkan`, `false`
`-i`, `--systemInfo`	Print llama.cpp system info (default: `false`) `(boolean)`
`--prefix [string]`	First prefix text to automatically load `(string)`
`--prefixFile [string]`	Path to a file to load prefix text from automatically `(string)`
`--suffix [string]`	First suffix text to automatically load. Requires `prefix` or `prefixFile` to be set `(string)`
`--suffixFile [string]`	Path to a file to load suffix text from automatically. Requires `prefix` or `prefixFile` to be set `(string)`
`-c <number>`, `--contextSize <number>`	Context size to use for the model context (default: Automatically determined based on the available VRAM) `(number)`
`-b <number>`, `--batchSize <number>`	Batch size to use for the model context `(number)`
`--flashAttention`, `--fa`	Enable flash attention (default: `false`) `(boolean)`
`--swaFullCache`, `--noSwa`	Disable SWA (Sliding Window Attention) on supported models (default: `false`) `(boolean)`
`--threads <number>`	Number of threads to use for the evaluation of tokens (default: Number of cores that are useful for math on the current machine) `(number)`
`-t <number>`, `--temperature <number>`	Temperature is a hyperparameter that controls the randomness of the generated text. It affects the probability distribution of the model's output tokens. A higher temperature (e.g., 1.5) makes the output more random and creative, while a lower temperature (e.g., 0.5) makes the output more focused, deterministic, and conservative. The suggested temperature is 0.8, which provides a balance between randomness and determinism. At the extreme, a temperature of 0 will always pick the most likely next token, leading to identical outputs in each run. Set to `0` to disable. (default: `0`) `(number)`
`--minP <number>`, `--mp <number>`	From the next token candidates, discard the percentage of tokens with the lowest probability. For example, if set to `0.05`, 5% of the lowest probability tokens will be discarded. This is useful for generating more high-quality results when using a high temperature. Set to a value between `0` and `1` to enable. Only relevant when `temperature` is set to a value greater than `0`. (default: `0`) `(number)`
`-k <number>`, `--topK <number>`	Limits the model to consider only the K most likely next tokens for sampling at each step of sequence generation. An integer number between `1` and the size of the vocabulary. Set to `0` to disable (which uses the full vocabulary). Only relevant when `temperature` is set to a value greater than 0. (default: `40`) `(number)`
`-p <number>`, `--topP <number>`	Dynamically selects the smallest set of tokens whose cumulative probability exceeds the threshold P, and samples the next token only from this set. A float number between `0` and `1`. Set to `1` to disable. Only relevant when `temperature` is set to a value greater than `0`. (default: `0.95`) `(number)`
`--seed <number>`	Used to control the randomness of the generated text. Only relevant when using `temperature`. (default: The current epoch time) `(number)`
`--gpuLayers <number>`, `--gl <number>`	number of layers to store in VRAM (default: Automatically determined based on the available VRAM) `(number)`
`--repeatPenalty <number>`, `--rp <number>`	Prevent the model from repeating the same token too much. Set to `1` to disable. (default: `1.1`) `(number)`
`--lastTokensRepeatPenalty <number>`, `--rpn <number>`	Number of recent tokens generated by the model to apply penalties to repetition of (default: `64`) `(number)`
`--penalizeRepeatingNewLine`, `--rpnl`	Penalize new line tokens. set `--no-penalizeRepeatingNewLine` or `--no-rpnl` to disable (default: `true`) `(boolean)`
`--repeatFrequencyPenalty <number>`, `--rfp <number>`	For n time a token is in the `punishTokens` array, lower its probability by `n * repeatFrequencyPenalty`. Set to a value between `0` and `1` to enable. `(number)`
`--repeatPresencePenalty <number>`, `--rpp <number>`	Lower the probability of all the tokens in the `punishTokens` array by `repeatPresencePenalty`. Set to a value between `0` and `1` to enable. `(number)`
`--maxTokens <number>`, `--mt <number>`	Maximum number of tokens to generate in responses. Set to `0` to disable. Set to `-1` to set to the context size (default: `0`) `(number)`
`--tokenPredictionDraftModel [string]`, `--dm [string]`, `--draftModel [string]`	Model file to use for draft sequence token prediction (speculative decoding). Can be a path to a local file or a URI of a model file to download `(string)`
`--tokenPredictionModelContextSize <number>`, `--dc <number>`, `--draftContextSize <number>`, `--draftContext <number>`	Max context size to use for the draft sequence token prediction model context (default: `4096`) `(number)`
`-d`, `--debug`	Print llama.cpp info and debug logs (default: `false`) `(boolean)`
`--numa [string]`	NUMA allocation policy. See the `numa` option on the `getLlama` method for more information (default: false) `(string)` choices: `distribute`, `isolate`, `numactl`, `mirror`, `false`
`--meter`	Log how many tokens were used as input and output for each response (default: `false`) `(boolean)`
`--timing`	Print how how long it took to generate each response (default: `false`) `(boolean)`
`--noMmap`	Disable mmap (memory-mapped file) usage (default: `false`) `(boolean)`
`--printTimings`, `--pt`	Print llama.cpp's internal timings after each response (default: `false`) `(boolean)`
`-h`, `--help`	Show help
`-v`, `--version`	Show version number

Source

Inspect

`infill` command

Usage

Options

infill command ​

Usage ​

Options

`infill` command

Usage