complete
command
Generate a completion for a given text
Usage
shell
npx --no node-llama-cpp complete [modelPath]
Options
Option | Description |
---|---|
-m [string] , --modelPath [string] , --model [string] , --path [string] , --url [string] , --uri [string] |
Model file to use for the completion. Can be a path to a local file or a URI of a model file to download. Leave empty to choose from a list of recommended models (string) |
-H [string] , --header [string] |
Headers to use when downloading a model from a URL, in the format key: value . You can pass this option multiple times to add multiple headers. (string[]) |
--gpu [string] |
Compute layer implementation type to use for llama.cpp. If omitted, uses the latest local build, and fallbacks to "auto" (default: Uses the latest local build, and fallbacks to "auto") (string)
|
-i , --systemInfo |
Print llama.cpp system info (default: false ) (boolean) |
--text [string] |
First text to automatically start generating completion for (string) |
--textFile [string] |
Path to a file to load text from and use as the first text to automatically start generating completion for (string) |
-c <number> , --contextSize <number> |
Context size to use for the model context (default: Automatically determined based on the available VRAM) (number) |
-b <number> , --batchSize <number> |
Batch size to use for the model context. The default value is the context size (number) |
--flashAttention , --fa |
Enable flash attention (default: false ) (boolean) |
--threads <number> |
Number of threads to use for the evaluation of tokens (default: Number of cores that are useful for math on the current machine) (number) |
-t <number> , --temperature <number> |
Temperature is a hyperparameter that controls the randomness of the generated text. It affects the probability distribution of the model's output tokens. A higher temperature (e.g., 1.5) makes the output more random and creative, while a lower temperature (e.g., 0.5) makes the output more focused, deterministic, and conservative. The suggested temperature is 0.8, which provides a balance between randomness and determinism. At the extreme, a temperature of 0 will always pick the most likely next token, leading to identical outputs in each run. Set to 0 to disable. (default: 0 ) (number) |
--minP <number> , --mp <number> |
From the next token candidates, discard the percentage of tokens with the lowest probability. For example, if set to 0.05 , 5% of the lowest probability tokens will be discarded. This is useful for generating more high-quality results when using a high temperature. Set to a value between 0 and 1 to enable. Only relevant when temperature is set to a value greater than 0 . (default: 0 ) (number) |
-k <number> , --topK <number> |
Limits the model to consider only the K most likely next tokens for sampling at each step of sequence generation. An integer number between 1 and the size of the vocabulary. Set to 0 to disable (which uses the full vocabulary). Only relevant when temperature is set to a value greater than 0. (default: 40 ) (number) |
-p <number> , --topP <number> |
Dynamically selects the smallest set of tokens whose cumulative probability exceeds the threshold P, and samples the next token only from this set. A float number between 0 and 1 . Set to 1 to disable. Only relevant when temperature is set to a value greater than 0 . (default: 0.95 ) (number) |
--seed <number> |
Used to control the randomness of the generated text. Only relevant when using temperature . (default: The current epoch time) (number) |
--gpuLayers <number> , --gl <number> |
number of layers to store in VRAM (default: Automatically determined based on the available VRAM) (number) |
--repeatPenalty <number> , --rp <number> |
Prevent the model from repeating the same token too much. Set to 1 to disable. (default: 1.1 ) (number) |
--lastTokensRepeatPenalty <number> , --rpn <number> |
Number of recent tokens generated by the model to apply penalties to repetition of (default: 64 ) (number) |
--penalizeRepeatingNewLine , --rpnl |
Penalize new line tokens. set --no-penalizeRepeatingNewLine or --no-rpnl to disable (default: true ) (boolean) |
--repeatFrequencyPenalty <number> , --rfp <number> |
For n time a token is in the punishTokens array, lower its probability by n * repeatFrequencyPenalty . Set to a value between 0 and 1 to enable. (number) |
--repeatPresencePenalty <number> , --rpp <number> |
Lower the probability of all the tokens in the punishTokens array by repeatPresencePenalty . Set to a value between 0 and 1 to enable. (number) |
--maxTokens <number> , --mt <number> |
Maximum number of tokens to generate in responses. Set to 0 to disable. Set to -1 to set to the context size (default: 0 ) (number) |
--tokenPredictionDraftModel [string] , --dm [string] , --draftModel [string] |
Model file to use for draft sequence token prediction (speculative decoding). Can be a path to a local file or a URI of a model file to download (string) |
--tokenPredictionModelContextSize <number> , --dc <number> , --draftContextSize <number> , --draftContext <number> |
Max context size to use for the draft sequence token prediction model context (default: 4096 ) (number) |
-d , --debug |
Print llama.cpp info and debug logs (default: false ) (boolean) |
--meter |
Log how many tokens were used as input and output for each response (default: false ) (boolean) |
--timing |
Print how how long it took to generate each response (default: false ) (boolean) |
--noMmap |
Disable mmap (memory-mapped file) usage (default: false ) (boolean) |
--printTimings , --pt |
Print llama.cpp's internal timings after each response (default: false ) (boolean) |
-h , --help |
Show help |
-v , --version |
Show version number |