Skip to content

chat command

Chat with a model

Usage

shell
npx --no node-llama-cpp chat [modelPath]

Options

Option Description
-m [string], --modelPath [string], --model [string], --path [string], --url [string], --uri [string] Model file to use for the chat. Can be a path to a local file or a URI of a model file to download. Leave empty to choose from a list of recommended models (string)
-H [string], --header [string] Headers to use when downloading a model from a URL, in the format key: value. You can pass this option multiple times to add multiple headers. (string[])
--gpu [string] Compute layer implementation type to use for llama.cpp. If omitted, uses the latest local build, and fallbacks to "auto" (default: Uses the latest local build, and fallbacks to "auto") (string)

choices: auto, metal, cuda, vulkan, false

-i, --systemInfo Print llama.cpp system info (default: false) (boolean)
-s [string], --systemPrompt [string] System prompt to use against the model (string)
--systemPromptFile [string] Path to a file to load text from and use as as the model system prompt (string)
--prompt [string] First prompt to automatically send to the model when starting the chat (string)
--promptFile [string] Path to a file to load text from and use as a first prompt to automatically send to the model when starting the chat (string)
-w [string], --wrapper [string] Chat wrapper to use. Use auto to automatically select a wrapper based on the model's metadata and tokenizer (default: auto) (string)

choices: auto, general, llama3.2-lightweight, llama3.1, llama3, llama2Chat, mistral, alpacaChat, functionary, chatML, falconChat, gemma

--noJinja Don't use a Jinja wrapper, even if it's the best option for the model (default: false) (boolean)
-c <number>, --contextSize <number> Context size to use for the model context (default: Automatically determined based on the available VRAM) (number)
-b <number>, --batchSize <number> Batch size to use for the model context. The default value is the context size (number)
--flashAttention, --fa Enable flash attention (default: false) (boolean)
--noTrimWhitespace, --noTrim Don't trim whitespaces from the model response (default: false) (boolean)
-g [string], --grammar [string] Restrict the model response to a specific grammar, like JSON for example (default: text) (string)

choices: text, json, list, arithmetic, japanese, chess

--jsonSchemaGrammarFile [string], --jsgf [string] File path to a JSON schema file, to restrict the model response to only generate output that conforms to the JSON schema (string)
--threads <number> Number of threads to use for the evaluation of tokens (default: Number of cores that are useful for math on the current machine) (number)
-t <number>, --temperature <number> Temperature is a hyperparameter that controls the randomness of the generated text. It affects the probability distribution of the model's output tokens. A higher temperature (e.g., 1.5) makes the output more random and creative, while a lower temperature (e.g., 0.5) makes the output more focused, deterministic, and conservative. The suggested temperature is 0.8, which provides a balance between randomness and determinism. At the extreme, a temperature of 0 will always pick the most likely next token, leading to identical outputs in each run. Set to 0 to disable. (default: 0) (number)
--minP <number>, --mp <number> From the next token candidates, discard the percentage of tokens with the lowest probability. For example, if set to 0.05, 5% of the lowest probability tokens will be discarded. This is useful for generating more high-quality results when using a high temperature. Set to a value between 0 and 1 to enable. Only relevant when temperature is set to a value greater than 0. (default: 0) (number)
-k <number>, --topK <number> Limits the model to consider only the K most likely next tokens for sampling at each step of sequence generation. An integer number between 1 and the size of the vocabulary. Set to 0 to disable (which uses the full vocabulary). Only relevant when temperature is set to a value greater than 0. (default: 40) (number)
-p <number>, --topP <number> Dynamically selects the smallest set of tokens whose cumulative probability exceeds the threshold P, and samples the next token only from this set. A float number between 0 and 1. Set to 1 to disable. Only relevant when temperature is set to a value greater than 0. (default: 0.95) (number)
--seed <number> Used to control the randomness of the generated text. Only relevant when using temperature. (default: The current epoch time) (number)
--gpuLayers <number>, --gl <number> number of layers to store in VRAM (default: Automatically determined based on the available VRAM) (number)
--repeatPenalty <number>, --rp <number> Prevent the model from repeating the same token too much. Set to 1 to disable. (default: 1.1) (number)
--lastTokensRepeatPenalty <number>, --rpn <number> Number of recent tokens generated by the model to apply penalties to repetition of (default: 64) (number)
--penalizeRepeatingNewLine, --rpnl Penalize new line tokens. set --no-penalizeRepeatingNewLine or --no-rpnl to disable (default: true) (boolean)
--repeatFrequencyPenalty <number>, --rfp <number> For n time a token is in the punishTokens array, lower its probability by n * repeatFrequencyPenalty. Set to a value between 0 and 1 to enable. (number)
--repeatPresencePenalty <number>, --rpp <number> Lower the probability of all the tokens in the punishTokens array by repeatPresencePenalty. Set to a value between 0 and 1 to enable. (number)
--maxTokens <number>, --mt <number> Maximum number of tokens to generate in responses. Set to 0 to disable. Set to -1 to set to the context size (default: 0) (number)
--noHistory, --nh Don't load or save chat history (default: false) (boolean)
--environmentFunctions, --ef Provide access to environment functions like getDate and getTime (default: false) (boolean)
--tokenPredictionDraftModel [string], --dm [string], --draftModel [string] Model file to use for draft sequence token prediction (speculative decoding). Can be a path to a local file or a URI of a model file to download (string)
--tokenPredictionModelContextSize <number>, --dc <number>, --draftContextSize <number>, --draftContext <number> Max context size to use for the draft sequence token prediction model context (default: 4096) (number)
-d, --debug Print llama.cpp info and debug logs (default: false) (boolean)
--meter Print how many tokens were used as input and output for each response (default: false) (boolean)
--timing Print how how long it took to generate each response (default: false) (boolean)
--noMmap Disable mmap (memory-mapped file) usage (default: false) (boolean)
--printTimings, --pt Print llama.cpp's internal timings after each response (default: false) (boolean)
-h, --help Show help
-v, --version Show version number