`chat` command

Chat with a model

Usage

shell

npx --no node-llama-cpp chat [modelPath]

Options

Option	Description
`-m [string]`, `--modelPath [string]`, `--model [string]`, `--path [string]`, `--url [string]`, `--uri [string]`	Model file to use for the chat. Can be a path to a local file or a URI of a model file to download. Leave empty to choose from a list of recommended models `(string)`
`-H [string]`, `--header [string]`	Headers to use when downloading a model from a URL, in the format `key: value`. You can pass this option multiple times to add multiple headers. `(string[])`
`--gpu [string]`	Compute layer implementation type to use for llama.cpp. If omitted, uses the latest local build, and fallbacks to "auto" (default: Uses the latest local build, and fallbacks to "auto") `(string)` choices: `auto`, `metal`, `cuda`, `vulkan`, `false`
`-i`, `--systemInfo`	Print llama.cpp system info (default: `false`) `(boolean)`
`-s [string]`, `--systemPrompt [string]`	System prompt to use against the model `(string)`
`--systemPromptFile [string]`	Path to a file to load text from and use as as the model system prompt `(string)`
`--prompt [string]`	First prompt to automatically send to the model when starting the chat `(string)`
`--promptFile [string]`	Path to a file to load text from and use as a first prompt to automatically send to the model when starting the chat `(string)`
`-w [string]`, `--wrapper [string]`	Chat wrapper to use. Use `auto` to automatically select a wrapper based on the model's metadata and tokenizer (default: `auto`) `(string)` choices: `auto`, `general`, `deepSeek`, `qwen`, `llama3.2-lightweight`, `llama3.1`, `llama3`, `llama2Chat`, `mistral`, `alpacaChat`, `functionary`, `chatML`, `falconChat`, `gemma`, `harmony`, `seed`
`--noJinja`	Don't use a Jinja wrapper, even if it's the best option for the model (default: `false`) `(boolean)`
`-c <number>`, `--contextSize <number>`	Context size to use for the model context (default: Automatically determined based on the available VRAM) `(number)`
`-b <number>`, `--batchSize <number>`	Batch size to use for the model context `(number)`
`--flashAttention`, `--fa`	Enable flash attention (default: `false`) `(boolean)`
`--swaFullCache`, `--noSwa`	Disable SWA (Sliding Window Attention) on supported models (default: `false`) `(boolean)`
`--noTrimWhitespace`, `--noTrim`	Don't trim whitespaces from the model response (default: `false`) `(boolean)`
`-g [string]`, `--grammar [string]`	Restrict the model response to a specific grammar, like JSON for example (default: `text`) `(string)` choices: `text`, `json`, `list`, `arithmetic`, `japanese`, `chess`
`--jsonSchemaGrammarFile [string]`, `--jsgf [string]`	File path to a JSON schema file, to restrict the model response to only generate output that conforms to the JSON schema `(string)`
`--threads <number>`	Number of threads to use for the evaluation of tokens (default: Number of cores that are useful for math on the current machine) `(number)`
`-t <number>`, `--temperature <number>`	Temperature is a hyperparameter that controls the randomness of the generated text. It affects the probability distribution of the model's output tokens. A higher temperature (e.g., 1.5) makes the output more random and creative, while a lower temperature (e.g., 0.5) makes the output more focused, deterministic, and conservative. The suggested temperature is 0.8, which provides a balance between randomness and determinism. At the extreme, a temperature of 0 will always pick the most likely next token, leading to identical outputs in each run. Set to `0` to disable. (default: `0`) `(number)`
`--minP <number>`, `--mp <number>`	From the next token candidates, discard the percentage of tokens with the lowest probability. For example, if set to `0.05`, 5% of the lowest probability tokens will be discarded. This is useful for generating more high-quality results when using a high temperature. Set to a value between `0` and `1` to enable. Only relevant when `temperature` is set to a value greater than `0`. (default: `0`) `(number)`
`-k <number>`, `--topK <number>`	Limits the model to consider only the K most likely next tokens for sampling at each step of sequence generation. An integer number between `1` and the size of the vocabulary. Set to `0` to disable (which uses the full vocabulary). Only relevant when `temperature` is set to a value greater than 0. (default: `40`) `(number)`
`-p <number>`, `--topP <number>`	Dynamically selects the smallest set of tokens whose cumulative probability exceeds the threshold P, and samples the next token only from this set. A float number between `0` and `1`. Set to `1` to disable. Only relevant when `temperature` is set to a value greater than `0`. (default: `0.95`) `(number)`
`--seed <number>`	Used to control the randomness of the generated text. Only relevant when using `temperature`. (default: The current epoch time) `(number)`
`--gpuLayers <number>`, `--gl <number>`	number of layers to store in VRAM (default: Automatically determined based on the available VRAM) `(number)`
`--repeatPenalty <number>`, `--rp <number>`	Prevent the model from repeating the same token too much. Set to `1` to disable. (default: `1.1`) `(number)`
`--lastTokensRepeatPenalty <number>`, `--rpn <number>`	Number of recent tokens generated by the model to apply penalties to repetition of (default: `64`) `(number)`
`--penalizeRepeatingNewLine`, `--rpnl`	Penalize new line tokens. set `--no-penalizeRepeatingNewLine` or `--no-rpnl` to disable (default: `true`) `(boolean)`
`--repeatFrequencyPenalty <number>`, `--rfp <number>`	For n time a token is in the `punishTokens` array, lower its probability by `n * repeatFrequencyPenalty`. Set to a value between `0` and `1` to enable. `(number)`
`--repeatPresencePenalty <number>`, `--rpp <number>`	Lower the probability of all the tokens in the `punishTokens` array by `repeatPresencePenalty`. Set to a value between `0` and `1` to enable. `(number)`
`--maxTokens <number>`, `--mt <number>`	Maximum number of tokens to generate in responses. Set to `0` to disable. Set to `-1` to set to the context size (default: `0`) `(number)`
`--reasoningBudget <number>`, `--tb <number>`, `--thinkingBudget <number>`, `--thoughtsBudget <number>`	Maximum number of tokens the model can use for thoughts. Set to `0` to disable reasoning (default: Unlimited) `(number)`
`--noHistory`, `--nh`	Don't load or save chat history (default: `false`) `(boolean)`
`--environmentFunctions`, `--ef`	Provide access to environment functions like `getDate` and `getTime` (default: `false`) `(boolean)`
`--tokenPredictionDraftModel [string]`, `--dm [string]`, `--draftModel [string]`	Model file to use for draft sequence token prediction (speculative decoding). Can be a path to a local file or a URI of a model file to download `(string)`
`--tokenPredictionModelContextSize <number>`, `--dc <number>`, `--draftContextSize <number>`, `--draftContext <number>`	Max context size to use for the draft sequence token prediction model context (default: `4096`) `(number)`
`-d`, `--debug`	Print llama.cpp info and debug logs (default: `false`) `(boolean)`
`--numa [string]`	NUMA allocation policy. See the `numa` option on the `getLlama` method for more information (default: false) `(string)` choices: `distribute`, `isolate`, `numactl`, `mirror`, `false`
`--meter`	Print how many tokens were used as input and output for each response (default: `false`) `(boolean)`
`--timing`	Print how how long it took to generate each response (default: `false`) `(boolean)`
`--noMmap`	Disable mmap (memory-mapped file) usage (default: `false`) `(boolean)`
`--printTimings`, `--pt`	Print llama.cpp's internal timings after each response (default: `false`) `(boolean)`
`-h`, `--help`	Show help
`-v`, `--version`	Show version number

Source

Inspect

`chat` command

Usage

Options

chat command ​

Usage ​

Options

`chat` command

Usage