Skip to content

inspect measure command

Measure VRAM consumption of a GGUF model file with all possible combinations of gpu layers and context sizes

Usage

shell
npx --no node-llama-cpp inspect measure [modelPath]

Options

Option Description
-m [string], --modelPath [string], --model [string], --path [string], --url [string], --uri [string] Model file to use for the measurements. Can be a path to a local file or a URI of a model file to download. Leave empty to choose from a list of recommended models (string)
-H [string], --header [string] Headers to use when downloading a model from a URL, in the format key: value. You can pass this option multiple times to add multiple headers. (string[])
--gpu [string] Compute layer implementation type to use for llama.cpp. If omitted, uses the latest local build, and fallbacks to "auto" (default: Uses the latest local build, and fallbacks to "auto") (string)

choices: auto, metal, cuda, vulkan, false

--minLayers <number>, --mnl <number> Minimum number of layers to offload to the GPU (default: 1) (number)
--maxLayers <number>, --mxl <number> Maximum number of layers to offload to the GPU (default: All layers) (number)
--minContextSize <number>, --mncs <number> Minimum context size (default: 512) (number)
--maxContextSize <number>, --mxcs <number> Maximum context size (default: Train context size) (number)
--flashAttention, --fa Enable flash attention for the context (default: false) (boolean)
-n <number>, --measures <number> Number of context size measures to take for each gpu layers count (default: 10) (number)
--printHeaderBeforeEachLayer, --ph Print header before each layer's measures (default: true) (boolean)
--evaluateText [string], --evaluate [string], --et [string] Text to evaluate with the model (string)
--repeatEvaluateText <number>, --repeatEvaluate <number>, --ret <number> Number of times to repeat the evaluation text before sending it for evaluation, in order to make it longer (default: 1) (number)
-h, --help Show help
-v, --version Show version number