Skip to content

CUDA Support

CUDA is a parallel computing platform and API created by NVIDIA for NVIDIA GPUs

node-llama-cpp ships with pre-built binaries with CUDA support for Windows and Linux, and these are automatically used when CUDA is detected on your machine.

To use node-llama-cpp's CUDA support with your NVIDIA GPU, make sure you have CUDA Toolkit 12.2 or higher installed on your machine.

If the pre-built binaries don't work with your CUDA installation, node-llama-cpp will automatically download a release of llama.cpp and build it from source with CUDA support. Building from source with CUDA support is slow and can take up to an hour.

The pre-built binaries are compiled with CUDA Toolkit 12.2, so any version of CUDA Toolkit that is 12.2 or higher should work with the pre-built binaries. If you have an older version of CUDA Toolkit installed on your machine, consider updating it to avoid having to wait the long build time.

Testing CUDA Support

To check whether the CUDA support works on your machine, run this command:

shell
npx --no node-llama-cpp inspect gpu

You should see an output like this:

CUDA: available

CUDA device: NVIDIA RTX A6000
CUDA used VRAM: 0.54% (266.88MB/47.65GB)
CUDA free VRAM: 99.45% (47.39GB/47.65GB)

CPU model: Intel(R) Xeon(R) Gold 5315Y CPU @ 3.20GHz
Used RAM: 2.51% (1.11GB/44.08GB)
Free RAM: 97.48% (42.97GB/44.08GB)

If you see CUDA used VRAM in the output, it means that CUDA support is working on your machine.

Prerequisites

Manually Building node-llama-cpp With CUDA Support

Run this command inside of your project:

shell
npx --no node-llama-cpp source download --gpu cuda

If cmake is not installed on your machine, node-llama-cpp will automatically download cmake to an internal directory and try to use it to build llama.cpp from source.

If you see the message CUDA not found during the build process, it means that CUDA Toolkit is not installed on your machine or that it is not detected by the build process.

Custom llama.cpp CMake Options

llama.cpp has some options you can use to customize your CUDA build.

llama.cpp CUDA CMake build options
Option Description Default value
GGML_CUDA_FORCE_DMMV ggml: use dmmv instead of mmvq CUDA kernels OFF
GGML_CUDA_FORCE_MMQ ggml: use mmq kernels instead of cuBLAS OFF
GGML_CUDA_FORCE_CUBLAS ggml: always use cuBLAS instead of mmq kernels OFF
GGML_CUDA_F16 ggml: use 16 bit floats for some calculations OFF
GGML_CUDA_NO_PEER_COPY ggml: do not use peer to peer copies OFF
GGML_CUDA_NO_VMM ggml: do not try to use CUDA VMM OFF
GGML_CUDA_FA_ALL_QUANTS ggml: compile all quants for FlashAttention OFF
GGML_CUDA_GRAPHS ggml: use CUDA graphs (llama.cpp only) ${GGML_CUDA_GRAPHS_DEFAULT}

Source: CMakeLists (filtered for only CUDA-related options)

You can see all the available llama.cpp CMake build options here

To build node-llama-cpp with any of these options, set an environment variable of an option prefixed with NODE_LLAMA_CPP_CMAKE_OPTION_.

Fix the Failed to detect a default CUDA architecture Build Error

To fix this issue you have to set the CUDACXX environment variable to the path of the nvcc compiler.

For example, if you have installed CUDA Toolkit 12.2, you have to run a command like this:

shell
export CUDACXX=/usr/local/cuda-12.2/bin/nvcc
cmd
set CUDACXX=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\bin\nvcc.exe

Then run the build command again to check whether setting the CUDACXX environment variable fixed the issue.

Fix the The CUDA compiler identification is unknown Build Error

The solution to this error is the same as the solution to the Failed to detect a default CUDA architecture error.

To fix this issue you have to set the CMAKE_GENERATOR_TOOLSET cmake option to the CUDA home directory, usually already set as the CUDA_PATH environment variable.

To do this, set the NODE_LLAMA_CPP_CMAKE_OPTION_CMAKE_GENERATOR_TOOLSET environment variable to the path of your CUDA home directory:

shell
export NODE_LLAMA_CPP_CMAKE_OPTION_CMAKE_GENERATOR_TOOLSET=$CUDA_PATH
cmd
set NODE_LLAMA_CPP_CMAKE_OPTION_CMAKE_GENERATOR_TOOLSET=%CUDA_PATH%

Then run the build command again to check whether setting the CMAKE_GENERATOR_TOOLSET cmake option fixed the issue.

Using node-llama-cpp With CUDA

It's recommended to use getLlama without specifying a GPU type, so it'll detect the available GPU types and use the best one automatically.

To do this, just use getLlama without any parameters:

typescript
const 
llama
= await
getLlama
();
console
.
log
("GPU type:",
llama
.
gpu
);

To force it to use CUDA, you can use the gpu option:

typescript
const 
llama
= await
getLlama
({
gpu
: "cuda"
});
console
.
log
("GPU type:",
llama
.
gpu
);

By default, node-llama-cpp will offload as many layers of the model to the GPU as it can fit in the VRAM.

To force it to offload a specific number of layers, you can use the gpuLayers option:

typescript
const 
model
= await
llama
.
loadModel
({
modelPath
,
gpuLayers
: 33 // or any other number of layers you want
});

WARNING

Attempting to offload more layers to the GPU than the available VRAM can fit will result in an InsufficientMemoryError error.

On Linux, you can monitor GPU usage with this command:

shell
watch -d nvidia-smi