CUDA Support
CUDA is a parallel computing platform and API created by NVIDIA for NVIDIA GPUs
node-llama-cpp
ships with pre-built binaries with CUDA support for Windows and Linux, and these are automatically used when CUDA is detected on your machine.
To use node-llama-cpp
's CUDA support with your NVIDIA GPU, make sure you have CUDA Toolkit 12.2 or higher installed on your machine.
If the pre-built binaries don't work with your CUDA installation, node-llama-cpp
will automatically download a release of llama.cpp
and build it from source with CUDA support. Building from source with CUDA support is slow and can take up to an hour.
The pre-built binaries are compiled with CUDA Toolkit 12.2, so any version of CUDA Toolkit that is 12.2 or higher should work with the pre-built binaries. If you have an older version of CUDA Toolkit installed on your machine, consider updating it to avoid having to wait the long build time.
Testing CUDA Support
To check whether the CUDA support works on your machine, run this command:
npx --no node-llama-cpp inspect gpu
You should see an output like this:
CUDA: available
CUDA device: NVIDIA RTX A6000
CUDA used VRAM: 0.54% (266.88MB/47.65GB)
CUDA free VRAM: 99.45% (47.39GB/47.65GB)
CPU model: Intel(R) Xeon(R) Gold 5315Y CPU @ 3.20GHz
Used RAM: 2.51% (1.11GB/44.08GB)
Free RAM: 97.48% (42.97GB/44.08GB)
If you see CUDA used VRAM
in the output, it means that CUDA support is working on your machine.
Prerequisites
- CUDA Toolkit 12.2 or higher
cmake-js
dependencies- CMake 3.26 or higher (optional, recommended if you have build issues)
Manually Building node-llama-cpp
With CUDA Support
Run this command inside of your project:
npx --no node-llama-cpp source download --gpu cuda
If
cmake
is not installed on your machine,node-llama-cpp
will automatically downloadcmake
to an internal directory and try to use it to buildllama.cpp
from source.
If you see the message
CUDA not found
during the build process, it means that CUDA Toolkit is not installed on your machine or that it is not detected by the build process.
Custom llama.cpp
CMake Options
llama.cpp
has some options you can use to customize your CUDA build.
llama.cpp
CUDA CMake build options
Option | Description | Default value |
---|---|---|
GGML_CUDA_FORCE_DMMV |
ggml: use dmmv instead of mmvq CUDA kernels | OFF |
GGML_CUDA_FORCE_MMQ |
ggml: use mmq kernels instead of cuBLAS | OFF |
GGML_CUDA_FORCE_CUBLAS |
ggml: always use cuBLAS instead of mmq kernels | OFF |
GGML_CUDA_F16 |
ggml: use 16 bit floats for some calculations | OFF |
GGML_CUDA_NO_PEER_COPY |
ggml: do not use peer to peer copies | OFF |
GGML_CUDA_NO_VMM |
ggml: do not try to use CUDA VMM | OFF |
GGML_CUDA_FA_ALL_QUANTS |
ggml: compile all quants for FlashAttention | OFF |
GGML_CUDA_GRAPHS |
ggml: use CUDA graphs (llama.cpp only) | ${GGML_CUDA_GRAPHS_DEFAULT} |
Source:
CMakeLists
(filtered for only CUDA-related options)You can see all the available
llama.cpp
CMake build options here
To build node-llama-cpp
with any of these options, set an environment variable of an option prefixed with NODE_LLAMA_CPP_CMAKE_OPTION_
.
Fix the Failed to detect a default CUDA architecture
Build Error
To fix this issue you have to set the CUDACXX
environment variable to the path of the nvcc
compiler.
For example, if you have installed CUDA Toolkit 12.2, you have to run a command like this:
export CUDACXX=/usr/local/cuda-12.2/bin/nvcc
set CUDACXX=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\bin\nvcc.exe
Then run the build command again to check whether setting the CUDACXX
environment variable fixed the issue.
Fix the The CUDA compiler identification is unknown
Build Error
The solution to this error is the same as the solution to the Failed to detect a default CUDA architecture
error.
Fix the A single input file is required for a non-link phase when an outputfile is specified
Build Error
To fix this issue you have to set the CMAKE_GENERATOR_TOOLSET
cmake option to the CUDA home directory, usually already set as the CUDA_PATH
environment variable.
To do this, set the NODE_LLAMA_CPP_CMAKE_OPTION_CMAKE_GENERATOR_TOOLSET
environment variable to the path of your CUDA home directory:
export NODE_LLAMA_CPP_CMAKE_OPTION_CMAKE_GENERATOR_TOOLSET=$CUDA_PATH
set NODE_LLAMA_CPP_CMAKE_OPTION_CMAKE_GENERATOR_TOOLSET=%CUDA_PATH%
Then run the build command again to check whether setting the CMAKE_GENERATOR_TOOLSET
cmake option fixed the issue.
Using node-llama-cpp
With CUDA
It's recommended to use getLlama
without specifying a GPU type, so it'll detect the available GPU types and use the best one automatically.
To do this, just use getLlama
without any parameters:
const llama = await getLlama();
console.log("GPU type:", llama.gpu);
To force it to use CUDA, you can use the gpu
option:
const llama = await getLlama({
gpu: "cuda"
});
console.log("GPU type:", llama.gpu);
By default, node-llama-cpp
will offload as many layers of the model to the GPU as it can fit in the VRAM.
To force it to offload a specific number of layers, you can use the gpuLayers
option:
const model = await llama.loadModel({
modelPath,
gpuLayers: 33 // or any other number of layers you want
});
WARNING
Attempting to offload more layers to the GPU than the available VRAM can fit will result in an InsufficientMemoryError
error.
On Linux, you can monitor GPU usage with this command:
watch -d nvidia-smi