Downloading Models
node-llama-cpp
is equipped with solutions to download models to use them in your project. The most common use case is to download models using the CLI.
For a tutorial on how to choose models and where to get them from, read the choosing a model tutorial
Using the CLI
node-llama-cpp
is equipped with a model downloader you can use to download models and their related files easily and at high speed (using ipull
).
It's recommended to add a models:pull
script to your package.json
to download all the models used by your project to a local models
folder.
It's also recommended to ensure all the models are automatically downloaded after running npm install
by setting up a postinstall
script
Here's an example of how you can set this up in your package.json
:
{
"scripts": {
"postinstall": "npm run models:pull",
"models:pull": "node-llama-cpp pull --dir ./models <model-url>"
}
}
Don't forget to add the models
folder to your .gitignore
file to avoid committing the models to your repository:
/models
If the model consists of multiple files, only use the URL of the first one, and the rest will be downloaded automatically. For more information, see createModelDownloader
.
Calling models:pull
multiple times will only download the models that haven't been downloaded yet. If a model file was updated, calling models:pull
will download the updated file and override the old one.
You can pass a list of model URLs to download multiple models at once:
{
"scripts": {
"postinstall": "npm run models:pull",
"models:pull": "node-llama-cpp pull --dir ./models <model1-url> <model2-url> <model3-url>"
}
}
TIP
When scaffolding a new project, the new project already includes this pattern.
Programmatically Downloading Models
You can also download models programmatically using the createModelDownloader
method, and combineModelDownloaders
to combine multiple model downloaders.
This option is recommended for more advanced use cases, such as downloading models based on user input.
If you know the exact model URLs you're going to need every time in your project, it's better to download the models automatically after running npm install
as described in the Using the CLI section.
Model URIs
You can reference models using a URI instead of their full download URL when using the CLI and relevant methods.
When downloading a model from a URI, the model files will be prefixed with a corresponding adaptation of the URI.
To reference a model from Hugging Face, you can use the scheme hf:<user>/<model>/<file-path>#<branch>
(#<branch>
is optional).
Here's an example usage of the Hugging Face URI scheme:
hf:mradermacher/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf
When using a URI to reference a model, it's recommended to add it to your package.json
file to ensure it's downloaded when running npm install
, and also resolve it using the resolveModelFile
method to get the full path of the resolved model file.
Here's an example usage of the resolveModelFile
method:
import {fileURLToPath} from "url";
import path from "path";
import {getLlama, resolveModelFile} from "node-llama-cpp";
const __dirname = path.dirname(fileURLToPath(import.meta.url));
const modelsDirectory = path.join(__dirname, "models");
const modelPath = await resolveModelFile(
"hf:user/model/model-file.gguf",
modelsDirectory
);
const llama = await getLlama();
const model = await llama.loadModel({modelPath});
NOTE
If a corresponding model file is not found in the given directory, the model will automatically be downloaded.
When a file is being downloaded, the download progress is shown in the console by default.
Set the cli
option to false
to disable this behavior.
Downloading Gated Models From Hugging Face
Some models on Hugging Face are "gated", meaning they require a manual consent from you before you can download them.
To download such models, after completing the consent form on the model card, you need to create a Hugging Face token and set it in one of the following locations:
- Set an environment variable called
HF_TOKEN
the token - Set the
~/.cache/huggingface/token
file content to the token
Now, using the CLI, the createModelDownloader
method, or the resolveModelFile
method will automatically use the token to download gated models.
Alternatively, you can use the token in the tokens
option when using createModelDownloader
or resolveModelFile
.
Inspecting Remote Models
You can inspect the metadata of a remote model without downloading it by either using the inspect gguf
command with a URL, or using the readGgufFileInfo
method with a URL:
import {readGgufFileInfo} from "node-llama-cpp";
const modelMetadata = await readGgufFileInfo("<model url>");
If the URL is of a model with multiple parts (either separate files or binary-split files), pass the URL of the first file and it'll automatically inspect the rest of the files and combine the metadata.
Detecting the Compatibility of Remote Models
It's handy to check the compatibility of a remote model with your current machine hardware before downloading it, so you won't waste time downloading a model that won't work on your machine.
You can do so using the inspect estimate
command with a URL:
npx --no node-llama-cpp inspect estimate <model-url>
Running this command will attempt to find the best balance of parameters for the model to run on your machine, and it'll output the estimated compatibility of the model with your machine with flash attention either turned off (the default) or on.
Note: don't specify any of these configurations when loading the model.
node-llama-cpp
will balance the parameters automatically also when loading the model, context, etc.
You can also estimate the compatibility of a model programmatically using the GgufInsights
class:
import {getLlama, readGgufFileInfo, GgufInsights} from "node-llama-cpp";
const llama = await getLlama();
const modelMetadata = await readGgufFileInfo("<model url>");
const insights = await GgufInsights.from(modelMetadata, llama);
const resolvedConfig =
await insights.configurationResolver.resolveAndScoreConfig();
const flashAttentionconfig =
await insights.configurationResolver.resolveAndScoreConfig({
flashAttention: true
});
console.log(`Compatibility: ${resolvedConfig.compatibilityScore * 100}%`);
console.log(
`With flash attention: ${flashAttentionconfig.compatibilityScore * 100}%`
);