Type Alias: LlamaModelOptions
type LlamaModelOptions = {
modelPath: string;
gpuLayers?: | "auto"
| "max"
| number
| {
min?: number;
max?: number;
fitContext?: {
contextSize?: number;
embeddingContext?: boolean;
};
};
vocabOnly?: boolean;
useMmap?: boolean;
useDirectIo?: boolean;
useMlock?: boolean;
checkTensors?: boolean;
defaultContextFlashAttention?: boolean;
experimentalDefaultContextKvCacheKeyType?: | "currentQuant"
| keyof typeof GgmlType
| GgmlType;
experimentalDefaultContextKvCacheValueType?: | "currentQuant"
| keyof typeof GgmlType
| GgmlType;
defaultContextSwaFullCache?: boolean;
onLoadProgress?: void;
loadSignal?: AbortSignal;
ignoreMemorySafetyChecks?: boolean;
metadataOverrides?: OverridesObject<GgufMetadata, number | bigint | boolean | string>;
};Defined in: evaluator/LlamaModel/LlamaModel.ts:27
Properties
modelPath
modelPath: string;Defined in: evaluator/LlamaModel/LlamaModel.ts:29
path to the model on the filesystem
gpuLayers?
optional gpuLayers:
| "auto"
| "max"
| number
| {
min?: number;
max?: number;
fitContext?: {
contextSize?: number;
embeddingContext?: boolean;
};
};Defined in: evaluator/LlamaModel/LlamaModel.ts:45
Number of layers to store in VRAM.
"auto"- adapt to the current VRAM state and try to fit as many layers as possible in it. Takes into account the VRAM required to create a context with acontextSizeset to"auto"."max"- store all layers in VRAM. If there's not enough VRAM, an error will be thrown. Use with caution.number- store the specified number of layers in VRAM. If there's not enough VRAM, an error will be thrown. Use with caution.{min?: number, max?: number, fitContext?: {contextSize: number}}- adapt to the current VRAM state and try to fit as many layers as possible in it, but at leastminand at mostmaxlayers. SetfitContextto the parameters of a context you intend to create with the model, so it'll take it into account in the calculations and leave enough memory for such a context.
If GPU support is disabled, will be set to 0 automatically.
Defaults to "auto".
vocabOnly?
optional vocabOnly: boolean;Defined in: evaluator/LlamaModel/LlamaModel.ts:65
Only load the vocabulary, not weight tensors.
Useful when you only want to use the model to use its tokenizer but not for evaluation.
Defaults to false.
useMmap?
optional useMmap: boolean;Defined in: evaluator/LlamaModel/LlamaModel.ts:78
Use mmap (memory-mapped file) to load the model.
Using mmap allows the OS to load the model tensors directly from the file on the filesystem, and makes it easier for the system to manage memory.
When using mmap, you might notice a delay the first time you actually use the model, which is caused by the OS itself loading the model into memory.
Defaults to true if the current system supports it.
useDirectIo?
optional useDirectIo: boolean;Defined in: evaluator/LlamaModel/LlamaModel.ts:94
Direct I/O is a method of reading and writing data to and from the storage device directly to the application memory, bypassing OS in-memory caches.
It can lead to improved model loading times and reduced RAM usage, on the expense of higher loading times when the model unloaded and loaded again repeatedly in a short period of time.
When this option is enabled, if Direct I/O is supported by the system (and for the given file) it will be used and mmap will be disabled.
Unsupported on macOS.
Defaults to false.
useMlock?
optional useMlock: boolean;Defined in: evaluator/LlamaModel/LlamaModel.ts:100
Force the system to keep the model in the RAM/VRAM. Use with caution as this can crash your system if the available resources are insufficient.
checkTensors?
optional checkTensors: boolean;Defined in: evaluator/LlamaModel/LlamaModel.ts:108
Check for tensor validity before actually loading the model. Using it increases the time it takes to load the model.
Defaults to false.
defaultContextFlashAttention?
optional defaultContextFlashAttention: boolean;Defined in: evaluator/LlamaModel/LlamaModel.ts:129
Enable flash attention by default for contexts created with this model. Only works with models that support flash attention.
Flash attention is an optimization in the attention mechanism that makes inference faster, more efficient and uses less memory.
The support for flash attention is currently experimental and may not always work as expected. Use with caution.
This option will be ignored if flash attention is not supported by the model.
Enabling this affects the calculations of default values for the model and contexts created with it as flash attention reduces the amount of memory required, which allows for more layers to be offloaded to the GPU and for context sizes to be bigger.
Defaults to false.
Upon flash attention exiting the experimental status, the default value will become true.
experimentalDefaultContextKvCacheKeyType?
optional experimentalDefaultContextKvCacheKeyType:
| "currentQuant"
| keyof typeof GgmlType
| GgmlType;Defined in: evaluator/LlamaModel/LlamaModel.ts:145
Experimental
The default type of the key for the KV cache tensors used for contexts created with this model.
Set to "currentQuant" to use the same type as the current quantization of the model weights tensors.
Defaults to F16.
Deprecated
- this option is experimental and highly unstable. Only use with a hard-coded model and on specific hardware that you verify where the type passed to this option works correctly. Avoid allowing end users to configure this option, as it's highly unstable.
- this option is experimental and highly unstable. It may not work as intended or even crash the process. Use with caution. This option may change or get removed in the future without a breaking change version.
experimentalDefaultContextKvCacheValueType?
optional experimentalDefaultContextKvCacheValueType:
| "currentQuant"
| keyof typeof GgmlType
| GgmlType;Defined in: evaluator/LlamaModel/LlamaModel.ts:161
Experimental
The default type of the value for the KV cache tensors used for contexts created with this model.
Set to "currentQuant" to use the same type as the current quantization of the model weights tensors.
Defaults to F16.
Deprecated
- this option is experimental and highly unstable. Only use with a hard-coded model and on specific hardware that you verify where the type passed to this option works correctly. Avoid allowing end users to configure this option, as it's highly unstable.
- this option is experimental and highly unstable. It may not work as intended or even crash the process. Use with caution. This option may change or get removed in the future without a breaking change version.
defaultContextSwaFullCache?
optional defaultContextSwaFullCache: boolean;Defined in: evaluator/LlamaModel/LlamaModel.ts:172
When using SWA (Sliding Window Attention) on a supported model, extend the sliding window size to the current context size (meaning practically disabling SWA) by default for contexts created with this model.
See the swaFullCache option of the .createContext() method for more information.
Defaults to false.
loadSignal?
optional loadSignal: AbortSignal;Defined in: evaluator/LlamaModel/LlamaModel.ts:181
An abort signal to abort the model load
ignoreMemorySafetyChecks?
optional ignoreMemorySafetyChecks: boolean;Defined in: evaluator/LlamaModel/LlamaModel.ts:189
Ignore insufficient memory errors and continue with the model load. Can cause the process to crash if there's not enough VRAM to fit the model.
Defaults to false.
metadataOverrides?
optional metadataOverrides: OverridesObject<GgufMetadata, number | bigint | boolean | string>;Defined in: evaluator/LlamaModel/LlamaModel.ts:198
Metadata overrides to load the model with.
Note: Most metadata value overrides aren't supported and overriding them will have no effect on
llama.cpp. Only use this for metadata values that are explicitly documented to be supported byllama.cppto be overridden, and only in cases when this is crucial, as this is not guaranteed to always work as expected.
Methods
onLoadProgress()?
optional onLoadProgress(loadProgress: number): void;Defined in: evaluator/LlamaModel/LlamaModel.ts:178
Called with the load percentage when the model is being loaded.
Parameters
| Parameter | Type | Description |
|---|---|---|
loadProgress | number | a number between 0 (exclusive) and 1 (inclusive). |
Returns
void