Type Alias: LlamaContextOptions
type LlamaContextOptions = {
sequences: number;
contextSize: | "auto"
| number
| {
min: number;
max: number;
};
batchSize: number;
flashAttention: boolean;
threads: | number
| {
ideal: number;
min: number;
};
batching: BatchingOptions;
lora: | string
| {
adapters: {
filePath: string;
scale: number;
}[];
onLoadProgress: void;
};
createSignal: AbortSignal;
ignoreMemorySafetyChecks: boolean;
failedCreationRemedy: | false
| {
retries: number;
autoContextSizeShrink: number | (contextSize: number) => number;
};
performanceTracking: boolean;
};
Defined in: evaluator/LlamaContext/types.ts:8
Properties
sequences?
optional sequences: number;
Defined in: evaluator/LlamaContext/types.ts:19
number of sequences for the context. Each sequence is a different "text generation process" that can run in parallel to other sequences in the same context. Although a single context has multiple sequences, the sequences are separate from each other and do not share data with each other. This is beneficial for performance, as multiple sequences can be evaluated in parallel (on the same batch).
Each sequence increases the memory usage of the context.
Defaults to 1
.
contextSize?
optional contextSize:
| "auto"
| number
| {
min: number;
max: number;
};
Defined in: evaluator/LlamaContext/types.ts:33
The number of tokens the model can see at once.
"auto"
- adapt to the current VRAM state and attemp to set the context size as high as possible up to the size the model was trained on.number
- set the context size to a specific number of tokens. If there's not enough VRAM, an error will be thrown. Use with caution.{min?: number, max?: number}
- adapt to the current VRAM state and attemp to set the context size as high as possible up to the size the model was trained on, but at leastmin
and at mostmax
.
Defaults to "auto"
.
batchSize?
optional batchSize: number;
Defined in: evaluator/LlamaContext/types.ts:43
The number of tokens that can be processed at once by the GPU.
Defaults to 512
or contextSize
if contextSize
is less than 512
.
flashAttention?
optional flashAttention: boolean;
Defined in: evaluator/LlamaContext/types.ts:58
Flash attention is an optimization in the attention mechanism that makes inference faster, more efficient and uses less memory.
The support for flash attention is currently experimental and may not always work as expected. Use with caution.
This option will be ignored if flash attention is not supported by the model.
Defaults to false
(inherited from the model option defaultContextFlashAttention
).
Upon flash attention exiting the experimental status, the default value will become true
(the inherited value from the model option defaultContextFlashAttention
will become true
).
threads?
optional threads:
| number
| {
ideal: number;
min: number;
};
Defined in: evaluator/LlamaContext/types.ts:73
number of threads to use to evaluate tokens. set to 0 to use the maximum threads supported by the current machine hardware.
This value is considered as a hint, and the actual number of threads used may be lower when other evaluations are running. To ensure the minimum number of threads you want to use are always used, set this to an object with a min
property (see the min
property description for more details).
If maxThreads
from the Llama instance is set to 0
, this value will always be the actual number of threads used.
If maxThreads
from the Llama instance is set to 0
, defaults to the .cpuMathCores
value from the Llama instance, otherwise defaults to maxThreads
from the Llama instance (see the maxThreads
option of getLlama
method for more details).
Type declaration
number
{ ideal
: number
; min
: number
; }
ideal?
optional ideal: number;
The ideal number of threads to use for evaluations.
If other evaluations are running, the actual number of threads may be lower than this value.
If maxThreads
from the Llama instance is set to 0
, this value will always be the actual number of threads used.
If maxThreads
from the Llama instance is set to 0
, defaults to the .cpuMathCores
value from the Llama instance, otherwise defaults to maxThreads
from the Llama instance (see the maxThreads
option of getLlama
method for more details).
min?
optional min: number;
Ensure evaluations always use at least this number of threads.
Use with caution, since setting this value too high can lead to the context waiting too much time to reserve this number of threads before the evaluation can start.
batching?
optional batching: BatchingOptions;
Defined in: evaluator/LlamaContext/types.ts:100
Control the parallel sequences processing behavior.
See BatchingOptions for more information.
lora?
optional lora:
| string
| {
adapters: {
filePath: string;
scale: number;
}[];
onLoadProgress: void;
};
Defined in: evaluator/LlamaContext/types.ts:109
Load the provided LoRA adapters onto the context. LoRA adapters are used to modify the weights of a pretrained model to adapt to new tasks or domains without the need for extensive retraining from scratch.
If a string is provided, it will be treated as a path to a single LoRA adapter file.
Type declaration
string
{ adapters
: { filePath
: string
; scale
: number
; }[]; onLoadProgress
: void
; }
adapters
adapters: {
filePath: string;
scale: number;
}[];
onLoadProgress()?
Called with the LoRA adapters load percentage when the LoRA adapters are being loaded.
Parameters
Parameter | Type | Description |
---|---|---|
loadProgress | number | a number between 0 (exclusive) and 1 (inclusive). |
Returns
void
createSignal?
optional createSignal: AbortSignal;
Defined in: evaluator/LlamaContext/types.ts:127
An abort signal to abort the context creation
ignoreMemorySafetyChecks?
optional ignoreMemorySafetyChecks: boolean;
Defined in: evaluator/LlamaContext/types.ts:135
Ignore insufficient memory errors and continue with the context creation. Can cause the process to crash if there's not enough VRAM for the new context.
Defaults to false
.
failedCreationRemedy?
optional failedCreationRemedy:
| false
| {
retries: number;
autoContextSizeShrink: number | (contextSize: number) => number;
};
Defined in: evaluator/LlamaContext/types.ts:144
On failed context creation, retry the creation with a smaller context size.
Only works if contextSize
is set to "auto"
, left as default or set to an object with min
and/or max
properties.
Set retries
to false
to disable.
Type declaration
false
{ retries
: number
; autoContextSizeShrink
: number
| (contextSize
: number
) => number
; }
retries?
optional retries: number;
Retries to attempt to create the context.
Defaults to 6
.
autoContextSizeShrink?
optional autoContextSizeShrink: number | (contextSize: number) => number;
The percentage to decrease the context size by on each retry. Should be a number between 0
and 1
.
If a function is provided, it will be called with the current context size and should return the new context size.
Defaults to 0.16
.
performanceTracking?
optional performanceTracking: boolean;
Defined in: evaluator/LlamaContext/types.ts:168
Track the inference performance of the context, so using .printTimings()
will work.
Defaults to false
.