Class: LlamaContext
Defined in: evaluator/LlamaContext/LlamaContext.ts:65
Properties
onDispose
readonly onDispose: EventRelay<void>;Defined in: evaluator/LlamaContext/LlamaContext.ts:98
Accessors
disposed
Get Signature
get disposed(): boolean;Defined in: evaluator/LlamaContext/LlamaContext.ts:215
Returns
boolean
model
Get Signature
get model(): LlamaModel;Defined in: evaluator/LlamaContext/LlamaContext.ts:219
Returns
contextSize
Get Signature
get contextSize(): number;Defined in: evaluator/LlamaContext/LlamaContext.ts:223
Returns
number
batchSize
Get Signature
get batchSize(): number;Defined in: evaluator/LlamaContext/LlamaContext.ts:227
Returns
number
flashAttention
Get Signature
get flashAttention(): boolean;Defined in: evaluator/LlamaContext/LlamaContext.ts:231
Returns
boolean
kvCacheKeyType
Get Signature
get kvCacheKeyType(): GgmlType;Defined in: evaluator/LlamaContext/LlamaContext.ts:235
Returns
kvCacheValueType
Get Signature
get kvCacheValueType(): GgmlType;Defined in: evaluator/LlamaContext/LlamaContext.ts:239
Returns
stateSize
Get Signature
get stateSize(): number;Defined in: evaluator/LlamaContext/LlamaContext.ts:247
The actual size of the state in the memory in bytes. This value is provided by llama.cpp and doesn't include all the memory overhead of the context.
Returns
number
currentThreads
Get Signature
get currentThreads(): number;Defined in: evaluator/LlamaContext/LlamaContext.ts:254
The number of threads currently used to evaluate tokens
Returns
number
idealThreads
Get Signature
get idealThreads(): number;Defined in: evaluator/LlamaContext/LlamaContext.ts:265
The number of threads that are preferred to be used to evaluate tokens.
The actual number of threads used may be lower when other evaluations are running in parallel.
Returns
number
totalSequences
Get Signature
get totalSequences(): number;Defined in: evaluator/LlamaContext/LlamaContext.ts:278
Returns
number
sequencesLeft
Get Signature
get sequencesLeft(): number;Defined in: evaluator/LlamaContext/LlamaContext.ts:282
Returns
number
Methods
dispose()
dispose(): Promise<void>;Defined in: evaluator/LlamaContext/LlamaContext.ts:201
Returns
Promise<void>
getAllocatedContextSize()
getAllocatedContextSize(): number;Defined in: evaluator/LlamaContext/LlamaContext.ts:269
Returns
number
getSequence()
getSequence(options?: {
contextShift?: ContextShiftOptions;
tokenPredictor?: TokenPredictor;
checkpoints?: {
max?: number;
interval?: number | false;
maxMemory?: number | null;
};
}): LlamaContextSequence;Defined in: evaluator/LlamaContext/LlamaContext.ts:290
Before calling this method, make sure to call sequencesLeft to check if there are any sequences left. When there are no sequences left, this method will throw an error.
Parameters
| Parameter | Type | Description |
|---|---|---|
options | { contextShift?: ContextShiftOptions; tokenPredictor?: TokenPredictor; checkpoints?: { max?: number; interval?: number | false; maxMemory?: number | null; }; } | - |
options.contextShift? | ContextShiftOptions | - |
options.tokenPredictor? | TokenPredictor | Token predictor to use for the sequence. Don't share the same token predictor between multiple sequences. Using a token predictor doesn't affect the generation output itself - it only allows for greater parallelization of the token evaluation to speed up the generation. > Note: that if a token predictor is too resource intensive, > it can slow down the generation process due to the overhead of running the predictor. > > Testing the effectiveness of a token predictor on the target machine is recommended before using it in production. Automatically disposed when disposing the sequence. See Using Token Predictors |
options.checkpoints? | { max?: number; interval?: number | false; maxMemory?: number | null; } | The maximum number of checkpoint to keep for the sequence when needed. When reusing a prefix evaluation state is not possible for the context sequence (like in contexts from recurrent and hybrid models, or with models that use SWA (Sliding Window Attention) when the swaFullCache option is not enabled on the context), storing checkpoints allows reusing the context state at certain points in the sequence to speed up the evaluation when erasing parts of the context state that come after those points. Those checkpoints will automatically be used when trying to erase parts of the context state that come after a checkpointed state, and be freed from memory when no longer relevant. Those checkpoints are relatively lightweight compared to saving the entire state, but taking too many checkpoints can increase memory usage. Checkpoints are stored in the RAM (not VRAM). See LlamaContextSequence.takeCheckpoint for more details on how checkpoints are taken and used. |
options.checkpoints.max? | number | The maximum number of checkpoints to keep for the sequence when needed. Defaults to 32. |
options.checkpoints.interval? | number | false | Take a checkpoint every interval tokens when the sequence needs taking checkpoints. Defaults to 8192. |
options.checkpoints.maxMemory? | number | null | The maximum memory in bytes to use for checkpoints for the sequence when needed. When taking a checkpoint causes the checkpoints pool memory to exceed this value, older checkpoints will be pruned until the total checkpoints memory usage is under this limit, while ensuring that at least one checkpoint is kept. Defaults to null (no memory limit). |
Returns
dispatchPendingBatch()
dispatchPendingBatch(): void;Defined in: evaluator/LlamaContext/LlamaContext.ts:387
Returns
void
printTimings()
printTimings(): Promise<void>;Defined in: evaluator/LlamaContext/LlamaContext.ts:704
Print the timings of token evaluation since that last print for this context.
Requires the performanceTracking option to be enabled.
Note: it prints on the
LlamaLogLevel.infolevel, so if you set the level of yourLlamainstance higher than that, it won't print anything.
Returns
Promise<void>