Type Alias: LLamaChatGenerateResponseOptions<Functions>

type LLamaChatGenerateResponseOptions<Functions> = {
  onTextChunk?: (text: string) => void;
  onToken?: (tokens: Token[]) => void;
  onResponseChunk?: (chunk: LlamaChatResponseChunk) => void;
  signal?: AbortSignal;
  stopOnAbortSignal?: boolean;
  maxTokens?: number;
  temperature?: number;
  minP?: number;
  topK?: number;
  topP?: number;
  seed?: number;
  trimWhitespaceSuffix?: boolean;
  repeatPenalty?:   | false
     | LLamaContextualRepeatPenalty;
  tokenBias?:   | TokenBias
     | () => TokenBias;
  evaluationPriority?: EvaluationPriority;
  contextShift?: LLamaChatContextShiftOptions;
  customStopTriggers?: readonly (LlamaText | string | readonly (string | Token)[])[];
  lastEvaluationContextWindow?: {
     history?: ChatHistoryItem[];
     minimumOverlapPercentageToPreventContextShift?: number;
  };
  onFunctionCallParamsChunk?: (chunk: LlamaChatResponseFunctionCallParamsChunk) => void;
  budgets?: {
     includeCurrentResponse?: boolean;
     thoughtTokens?: number;
     commentTokens?: number;
  };
  abortOnNonText?: boolean;
} & 
  | {
  grammar?: LlamaGrammar;
  functions?: never;
  documentFunctionParams?: never;
  maxParallelFunctionCalls?: never;
  onFunctionCall?: never;
  onFunctionCallParamsChunk?: never;
}
  | {
  grammar?: never;
  functions?: Functions | ChatModelFunctions;
  documentFunctionParams?: boolean;
  maxParallelFunctionCalls?: number;
  onFunctionCall?: (functionCall: LlamaChatResponseFunctionCall<Functions extends ChatModelFunctions ? Functions : ChatModelFunctions>) => void;
  onFunctionCallParamsChunk?: (chunk: LlamaChatResponseFunctionCallParamsChunk) => void;
};

Defined in: evaluator/LlamaChat/LlamaChat.ts:140

Type declaration

onTextChunk()?

optional onTextChunk: (text: string) => void;

Called as the model generates the main response with the generated text chunk.

Useful for streaming the generated response as it's being generated.

Includes only the main response without any text segments (like thoughts). For streaming the response with segments, use `onResponseChunk`.

Parameters

Parameter	Type
`text`	`string`

Returns

void

onToken()?

optional onToken: (tokens: Token[]) => void;

Called as the model generates the main response with the generated tokens.

Preferably, you'd want to use `onTextChunk` instead of this.

Includes only the main response without any segments (like thoughts). For streaming the response with segments, use `onResponseChunk`.

Parameters

Parameter	Type
`tokens`	`Token`[]

Returns

void

onResponseChunk()?

optional onResponseChunk: (chunk: LlamaChatResponseChunk) => void;

Called as the model generates a response with the generated text and tokens, including segment information (when the generated output is part of a segment).

Useful for streaming the generated response as it's being generated, including the main response and all segments.

Only use this function when you need the segmented texts, like thought segments (chain of thought text).

Parameters

Parameter	Type
`chunk`	`LlamaChatResponseChunk`

Returns

void

signal?

optional signal: AbortSignal;

stopOnAbortSignal?

optional stopOnAbortSignal: boolean;

When a response already started being generated and then the signal is aborted, the generation will stop and the response will be returned as is instead of throwing an error.

Defaults to false.

maxTokens?

optional maxTokens: number;

temperature?

optional temperature: number;

Temperature is a hyperparameter that controls the randomness of the generated text. It affects the probability distribution of the model's output tokens.

A higher temperature (e.g., 1.5) makes the output more random and creative, while a lower temperature (e.g., 0.5) makes the output more focused, deterministic, and conservative.

The suggested temperature is 0.8, which provides a balance between randomness and determinism.

At the extreme, a temperature of 0 will always pick the most likely next token, leading to identical outputs in each run.

Set to 0 to disable. Disabled by default (set to 0).

minP?

optional minP: number;

From the next token candidates, discard the percentage of tokens with the lowest probability. For example, if set to 0.05, 5% of the lowest probability tokens will be discarded. This is useful for generating more high-quality results when using a high temperature. Set to a value between 0 and 1 to enable.

Only relevant when temperature is set to a value greater than 0. Disabled by default.

topK?

optional topK: number;

Limits the model to consider only the K most likely next tokens for sampling at each step of sequence generation. An integer number between 1 and the size of the vocabulary. Set to 0 to disable (which uses the full vocabulary).

Only relevant when temperature is set to a value greater than 0.

topP?

optional topP: number;

Dynamically selects the smallest set of tokens whose cumulative probability exceeds the threshold P, and samples the next token only from this set. A float number between 0 and 1. Set to 1 to disable.

Only relevant when temperature is set to a value greater than 0.

seed?

optional seed: number;

Used to control the randomness of the generated text.

Change the seed to get different results.

Only relevant when using temperature.

trimWhitespaceSuffix?

optional trimWhitespaceSuffix: boolean;

Trim whitespace from the end of the generated text

Defaults to false.

repeatPenalty?

optional repeatPenalty: 
  | false
  | LLamaContextualRepeatPenalty;

tokenBias?

optional tokenBias: 
  | TokenBias
  | () => TokenBias;

Adjust the probability of tokens being generated. Can be used to bias the model to generate tokens that you want it to lean towards, or to avoid generating tokens that you want it to avoid.

evaluationPriority?

optional evaluationPriority: EvaluationPriority;

See the parameter evaluationPriority on the LlamaContextSequence.evaluate() function for more information.

contextShift?

optional contextShift: LLamaChatContextShiftOptions;

customStopTriggers?

optional customStopTriggers: readonly (LlamaText | string | readonly (string | Token)[])[];

Custom stop triggers to stop the generation of the response when any of the provided triggers are found.

lastEvaluationContextWindow?

optional lastEvaluationContextWindow: {
  history?: ChatHistoryItem[];
  minimumOverlapPercentageToPreventContextShift?: number;
};

The evaluation context window returned from the last evaluation. This is an optimization to utilize existing context sequence state better when possible.

lastEvaluationContextWindow.history?

optional history: ChatHistoryItem[];

The history of the last evaluation.

lastEvaluationContextWindow.minimumOverlapPercentageToPreventContextShift?

optional minimumOverlapPercentageToPreventContextShift: number;

Minimum overlap percentage with existing context sequence state to use the last evaluation context window. If the last evaluation context window is not used, a new context will be generated based on the full history, which will decrease the likelihood of another context shift happening so soon.

A number between 0 (exclusive) and 1 (inclusive).

onFunctionCallParamsChunk()?

optional onFunctionCallParamsChunk: (chunk: LlamaChatResponseFunctionCallParamsChunk) => void;

Called as the model generates function calls with the generated parameters chunk for each function call.

Useful for streaming the generated function call parameters as they're being generated. Only useful in specific use cases, such as showing the generated textual file content as it's being generated (note that doing this requires parsing incomplete JSON).

The constructed text from all the params chunks of a given function call can be parsed as a JSON object, according to the function parameters schema.

Each function call has its own callIndex you can use to distinguish between them.

Only relevant when using function calling (via passing the functions option).

Parameters

Parameter	Type
`chunk`	`LlamaChatResponseFunctionCallParamsChunk`

Returns

void

budgets?

optional budgets: {
  includeCurrentResponse?: boolean;
  thoughtTokens?: number;
  commentTokens?: number;
};

Set the maximum number of tokens the model is allowed to spend on various segmented responses.

budgets.includeCurrentResponse?

optional includeCurrentResponse: boolean;

Whether to include the tokens already consumed by the current model response being completed in the budget.

Defaults to true.

budgets.thoughtTokens?

optional thoughtTokens: number;

Budget for thought tokens.

Defaults to Infinity.

budgets.commentTokens?

optional commentTokens: number;

Budget for comment tokens.

Defaults to Infinity.

abortOnNonText?

optional abortOnNonText: boolean;

Stop the generation when the model tries to generate a non-textual segment or call a function.

Useful for generating completions in a form of a model response.

Defaults to false.

Type Parameters

Type Parameter	Default type
`Functions` extends `ChatModelFunctions` \| `undefined`	`undefined`

LlamaModelTokens

ChatModelResponse

GgufMetadata

LlamaContextOptions

BatchingOptions

LlamaChatSessionOptions

LLamaChatPromptOptions

JinjaTemplateChatWrapperOptions

Type Alias: LLamaChatGenerateResponseOptions<Functions> ​

Type declaration ​

onTextChunk()? ​

Parameters ​

Returns ​

onToken()? ​

Parameters ​

Returns ​

onResponseChunk()? ​

Parameters ​

Returns ​

signal? ​

stopOnAbortSignal? ​

maxTokens? ​

temperature? ​

minP? ​

topK? ​

topP? ​

seed? ​

trimWhitespaceSuffix? ​

repeatPenalty? ​

tokenBias? ​

evaluationPriority? ​

contextShift? ​

customStopTriggers? ​

lastEvaluationContextWindow? ​

lastEvaluationContextWindow.history? ​

lastEvaluationContextWindow.minimumOverlapPercentageToPreventContextShift? ​

onFunctionCallParamsChunk()? ​

Parameters ​

Returns ​

budgets? ​

budgets.includeCurrentResponse? ​

budgets.thoughtTokens? ​

budgets.commentTokens? ​

abortOnNonText? ​

Type Parameters ​

Type Alias: LLamaChatGenerateResponseOptions<Functions>

Type declaration

onTextChunk()?

Parameters

Returns

onToken()?

Parameters

Returns

onResponseChunk()?

Parameters

Returns

signal?

stopOnAbortSignal?

maxTokens?

temperature?

minP?

topK?

topP?

seed?

trimWhitespaceSuffix?

repeatPenalty?

tokenBias?

evaluationPriority?

contextShift?

customStopTriggers?

lastEvaluationContextWindow?

lastEvaluationContextWindow.history?

lastEvaluationContextWindow.minimumOverlapPercentageToPreventContextShift?

onFunctionCallParamsChunk()?

Parameters

Returns

budgets?

budgets.includeCurrentResponse?

budgets.thoughtTokens?

budgets.commentTokens?

abortOnNonText?

Type Parameters