Using `LlamaChatSession`

To chat with a text generation model, you can use the LlamaChatSession class.

Here are usage examples of LlamaChatSession:

Simple Chatbot

typescript

import {fileURLToPath
} from "url";
import path
 from "path";
import {getLlama
, LlamaChatSession
} from "node-llama-cpp";

const __dirname
 = path
.dirname
(fileURLToPath
(import.meta.url
));

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
()
});


const q1
 = "Hi there, how are you?";
console
.log
("User: " + q1
);

const a1
 = await session
.prompt
(q1
);
console
.log
("AI: " + a1
);


const q2
 = "Summarize what you said";
console
.log
("User: " + q2
);

const a2
 = await session
.prompt
(q2
);
console
.log
("AI: " + a2
);

Specific Chat Wrapper

To learn more about chat wrappers, see the chat wrapper guide.

typescript

import {fileURLToPath
} from "url";
import path
 from "path";
import {getLlama
, LlamaChatSession
, GeneralChatWrapper
} from "node-llama-cpp";

const __dirname
 = path
.dirname
(fileURLToPath
(import.meta.url
));

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
(),
    chatWrapper
: new GeneralChatWrapper
()
});


const q1
 = "Hi there, how are you?";
console
.log
("User: " + q1
);

const a1
 = await session
.prompt
(q1
);
console
.log
("AI: " + a1
);


const q2
 = "Summarize what you said";
console
.log
("User: " + q2
);

const a2
 = await session
.prompt
(q2
);
console
.log
("AI: " + a2
);

Response Streaming

You can see all the possible options of the prompt function here.

typescript

import {fileURLToPath
} from "url";
import path
 from "path";
import {getLlama
, LlamaChatSession
} from "node-llama-cpp";

const __dirname
 = path
.dirname
(fileURLToPath
(import.meta.url
));

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
()
});


const q1
 = "Hi there, how are you?";
console
.log
("User: " + q1
);

process
.stdout
.write
("AI: ");
const a1
 = await session
.prompt
(q1
, {
    onTextChunk
(chunk
: string) {
        process
.stdout
.write
(chunk
);
    }
});

To stream thought segment, see Stream Response Segments

Repeat Penalty Customization

You can see all the possible options of the prompt function here.

typescript

import {fileURLToPath
} from "url";
import path
 from "path";
import {getLlama
, LlamaChatSession
, Token
} from "node-llama-cpp";

const __dirname
 = path
.dirname
(fileURLToPath
(import.meta.url
));

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
()
});


const q1
 = "Write a poem about llamas";
console
.log
("User: " + q1
);

const a1
 = await session
.prompt
(q1
, {
    repeatPenalty
: {
        lastTokens
: 24,
        penalty
: 1.12,
        penalizeNewLine
: true,
        frequencyPenalty
: 0.02,
        presencePenalty
: 0.02,
        punishTokensFilter
(tokens
: Token
[]) {
            return tokens
.filter
(token
 => {
                const text
 = model
.detokenize
([token
]);

                // allow the model to repeat tokens
                // that contain the word "better"
                return !text
.toLowerCase
().includes
("better");
            });
        }
    }
});
console
.log
("AI: " + a1
);

Custom Temperature

Setting the temperature option is useful for controlling the randomness of the model's responses.

A temperature of 0 (the default) will ensure the model response is always deterministic for a given prompt.

The randomness of the temperature can be controlled by the seed parameter. Setting a specific seed and a specific temperature will yield the same response every time for the same input.

You can see the description of the prompt function options here.

typescript

import {fileURLToPath
} from "url";
import path
 from "path";
import {getLlama
, LlamaChatSession
} from "node-llama-cpp";

const __dirname
 = path
.dirname
(fileURLToPath
(import.meta.url
));

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
()
});


const q1
 = "Hi there, how are you?";
console
.log
("User: " + q1
);

const a1
 = await session
.prompt
(q1
, {
    temperature
: 0.8,
    topK
: 40,
    topP
: 0.02,
    seed
: 2462
});
console
.log
("AI: " + a1
);

JSON Response

To learn more about grammars, see the grammar guide.

typescript

import {fileURLToPath
} from "url";
import path
 from "path";
import {getLlama
, LlamaChatSession
} from "node-llama-cpp";

const __dirname
 = path
.dirname
(fileURLToPath
(import.meta.url
));

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
()
});
const grammar
 = await llama
.getGrammarFor
("json");


const q1
 = 'Create a JSON that contains a message saying "hi there"';
console
.log
("User: " + q1
);

const a1
 = await session
.prompt
(q1
, {
    grammar
,
    maxTokens
: context
.contextSize

});
console
.log
("AI: " + a1
);
console
.log
(JSON
.parse
(a1
));


const q2
 = 'Add another field to the JSON with the key being "author" ' +
    'and the value being "Llama"';
console
.log
("User: " + q2
);

const a2
 = await session
.prompt
(q2
, {
    grammar
,
    maxTokens
: context
.contextSize

});
console
.log
("AI: " + a2
);
console
.log
(JSON
.parse
(a2
));

JSON Response With a Schema

To learn more about the JSON schema grammar, see the grammar guide.

typescript

import {fileURLToPath
} from "url";
import path
 from "path";
import {getLlama
, LlamaChatSession
} from "node-llama-cpp";

const __dirname
 = path
.dirname
(
    fileURLToPath
(import.meta.url
)
);

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
()
});

const grammar
 = await llama
.createGrammarForJsonSchema
({
    type
: "object",
    properties
: {
        positiveWordsInUserMessage
: {
            type
: "array",
            items
: {
                type
: "string"
            }
        },
        userMessagePositivityScoreFromOneToTen
: {
            enum
: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
        },
        nameOfUser
: {
            oneOf
: [{
                type
: "null"
            }, {
                type
: "string"
            }]
        }
    }
});

const prompt
 = "Hi there! I'm John. Nice to meet you!";

const res
 = await session
.prompt
(prompt
, {grammar
});
const parsedRes
 = grammar
.parse
(res
);

console
.log
("User name:", parsedRes
.nameOfUser
);
console
.log
(
    "Positive words in user message:",
    parsedRes
.positiveWordsInUserMessage

);
console
.log
(
    "User message positivity score:",
    parsedRes
.userMessagePositivityScoreFromOneToTen

);

Function Calling

To learn more about using function calling, read the function calling guide.

typescript

import {fileURLToPath
} from "url";
import path
 from "path";
import {getLlama
, LlamaChatSession
, defineChatSessionFunction
} from "node-llama-cpp";

const __dirname
 = path
.dirname
(fileURLToPath
(import.meta.url
));

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
()
});

const fruitPrices
: Record
<string, string> = {
    "apple": "$6",
    "banana": "$4"
};
const functions
 = {
    getFruitPrice
: defineChatSessionFunction
({
        description
: "Get the price of a fruit",
        params
: {
            type
: "object",
            properties
: {
                name
: {
                    type
: "string"
                }
            }
        },
        async handler
(params
) {
            const name
 = params
.name
.toLowerCase
();
            if (Object
.keys
(fruitPrices
).includes
(name
))
                return {
                    name
: name
,
                    price
: fruitPrices
[name
]
                };

            return `Unrecognized fruit "${params
.name
}"`;
        }
    })
};


const q1
 = "Is an apple more expensive than a banana?";
console
.log
("User: " + q1
);

const a1
 = await session
.prompt
(q1
, {functions
});
console
.log
("AI: " + a1
);

Customizing the System Prompt

What is a system prompt?

A system prompt is a text that guides the model towards the kind of responses we want it to generate.

It's recommended to explain to the model how to behave in certain situations you care about, and to tell it to not make up information if it doesn't know something.

Here is an example of how to customize the system prompt:

typescript

import {fileURLToPath
} from "url";
import path
 from "path";
import {getLlama
, LlamaChatSession
} from "node-llama-cpp";

const __dirname
 = path
.dirname
(fileURLToPath
(import.meta.url
));

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
(),
    systemPrompt
: "You are a helpful, respectful and honest botanist. " +
        "Always answer as helpfully as possible.\n" +
        
        "If a question does not make any sense or is not factually coherent," +
        "explain why instead of answering something incorrectly.\n" +
        
        "Attempt to include nature facts that you know in your answers.\n" + 
        
        "If you don't know the answer to a question, " +
        "don't share false information."
});


const q1
 = "What is the tallest tree in the world?";
console
.log
("User: " + q1
);

const a1
 = await session
.prompt
(q1
);
console
.log
("AI: " + a1
);

Saving and Restoring a Chat Session

Save chat history

typescript

import {fileURLToPath
} from "url";
import path
 from "path";
import fs
 from "fs/promises";
import {getLlama
, LlamaChatSession
} from "node-llama-cpp";

const __dirname
 = path
.dirname
(fileURLToPath
(import.meta.url
));

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
()
});


const q1
 = "Hi there, how are you?";
console
.log
("User: " + q1
);

const a1
 = await session
.prompt
(q1
);
console
.log
("AI: " + a1
);

const chatHistory
 = session
.getChatHistory
();
await fs
.writeFile
("chatHistory.json", JSON
.stringify
(chatHistory
), "utf8");

Restore chat history

typescript

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
()
});

const chatHistory
 = JSON
.parse
(await fs
.readFile
("chatHistory.json", "utf8"));
session
.setChatHistory
(chatHistory
);

const q2
 = "Summarize what you said";
console
.log
("User: " + q2
);

const a2
 = await session
.prompt
(q2
);
console
.log
("AI: " + a2
);

Saving and restoring a context sequence evaluation state

You can also save and restore the context sequence evaluation state to avoid re-evaluating the chat history when you load it on a new context sequence.

Please note that context sequence state files can get very large (109MB for only 1K tokens). Using this feature is only recommended when the chat history is very long and you plan to load it often, or when the evaluation is too slow due to hardware limitations.

WARNING

When loading a context sequence state from a file, always ensure that the model used to create the context sequence is exactly the same as the one used to save the state file.

Loading a state file created from a different model can crash the process, thus you have to pass {acceptRisk: true} to the loadStateFromFile method to use it.

Use with caution.

Save chat history and context sequence state

typescript

import {fileURLToPath
} from "url";
import path
 from "path";
import fs
 from "fs/promises";
import {getLlama
, LlamaChatSession
} from "node-llama-cpp";

const __dirname
 = path
.dirname
(fileURLToPath
(import.meta.url
));

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const contextSequence
 = context
.getSequence
();
const session
 = new LlamaChatSession
({contextSequence
});


const q1
 = "Hi there, how are you?";
console
.log
("User: " + q1
);

const a1
 = await session
.prompt
(q1
);
console
.log
("AI: " + a1
);

const chatHistory
 = session
.getChatHistory
();
await Promise
.all
([
    contextSequence
.saveStateToFile
("state.bin"),
    fs
.writeFile
("chatHistory.json", JSON
.stringify
(chatHistory
), "utf8")
]);

Restore chat history and context sequence state

typescript

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const contextSequence
 = context
.getSequence
();
const session
 = new LlamaChatSession
({contextSequence
});

await contextSequence
.loadStateFromFile
("state.bin", {acceptRisk
: true});
const chatHistory
 = JSON
.parse
(await fs
.readFile
("chatHistory.json", "utf8"));
session
.setChatHistory
(chatHistory
);

const q2
 = "Summarize what you said";
console
.log
("User: " + q2
);

const a2
 = await session
.prompt
(q2
);
console
.log
("AI: " + a2
);

Prompt Without Updating Chat History

Prompt without saving the prompt to the chat history.

typescript

import {fileURLToPath
} from "url";
import path
 from "path";
import fs
 from "fs/promises";
import {getLlama
, LlamaChatSession
} from "node-llama-cpp";

const __dirname
 = path
.dirname
(fileURLToPath
(import.meta.url
));

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
()
});

// Save the initial chat history
const initialChatHistory
 = session
.getChatHistory
();

const q1
 = "Hi there, how are you?";
console
.log
("User: " + q1
);

const a1
 = await session
.prompt
(q1
);
console
.log
("AI: " + a1
);

// Reset the chat history
session
.setChatHistory
(initialChatHistory
);

const q2
 = "Summarize what you said";
console
.log
("User: " + q2
);

// This response will not be aware of the previous interaction
const a2
 = await session
.prompt
(q2
);
console
.log
("AI: " + a2
);

Preload User Prompt

You can preload a user prompt onto the context sequence state to make the response start being generated sooner when the final prompt is given.

This won't speed up inference if you call the .prompt() function immediately after preloading the prompt, but can greatly improve initial response times if you preload a prompt before the user gives it.

You can call this function with an empty string to only preload the existing chat history onto the context sequence state.

NOTE

Preloading a long prompt can cause context shifts, so it's recommended to limit the maximum length of the prompt you preload.

typescript

import {fileURLToPath
} from "url";
import path
 from "path";
import {getLlama
, LlamaChatSession
} from "node-llama-cpp";

const __dirname
 = path
.dirname
(fileURLToPath
(import.meta.url
));

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
()
});

const prompt
 = "Hi there, how are you?";

console
.log
("Preloading prompt");
await session
.preloadPrompt
(prompt
);

console
.log
("Prompt preloaded. Waiting 10 seconds");
await new Promise
(resolve
 => setTimeout
(resolve
, 1000 * 10));

console
.log
("Generating response...");
process
.stdout
.write
("AI: ");
const res
 = await session
.prompt
(prompt
, {
    onTextChunk
(text
) {
        process
.stdout
.write
(text
);
    }
});

console
.log
("AI: " + res
);

Complete User Prompt

You can try this feature in the example Electron app. Just type a prompt and see the completion generated by the model.

You can generate a completion to a given incomplete user prompt and let the model complete it.

The advantage of doing that on the chat session is that it will use the chat history as context for the completion, and also use the existing context sequence state, so you don't have to create another context sequence for this.

NOTE

Generating a completion to a user prompt can incur context shifts, so it's recommended to limit the maximum number of tokens that are used for the prompt + completion.

INFO

Prompting the model while a prompt completion is in progress will automatically abort the prompt completion.

typescript

import {fileURLToPath
} from "url";
import path
 from "path";
import {getLlama
, LlamaChatSession
} from "node-llama-cpp";

const __dirname
 = path
.dirname
(fileURLToPath
(import.meta.url
));

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
()
});


const q1
 = "Give me a recipe for a cheesecake";
console
.log
("User: " + q1
);

process
.stdout
.write
("AI: ");
const a1
 = await session
.prompt
(q1
, {
    onTextChunk
(text
) {
        process
.stdout
.write
(text
);
    }
});
console
.log
("AI: " + a1
);

const maxTokens
 = 100;
const partialPrompt
 = "Can I replace the cream cheese with ";

const maxCompletionTokens
 = maxTokens
 - model
.tokenize
(partialPrompt
).length
;
console
.log
("Partial prompt: " + partialPrompt
);
process
.stdout
.write
("Completion: ");
const promptCompletion
 = await session
.completePrompt
(partialPrompt
, {
    maxTokens
: maxCompletionTokens
,
    onTextChunk
(text
) {
        process
.stdout
.write
(text
);
    }
});
console
.log
("\nPrompt completion: " + promptCompletion
);

Prompt Completion Engine

If you want to complete a user prompt as the user types it in an input field, you need a more robust prompt completion engine that can work well with partial prompts that their completion is frequently cancelled and restarted.

The prompt completion created with .createPromptCompletionEngine() allows you to trigger the completion of a prompt, while utilizing existing cache to avoid redundant inference and provide fast completions.

typescript

import {fileURLToPath
} from "url";
import path
 from "path";
import {getLlama
, LlamaChatSession
} from "node-llama-cpp";

const __dirname
 = path
.dirname
(fileURLToPath
(import.meta.url
));

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
()
});

// ensure the model is fully loaded before continuing this demo
await session
.preloadPrompt
("");

const completionEngine
 = session
.createPromptCompletionEngine
({
    // 15 is used for demonstration only,
    // it's best to omit this option
    maxPreloadTokens
: 15,
    // temperature: 0.8, // you can set custom generation options
    onGeneration
(prompt
, completion
) {
        console
.log
(`Prompt: ${prompt
} | Completion:${completion
}`);
        // you should add a custom code here that checks whether
        // the existing input text equals to `prompt`, and if it does,
        // use `completion` as the completion of the input text.
        // this callback will be called multiple times
        // as the completion is being generated.
    }
});

completionEngine
.complete
("Hi the");

await new Promise
(resolve
 => setTimeout
(resolve
, 1500));

completionEngine
.complete
("Hi there");
await new Promise
(resolve
 => setTimeout
(resolve
, 1500));

completionEngine
.complete
("Hi there! How");
await new Promise
(resolve
 => setTimeout
(resolve
, 1500));

// get an existing completion from the cache
// and begin/continue generating a completion for it
const cachedCompletion
 = completionEngine
.complete
("Hi there! How");
console
.log
("Cached completion:", cachedCompletion
);

Response Prefix

You can force the model response to start with a specific prefix, to make the model follow a certain direction in its response.

typescript

import {fileURLToPath
} from "url";
import path
 from "path";
import {getLlama
, LlamaChatSession
} from "node-llama-cpp";

const __dirname
 = path
.dirname
(fileURLToPath
(import.meta.url
));

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
()
});


const q1
 = "Hi there, how are you?";
console
.log
("User: " + q1
);

const a1
 = await session
.prompt
(q1
, {
    responsePrefix
: "The weather today is"
});
console
.log
("AI: " + a1
);

Stop Response Generation

To stop the generation of the current response, without removing the existing partial generation from the chat history, you can use the stopOnAbortSignal option to configure what happens when the given signal is aborted.

typescript

import {fileURLToPath
} from "url";
import path
 from "path";
import {getLlama
, LlamaChatSession
} from "node-llama-cpp";

const __dirname
 = path
.dirname
(fileURLToPath
(import.meta.url
));

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
()
});


const abortController
 = new AbortController
();
const q1
 = "Hi there, how are you?";
console
.log
("User: " + q1
);

let response
 = "";

const a1
 = await session
.prompt
(q1
, {
    // stop the generation, instead of cancelling it
    stopOnAbortSignal
: true,
    
    signal
: abortController
.signal
,
    onTextChunk
(chunk
) {
        response
 += chunk
;
        
        if (response
.length
 >= 10)
            abortController
.abort
();
    }
});
console
.log
("AI: " + a1
);

Stream Response Segments

The raw model response is automatically segmented into different types of segments. The main response is not segmented, but other kinds of sections, like thoughts (chain of thought) and comments (on relevant models, like gpt-oss), are segmented.

To stream response segments you can use the onResponseChunk option.

typescript

import {fileURLToPath
} from "url";
import path
 from "path";
import {getLlama
, LlamaChatSession
} from "node-llama-cpp";

const __dirname
 = path
.dirname
(fileURLToPath
(import.meta.url
));

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: path
.join
(__dirname
, "models", "DeepSeek-R1-Distill-Qwen-14B.Q4_K_M.gguf")
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
()
});


const q1
 = "Hi there, how are you?";
console
.log
("User: " + q1
);

process
.stdout
.write
("AI: ");
const a1
 = await session
.promptWithMeta
(q1
, {
    onResponseChunk
(chunk
) {
        const isThoughtSegment
 = chunk
.type
 === "segment" &&
            chunk
.segmentType
 === "thought";
        const isCommentSegment
 = chunk
.type
 === "segment" &&
            chunk
.segmentType
 === "comment";
        
        if (chunk
.type
 === "segment" && chunk
.segmentStartTime
 != null)
            process
.stdout
.write
(` [segment start: ${chunk
.segmentType
}] `);

        process
.stdout
.write
(chunk
.text
);

        if (chunk
.type
 === "segment" && chunk
.segmentEndTime
 != null)
            process
.stdout
.write
(` [segment end: ${chunk
.segmentType
}] `);
    }
});

const fullResponse
 = a1
.response

    .map
((item
) => {
        if (typeof item
 === "string")
            return item
;
        else if (item
.type
 === "segment") {
            const isThoughtSegment
 = item
.segmentType
 === "thought";
            const isCommentSegment
 = item
.segmentType
 === "comment";
            let res
 = "";
            
            if (item
.startTime
 != null)
                res
 += ` [segment start: ${item
.segmentType
}] `;

            res
 += item
.text
;

            if (item
.endTime
 != null)
                res
 += ` [segment end: ${item
.segmentType
}] `;

            return res
;
        }

        return "";
    })
    .join
("");

console
.log
("Full response: " + fullResponse
);

Set Reasoning Budget

You can set a reasoning budget to limit the number of tokens a thinking model can spend on thought segments.

typescript

import {
    getLlama
, LlamaChatSession
, resolveModelFile
, Token

} from "node-llama-cpp";

const modelPath
 = await resolveModelFile
("hf:Qwen/Qwen3-14B-GGUF:Q4_K_M");

const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({modelPath
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
()
});


const q1
 = "Where do llamas come from?";
console
.log
("User: " + q1
);

const maxThoughtTokens
 = 100;

let responseTokens
 = 0;
let thoughtTokens
 = 0;

process
.stdout
.write
("AI: ");
const response
 = await session
.prompt
(q1
, {
    budgets
: {
        thoughtTokens
: maxThoughtTokens

    },
    onResponseChunk
(chunk
) {
        const isThoughtSegment
 = chunk
.type
 === "segment" &&
            chunk
.segmentType
 === "thought";

        if (chunk
.type
 === "segment" && chunk
.segmentStartTime
 != null)
            process
.stdout
.write
(` [segment start: ${chunk
.segmentType
}] `);

        process
.stdout
.write
(chunk
.text
);

        if (chunk
.type
 === "segment" && chunk
.segmentEndTime
 != null)
            process
.stdout
.write
(` [segment end: ${chunk
.segmentType
}] `);

        if (isThoughtSegment
)
            thoughtTokens
 += chunk
.tokens
.length
;
        else
            responseTokens
 += chunk
.tokens
.length
;
    }
});

console
.log
("Response: " + response
);

console
.log
("Response tokens: " + responseTokens
);
console
.log
("Thought tokens: " + thoughtTokens
);

Last edited 3 months agoView full history

Using LlamaChatSession ​

Simple Chatbot ​

Specific Chat Wrapper ​

Response Streaming ​

Repeat Penalty Customization ​

Custom Temperature ​

JSON Response ​

JSON Response With a Schema ​

Function Calling ​

Customizing the System Prompt ​

Saving and Restoring a Chat Session ​

Prompt Without Updating Chat History ​

Preload User Prompt ​

Complete User Prompt ​

Prompt Completion Engine ​

Response Prefix ​

Stop Response Generation ​

Stream Response Segments ​

Set Reasoning Budget ​

Using `LlamaChatSession`

Simple Chatbot

Specific Chat Wrapper

Response Streaming

Repeat Penalty Customization

Custom Temperature

JSON Response

JSON Response With a Schema

Function Calling

Customizing the System Prompt

Saving and Restoring a Chat Session

Prompt Without Updating Chat History

Preload User Prompt

Complete User Prompt

Prompt Completion Engine

Response Prefix

Stop Response Generation

Stream Response Segments

Set Reasoning Budget