gpt-oss is here!
August 9, 2025

node-llama-cpp
v3.12 is here, with full support for gpt-oss
models!
gpt-oss
gpt-oss
comes in two flavors:
gpt-oss-20b
- 21B parameters with 3.6B active parametersgpt-oss-120b
- 117B parameters with 5.1B active parameters
Here are a few highlights of these models:
- Due to the low number of active parameters, these models are very fast
- These are reasoning models, and you can adjust their reasoning efforts
- They are very good at function calling, and are built with agentic capabilities in mind
- These models were trained with native MXFP4 precision, so no need to quantize them further. They're small compared to their capabilities already
- They are provided with an Apache 2.0 license, so you can use them in your commercial applications
Recommended Models
Here are some recommended model URIs you can use to try out gpt-oss
right away:
Model | Size | URI |
---|---|---|
gpt-oss-20b | 12.1GB | hf:giladgd/gpt-oss-20b-GGUF/gpt-oss-20b.MXFP4.gguf |
gpt-oss-120b | 63.4GB | hf:giladgd/gpt-oss-120b-GGUF/gpt-oss-120b.MXFP4-00001-of-00002.gguf |
TIP
Estimate the compatibility of a model with your machine before downloading it:
npx -y node-llama-cpp inspect estimate <model URI>
MXFP4
Quantization
You might be used to looking for a Q4_K_M
quantization because of its good balance between quality and size, and be looking for a Q4_K_M
quantization of gpt-oss
models. You don't have to, because these models are already natively provided in a similar quantization format called MXFP4
.
Let's break down what MXFP4
is:
MXFP4
stands for Microscaling FP4 (Floating Point, 4-bit).Q4_K_M
is also a 4-bit quantization.- It's a format what was created and standardized by the Open Compute Project (OCP) in early 2024. OCP is backed by big players like OpenAI, NVIDIA, AMD, Microsoft, and Meta, with the goal of lowering the hardware and compute barriers to running AI models.
- Designed to dramatically reduce the memory and compute requirements for training and running AI models, while preserving as much precision as possible.
This format was used to train the gpt-oss
models, so the most precise format of these models is MXFP4
.
Since this is a 4-bit precision format, its size footprint is similar to Q4_K_M
quantization, but it provides better precision and thus better quality. First class support for MXFP4
in llama.cpp
was introduced as part of the gpt-oss
release.
The bottom line is that you don't have to find a Q4_K_M
quantization of gpt-oss
models, because the MXFP4
format is as small, efficient, and fast as Q4_K_M
, but offers better precision and thus better quality.
Try It Using the CLI
To quickly try out gpt-oss-20b
, you can use the CLI chat
command:
npx -y node-llama-cpp chat --ef --prompt "Hi there" hf:giladgd/gpt-oss-20b-GGUF/gpt-oss-20b.MXFP4.gguf
thought
Segments
Since gpt-oss
models are reasoning models, they generate thoughts as part of their response. These thoughts are useful for debugging and understanding the model's reasoning process, and can be used to iterate on the system prompt and inputs you provide to the model to improve its responses.
However, OpenAI emphasizes that the thoughts generated by these models may not be safe to show to end users as they are unrestricted and might include sensitive information, uncontained language, hallucinations, or other issues. Thus, OpenAI recommends not showing these to users without further filtering, moderation or summarization.
Check out the segment streaming example to learn how to use segments.
comment
Segments
gpt-oss
models output "preamble" messages in their response; these are segmented as a new comment
segment in the model's response.
The model might choose to generate those segments to inform the user about the functions it's about to call. For example, when it plans to use multiple functions, it may generate a plan in advance.
These are intended for the user to see, but not as part of the main response.
Check out the segment streaming example to learn how to use segments.
Experiment with comment
segments
The Electron app template has been updated to properly segment comments in the response.
Try it out by downloading the latest build from GitHub, or by scaffolding a new project based on the Electron template:
npm create node-llama-cpp@latest
Customizing gpt-oss
You can adjust gpt-oss
's responses by configuring the options of HarmonyChatWrapper
:
import {
getLlama, resolveModelFile, LlamaChatSession,
HarmonyChatWrapper
} from "node-llama-cpp";
const modelUri = "hf:giladgd/gpt-oss-20b-GGUF/gpt-oss-20b.MXFP4.gguf";
const llama = await getLlama();
const model = await llama.loadModel({
modelPath: await resolveModelFile(modelUri)
});
const context = await model.createContext();
const session = new LlamaChatSession({
contextSequence: context.getSequence(),
chatWrapper: new HarmonyChatWrapper({
modelIdentity: "You are ChatGPT, a large language model trained by OpenAI.",
reasoningEffort: "high"
})
});
const q1 = "What is the weather like in SF?";
console.log("User: " + q1);
const a1 = await session.prompt(q1);
console.log("AI: " + a1);
Using Function Calling
gpt-oss
models have great support for function calling. However, these models don't support parallel function calling, so only one function will be called at a time.
import {
getLlama, resolveModelFile, LlamaChatSession,
defineChatSessionFunction
} from "node-llama-cpp";
const modelUri = "hf:giladgd/gpt-oss-20b-GGUF/gpt-oss-20b.MXFP4.gguf";
const llama = await getLlama();
const model = await llama.loadModel({
modelPath: await resolveModelFile(modelUri)
});
const context = await model.createContext();
const session = new LlamaChatSession({
contextSequence: context.getSequence()
});
const functions = {
getCurrentWeather: defineChatSessionFunction({
description: "Gets the current weather in the provided location.",
params: {
type: "object",
properties: {
location: {
type: "string",
description: "The city and state, e.g. San Francisco, CA"
},
format: {
enum: ["celsius", "fahrenheit"]
}
}
},
handler({location, format}) {
console.log(`Getting current weather for "${location}" in ${format}`);
return {
// simulate a weather API response
temperature: format === "celsius" ? 20 : 68,
format
};
}
})
};
const q1 = "What is the weather like in SF?";
console.log("User: " + q1);
const a1 = await session.prompt(q1, {functions});
console.log("AI: " + a1);