Skip to content

gpt-oss is here!

August 9, 2025

node-llama-cpp + gpt-oss

node-llama-cpp v3.12 is here, with full support for gpt-oss models!


gpt-oss

gpt-oss comes in two flavors:

Here are a few highlights of these models:

  • Due to the low number of active parameters, these models are very fast
  • These are reasoning models, and you can adjust their reasoning efforts
  • They are very good at function calling, and are built with agentic capabilities in mind
  • These models were trained with native MXFP4 precision, so no need to quantize them further. They're small compared to their capabilities already
  • They are provided with an Apache 2.0 license, so you can use them in your commercial applications

Here are some recommended model URIs you can use to try out gpt-oss right away:

ModelSizeURI
gpt-oss-20b12.1GBhf:giladgd/gpt-oss-20b-GGUF/gpt-oss-20b.MXFP4.gguf
gpt-oss-120b63.4GBhf:giladgd/gpt-oss-120b-GGUF/gpt-oss-120b.MXFP4-00001-of-00002.gguf

TIP

Estimate the compatibility of a model with your machine before downloading it:

shell
npx -y node-llama-cpp inspect estimate <model URI>

MXFP4 Quantization

You might be used to looking for a Q4_K_M quantization because of its good balance between quality and size, and be looking for a Q4_K_M quantization of gpt-oss models. You don't have to, because these models are already natively provided in a similar quantization format called MXFP4.

Let's break down what MXFP4 is:

  • MXFP4 stands for Microscaling FP4 (Floating Point, 4-bit). Q4_K_M is also a 4-bit quantization.
  • It's a format what was created and standardized by the Open Compute Project (OCP) in early 2024. OCP is backed by big players like OpenAI, NVIDIA, AMD, Microsoft, and Meta, with the goal of lowering the hardware and compute barriers to running AI models.
  • Designed to dramatically reduce the memory and compute requirements for training and running AI models, while preserving as much precision as possible.

This format was used to train the gpt-oss models, so the most precise format of these models is MXFP4.
Since this is a 4-bit precision format, its size footprint is similar to Q4_K_M quantization, but it provides better precision and thus better quality. First class support for MXFP4 in llama.cpp was introduced as part of the gpt-oss release.

The bottom line is that you don't have to find a Q4_K_M quantization of gpt-oss models, because the MXFP4 format is as small, efficient, and fast as Q4_K_M, but offers better precision and thus better quality.

Try It Using the CLI

To quickly try out gpt-oss-20b, you can use the CLI chat command:

shell
npx -y node-llama-cpp chat --ef --prompt "Hi there" hf:giladgd/gpt-oss-20b-GGUF/gpt-oss-20b.MXFP4.gguf

thought Segments

Since gpt-oss models are reasoning models, they generate thoughts as part of their response. These thoughts are useful for debugging and understanding the model's reasoning process, and can be used to iterate on the system prompt and inputs you provide to the model to improve its responses.

However, OpenAI emphasizes that the thoughts generated by these models may not be safe to show to end users as they are unrestricted and might include sensitive information, uncontained language, hallucinations, or other issues. Thus, OpenAI recommends not showing these to users without further filtering, moderation or summarization.

Check out the segment streaming example to learn how to use segments.

comment Segments

gpt-oss models output "preamble" messages in their response; these are segmented as a new comment segment in the model's response.

The model might choose to generate those segments to inform the user about the functions it's about to call. For example, when it plans to use multiple functions, it may generate a plan in advance.

These are intended for the user to see, but not as part of the main response.

Check out the segment streaming example to learn how to use segments.

Experiment with comment segments

The Electron app template has been updated to properly segment comments in the response.

Try it out by downloading the latest build from GitHub, or by scaffolding a new project based on the Electron template:

shell
npm create node-llama-cpp@latest

Customizing gpt-oss

You can adjust gpt-oss's responses by configuring the options of HarmonyChatWrapper:

typescript
import {
    
getLlama
,
resolveModelFile
,
LlamaChatSession
,
HarmonyChatWrapper
} from "node-llama-cpp"; const
modelUri
= "hf:giladgd/gpt-oss-20b-GGUF/gpt-oss-20b.MXFP4.gguf";
const
llama
= await
getLlama
();
const
model
= await
llama
.
loadModel
({
modelPath
: await
resolveModelFile
(
modelUri
)
}); const
context
= await
model
.
createContext
();
const
session
= new
LlamaChatSession
({
contextSequence
:
context
.
getSequence
(),
chatWrapper
: new
HarmonyChatWrapper
({
modelIdentity
: "You are ChatGPT, a large language model trained by OpenAI.",
reasoningEffort
: "high"
}) }); const
q1
= "What is the weather like in SF?";
console
.
log
("User: " +
q1
);
const
a1
= await
session
.
prompt
(
q1
);
console
.
log
("AI: " +
a1
);

Using Function Calling

gpt-oss models have great support for function calling. However, these models don't support parallel function calling, so only one function will be called at a time.

typescript
import {
    
getLlama
,
resolveModelFile
,
LlamaChatSession
,
defineChatSessionFunction
} from "node-llama-cpp"; const
modelUri
= "hf:giladgd/gpt-oss-20b-GGUF/gpt-oss-20b.MXFP4.gguf";
const
llama
= await
getLlama
();
const
model
= await
llama
.
loadModel
({
modelPath
: await
resolveModelFile
(
modelUri
)
}); const
context
= await
model
.
createContext
();
const
session
= new
LlamaChatSession
({
contextSequence
:
context
.
getSequence
()
}); const
functions
= {
getCurrentWeather
:
defineChatSessionFunction
({
description
: "Gets the current weather in the provided location.",
params
: {
type
: "object",
properties
: {
location
: {
type
: "string",
description
: "The city and state, e.g. San Francisco, CA"
},
format
: {
enum
: ["celsius", "fahrenheit"]
} } },
handler
({
location
,
format
}) {
console
.
log
(`Getting current weather for "${
location
}" in ${
format
}`);
return { // simulate a weather API response
temperature
:
format
=== "celsius" ? 20 : 68,
format
}; } }) }; const
q1
= "What is the weather like in SF?";
console
.
log
("User: " +
q1
);
const
a1
= await
session
.
prompt
(
q1
, {
functions
});
console
.
log
("AI: " +
a1
);