gpt-oss is here!

August 9, 2025

node-llama-cpp v3.12 is here, with full support for gpt-oss models!

gpt-oss

gpt-oss comes in two flavors:

gpt-oss-20b - 21B parameters with 3.6B active parameters
gpt-oss-120b - 117B parameters with 5.1B active parameters

Here are a few highlights of these models:

Due to the low number of active parameters, these models are very fast
These are reasoning models, and you can adjust their reasoning effort
They are very good at function calling, and are built with agentic capabilities in mind
These models were trained with native MXFP4 precision, so no need to quantize them further. They're small compared to their capabilities already
They are provided with an Apache 2.0 license, so you can use them in your commercial applications

Recommended Models

Here are some recommended model URIs you can use to try out gpt-oss right away:

Model	Size	URI
`gpt-oss-20b`	12.1GB	`hf:giladgd/gpt-oss-20b-GGUF/gpt-oss-20b.MXFP4.gguf`
`gpt-oss-120b`	63.4GB	`hf:giladgd/gpt-oss-120b-GGUF/gpt-oss-120b.MXFP4-00001-of-00002.gguf`

TIP

Estimate the compatibility of a model with your machine before downloading it:

shell

npx -y node-llama-cpp inspect estimate <model URI>

`MXFP4` Quantization

You might be used to looking for a Q4_K_M quantization because of its good balance between quality and size, and be looking for a Q4_K_M quantization of gpt-oss models. You don't have to, because these models are already natively provided in a similar quantization format called MXFP4.

Let's break down what MXFP4 is:

MXFP4 stands for Microscaling FP4 (Floating Point, 4-bit). Q4_K_M is also a 4-bit quantization.
It's a format what was created and standardized by the Open Compute Project (OCP) in early 2024. OCP is backed by big players like OpenAI, NVIDIA, AMD, Microsoft, and Meta, with the goal of lowering the hardware and compute barriers to running AI models.
Designed to dramatically reduce the memory and compute requirements for training and running AI models, while preserving as much precision as possible.

This format was used to train the gpt-oss models, so the most precise format of these models is MXFP4.
Since this is a 4-bit precision format, its size footprint is similar to Q4_K_M quantization, but it provides better precision and thus better quality. First class support for MXFP4 in llama.cpp was introduced as part of the gpt-oss release.

The bottom line is that you don't have to find a Q4_K_M quantization of gpt-oss models, because the MXFP4 format is as small, efficient, and fast as Q4_K_M, but offers better precision and thus better quality.

Try It Using the CLI

To quickly try out gpt-oss-20b, you can use the CLI chat command:

shell

npx -y node-llama-cpp chat --prompt "Hi there" hf:giladgd/gpt-oss-20b-GGUF/gpt-oss-20b.MXFP4.gguf

`thought` Segments

Since gpt-oss models are reasoning models, they generate thoughts as part of their response. These thoughts are useful for debugging and understanding the model's reasoning process, and can be used to iterate on the system prompt and inputs you provide to the model to improve its responses.

However, OpenAI emphasizes that the thoughts generated by these models may not be safe to show to end users as they are unrestricted and might include sensitive information, uncontained language, hallucinations, or other issues. Thus, OpenAI recommends not showing these to users without further filtering, moderation or summarization.

Check out the segment streaming example to learn how to use segments.

`comment` Segments

gpt-oss models output "preamble" messages in their response; these are segmented as a new comment segment in the model's response.

The model might choose to generate those segments to inform the user about the functions it's about to call. For example, when it plans to use multiple functions, it may generate a plan in advance.

These are intended for the user to see, but not as part of the main response.

Check out the segment streaming example to learn how to use segments.

Experiment with comment segments

The Electron app template has been updated to properly segment comments in the response.

Try it out by downloading the latest build from GitHub, or by scaffolding a new project based on the Electron template:

shell

npm create node-llama-cpp@latest

Customizing gpt-oss

You can adjust gpt-oss's responses by configuring the options of HarmonyChatWrapper:

typescript

import {
    getLlama
, resolveModelFile
, LlamaChatSession
,
    HarmonyChatWrapper

} from "node-llama-cpp";

const modelUri
 = "hf:giladgd/gpt-oss-20b-GGUF/gpt-oss-20b.MXFP4.gguf";


const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: await resolveModelFile
(modelUri
)
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
(),
    chatWrapper
: new HarmonyChatWrapper
({
        modelIdentity
: "You are ChatGPT, a large language model trained by OpenAI.",
        reasoningEffort
: "high"
    })
});

const q1
 = "What is the weather like in SF?";
console
.log
("User: " + q1
);

const a1
 = await session
.prompt
(q1
);
console
.log
("AI: " + a1
);

Using Function Calling

gpt-oss models have great support for function calling. However, these models don't support parallel function calling, so only one function will be called at a time.

typescript

import {
    getLlama
, resolveModelFile
, LlamaChatSession
,
    defineChatSessionFunction

} from "node-llama-cpp";

const modelUri
 = "hf:giladgd/gpt-oss-20b-GGUF/gpt-oss-20b.MXFP4.gguf";


const llama
 = await getLlama
();
const model
 = await llama
.loadModel
({
    modelPath
: await resolveModelFile
(modelUri
)
});
const context
 = await model
.createContext
();
const session
 = new LlamaChatSession
({
    contextSequence
: context
.getSequence
()
});

const functions
 = {
    getCurrentWeather
: defineChatSessionFunction
({
        description
: "Gets the current weather in the provided location.",
        params
: {
            type
: "object",
            properties
: {
                location
: {
                    type
: "string",
                    description
: "The city and state, e.g. San Francisco, CA"
                },
                format
: {
                    enum
: ["celsius", "fahrenheit"]
                }
            }
        },
        handler
({location
, format
}) {
            console
.log
(`Getting current weather for "${location
}" in ${format
}`);

            return {
                // simulate a weather API response
                temperature
: format
 === "celsius" ? 20 : 68,
                format

            };
        }
    })
};

const q1
 = "What is the weather like in SF?";
console
.log
("User: " + q1
);

const a1
 = await session
.prompt
(q1
, {functions
});
console
.log
("AI: " + a1
);

gpt-oss is here! ​

gpt-oss ​

Recommended Models ​

MXFP4 Quantization ​

Try It Using the CLI ​

thought Segments ​

comment Segments ​

Customizing gpt-oss ​

Using Function Calling ​

gpt-oss is here!

gpt-oss

Recommended Models

`MXFP4` Quantization

Try It Using the CLI

`thought` Segments

`comment` Segments

Customizing gpt-oss

Using Function Calling