Skip to content

Using Batching

Batching is the process of grouping multiple input sequences together to be processed simultaneously, which improves computational efficiently and reduces overall inference times.

This is useful when you have a large number of inputs to evaluate and want to speed up the process.

When evaluating inputs on multiple context sequences in parallel, batching is automatically used.

To create a context that has multiple context sequences, you can set the sequences option when creating a context.

Here's an example of how to process 2 inputs in parallel, utilizing batching:

typescript
const 
llama
= await
getLlama
();
const
model
= await
llama
.
loadModel
({
modelPath
});
const
context
= await
model
.
createContext
({
sequences
: 2
}); const
sequence1
=
context
.
getSequence
();
const
sequence2
=
context
.
getSequence
();
const
session1
= new
LlamaChatSession
({
contextSequence
:
sequence1
}); const
session2
= new
LlamaChatSession
({
contextSequence
:
sequence2
}); const
q1
= "Hi there, how are you?";
const
q2
= "How much is 6+6?";
const [
a1
,
a2
] = await
Promise
.
all
([
session1
.
prompt
(
q1
),
session2
.
prompt
(
q2
)
]);
console
.
log
("User: " +
q1
);
console
.
log
("AI: " +
a1
);
console
.
log
("User: " +
q2
);
console
.
log
("AI: " +
a2
);

INFO

Since multiple context sequences are processed in parallel, aborting the evaluation of one of them will only cancel the next evaluations of that sequence, and the existing batched evaluation will continue.

For clarification, when aborting a response on a chat session, the response will stop only after the next token finishes being generated; the rest of the response after that token will not be generated.

Custom batchSize

You can set the batchSize option when creating a context to change the maximum number of tokens that can be processed in parallel.

Note that a larger batchSize will require more memory and may slow down inference if the GPU is not powerful enough to handle it.