1 Introduction#
This paper uses the Ministral-3-3B-Instruct-2512 model and takes an image classification task fine-tuned via SFT as an example to illustrate how to add new special tokens. The experimental command is as follows:
1
2
3
4
| # install newest transformers
pip install git+https://github.com/huggingface/transformers
DISABLE_VERSION_CHECK=1 CUDA_VISIBLE_DEVICES=7 python src/train.py examples/train_lora/ministral3_lora_sft.yaml
|
It is necessary to preconfigure ministral3_lora_sft.yaml.
2 Dataset Loading and Preprocessing#
In the file
LLaMA-Factory/src/llamafactory/data/loader.py,
the get_dataset function is responsible for loading the dataset and preprocessing the data using the tokenizer.
2.1 Data Loading#
The following code is part of the
LLaMA-Factory/src/llamafactory/data/loader.py:get_dataset
function. It handles reading the data and converting it into the required format.
1
2
3
4
5
6
7
8
9
10
11
| # Load and preprocess dataset
with training_args.main_process_first(desc="load dataset", local=(not data_args.data_shared_file_system)):
dataset = _get_merged_dataset(data_args.dataset, model_args, data_args, training_args, stage)
eval_dataset = _get_merged_dataset(
data_args.eval_dataset,
model_args,
data_args,
training_args,
stage,
return_dict=data_args.eval_on_each_dataset,
)
|
The loaded data are stored in dataset, and the data format is transformed as follows, for example:
1
2
3
4
5
6
7
8
9
10
11
| [
{
'_prompt': [{'role': 'user', 'content': 'Transform the following sentence using a synonym: The car sped quickly.'}],
'_response': [{'role': 'assistant', 'content': 'The car accelerated rapidly.'}],
'_system': '',
'_tools': '',
'_images': None,
'_videos': None,
'_audios': None
}
]
|
2.2 Data Preprocessing#
The data preprocessing code is located in
LLaMA-Factory/src/llamafactory/data/loader.py:get_dataset, as shown below:
1
2
3
4
| with training_args.main_process_first(desc="pre-process dataset", local=(not data_args.data_shared_file_system)):
dataset = _get_preprocessed_dataset(
dataset, data_args, training_args, stage, template, tokenizer, processor, is_eval=False
)
|
This code converts data in json format into formatted sequence data, for example:
1
| '_prompt': [{'role': 'user', 'content': 'Transform the following sentence using a synonym: The car sped quickly.'}]
|
is converted to
1
| '<|im_start|>user\nTransform the following sentence using a synonym: The car sped quickly.<|im_end|>\n<|im_start|>assistant\n'
|
Then, the sequence is converted into token IDs, and the function call flow is as follows:
_get_preprocessed_dataset $\rightarrow$ SupervisedDatasetProcessor.preprocess_dataset $\rightarrow$ SupervisedDatasetProcessor._encode_data_example $\rightarrow$ SupervisedDatasetProcessor.template.encode_multiturn $\rightarrow$ Template._encode
Template._encode performs the conversion from sequences to token IDs. The code is as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
| def _encode(
self,
tokenizer: "PreTrainedTokenizer",
messages: list[dict[str, str]],
system: Optional[str],
tools: Optional[str],
) -> list[list[int]]:
r"""Encode formatted inputs to pairs of token ids.
Turn 0: prefix + system + query resp
Turn t: query resp.
"""
system = system or self.default_system
encoded_messages = []
for i, message in enumerate(messages):
elements = []
if i == 0:
elements += self.format_prefix.apply()
if system or tools:
tool_text = self.format_tools.apply(content=tools)[0] if tools else ""
elements += self.format_system.apply(content=(system + tool_text))
if message["role"] == Role.USER:
elements += self.format_user.apply(content=message["content"], idx=str(i // 2))
elif message["role"] == Role.ASSISTANT:
elements += self.format_assistant.apply(content=message["content"])
elif message["role"] == Role.OBSERVATION:
elements += self.format_observation.apply(content=message["content"])
elif message["role"] == Role.FUNCTION:
elements += self.format_function.apply(
content=message["content"], thought_words=self.thought_words, tool_call_words=self.tool_call_words
)
else:
raise NotImplementedError("Unexpected role: {}".format(message["role"]))
encoded_messages.append(self._convert_elements_to_ids(tokenizer, elements))
return encoded_messages
|
This function first performs format conversion to obtain elements, and then uses the tokenizer to convert elements into token IDs.
3 Special Tokens Parameter Passing#
Adding special tokens requires using the add_special_tokens interface of the tokenizer, for example:
1
2
3
4
5
6
7
8
9
10
11
12
13
| from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
special_tokens_dict = {
"additional_special_tokens": [
"<start>",
"<end>",
]
}
num_added = tokenizer.add_special_tokens(special_tokens_dict)
print("Added tokens:", num_added)
|
Therefore, to add special tokens in LLaMA-Factory, the required special tokens must be added to the tokenizer.
3.1 Tokenizer Loading Method#
In run_sft under
LLaMA-Factory/src/llamafactory/train/sft/workflow.py,
the tokenizer is loaded.
1
2
3
4
5
6
7
8
9
10
| def run_sft(
model_args: "ModelArguments",
data_args: "DataArguments",
training_args: "Seq2SeqTrainingArguments",
finetuning_args: "FinetuningArguments",
generating_args: "GeneratingArguments",
callbacks: Optional[list["TrainerCallback"]] = None,
):
tokenizer_module = load_tokenizer(model_args)
......
|
The function call path is:
load_tokenizer →
patch_tokenizer.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| def patch_tokenizer(tokenizer: "PreTrainedTokenizer", model_args: "ModelArguments") -> None:
if "PreTrainedTokenizerBase" not in str(tokenizer._pad.__func__):
tokenizer._pad = MethodType(PreTrainedTokenizerBase._pad, tokenizer)
......
if model_args.add_special_tokens is not None:
num_added_special_tokens = tokenizer.add_tokens(new_tokens=model_args.add_special_tokens, special_tokens=True)
logger.info_rank0(
"Add special tokens {} to tokenizer's vocabulary.".format(",".join(model_args.add_special_tokens))
)
if num_added_special_tokens > 0 and not model_args.resize_vocab:
model_args.resize_vocab = True
logger.warning_rank0("New special tokens have been added, changed `resize_vocab` to True.")
|
It can be seen that if model_args contains the add_special_tokens parameter, the corresponding special tokens will be loaded.
3.2 Model Arguments Loading Method#
Now that we understand how the tokenizer is loaded, the key question becomes how model_args and its internal add_special_tokens are loaded.
In _training_function under
LLaMA-Factory/src/llamafactory/train/tuner.py,
the function reads the model arguments, data arguments, training arguments, and so on.
1
2
3
4
5
| def _training_function(config: dict[str, Any]) -> None:
args = config.get("args")
callbacks: list[Any] = config.get("callbacks")
model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
......
|
The definition of
get_train_args
is as follows:
1
2
3
4
5
6
7
| def get_train_args(args: Optional[Union[dict[str, Any], list[str]]] = None) -> _TRAIN_CLS:
if is_env_enabled("USE_MCA"):
model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_mca_args(args)
else:
model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
finetuning_args.use_mca = False
......
|
Then it calls
_parse_train_args,
which is defined as follows:
1
2
3
4
| def _parse_train_args(args: Optional[Union[dict[str, Any], list[str]]] = None) -> _TRAIN_CLS:
parser = HfArgumentParser(_TRAIN_ARGS)
allow_extra_keys = is_env_enabled("ALLOW_EXTRA_ARGS")
return _parse_args(parser, args, allow_extra_keys=allow_extra_keys)
|
Finally, it calls
_parse_args,
which is defined as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| def _parse_args(
parser: "HfArgumentParser", args: Optional[Union[dict[str, Any], list[str]]] = None, allow_extra_keys: bool = False
) -> tuple[Any]:
args = read_args(args)
if isinstance(args, dict):
return parser.parse_dict(args, allow_extra_keys=allow_extra_keys)
(*parsed_args, unknown_args) = parser.parse_args_into_dataclasses(args=args, return_remaining_strings=True)
if unknown_args and not allow_extra_keys:
print(parser.format_help())
print(f"Got unknown args, potentially deprecated arguments: {unknown_args}")
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {unknown_args}")
return tuple(parsed_args)
|
parser: "HfArgumentParser" parses all parameters defined in _TRAIN_ARGS within
parser = HfArgumentParser(_TRAIN_ARGS), including model_args.
4 Example: Adding Special Tokens#
4.1 Add them directly in the YAML file.#
To add special tokens, you only need to include the add_special_tokens parameter in the training configuration file, for example:
1
2
3
4
5
| ### model
model_name_or_path: Qwen2.5-3B-Instruct
trust_remote_code: true
add_special_tokens: "[start],[end]"
...
|
A separate new_special_tokens_config.yaml file is required, for example:
1
2
3
4
5
6
7
| # SVG Container Tags
"<|START_OF_SVG|>": "Marks the beginning of an SVG document"
"<|END_OF_SVG|>": "Marks the end of an SVG document"
# SVG Group Tags
"<|start_of_g|>": "Begins a group element in SVG for organizing related shapes"
"<|end_of_g|>": "Ends a group element"
|
In this file, both the special tokens and their corresponding descriptions need to be defined.
1
2
3
4
5
6
7
8
9
10
11
12
13
| ### model
model_name_or_path: Qwen2.5-3B-Instruct
trust_remote_code: true
...
# Training config
new_special_tokens_config: examples/extras/multi_tokens/tokens_cfg.yaml
init_special_tokens: desc_init
...
# Inference config
skip_special_tokens: false # Must set to false for structured tokens
...
|
new_special_tokens_config specifies the path to the tokens_config.yaml file, while init_special_tokens configures the method used to initialize the embeddings of the special tokens. The available options for init_special_tokens are desc_init and desc_init_w_noise. Initialization methods that leverage token descriptions allow the tokenizer to initialize token embeddings based on their descriptions.
Note: Loading special tokens from a file takes higher priority than specifying special tokens directly in the configuration file.
4.3 Adding via the Graphical User Interface#

Simply add the content that would normally be specified in the YAML file under Extra arguments; this method is equivalent to adding it directly in the YAML file.
5 Validating Special Tokens#
Here, a Pokémon image classification task is used to verify whether the special tokens can be correctly added, and to perform training and inference.
5.1 Preparing the Dataset#
1
2
3
4
5
6
7
| from huggingface_hub import snapshot_download
repo_id = "fcakyon/pokemon-classification"
local_dir = "./pokemon-classification"
snapshot_download(repo_id=repo_id, repo_type="dataset", local_dir=local_dir)
print("Done!")
|
Use the script above to download the dataset.
Unzip the train.zip file under pokemon-classification/data, then use the script below to generate a JSON file adapted for LLaMA-Factory for training.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
| import os
import json
train_dir = "train"
output_file = "pokemon_dataset.json"
dataset = []
special_tokens_list = []
for class_name in os.listdir(train_dir)[:20]:
class_path = os.path.join(train_dir, class_name)
if not os.path.isdir(class_path):
continue
special_tokens_list.append(class_name)
for img_file in os.listdir(class_path):
if not img_file.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.gif')):
continue
img_path = os.path.join(class_path, img_file)
data_item = {
"messages": [
{
"role": "user",
"content": "<image>Who is this Pokemon?"
},
{
"role": "assistant",
"content": f"[{class_name}]"
},
{
"role": "user",
"content": "What type is it?<image>"
},
{
"role": "assistant",
"content": f"[{class_name}]"
}
],
"images": [
img_path,
img_path
]
}
dataset.append(data_item)
with open(output_file, "w") as f:
json.dump(dataset, f, indent=2)
print(f"Generation completed. A total of {len(dataset)} data entries were generated and saved to {output_file}.")
special_tokens = ""
for token in special_tokens_list:
special_tokens += f"[{token}],"
print(f"special_tokens: {special_tokens}.")
|
The resulting JSON file has the following format:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| [
{
"messages": [
{
"role": "user",
"content": "<image>Who is this Pokemon?"
},
{
"role": "assistant",
"content": "[Dratini]"
},
{
"role": "user",
"content": "What type is it?<image>"
},
{
"role": "assistant",
"content": "[Dratini]"
}
],
"images": [
"train/Dratini/d767470f6a6e44f6b3076282d4d416cf_jpg.rf.0d1a118bbc525e1772ace46ea075ca1e.jpg",
"train/Dratini/d767470f6a6e44f6b3076282d4d416cf_jpg.rf.0d1a118bbc525e1772ace46ea075ca1e.jpg"
]
}
]
|
5.2 Training the Pokémon Multimodal Classification Model#
Copy the generated dataset JSON file and the corresponding train folder into LLaMA-Factory/data. Then, add the following configuration to the LLaMA-Factory/data/dataset_info.json file to register the dataset:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| "pokemon_dataset": {
"file_name": "pokemon_dataset.json",
"formatting": "sharegpt",
"columns": {
"messages": "messages",
"images": "images"
},
"tags": {
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant"
}
}
|
- Training the Model with Special Tokens
1
| DISABLE_VERSION_CHECK=1 CUDA_VISIBLE_DEVICES=7 USE_MODELSCOPE_HUB=1 llamafactory-cli webui
|
The special tokens used in this task are the names of Pokémon, and add_special_tokens needs to be added under Extra arguments.
1
| "add_special_tokens":"[Dratini],[Kabuto],[Articuno],[Farfetchd],[Parasect],[Alolan Sandslash],[Gloom],[Jynx],[Muk],[Mew],[Machamp],[Eevee],[Doduo],[Kingler],[Kakuna],[MrMime],[Ninetales],[Golem],[Gyarados],[Dragonite]"
|

Once added, training can be started.

5.3 Inference Using the Model#
Similarly, “add_special_tokens” needs to be added under Extra arguments.

Input an image for classification. Since the classification labels are special tokens, be sure to uncheck “Skip special tokens”.

The results from the original model are as follows:

This indicates that the model has been properly trained and the special tokens have been successfully learned.