Add New Special Tokens for Model Training

1 Introduction

This paper uses the Ministral-3-3B-Instruct-2512 model and takes an image classification task fine-tuned via SFT as an example to illustrate how to add new special tokens. The experimental command is as follows:

1
2
3
4
# install newest transformers
pip install git+https://github.com/huggingface/transformers

DISABLE_VERSION_CHECK=1 CUDA_VISIBLE_DEVICES=7 python src/train.py examples/train_lora/ministral3_lora_sft.yaml

It is necessary to preconfigure ministral3_lora_sft.yaml.

2 Dataset Loading and Preprocessing

In the file LLaMA-Factory/src/llamafactory/data/loader.py, the get_dataset function is responsible for loading the dataset and preprocessing the data using the tokenizer.

2.1 Data Loading

The following code is part of the LLaMA-Factory/src/llamafactory/data/loader.py:get_dataset function. It handles reading the data and converting it into the required format.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Load and preprocess dataset
with training_args.main_process_first(desc="load dataset", local=(not data_args.data_shared_file_system)):
    dataset = _get_merged_dataset(data_args.dataset, model_args, data_args, training_args, stage)
    eval_dataset = _get_merged_dataset(
        data_args.eval_dataset,
        model_args,
        data_args,
        training_args,
        stage,
        return_dict=data_args.eval_on_each_dataset,
    )

The loaded data are stored in dataset, and the data format is transformed as follows, for example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
[
    {
        '_prompt': [{'role': 'user', 'content': 'Transform the following sentence using a synonym: The car sped quickly.'}],
        '_response': [{'role': 'assistant', 'content': 'The car accelerated rapidly.'}],
        '_system': '',
        '_tools': '',
        '_images': None, 
        '_videos': None, 
        '_audios': None
    }
]

2.2 Data Preprocessing

The data preprocessing code is located in LLaMA-Factory/src/llamafactory/data/loader.py:get_dataset, as shown below:

1
2
3
4
with training_args.main_process_first(desc="pre-process dataset", local=(not data_args.data_shared_file_system)):
    dataset = _get_preprocessed_dataset(
        dataset, data_args, training_args, stage, template, tokenizer, processor, is_eval=False
    )

This code converts data in json format into formatted sequence data, for example:

`1`	`'_prompt': [{'role': 'user', 'content': 'Transform the following sentence using a synonym: The car sped quickly.'}]`

is converted to

`1`	`'<\|im_start\|>user\nTransform the following sentence using a synonym: The car sped quickly.<\|im_end\|>\n<\|im_start\|>assistant\n'`

Then, the sequence is converted into token IDs, and the function call flow is as follows:

_get_preprocessed_dataset $\rightarrow$ SupervisedDatasetProcessor.preprocess_dataset $\rightarrow$ SupervisedDatasetProcessor._encode_data_example $\rightarrow$ SupervisedDatasetProcessor.template.encode_multiturn $\rightarrow$ Template._encode

Template._encode performs the conversion from sequences to token IDs. The code is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def _encode(
    self,
    tokenizer: "PreTrainedTokenizer",
    messages: list[dict[str, str]],
    system: Optional[str],
    tools: Optional[str],
) -> list[list[int]]:
    r"""Encode formatted inputs to pairs of token ids.

    Turn 0: prefix + system + query        resp
    Turn t: query                          resp.
    """
    system = system or self.default_system
    encoded_messages = []
    for i, message in enumerate(messages):
        elements = []

        if i == 0:
            elements += self.format_prefix.apply()
            if system or tools:
                tool_text = self.format_tools.apply(content=tools)[0] if tools else ""
                elements += self.format_system.apply(content=(system + tool_text))

        if message["role"] == Role.USER:
            elements += self.format_user.apply(content=message["content"], idx=str(i // 2))
        elif message["role"] == Role.ASSISTANT:
            elements += self.format_assistant.apply(content=message["content"])
        elif message["role"] == Role.OBSERVATION:
            elements += self.format_observation.apply(content=message["content"])
        elif message["role"] == Role.FUNCTION:
            elements += self.format_function.apply(
                content=message["content"], thought_words=self.thought_words, tool_call_words=self.tool_call_words
            )
        else:
            raise NotImplementedError("Unexpected role: {}".format(message["role"]))

        encoded_messages.append(self._convert_elements_to_ids(tokenizer, elements))

    return encoded_messages

This function first performs format conversion to obtain elements, and then uses the tokenizer to convert elements into token IDs.

3 Special Tokens Parameter Passing

Adding special tokens requires using the add_special_tokens interface of the tokenizer, for example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

special_tokens_dict = {
    "additional_special_tokens": [
        "<start>",
        "<end>",
    ]
}

num_added = tokenizer.add_special_tokens(special_tokens_dict)
print("Added tokens:", num_added)

Therefore, to add special tokens in LLaMA-Factory, the required special tokens must be added to the tokenizer.

3.1 Tokenizer Loading Method

In run_sft under LLaMA-Factory/src/llamafactory/train/sft/workflow.py, the tokenizer is loaded.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def run_sft(
    model_args: "ModelArguments",
    data_args: "DataArguments",
    training_args: "Seq2SeqTrainingArguments",
    finetuning_args: "FinetuningArguments",
    generating_args: "GeneratingArguments",
    callbacks: Optional[list["TrainerCallback"]] = None,
):
    tokenizer_module = load_tokenizer(model_args)
    ......

The function call path is: load_tokenizer → patch_tokenizer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def patch_tokenizer(tokenizer: "PreTrainedTokenizer", model_args: "ModelArguments") -> None:
    if "PreTrainedTokenizerBase" not in str(tokenizer._pad.__func__):
        tokenizer._pad = MethodType(PreTrainedTokenizerBase._pad, tokenizer)

    ......

    if model_args.add_special_tokens is not None:
        num_added_special_tokens = tokenizer.add_tokens(new_tokens=model_args.add_special_tokens, special_tokens=True)
        logger.info_rank0(
            "Add special tokens {} to tokenizer's vocabulary.".format(",".join(model_args.add_special_tokens))
        )
        if num_added_special_tokens > 0 and not model_args.resize_vocab:
            model_args.resize_vocab = True
            logger.warning_rank0("New special tokens have been added, changed `resize_vocab` to True.")

It can be seen that if model_args contains the add_special_tokens parameter, the corresponding special tokens will be loaded.

3.2 Model Arguments Loading Method

Now that we understand how the tokenizer is loaded, the key question becomes how model_args and its internal add_special_tokens are loaded.

In _training_function under LLaMA-Factory/src/llamafactory/train/tuner.py, the function reads the model arguments, data arguments, training arguments, and so on.

1
2
3
4
5
def _training_function(config: dict[str, Any]) -> None:
    args = config.get("args")
    callbacks: list[Any] = config.get("callbacks")
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
    ......

The definition of get_train_args is as follows:

1
2
3
4
5
6
7
def get_train_args(args: Optional[Union[dict[str, Any], list[str]]] = None) -> _TRAIN_CLS:
    if is_env_enabled("USE_MCA"):
        model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_mca_args(args)
    else:
        model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args)
        finetuning_args.use_mca = False
    ......

Then it calls _parse_train_args, which is defined as follows:

1
2
3
4
def _parse_train_args(args: Optional[Union[dict[str, Any], list[str]]] = None) -> _TRAIN_CLS:
    parser = HfArgumentParser(_TRAIN_ARGS)
    allow_extra_keys = is_env_enabled("ALLOW_EXTRA_ARGS")
    return _parse_args(parser, args, allow_extra_keys=allow_extra_keys)

Finally, it calls _parse_args, which is defined as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def _parse_args(
    parser: "HfArgumentParser", args: Optional[Union[dict[str, Any], list[str]]] = None, allow_extra_keys: bool = False
) -> tuple[Any]:
    args = read_args(args)
    if isinstance(args, dict):
        return parser.parse_dict(args, allow_extra_keys=allow_extra_keys)

    (*parsed_args, unknown_args) = parser.parse_args_into_dataclasses(args=args, return_remaining_strings=True)

    if unknown_args and not allow_extra_keys:
        print(parser.format_help())
        print(f"Got unknown args, potentially deprecated arguments: {unknown_args}")
        raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {unknown_args}")

    return tuple(parsed_args)

parser: "HfArgumentParser" parses all parameters defined in _TRAIN_ARGS within parser = HfArgumentParser(_TRAIN_ARGS), including model_args.

4 Example: Adding Special Tokens

4.1 Add them directly in the YAML file.

To add special tokens, you only need to include the add_special_tokens parameter in the training configuration file, for example:

1
2
3
4
5
### model
model_name_or_path: Qwen2.5-3B-Instruct
trust_remote_code: true
add_special_tokens: "[start],[end]"
...

4.2 Configure the `new_special_tokens_config` file parameter.

A separate new_special_tokens_config.yaml file is required, for example:

1
2
3
4
5
6
7
# SVG Container Tags
"<|START_OF_SVG|>": "Marks the beginning of an SVG document"
"<|END_OF_SVG|>": "Marks the end of an SVG document"

# SVG Group Tags
"<|start_of_g|>": "Begins a group element in SVG for organizing related shapes"
"<|end_of_g|>": "Ends a group element"

In this file, both the special tokens and their corresponding descriptions need to be defined.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
### model
model_name_or_path: Qwen2.5-3B-Instruct
trust_remote_code: true
...

# Training config
new_special_tokens_config: examples/extras/multi_tokens/tokens_cfg.yaml
init_special_tokens: desc_init
...

# Inference config
skip_special_tokens: false  # Must set to false for structured tokens
...

new_special_tokens_config specifies the path to the tokens_config.yaml file, while init_special_tokens configures the method used to initialize the embeddings of the special tokens. The available options for init_special_tokens are desc_init and desc_init_w_noise. Initialization methods that leverage token descriptions allow the tokenizer to initialize token embeddings based on their descriptions.

Note: Loading special tokens from a file takes higher priority than specifying special tokens directly in the configuration file.

4.3 Adding via the Graphical User Interface

Simply add the content that would normally be specified in the YAML file under Extra arguments; this method is equivalent to adding it directly in the YAML file.

5 Validating Special Tokens

Here, a Pokémon image classification task is used to verify whether the special tokens can be correctly added, and to perform training and inference.

5.1 Preparing the Dataset

1
2
3
4
5
6
7
from huggingface_hub import snapshot_download

repo_id = "fcakyon/pokemon-classification"
local_dir = "./pokemon-classification"

snapshot_download(repo_id=repo_id, repo_type="dataset", local_dir=local_dir)
print("Done！")

Use the script above to download the dataset.

Unzip the train.zip file under pokemon-classification/data, then use the script below to generate a JSON file adapted for LLaMA-Factory for training.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import os
import json

train_dir = "train"
output_file = "pokemon_dataset.json"

dataset = []

special_tokens_list = []

for class_name in os.listdir(train_dir)[:20]:
    class_path = os.path.join(train_dir, class_name)
    if not os.path.isdir(class_path):
        continue

    special_tokens_list.append(class_name)

    for img_file in os.listdir(class_path):
        if not img_file.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.gif')):
            continue

        img_path = os.path.join(class_path, img_file)

        data_item = {
            "messages": [
                {
                    "role": "user",
                    "content": "<image>Who is this Pokemon?"
                },
                {
                    "role": "assistant",
                    "content": f"[{class_name}]"
                },
                {
                    "role": "user",
                    "content": "What type is it?<image>"
                },
                {
                    "role": "assistant",
                    "content": f"[{class_name}]"
                }
            ],
            "images": [
                img_path,
                img_path
            ]
        }

        dataset.append(data_item)

with open(output_file, "w") as f:
    json.dump(dataset, f, indent=2)

print(f"Generation completed. A total of {len(dataset)} data entries were generated and saved to {output_file}.")
special_tokens = ""
for token in special_tokens_list:
    special_tokens += f"[{token}],"
print(f"special_tokens: {special_tokens}.")

The resulting JSON file has the following format:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[
  {
    "messages": [
      {
        "role": "user",
        "content": "<image>Who is this Pokemon?"
      },
      {
        "role": "assistant",
        "content": "[Dratini]"
      },
      {
        "role": "user",
        "content": "What type is it?<image>"
      },
      {
        "role": "assistant",
        "content": "[Dratini]"
      }
    ],
    "images": [
      "train/Dratini/d767470f6a6e44f6b3076282d4d416cf_jpg.rf.0d1a118bbc525e1772ace46ea075ca1e.jpg",
      "train/Dratini/d767470f6a6e44f6b3076282d4d416cf_jpg.rf.0d1a118bbc525e1772ace46ea075ca1e.jpg"
    ]
  }
]

5.2 Training the Pokémon Multimodal Classification Model

Registering the Dataset

Copy the generated dataset JSON file and the corresponding train folder into LLaMA-Factory/data. Then, add the following configuration to the LLaMA-Factory/data/dataset_info.json file to register the dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
"pokemon_dataset": {
    "file_name": "pokemon_dataset.json",
    "formatting": "sharegpt",
    "columns": {
        "messages": "messages",
        "images": "images"
    },
    "tags": {
        "role_tag": "role",
        "content_tag": "content",
        "user_tag": "user",
        "assistant_tag": "assistant"
    }
}

Training the Model with Special Tokens

1
DISABLE_VERSION_CHECK=1 CUDA_VISIBLE_DEVICES=7 USE_MODELSCOPE_HUB=1 llamafactory-cli webui

The special tokens used in this task are the names of Pokémon, and add_special_tokens needs to be added under Extra arguments.

"add_special_tokens":"[Dratini],[Kabuto],[Articuno],[Farfetchd],[Parasect],[Alolan Sandslash],[Gloom],[Jynx],[Muk],[Mew],[Machamp],[Eevee],[Doduo],[Kingler],[Kakuna],[MrMime],[Ninetales],[Golem],[Gyarados],[Dragonite]"

Once added, training can be started.

5.3 Inference Using the Model

Similarly, “add_special_tokens” needs to be added under Extra arguments.

Input an image for classification. Since the classification labels are special tokens, be sure to uncheck “Skip special tokens”.

The results from the original model are as follows:

This indicates that the model has been properly trained and the special tokens have been successfully learned.

1 Introduction#

2 Dataset Loading and Preprocessing#

2.1 Data Loading#

2.2 Data Preprocessing#

3 Special Tokens Parameter Passing#

3.1 Tokenizer Loading Method#

3.2 Model Arguments Loading Method#

4 Example: Adding Special Tokens#

4.1 Add them directly in the YAML file.#

4.2 Configure the new_special_tokens_config file parameter.#

4.3 Adding via the Graphical User Interface#

5 Validating Special Tokens#

5.1 Preparing the Dataset#

5.2 Training the Pokémon Multimodal Classification Model#

5.3 Inference Using the Model#