This tutorial demonstrates how to fine-tune a language model using the LLaMA-Factory framework with Direct Preference Optimization (DPO). DPO is a training method based on human preferences, enabling model outputs to better align with human expectations and be more user-centric.
1 Environment Setup
Software & hardware requirements: CPU must support AMX, the system glibc version must be ≥ 2.32, and a GPU with at least 32 GB of VRAM is recommended.
Step 1: Create a Conda Environment for KTransformers
Step 2: Install LLaMA-Factory
Step 3: Install KTransformers
Option 1: Download and install a KTransformers wheel that matches your Torch and Python versions from https://github.com/kvcache-ai/ktransformers/releases/tag/v0.4.4
✨ The CUDA version can be different from the version indicated in the wheel filename.
| |
❗❗❗ The Python, CUDA, and Torch versions of the wheel must exactly match the current environment.
Option 2: Install KTransformers from source
Step 4: Install Flash-Attention Wheel
Download and install a Flash-Attention wheel that matches your Torch and Python versions from https://github.com/Dao-AILab/flash-attention/releases
❗❗❗ The Python, CUDA, and Torch versions must match the environment, and you must also verify whether the ABI is True or False.
Step 5: (Optional) Enable flash_infer
2 DPO Training
2.1 Prepare the Model
This blog uses the DeepSeek-V2-Lite-Chat model as an example. You may replace it with another model if needed.
2.2 Configure Training Parameter Files
(1) examples/train_lora/deepseek2_lora_dpo_kt.yaml
| |
(2) examples/inference/deepseek2_lora_dpo_kt.yaml
| |
2.3 Train the Model
Training results:
2.4 Model Inference
2.5 Use the Model API
Error Examples
Environment Installation Errors
PyTorch, Python, FlashAttention, and CUDA must all be version-compatible. Before installing FlashAttention and KTransformers using wheel packages, check the installed Python and Torch versions with the following command:
| |
This will display the versions of all installed packages. Locate the Torch version, for example:
| |
Based on the Python version and CUDA runtime version installed in Step 1 of the environment setup, you can determine the correct wheel packages for FlashAttention and KTransformers.
Then download the corresponding versions from:
- https://github.com/kvcache-ai/ktransformers/releases/tag/v0.4.4
- https://github.com/Dao-AILab/flash-attention/releases
In this blog, the environment uses:
- Python = 3.12
- Torch = 2.9.1
- CUDA = 12.8
- Architecture = x86
Therefore, the correct wheels to install are:
- ktransformers-0.4.4+cu128torch29fancy-cp312-cp312-linux_x86_64.whl
- flash_attn-2.8.3+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
The suffixes in the wheel filenames indicate:
cu: CUDA versiontorch: PyTorch versioncp: Python versioncxx: C++ standardabi: whether the C++ ABI is enabled
KTransformers Only Supports CPUs with AMX
AMX refers to Intel Advanced Matrix Extensions, a set of hardware-accelerated matrix computation instructions introduced by Intel for server and high-performance CPUs. It is primarily designed for AI, deep learning, and HPC workloads.
You can check whether your CPU supports AMX with the following command:
| |
If the output contains something like:
| |
then your CPU supports AMX.
If no such output appears, the CPU does not support AMX, and you will need to switch to a different machine.