1 Introduction
Easy Dataset is an application designed specifically for creating fine-tuning datasets for large language models (LLMs). It provides an intuitive interface for uploading domain-specific documents, intelligently segmenting content, generating questions, and producing high-quality training data for model fine-tuning. It supports calling large models through APIs such as OpenAI, DeepSeek, Volcano Engine, as well as local models via Ollama.
LLaMA Factory is an open-source, low-code fine-tuning framework for large language models. It integrates the most widely used fine-tuning techniques in the industry and supports zero-code model fine-tuning through a Web UI. It has become one of the most popular fine-tuning frameworks in the open-source community, with over 63K stars on GitHub. It supports full-parameter fine-tuning, LoRA fine-tuning, as well as fine-tuning algorithms such as SFT and DPO.
This tutorial uses Easy Dataset to construct an SFT fine-tuning dataset from the publicly available financial reports of five internet companies and uses LLaMA Factory to fine-tune the Qwen2.5-3B-Instruct model, enabling the fine-tuned model to learn the knowledge contained in the financial report dataset.
2 System Requirements
- GPU Memory: ≥ 12 GB
- CUDA Version: above 11.6
- Python Version: 3.10
3 Generating Fine-Tuning Data with Easy Dataset
3.1 Installing Easy Dataset
Method 1: Using the Installation Package
If your operating system is Windows, macOS, or a Unix system with an ARM architecture, you can directly download the installation package from the Easy Dataset repository: https://github.com/ConardLi/easy-dataset/releases/latest .
Method 2: Using Dockerfile
1.Pulling the Easy Dataset Repository from GitHub
2.Building the Docker Image
| |
3.Running the Container
Replace {YOUR_LOCAL_DB_PATH} with a local directory to serve as the mount path for /app/local-db in the Docker runtime. After starting, open the web interface at http://localhost:1717 to use Easy Dataset’s UI.
Method 3: Using NPM
1.Download Node.js and pnpm
Visit the official websites to install Node.js and pnpm: https://nodejs.org/en/download | https://pnpm.io/
Check that the Node.js version is above 18.0:
| |
2.Clone the Easy Dataset Repository from GitHub
3.Install Dependencies
| |
4.Start the Easy Dataset Application
If the console shows the following output, it means the application has started successfully. Open your browser and visit http://localhost:1717 to access the Easy Dataset interface:
3.2 Sample Data Download
This tutorial provides a set of financial reports from internet companies as sample data, including the Q2 2024 reports of five domestic internet companies in TXT and Markdown formats. You can download them using Git or by directly visiting the repository link.
| |
All data are in plain text format. Below is a sample excerpt.
| |
3.3 Fine-Tuning Data Generation
Create Project and Configure Parameters
1.After opening the Easy Dataset homepage in your browser, click Create Project.
2.First, enter the Project Name (required). The other two fields can be left blank. Then click Create Project to confirm.
3.After the project is created, you will be redirected to the Project Settings page. Open Model Configuration and select the large model API to be used for data generation.
4.Here, we use the DeepSeek model as an example. Enter the model Provider and Model Name, and provide the API Key. Click Save to store the data locally. Then, select the configured model from the top-right corner. The API Key must be obtained from the model provider and must be valid for accessing the provider’s large model.
5.Open the Task Configuration page and set the text segmentation length to a minimum of 500 characters and a maximum of 2000 characters. In the question generation settings, change it to generate one question per 10 characters. After making the changes, click Save Task Configuration at the bottom of the page.
Process Data Files
1.Open the Document Processing page and select a model.
2.After selecting the files, click Upload and Process Files.
3.After uploading, the large model will be used to parse the file content and segment it. Please wait patiently for the processing to complete. Sample data usually takes around 2 minutes.
Generate Fine-Tuning Data
1.Once the file processing is complete, you can see the text segments after splitting. Select all the text segments and click Auto Generate.
2.After clicking, the large model will be used to generate questions based on the text segments. Please wait patiently for the process to complete. Depending on the API speed, it usually takes around 2 minutes.
Export Dataset to LLaMA Factory
1.After all answers have been generated, open the Dataset Management page and click Export Dataset.
You can see the task in progress in the background. Wait approximately 2 minutes for it to complete.
2.Export the Dataset on the Single-Turn QA Dataset Page
3.In the export configuration, select Use in LLaMA Factory, then click Update LLaMA Factory Configuration. This will generate a configuration file in the corresponding folder. Click the Copy button to copy the configuration path to the clipboard.
4.In the folder corresponding to the configuration file path, you can find the generated data files. The main files to focus on are:
a. dataset_info.json: The dataset configuration file required by LLaMA Factory
b. alpaca.json: The dataset file organized in Alpaca format
c. sharegpt.json: The dataset file organized in ShareGPT format
Both the Alpaca and ShareGPT formats can be used for fine-tuning, and the contents of the two files are identical.
4 Fine-Tune the Qwen2.5-3B-Instruct Model Using LLaMA Factory
4.1 Install LLaMA Factory
1.Create a Virtual Environment for the Experiment (Optional)
| |
2.Clone the LLaMA Factory Repository from GitHub and Install Environment Dependencies
3.Run llamafactory-cli version to verify. If the current LLaMA Factory version is displayed, the installation was successful.
4.2 Start the fine-tuning task
1.After confirming that LLaMA Factory has been successfully installed, run the following command to launch LLaMA Board.
| |
Environment variable explanation:
- CUDA_VISIBLE_DEVICES: Specifies the GPU device index to use. By default, all GPUs are used.
- USE_MODELSCOPE_HUB: Enables accelerated model downloads from the ModelScope Hub (China). Disabled by default.
After successful startup, the following information will appear in the console. Open http://localhost:7860 in your browser to access the Web UI.
2.After entering the Web UI interface, select the model Qwen2.5-3B-Instruct. You can specify the local absolute path for the model. If left blank, it will be downloaded from the internet.
3.Set the dataset path to the configuration path exported by Easy Dataset, and select the Alpaca format dataset.
4.To help the model learn the dataset more effectively, set the learning rate to 1e-4 and increase the number of training epochs to 8. The batch size and gradient accumulation should be adjusted according to the available GPU memory. If memory allows, increasing the batch size can speed up training. In general, ensure that Batch Size × Gradient Accumulation × Number of GPUs = 32.
5.Click on Other Parameters, and set the save interval to 50. Saving more checkpoints helps observe how the model’s performance changes over training epochs.
6.Click LoRA Parameter Settings, set the LoRA rank to 16, and set the LoRA scaling factor to 32.
7.Click the Start button, wait for the model to download, and after some time you should be able to observe the loss curve during training.
8.Wait for the model training to complete. Depending on GPU performance, the training time may range from 20 to 60 minutes.
4.3 Validate Fine-Tuning Results
1.Select the Checkpoint Path as the output directory from earlier, open the Chat page, and click Load Model.
2.Enter your question in the chat box below and click Submit to interact with the model. Comparing with the original data, the fine-tuned model provides correct answers.
3.Click Unload Model to unload the fine-tuned model. Clear the Checkpoint Path and click Load Model to load the original pre-trained model.
4.Enter the same question and interact with the model. You will find that the original model answers incorrectly, which demonstrates that the fine-tuning was effective.
The fine-tuning effect of the 3B model is relatively limited and is used here only for tutorial demonstration.
For better results, it is recommended to try the 7B or 14B models when sufficient resources are available.
You are welcome to follow the GitHub repository:
- Easy Dataset: https://github.com/ConardLi/easy-dataset
- LLaMA Factory: https://github.com/hiyouga/LLaMA-Factory