At the beginning of 2026, from the Consumer Electronics Show (CES) in Las Vegas, USA, to the China Central Television (CCTV) Spring Festival Gala, China’s self-developed humanoid robots have frequently “broken through the circle.” Products and applications from multiple Chinese enterprises have not only sparked discussions within the overseas industry but have also continuously “swept” global social media platforms and international media. Embodied intelligence, regarded as the next stage of artificial intelligence development, has its core in achieving a deep coupling between the intelligent “brain” and the physical “body,” thereby directly transforming data, algorithms, and computing power into the ability to act on and transform the objective world. Humanoid robots, due to their human-like appearance and functionality, are considered a high-level form and the optimal carrier for embodied intelligence, poised to become the next-generation super terminal following smartphones and new energy vehicles.
LlamaFactory is an open-source, low-code large model fine-tuning framework. It integrates the most widely used fine-tuning techniques in the industry and supports zero-code fine-tuning of large models through a Web UI interface. It has now become one of the most popular fine-tuning frameworks in the open-source community, with nearly 70,000 GitHub stars .
The Tongyi Qianwen team has open-sourced the new-generation multimodal large model Qwen3.5. This tutorial will focus on how to use the open-source Qwen3.5-9B model, leveraging the LlamaFactory open-source low-code large model fine-tuning framework, to fine-tune for the specific task of “identifying humanoid robot models.” Through this practice, we aim to demonstrate how lightweight large models can empower embodied intelligence applications, enabling robots to not only “see” but also “understand,” thereby contributing a practical force from the open-source community to this intelligent revolution that is sweeping the globe.
Runtime Environment Requirements
- It is recommended to have a GPU with at least 32 GB of video memory.
1. Install LlamaFactory
Clone LlamaFactory to your local machine:
Install the LlamaFactory environment dependencies:
| |
Run the following command. If the LlamaFactory version is displayed, the installation was successful.
| |
2. Prepare the Dataset
Manus is a general-purpose AI agent focused on executing complex tasks, capable of autonomously completing end-to-end tasks from planning to execution. We used Manus to automate the construction of a data scraping workflow, which is highly efficient compared to traditional methods like writing crawler scripts. For example, you can use the following prompt to accomplish dataset acquisition:
| |
This tutorial provides a multi-turn conversation dataset. The link is: mllm_robot.zip. The samples in the dataset are in a single-turn conversation format, containing 405 samples. Each sample consists of one user instruction and one model response. During the fine-tuning process, the model continuously learns the response style from the samples, thereby achieving the goal of identifying robots. A sample of the data is shown below:
| |
You can download this dataset and place it under LlamaFactory/data, and modify the dataset_info.json file by adding the following content:
| |
This allows LlamaFactory to recognize the newly added dataset.
3. Model Fine-Tuning
3.1 Launch the Web UI
After completing the preliminary preparations, you can launch the Web UI by running the following command:
| |
Click on the returned URL address to enter the Web UI page.
3.2 Configure Parameters
After entering the WebUI, you can switch the language according to your needs. First, configure the model. This tutorial selects the Qwen3.5-9B model, and the fine-tuning method is changed to lora.
For the dataset, use mllm_robot and mllm_robot_en. Use a learning rate of 1e-4, and set Epochs to 5.
3.3 Start Fine-Tuning
Change the output directory to train_qwen3_5_9B, where the trained model weights will be saved. Clicking “Preview Command” will display all configured parameters. If you wish to run the fine-tuning via code, you can copy this command and run it in the command line.
After starting fine-tuning, you need to wait for some time. After the model is downloaded, you can observe the training progress and loss curve in the interface. On an RTX 5090, model fine-tuning takes approximately 30 minutes. The message “Training Finished” indicates successful fine-tuning.
4. Model Dialogue
4.1 Dialogue with the Fine-Tuned Model
Select the “Chat” tab. Change the Checkpoint Path to train_qwen3_5_9B, and click “Load Model” to start a dialogue with the fine-tuned model in the Web UI.
Randomly upload an image and let the model identify the robot in the image.
The model correctly identified the robot in the image as the MagicBot Z1 (2026 Spring Festival Gala Custom Edition) robot designed by MagicLab, indicating a good fine-tuning effect .
4.2 Dialogue with the Original Model
Click “Unload Model,” then uncheck the checkpoint path input box, and click “Load Model” again to chat with the original model before fine-tuning.
The model did not identify the robot in the image and instead thought the robot was a person in a costume. This demonstrates that the model fine-tuning was effective.
5. Summary
This tutorial introduced how to use the Manus and LlamaFactory frameworks to fine-tune the Qwen3.5-9B model using LoRA, enabling it to identify robot models. The fine-tuning effect was verified through manual testing. In subsequent practices, you can use actual business datasets to fine-tune the model, obtaining a local domain-specific multimodal large model capable of solving problems in actual business scenarios.