Qwen3-VL 模型相关问题
这篇博客关注 Qwen3-VL 模型的几个小问题,并给出相应的问题原因和解决办法。 1 Qwen3-VL 模型训练推理速度慢 问题:一些帖子和 issues 提到,在 torch=2.9 并且使用 Conv3D 的情况下,Qwen3-VL 的训练推理速度相较于 torch=2.8 有大幅退化,参考 https://github.com/pytorch/pytorch/issues/166122 。 1.1 检查 kernel 调用区别 首先分别在 torch=2.8 和 torch=2.9 两个版本下测试了 Conv3D 的 cuda 调用,测试代码如下: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 import torch import torch.nn as nn class Glm4vVisionPatchEmbed(nn.Module): def __init__( self, patch_size: int = 14, temporal_patch_size: int = 1, in_channels: int = 3, hidden_size: int = 1536, ) -> None: super().__init__() self.patch_size = patch_size self.temporal_patch_size = temporal_patch_size self.hidden_size = hidden_size kernel_size = (temporal_patch_size, patch_size, patch_size) self.proj = nn.Conv3d( in_channels, hidden_size, kernel_size=kernel_size, stride=kernel_size, bias=True, ) def forward(self, x: torch.Tensor) -> torch.Tensor: L, C = x.shape x = x.view(L, -1, self.temporal_patch_size, self.patch_size, self.patch_size) x = self.proj(x).view(L, self.hidden_size) return x net = Glm4vVisionPatchEmbed( patch_size=14, temporal_patch_size=2, in_channels=3, hidden_size=1536, ) net = net.to('cuda').bfloat16() x = torch.randn(8192, 14 * 14 * 3 * 2).to('cuda').bfloat16() y = net(x) print(y.shape) with torch.cuda.nvtx.range("Glm4vVisionPatchEmbed"): y = net(x) torch.cuda.synchronize() 执行如下命令,可以得到 cuda 内核调用信息 ...