Qwen Ascend¶

总览¶

Qwen Ascend 是当前 Ascend Backend 的主要端到端验证路径，代码位于 mllm/models/qwen_ascend，示例位于 examples/qwen_ascend。

该模型路径覆盖：

文件	说明
`configuration_qwen_ascend.hpp`	Qwen Ascend 配置读取。
`tokenization_qwen_ascend.hpp`	tokenizer、message template 和输入构造。
`modeling_qwen_ascend.hpp`	模型结构：MLP、Attention、Decoder、LM。
`qwen_ascend_decoder_graph.hpp`	decoder graph 构建和 graph forward。
`qwen_ascend_graph_ops.hpp`	graph op factory、runner 和 graph 环境。
`qwen_ascend_rope.hpp`	RoPE position id、inv_freq 和 cache。

QwenAscendForCausalLM 继承 ARGeneration 和 nn::Module，内部包含：

推理时，输入 token id 先经过 embedding，然后进入多层 decoder，最后通过 lm head 生成 logits。生成流程复用 mLLM 的 ARGeneration::chat 接口。

Qwen Ascend 示例会先将模型移动到 Ascend，再加载权重：

auto model = QwenAscendForCausalLM(cfg);
model.to(mllm::kAscend);
model.load(mllm::load(model_path, file_version));

这个顺序很重要。对于 W8A8 权重，AscendLinearOp::load() 需要直接读取模型文件中的 scale 和 scale_x 参数，并准备后续 graph 所需的量化 artifacts。

AscendKVCache 是 Ascend 专用 KV cache。它按 layer 保存 K/V buffer：

K cache: [1, kv_heads, max_cache_length, head_dim]
V cache: [1, kv_heads, max_cache_length, head_dim]

主要接口：

GQA 的 K/V repeat 不在 cache 内做，而是在 attention 计算中处理。

qwen_ascend_rope.hpp 提供：

RoPE cache 负责为当前 sequence 准备 sin/cos tensor，避免重复构造相同位置编码。 Ascend RoPE 当前按 ATB RoPE 的输入约定组织 Q/K、cos、sin 和 position ids。

Qwen Ascend decoder 默认优先使用 graph 路径：

export MLLM_ASCEND_QWEN_DECODER_GRAPH=1

设为 0 可关闭 graph，回退 eager：

export MLLM_ASCEND_QWEN_DECODER_GRAPH=0

QwenAscendDecoder::canUseGraph() 会检查 graph 开关、输入形状和算子状态。首次进入 graph 路径时，ensureGraphExecutor() 会构建并缓存当前 layer 的 graph executor。

Graph 目前分两类：

FP16 decoder graph：使用 ATB Linear、RMSNorm、RoPE、Transpose、Attention plugin、SiLU 和 Add。
W8A8 decoder graph：将 Linear 节点替换为 AscendLinearW8A8PluginOperation。

Attention graph 节点通过 AscendAttentionWithKVCachePluginOperation 封装 prefill/decode 子图、KV cache 访问和 sequence length 更新。

Qwen Ascend 支持静态 W8A8 linear。加载 INT8 权重时，AscendLinearOp 会准备：

默认生产路径使用静态校准的 W8A8 graph plugin：

x_fp16
  -> x * (1 / scale_x)
  -> round
  -> clamp [-128, 127]
  -> cast int8
  -> ATB Linear W8A8
  -> y_fp16

Dynamic W8A8 eager 路径需要显式打开：

export MLLM_ASCEND_ENABLE_DYNAMIC_W8A8=1

该路径用于精度分析和调试，不是默认推理路径。

examples/qwen_ascend/main.cpp 支持两种模式。

默认 QA generation：

./mllm-qwen-ascend-runner \
  -m /path/to/model.mllm \
  -c /path/to/config.json \
  -t /path/to/tokenizer.json \
  -p "请用一句话介绍你自己。" \
  -g 64

Forward smoke test：

./mllm-qwen-ascend-runner \
  -m /path/to/model.mllm \
  -c /path/to/config.json \
  --forward_smoke_test \
  -s 8

QA generation 会输出生成 token，并在结束后打印 perfSummary()。当外层通过 max_new_tokens 截断生成时，示例会在已经进入 decode 阶段后补充 decode 结束时间，保证 decode 速度统计可用。

常用调试方式：

如果默认 QA generation 的首个 prefill 较慢，需要区分冷启动开销和 steady-state 性能。首次真实 forward 会包含 Ascend/ATB/runtime/内存池等初始化成本。