深入探索 deepspeed（二）

DeepSpeed 高性能算子实现

上篇博客我罗列了 deepspeed 针对推理的优化方法，并详细分析了 deepspeed 推理引擎中对网络层的替换，张量并行等实现。那么 deepspeed 自己内部实现的高性能网络层究竟有何蹊跷，能比一般的网络层更快？让我们从源码开始看起。

注：本篇博文的源码分析基于 deepspeed-0.14.2。

接上篇博客

上篇博客我们提到对于一些常见的主流大模型，deepspeed 其内部自己实现了一套高性能的代码。只要 deepspeed 检测到用户使用了这些模型，那么就会启动模型网络结构的替换功能，用高效的实现替代部分或全部网络结构。以 llama2 模型为例，DeepSpeedLlama2Inference 就是 deepspeed 内针对 llama2 开发的高性能推理模型。本篇博客我们来细致地研究一下 deepspeed 如何针对性地构建一个高效的大模型架构，从而提升模型的推理性能。

从初始化说起

上一篇博客中其实已经谈及了很多关于 deepspeed 推理引擎的实现，因此这里我们简单地过一下：

当我们写出如下代码，并运行后：

import deepspeed
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)

# Initialize the DeepSpeed-Inference engine
ds_engine = deepspeed.init_inference(model,
                                 tensor_parallel={"tp_size": 8},
                                 dtype=torch.half,
                                 checkpoint=None if args.pre_load_checkpoint else args.checkpoint_json,
                                 replace_with_kernel_inject=True)

deepspeed 的 init_inference 会帮助我们记录模型推理 config，并启动推理引擎 InferenceEngine。若 replace_with_kernel_inject=True，那么推理引擎在构建时会扫描整个模型，将其中的某些层替换为 deepspeed 内部实现的高性能网络层，从而实现加速模型推理的效果。

而对于 llama2 模型，deepspeed 甚至内部实现了整个模型，因此可以直接替换为 deepspeed 内部的 DeepSpeedLlama2Inference 类。具体过程见下图：

我们把实际运行过程中的替换模块部分的 log 信息打印出来：可以发现，每一个 LlamaDecoderlayer 都被替换了（博主这边是 llama-1，因此替换成了 DeepSpeedGPTInference 😢）

# 原模型
LlamaDecoderlayer(
  (self_attn): LlamaAttention(
    (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
    (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
    (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
    (o_proj): Linear(in_features=4096,out_features=4096,.bias=False)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (mlp):LlamaMLP(
    (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
    (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
    (down_proj): Linear(in_features=11608, out_features=4096, bias=False)
    (act_Fn): SiLUActivation()
  )
  (input_layernorm): LlamaRMSNorm()
  (post_attention_layernorm): LlamaRMSNorm()
)
# 替换掉的类
<class 'deepspeed.module inject.containers.llama.LLAMALayerPolicy'>
DeepSpeedGPTInference(
  (attention): DeepSpeedSelfAttention(
    (gkv_func): QKVGemmOp()
    (score_context_func): SoftmaxContextop()
    (linear_func): Linearop()
    (vector_matmul_func): VectorMatMuLOp()
  )
  (mlp): DeepSpeedMLP(
    (mlp_gemm_func): MLPGemmOp()
    (vector_matmul_func): VectorMatMulOp()
    (fused_gemm_geTu): GELUGemmOp()
    (residual_add_func): ResiduaiAddOp()
  )
)

明显可以观察到两点：1）deepspeed 使用 DeepSpeedSelfAttention 和 DeepSpeedMLP 替换并融合了 llama 的 Attention 和 MLP，以及 layernorm。2）deepspeed 在底层使用了自己的高性能算子，例如：QKVGemmOp 和 MLPGemmOp 等。接下来，我们先探究 DeepSpeedSelfAttention 和 DeepSpeedMLP 的实现，再来看看这些 Op 是如何实现的。

高性能网络层的实现

为避免被绕晕，先将一张大致描述 deepspeed 推理代码框架图呈上：

DeepSpeed-Inference

从上图中可以看到，DeepSpeed Inference 实现的大模型推理类，都是 DeepSpeedTransformerInference 的派生类。目前为止，一共有如下几种类：

DeepSpeedBloomInference
DeepSpeedBERTInference
DeepSpeedLlama2Inference
DeepSpeedGPTInference
DeepSpeedMegatronGPTInference
DeepSpeedOPTInference

但大多数的推理类继承后的实现非常平凡，因此我们直接来看 DeepSpeedTransformerInference 实现。

首先要明确的是，DeepSpeedTransformerInference 对应于一个大模型的一层 transformer 层，而非整个大模型。该类支持使用 triton 作后端优化推理。该类有两个关键的成员，DeepSpeedMLP 和 DeepSpeedSelfAttention。

allocate workspace

接下来我们一步步地看看它的 forward 实现：

def forward(self, input=None, input_mask=None, attention_mask=None, attn_mask=None, head_mask=None, layer_past=None,
        get_key_value=False, get_present=False, encoder_output=None, enc_dec_attn_mask=None, x=None,
        encoder_hidden_states=None, encoder_attention_mask=None, use_cache=False, alibi=None, output_attentions=False,
        layer_head_mask=None, past_key_value=None, **kwargs):
    # ... #
    input_mask = (input_mask if attn_mask is None else attn_mask) if attention_mask is None else attention_mask

    # Allocate memory only on first layer forward
    if self.config.layer_id == 0 and self._alloc_workspace:
        self.allocate_workspace(self.config.hidden_size, self.config.heads,
                                input.size()[1],
                                input.size()[0], DeepSpeedTransformerInference.layer_id, self.config.mp_size,
                                self.config.bigscience_bloom,
                                dist.get_rank() if dist.is_initialized() else 0, self.config.max_out_tokens,
                                self.config.min_out_tokens)
        self._alloc_workspace = False

这里的 allocate_workspace 对应了初始化时传入的分配内存空间的函数，实际上调用的是 deepspeed 包装的 C++ CUDA 实现：

def __init__(self):
  # ...
  if config.dtype == torch.float32:
      self.allocate_workspace = inference_module.allocate_workspace_fp32
  elif config.dtype == torch.bfloat16:
      self.allocate_workspace = inference_module.allocate_workspace_bf16
  else:
      self.allocate_workspace = inference_module.allocate_workspace_fp32
  self._alloc_workspace = True

InferenceContext::Instance().GenWorkSpace(num_layers, num_heads, batch_size,
                                          prompt_length, hidden_dim, mp_size,
                                          external_cache, sizeof(T), rank,
                                          max_out_tokens, min_out_tokens);

这里提一句大模型推理所需内存的计算方法。即刨除大模型本身的参数占用内存，还需要多少内存来完成推理：

size_t activation_size = 10 * (num_heads * effective_head_size) * batch_size;
// Other sequence length dimension is added when the final workSpaceSize is calculated
size_t temp_size = batch_size * (num_heads / mp_size) * max_out_tokens;
size_t cache_size =
    num_layers * batch_size * ((num_heads * effective_head_size) / mp_size) * 2
size_t workSpaceSize = ((external_cache ? (activation_size + temp_size)
                                                : (activation_size + temp_size + cache_size))) *
                               _max_seq_len * elem_size;

具体的推导步骤可以参考大模型训练时占用内存的知乎文章。这里做简要注解：

transformer 模型的层数为 $l$
隐藏层维度为 $h$
注意力头数为 $a$
词表大小为 $v$
批次大小为 $b$
序列长度为 $s$

在多头注意力中，我们有 $Q=XW_Q$ 、 $K=XW_K$ 、 $V=XW_V$ ，这三个前向计算的矩阵乘法， $X$ 大小是 (b, s, h)；计算后得到的 $Q$ 、 $K$ 、 $V$ 大小都是 (b, a, s, h/a) （不考虑 GQA 的情况），因此一共需要 $3bsh$ 的内存大小。随后做 layernorm、注意力计算等操作还需要大约 $5bsh$ 的内存大小，因此代码中 activation_size 直接分配了 $10bsh$ 的内存大小。

代码中 temp_size 是用来存放注意力计算 $QK^T$ 的值。因此大小是 $bas^2$ 。

每个 batch 的每一层 transformer 都需要一个 KV cache，因此总大小为 $2bslh \times$ sizeof(T)，与 cache_size 的计算代码对应。

attention

接下来我们看看 attention 的计算过程。准备好函数的各项参数后，直接调用 DeepSpeedSelfAttention:forward 就可以算出注意力值了

# We set the prev key/value to None when there is a prompt
if input.shape[1] > 1:
    self.layer_past = None
layer_past = layer_past if layer_past is not None else self.layer_past
# ....
with torch.no_grad():
    attention_output, key, value, context_outputtn_ctx, inp_norm = \
            self.attention(input,
                    input_mask,
                    head_mask,
                    layer_past,
                    get_present,
                    encoder_hidden_states,
                    encoder_attention_mask,
                    output_attentions,
                    self.norm_w,
                    self.norm_b,
                    alibi)

    presents = (key, value)

self.attention 直接对应了 DeepSpeedSelfAttention 的实现，因此再把目光转向下：

def forward(self, input, input_mask, head_mask=None, layer_past=None,
            get_present=False, encoder_hidden_states=None,
            encoder_attention_mask=None, output_attentions=False,
            norm_w=None, norm_b=None, alibi=None):
        # ...
        if not self.config.pre_layer_norm:
            qkv_out = self.linear_func(input=input,
                                       weight=self._attn_qkvw,
                                       bias=self._attn_qkvb,
                                       add_bias=self.attn_qkvb is not None,
                                       do_flash_attn=False,
                                       num_heads=self.num_attention_heads_per_partition,
                                       num_layers=DeepSpeedSelfAttention.num_layers)
        else:
            qkv_out = self.qkv_func(input=input,
                                    weight=self._attn_qkvw,
                                    bias=self._attn_qkvb,
                                    gamma=norm_w,
                                    beta=norm_b)

        context_layer, key_layer, value_layer = self.compute_attention(qkv_out=qkv_out,
                                                                       input_mask=input_mask,
                                                                       layer_past=layer_past,
                                                                       alibi=alibi)
        output = self.vector_matmul_func(input=context_layer, weight=self.attn_ow)
        inp_norm = qkv_out[-1]
        if self.config.mlp_after_attn and self.mp_group is not None 
          and dist.get_world_size(group=self.mp_group) > 1:
            dist.all_reduce(output, group=self.mp_group)
        return (output, key_layer, value_layer, context_layer, inp_norm)

这里涉及到了四个 Op 算子，流程如下图。QKVGemmOp 计算了 pre layer norm 和 $Q=XW_Q$ ，SoftmaxContextOp 计算了 $softmax((QK^T)/\sqrt{n_{dim}})V$ ，最后 VectorMatMulOp 计算了 ${Attn}W_O$ 。

mlp

attention 计算过程结束后，紧接着就是 MLP 的计算过程。

    self.layer_past = presents if layer_past is None else None
    output = self.mlp(attention_output, input, inp_norm, self.attention.attn_ob)

    if not self.config.pre_layer_norm:
        output = inference_module.layer_norm(output, self.norm_w, self.norm_b, self.config.epsilon)

    output = output.to(input_type)
if get_present:
    output = (output, presents)

if self.config.return_single_tuple:
    return (output, )
elif self.config.return_tuple:
    return output if type(output) is tuple else (output, attn_mask)
else:
    return output

当然，self.mlp 也对应着 DeepSpeedMLP 的实现：

def forward(self, input, residual, residual_norm, bias):
  # ...
  if self.attn_nw is None:
    output = self.fused_gemm_gelu(input=residual_norm,
                                  weight=self._inter_w,
                                  bias=self._inter_b,
                                  weight_out=self.output_w)
  else:
    output, residual_add = self.mlp_gemm_func(input=input,
                                              residual=residual,
                                              weight_interm=self._inter_w,
                                              weight_out=self.output_w,
                                              input_bias=bias,
                                              bias=self._inter_b,
                                              gamma=self.attn_nw,
                                              beta=self.attn_nb)
  residual = self.residual_add_func(hidden_state=output,
                                    residual=residual,
                                    add_bias=bias is not None,
                                    attention_output=input,
                                    attention_bias=bias if bias is not None else self.output_b,
                                    final_bias=self.output_b,
                                    residual_add=residual_add)
  if self.mp_group is not None and dist.get_world_size(group=self.mp_group) > 1:
    dist.all_reduce(residual, group=self.mp_group)
  return residual

这里涉及到了四个 Op 算子，流程如下图。MLPGemmOp 计算了 FFN，ResidualAddOp 计算了偏移加法。

高性能算子的实现

deepspeed inference v1 版本的算子代码很多。我这里只挑重点，一起来看一下 Attention 部分。

template <typename T>
void launch_bias_add_transform_0213(T* output, T* k_cache, T* v_cache,const T* vals, const T* bias,
      int batch_size, int seq_length, unsigned seq_offset, int all_tokens, int hidden_dim,
      int heads, int num_kv, int rotary_dim, bool rotate_half, bool rotate_every_two,
      cudaStream_t stream, int trans_count, int max_out_tokens, float rope_theta) {
    hidden_dim >>= 3;
    int head_ext = 1;  // (hidden_dim - 1) / MAX_THREADS + 1;
    dim3 block_dim(hidden_dim / heads, (heads / head_ext));
    dim3 grid_dim(batch_size, seq_length, (trans_count * head_ext));
    bias_add_transform_0213<<<grid_dim, block_dim, 0, stream>>>(output,
                                                                k_cache,
                                                                v_cache,
                                                                vals,
                                                                bias,
                                                                hidden_dim,
                                                                seq_length,
                                                                seq_offset,
                                                                all_tokens,
                                                                heads,
                                                                num_kv > 0 ? (heads / num_kv) : 1,
                                                                num_kv > 0 ? num_kv : heads,
                                                                rotary_dim >> 3,
                                                                rotate_half,
                                                                rotate_every_two,
                                                                head_ext,
                                                                max_out_tokens,
                                                                rope_theta);
}

#AI #deepspeed

深入探索 deepspeed（二）

https://dingfen.github.io/2024/05/15/2024-5-15-deepspeed/

作者

Bill Ding

发布于

2024年5月15日

更新于

2025年1月26日

许可协议

大模型性能优化的总结和分享上一篇

深入探索 deepspeed（一）下一篇