TVM 加速模型，優化推斷

語言: CN / TW / HK

時間 2022-05-22 10:51:09 SegmentFault

TVM 是一個開源深度學習編譯器，可適用於各類 CPUs, GPUs 及其他專用加速器。它的目標是使得我們能夠在任何硬體上優化和執行自己的模型。不同於深度學習框架關注模型生產力，TVM 更關注模型在硬體上的效能和效率。

本文只簡單介紹 TVM 的編譯流程，及如何自動調優自己的模型。更深入瞭解，可見 TVM 官方內容：

文件: http://tvm.apache.org/docs/
原始碼: http://github.com/apache/tvm

編譯流程

TVM 文件 Design and Architecture 講述了例項編譯流程、邏輯結構元件、裝置目標實現等。其中流程見下圖：

從高層次上看，包含了如下步驟：

匯入（Import）：前端元件將模型提取進 IRModule，其是模型內部表示（IR）的函式集合。
轉換（Transformation）：編譯器將 IRModule 轉換為另一個功能等效或近似等效（如量化情況下）的 IRModule。大多轉換都是獨立於目標（後端）的。TVM 也允許目標影響轉換通道的配置。
目標翻譯（Target Translation）：編譯器翻譯（程式碼生成） IRModule 到目標上的可執行格式。目標翻譯結果被封裝為 runtime.Module，可以在目標執行時環境中匯出、載入和執行。
執行時執行（Runtime Execution）：使用者載入一個 runtime.Module 並在支援的執行時環境中執行編譯好的函式。

調優模型

TVM 文件User Tutorial 從怎麼編譯優化模型開始，逐步深入到 TE, TensorIR, Relay 等更底層的邏輯結構元件。

這裡只講下如何用 AutoTVM 自動調優模型，實際瞭解 TVM 編譯、調優、執行模型的過程。原文見 Compiling and Optimizing a Model with the Python Interface (AutoTVM) 。

準備 TVM

首先，安裝 TVM。可見文件Installing TVM，或筆記「TVM 安裝」。

之後，即可通過 TVM Python API 來調優模型。我們先匯入如下依賴：

import onnx
from tvm.contrib.download import download_testdata
from PIL import Image
import numpy as np
import tvm.relay as relay
import tvm
from tvm.contrib import graph_executor

準備模型，並載入

獲取預訓練的 ResNet-50 v2 ONNX 模型，並載入：

model_url = "".join(
    [
        "http://github.com/onnx/models/raw/",
        "main/vision/classification/resnet/model/",
        "resnet50-v2-7.onnx",
    ]
)

model_path = download_testdata(model_url, "resnet50-v2-7.onnx", module="onnx")
onnx_model = onnx.load(model_path)

準備圖片，並前處理

獲取一張測試圖片，並前處理成 224x224 NCHW 格式：

img_url = "http://s3.amazonaws.com/model-server/inputs/kitten.jpg"
img_path = download_testdata(img_url, "imagenet_cat.png", module="data")

# Resize it to 224x224
resized_image = Image.open(img_path).resize((224, 224))
img_data = np.asarray(resized_image).astype("float32")

# Our input image is in HWC layout while ONNX expects CHW input, so convert the array
img_data = np.transpose(img_data, (2, 0, 1))

# Normalize according to the ImageNet input specification
imagenet_mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1))
imagenet_stddev = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1))
norm_img_data = (img_data / 255 - imagenet_mean) / imagenet_stddev

# Add the batch dimension, as we are expecting 4-dimensional input: NCHW.
img_data = np.expand_dims(norm_img_data, axis=0)

編譯模型，用 TVM Relay

TVM 匯入 ONNX 模型成 Relay，並建立 TVM 圖模型：

target = input("target [llvm]: ")
if not target:
    target = "llvm"
    # target = "llvm -mcpu=core-avx2"
    # target = "llvm -mcpu=skylake-avx512"

# The input name may vary across model types. You can use a tool
# like Netron to check input names
input_name = "data"
shape_dict = {input_name: img_data.shape}

mod, params = relay.frontend.from_onnx(onnx_model, shape_dict)

with tvm.transform.PassContext(opt_level=3):
    lib = relay.build(mod, target=target, params=params)

dev = tvm.device(str(target), 0)
module = graph_executor.GraphModule(lib["default"](dev))

其中 target 是目標硬體平臺。 llvm 指用 CPU，建議指明架構指令集，可更優化效能。如下命令可檢視 CPU：

$ llc --version | grep CPU
  Host CPU: skylake
$ lscpu

或直接上廠商網站（如Intel® Products）檢視產品引數。

執行模型，用 TVM Runtime

用 TVM Runtime 執行模型，進行預測：

dtype = "float32"
module.set_input(input_name, img_data)
module.run()
output_shape = (1, 1000)
tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy()

收集優化前的效能資料

收集優化前的效能資料：

import timeit

timing_number = 10
timing_repeat = 10
unoptimized = (
    np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number))
    * 1000
    / timing_number
)
unoptimized = {
    "mean": np.mean(unoptimized),
    "median": np.median(unoptimized),
    "std": np.std(unoptimized),
}

print(unoptimized)

之後，用以對比優化後的效能。

後處理輸出，得知預測結果

輸出的預測結果，後處理成可讀的分類結果：

from scipy.special import softmax

# Download a list of labels
labels_url = "http://s3.amazonaws.com/onnx-model-zoo/synset.txt"
labels_path = download_testdata(labels_url, "synset.txt", module="data")

with open(labels_path, "r") as f:
    labels = [l.rstrip() for l in f]

# Open the output and read the output tensor
scores = softmax(tvm_output)
scores = np.squeeze(scores)
ranks = np.argsort(scores)[::-1]
for rank in ranks[0:5]:
    print("class='%s' with probability=%f" % (labels[rank], scores[rank]))

調優模型，獲取調優資料

於目標硬體平臺，用 AutoTVM 自動調優，獲取調優資料：

import tvm.auto_scheduler as auto_scheduler
from tvm.autotvm.tuner import XGBTuner
from tvm import autotvm

number = 10
repeat = 1
min_repeat_ms = 0  # since we're tuning on a CPU, can be set to 0
timeout = 10  # in seconds

# create a TVM runner
runner = autotvm.LocalRunner(
    number=number,
    repeat=repeat,
    timeout=timeout,
    min_repeat_ms=min_repeat_ms,
    enable_cpu_cache_flush=True,
)

tuning_option = {
    "tuner": "xgb",
    "trials": 10,
    "early_stopping": 100,
    "measure_option": autotvm.measure_option(
        builder=autotvm.LocalBuilder(build_func="default"), runner=runner
    ),
    "tuning_records": "resnet-50-v2-autotuning.json",
}

# begin by extracting the tasks from the onnx model
tasks = autotvm.task.extract_from_program(mod["main"], target=target, params=params)

# Tune the extracted tasks sequentially.
for i, task in enumerate(tasks):
    prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))
    tuner_obj = XGBTuner(task, loss_type="rank")
    tuner_obj.tune(
        n_trial=min(tuning_option["trials"], len(task.config_space)),
        early_stopping=tuning_option["early_stopping"],
        measure_option=tuning_option["measure_option"],
        callbacks=[
            autotvm.callback.progress_bar(tuning_option["trials"], prefix=prefix),
            autotvm.callback.log_to_file(tuning_option["tuning_records"]),
        ],
    )

上述 tuning_option 選用的 XGBoost Grid 演算法進行優化搜尋，資料記錄進 tuning_records 。

重編譯模型，用調優資料

重新編譯出一個優化模型，依據調優資料：

with autotvm.apply_history_best(tuning_option["tuning_records"]):
    with tvm.transform.PassContext(opt_level=3, config={}):
        lib = relay.build(mod, target=target, params=params)

dev = tvm.device(str(target), 0)
module = graph_executor.GraphModule(lib["default"](dev))


# Verify that the optimized model runs and produces the same results

dtype = "float32"
module.set_input(input_name, img_data)
module.run()
output_shape = (1, 1000)
tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy()

scores = softmax(tvm_output)
scores = np.squeeze(scores)
ranks = np.argsort(scores)[::-1]
for rank in ranks[0:5]:
    print("class='%s' with probability=%f" % (labels[rank], scores[rank]))

對比調優與非調優模型

收集優化後的效能資料，與優化前的對比：

import timeit

timing_number = 10
timing_repeat = 10
optimized = (
    np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number))
    * 1000
    / timing_number
)
optimized = {"mean": np.mean(optimized), "median": np.median(optimized), "std": np.std(optimized)}

print("optimized: %s" % (optimized))
print("unoptimized: %s" % (unoptimized))

調優模型，整個過程的執行結果，如下：

$ time python autotvm_tune.py
# TVM 編譯執行模型
## Downloading and Loading the ONNX Model
## Downloading, Preprocessing, and Loading the Test Image
## Compile the Model With Relay
target [llvm]: llvm -mcpu=core-avx2
One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
## Execute on the TVM Runtime
## Collect Basic Performance Data
{'mean': 44.97057118016528, 'median': 42.52320024970686, 'std': 6.870915251002107}
## Postprocess the output
class='n02123045 tabby, tabby cat' with probability=0.621104
class='n02123159 tiger cat' with probability=0.356378
class='n02124075 Egyptian cat' with probability=0.019712
class='n02129604 tiger, Panthera tigris' with probability=0.001215
class='n04040759 radiator' with probability=0.000262
# AutoTVM 調優模型 [Y/n]
## Tune the model
[Task  1/25]  Current/Best:  156.96/ 353.76 GFLOPS | Progress: (10/10) | 4.78 s Done.
[Task  2/25]  Current/Best:   54.66/ 241.25 GFLOPS | Progress: (10/10) | 2.88 s Done.
[Task  3/25]  Current/Best:  116.71/ 241.30 GFLOPS | Progress: (10/10) | 3.48 s Done.
[Task  4/25]  Current/Best:  119.92/ 184.18 GFLOPS | Progress: (10/10) | 3.48 s Done.
[Task  5/25]  Current/Best:   48.92/ 158.38 GFLOPS | Progress: (10/10) | 3.13 s Done.
[Task  6/25]  Current/Best:  156.89/ 230.95 GFLOPS | Progress: (10/10) | 2.82 s Done.
[Task  7/25]  Current/Best:   92.33/ 241.99 GFLOPS | Progress: (10/10) | 2.40 s Done.
[Task  8/25]  Current/Best:   50.04/ 331.82 GFLOPS | Progress: (10/10) | 2.64 s Done.
[Task  9/25]  Current/Best:  188.47/ 409.93 GFLOPS | Progress: (10/10) | 4.44 s Done.
[Task 10/25]  Current/Best:   44.81/ 181.67 GFLOPS | Progress: (10/10) | 2.32 s Done.
[Task 11/25]  Current/Best:   83.74/ 312.66 GFLOPS | Progress: (10/10) | 2.74 s Done.
[Task 12/25]  Current/Best:   96.48/ 294.40 GFLOPS | Progress: (10/10) | 2.82 s Done.
[Task 13/25]  Current/Best:  123.74/ 354.34 GFLOPS | Progress: (10/10) | 2.62 s Done.
[Task 14/25]  Current/Best:   23.76/ 178.71 GFLOPS | Progress: (10/10) | 2.90 s Done.
[Task 15/25]  Current/Best:  119.18/ 534.63 GFLOPS | Progress: (10/10) | 2.49 s Done.
[Task 16/25]  Current/Best:  101.24/ 172.92 GFLOPS | Progress: (10/10) | 2.49 s Done.
[Task 17/25]  Current/Best:  309.85/ 309.85 GFLOPS | Progress: (10/10) | 2.69 s Done.
[Task 18/25]  Current/Best:   54.45/ 368.31 GFLOPS | Progress: (10/10) | 2.46 s Done.
[Task 19/25]  Current/Best:   78.69/ 162.43 GFLOPS | Progress: (10/10) | 3.29 s Done.
[Task 20/25]  Current/Best:   40.78/ 317.50 GFLOPS | Progress: (10/10) | 4.52 s Done.
[Task 21/25]  Current/Best:  169.03/ 296.36 GFLOPS | Progress: (10/10) | 3.95 s Done.
[Task 22/25]  Current/Best:   90.96/ 210.43 GFLOPS | Progress: (10/10) | 2.28 s Done.
[Task 23/25]  Current/Best:   48.93/ 217.36 GFLOPS | Progress: (10/10) | 2.87 s Done.
[Task 25/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s Done.
[Task 25/25]  Current/Best:   25.50/  33.86 GFLOPS | Progress: (10/10) | 9.28 s Done.
## Compiling an Optimized Model with Tuning Data
class='n02123045 tabby, tabby cat' with probability=0.621104
class='n02123159 tiger cat' with probability=0.356378
class='n02124075 Egyptian cat' with probability=0.019712
class='n02129604 tiger, Panthera tigris' with probability=0.001215
class='n04040759 radiator' with probability=0.000262
## Comparing the Tuned and Untuned Models
optimized: {'mean': 34.736288779822644, 'median': 34.547542000655085, 'std': 0.5144378649382363}
unoptimized: {'mean': 44.97057118016528, 'median': 42.52320024970686, 'std': 6.870915251002107}

real    3m23.904s
user    5m2.900s
sys     5m37.099s

對比效能資料，可以發現：調優模型的執行速度更快、更平穩。

參考

筆記:start-ai-compiler
資料:
- 2020 / The Deep Learning Compiler: A Comprehensive Survey
  - [[譯] 深度學習編譯器綜述]( http://www.jianshu.com/p/ed3... )
- 2018 / TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
  - [[譯] TVM: 一個自動的端到端深度學習優化編譯器]( http://zhuanlan.zhihu.com/p/... )

GoCoding 個人實踐的經驗分享，可關注公眾號！

「其他文章」