TVM 加速模型,優化推斷
TVM 是一個開源深度學習編譯器,可適用於各類 CPUs, GPUs 及其他專用加速器。它的目標是使得我們能夠在任何硬體上優化和執行自己的模型。不同於深度學習框架關注模型生產力,TVM 更關注模型在硬體上的效能和效率。
本文只簡單介紹 TVM 的編譯流程,及如何自動調優自己的模型。更深入瞭解,可見 TVM 官方內容:
編譯流程
TVM 文件 Design and Architecture 講述了例項編譯流程、邏輯結構元件、裝置目標實現等。其中流程見下圖:
從高層次上看,包含了如下步驟:
- 匯入(Import):前端元件將模型提取進 IRModule,其是模型內部表示(IR)的函式集合。
- 轉換(Transformation):編譯器將 IRModule 轉換為另一個功能等效或近似等效(如量化情況下)的 IRModule。大多轉換都是獨立於目標(後端)的。TVM 也允許目標影響轉換通道的配置。
- 目標翻譯(Target Translation):編譯器翻譯(程式碼生成) IRModule 到目標上的可執行格式。目標翻譯結果被封裝為 runtime.Module,可以在目標執行時環境中匯出、載入和執行。
- 執行時執行(Runtime Execution):使用者載入一個 runtime.Module 並在支援的執行時環境中執行編譯好的函式。
調優模型
TVM 文件User Tutorial 從怎麼編譯優化模型開始,逐步深入到 TE, TensorIR, Relay 等更底層的邏輯結構元件。
這裡只講下如何用 AutoTVM 自動調優模型,實際瞭解 TVM 編譯、調優、執行模型的過程。原文見 Compiling and Optimizing a Model with the Python Interface (AutoTVM) 。
準備 TVM
首先,安裝 TVM。可見文件Installing TVM,或筆記「TVM 安裝」。
之後,即可通過 TVM Python API 來調優模型。我們先匯入如下依賴:
import onnx from tvm.contrib.download import download_testdata from PIL import Image import numpy as np import tvm.relay as relay import tvm from tvm.contrib import graph_executor
準備模型,並載入
獲取預訓練的 ResNet-50 v2 ONNX 模型,並載入:
model_url = "".join( [ "http://github.com/onnx/models/raw/", "main/vision/classification/resnet/model/", "resnet50-v2-7.onnx", ] ) model_path = download_testdata(model_url, "resnet50-v2-7.onnx", module="onnx") onnx_model = onnx.load(model_path)
準備圖片,並前處理
獲取一張測試圖片,並前處理成 224x224 NCHW 格式:
img_url = "http://s3.amazonaws.com/model-server/inputs/kitten.jpg" img_path = download_testdata(img_url, "imagenet_cat.png", module="data") # Resize it to 224x224 resized_image = Image.open(img_path).resize((224, 224)) img_data = np.asarray(resized_image).astype("float32") # Our input image is in HWC layout while ONNX expects CHW input, so convert the array img_data = np.transpose(img_data, (2, 0, 1)) # Normalize according to the ImageNet input specification imagenet_mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1)) imagenet_stddev = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1)) norm_img_data = (img_data / 255 - imagenet_mean) / imagenet_stddev # Add the batch dimension, as we are expecting 4-dimensional input: NCHW. img_data = np.expand_dims(norm_img_data, axis=0)
編譯模型,用 TVM Relay
TVM 匯入 ONNX 模型成 Relay,並建立 TVM 圖模型:
target = input("target [llvm]: ") if not target: target = "llvm" # target = "llvm -mcpu=core-avx2" # target = "llvm -mcpu=skylake-avx512" # The input name may vary across model types. You can use a tool # like Netron to check input names input_name = "data" shape_dict = {input_name: img_data.shape} mod, params = relay.frontend.from_onnx(onnx_model, shape_dict) with tvm.transform.PassContext(opt_level=3): lib = relay.build(mod, target=target, params=params) dev = tvm.device(str(target), 0) module = graph_executor.GraphModule(lib["default"](dev))
其中 target
是目標硬體平臺。 llvm
指用 CPU,建議指明架構指令集,可更優化效能。如下命令可檢視 CPU:
$ llc --version | grep CPU Host CPU: skylake $ lscpu
或直接上廠商網站(如Intel® Products)檢視產品引數。
執行模型,用 TVM Runtime
用 TVM Runtime 執行模型,進行預測:
dtype = "float32" module.set_input(input_name, img_data) module.run() output_shape = (1, 1000) tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy()
收集優化前的效能資料
收集優化前的效能資料:
import timeit timing_number = 10 timing_repeat = 10 unoptimized = ( np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number)) * 1000 / timing_number ) unoptimized = { "mean": np.mean(unoptimized), "median": np.median(unoptimized), "std": np.std(unoptimized), } print(unoptimized)
之後,用以對比優化後的效能。
後處理輸出,得知預測結果
輸出的預測結果,後處理成可讀的分類結果:
from scipy.special import softmax # Download a list of labels labels_url = "http://s3.amazonaws.com/onnx-model-zoo/synset.txt" labels_path = download_testdata(labels_url, "synset.txt", module="data") with open(labels_path, "r") as f: labels = [l.rstrip() for l in f] # Open the output and read the output tensor scores = softmax(tvm_output) scores = np.squeeze(scores) ranks = np.argsort(scores)[::-1] for rank in ranks[0:5]: print("class='%s' with probability=%f" % (labels[rank], scores[rank]))
調優模型,獲取調優資料
於目標硬體平臺,用 AutoTVM 自動調優,獲取調優資料:
import tvm.auto_scheduler as auto_scheduler from tvm.autotvm.tuner import XGBTuner from tvm import autotvm number = 10 repeat = 1 min_repeat_ms = 0 # since we're tuning on a CPU, can be set to 0 timeout = 10 # in seconds # create a TVM runner runner = autotvm.LocalRunner( number=number, repeat=repeat, timeout=timeout, min_repeat_ms=min_repeat_ms, enable_cpu_cache_flush=True, ) tuning_option = { "tuner": "xgb", "trials": 10, "early_stopping": 100, "measure_option": autotvm.measure_option( builder=autotvm.LocalBuilder(build_func="default"), runner=runner ), "tuning_records": "resnet-50-v2-autotuning.json", } # begin by extracting the tasks from the onnx model tasks = autotvm.task.extract_from_program(mod["main"], target=target, params=params) # Tune the extracted tasks sequentially. for i, task in enumerate(tasks): prefix = "[Task %2d/%2d] " % (i + 1, len(tasks)) tuner_obj = XGBTuner(task, loss_type="rank") tuner_obj.tune( n_trial=min(tuning_option["trials"], len(task.config_space)), early_stopping=tuning_option["early_stopping"], measure_option=tuning_option["measure_option"], callbacks=[ autotvm.callback.progress_bar(tuning_option["trials"], prefix=prefix), autotvm.callback.log_to_file(tuning_option["tuning_records"]), ], )
上述 tuning_option
選用的 XGBoost Grid 演算法進行優化搜尋,資料記錄進 tuning_records
。
重編譯模型,用調優資料
重新編譯出一個優化模型,依據調優資料:
with autotvm.apply_history_best(tuning_option["tuning_records"]): with tvm.transform.PassContext(opt_level=3, config={}): lib = relay.build(mod, target=target, params=params) dev = tvm.device(str(target), 0) module = graph_executor.GraphModule(lib["default"](dev)) # Verify that the optimized model runs and produces the same results dtype = "float32" module.set_input(input_name, img_data) module.run() output_shape = (1, 1000) tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy() scores = softmax(tvm_output) scores = np.squeeze(scores) ranks = np.argsort(scores)[::-1] for rank in ranks[0:5]: print("class='%s' with probability=%f" % (labels[rank], scores[rank]))
對比調優與非調優模型
收集優化後的效能資料,與優化前的對比:
import timeit timing_number = 10 timing_repeat = 10 optimized = ( np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number)) * 1000 / timing_number ) optimized = {"mean": np.mean(optimized), "median": np.median(optimized), "std": np.std(optimized)} print("optimized: %s" % (optimized)) print("unoptimized: %s" % (unoptimized))
調優模型,整個過程的執行結果,如下:
$ time python autotvm_tune.py # TVM 編譯執行模型 ## Downloading and Loading the ONNX Model ## Downloading, Preprocessing, and Loading the Test Image ## Compile the Model With Relay target [llvm]: llvm -mcpu=core-avx2 One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details. ## Execute on the TVM Runtime ## Collect Basic Performance Data {'mean': 44.97057118016528, 'median': 42.52320024970686, 'std': 6.870915251002107} ## Postprocess the output class='n02123045 tabby, tabby cat' with probability=0.621104 class='n02123159 tiger cat' with probability=0.356378 class='n02124075 Egyptian cat' with probability=0.019712 class='n02129604 tiger, Panthera tigris' with probability=0.001215 class='n04040759 radiator' with probability=0.000262 # AutoTVM 調優模型 [Y/n] ## Tune the model [Task 1/25] Current/Best: 156.96/ 353.76 GFLOPS | Progress: (10/10) | 4.78 s Done. [Task 2/25] Current/Best: 54.66/ 241.25 GFLOPS | Progress: (10/10) | 2.88 s Done. [Task 3/25] Current/Best: 116.71/ 241.30 GFLOPS | Progress: (10/10) | 3.48 s Done. [Task 4/25] Current/Best: 119.92/ 184.18 GFLOPS | Progress: (10/10) | 3.48 s Done. [Task 5/25] Current/Best: 48.92/ 158.38 GFLOPS | Progress: (10/10) | 3.13 s Done. [Task 6/25] Current/Best: 156.89/ 230.95 GFLOPS | Progress: (10/10) | 2.82 s Done. [Task 7/25] Current/Best: 92.33/ 241.99 GFLOPS | Progress: (10/10) | 2.40 s Done. [Task 8/25] Current/Best: 50.04/ 331.82 GFLOPS | Progress: (10/10) | 2.64 s Done. [Task 9/25] Current/Best: 188.47/ 409.93 GFLOPS | Progress: (10/10) | 4.44 s Done. [Task 10/25] Current/Best: 44.81/ 181.67 GFLOPS | Progress: (10/10) | 2.32 s Done. [Task 11/25] Current/Best: 83.74/ 312.66 GFLOPS | Progress: (10/10) | 2.74 s Done. [Task 12/25] Current/Best: 96.48/ 294.40 GFLOPS | Progress: (10/10) | 2.82 s Done. [Task 13/25] Current/Best: 123.74/ 354.34 GFLOPS | Progress: (10/10) | 2.62 s Done. [Task 14/25] Current/Best: 23.76/ 178.71 GFLOPS | Progress: (10/10) | 2.90 s Done. [Task 15/25] Current/Best: 119.18/ 534.63 GFLOPS | Progress: (10/10) | 2.49 s Done. [Task 16/25] Current/Best: 101.24/ 172.92 GFLOPS | Progress: (10/10) | 2.49 s Done. [Task 17/25] Current/Best: 309.85/ 309.85 GFLOPS | Progress: (10/10) | 2.69 s Done. [Task 18/25] Current/Best: 54.45/ 368.31 GFLOPS | Progress: (10/10) | 2.46 s Done. [Task 19/25] Current/Best: 78.69/ 162.43 GFLOPS | Progress: (10/10) | 3.29 s Done. [Task 20/25] Current/Best: 40.78/ 317.50 GFLOPS | Progress: (10/10) | 4.52 s Done. [Task 21/25] Current/Best: 169.03/ 296.36 GFLOPS | Progress: (10/10) | 3.95 s Done. [Task 22/25] Current/Best: 90.96/ 210.43 GFLOPS | Progress: (10/10) | 2.28 s Done. [Task 23/25] Current/Best: 48.93/ 217.36 GFLOPS | Progress: (10/10) | 2.87 s Done. [Task 25/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/10) | 0.00 s Done. [Task 25/25] Current/Best: 25.50/ 33.86 GFLOPS | Progress: (10/10) | 9.28 s Done. ## Compiling an Optimized Model with Tuning Data class='n02123045 tabby, tabby cat' with probability=0.621104 class='n02123159 tiger cat' with probability=0.356378 class='n02124075 Egyptian cat' with probability=0.019712 class='n02129604 tiger, Panthera tigris' with probability=0.001215 class='n04040759 radiator' with probability=0.000262 ## Comparing the Tuned and Untuned Models optimized: {'mean': 34.736288779822644, 'median': 34.547542000655085, 'std': 0.5144378649382363} unoptimized: {'mean': 44.97057118016528, 'median': 42.52320024970686, 'std': 6.870915251002107} real 3m23.904s user 5m2.900s sys 5m37.099s
對比效能資料,可以發現:調優模型的執行速度更快、更平穩。
參考
- 筆記:start-ai-compiler
-
資料:
-
2020 / The Deep Learning Compiler: A Comprehensive Survey
- [[譯] 深度學習編譯器綜述]( http://www.jianshu.com/p/ed3... )
-
2018 / TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
- [[譯] TVM: 一個自動的端到端深度學習優化編譯器]( http://zhuanlan.zhihu.com/p/... )
-
GoCoding 個人實踐的經驗分享,可關注公眾號!
- 基於 Nebula Graph 構建百億關係知識圖譜實踐
- 元宇宙 3D 開荒場 - 探味奇遇記
- 摺疊面板元件的設計與實現
- web技術分享| 【高德地圖】實現自定義的軌跡回放
- Vue3中的teleport節點傳送
- Flutter 常見異常分析
- React Native如何做線上錯誤與效能監控
- shell指令碼程式設計學習筆記——變數
- Object.prototype.toString.call()的原理
- 玩轉 AbortController 控制器
- Go十大常見錯誤第2篇:benchmark效能測試的坑
- 技術分享 | dbslower 工具學習之探針使用
- TypeScript 中令人迷惑的物件型別:Object、{}和 object
- 探針技術-JavaAgent 和位元組碼增強技術-Byte Buddy
- 解決方案| 快對講綜合排程系統
- 解鎖Markdown高階用法,提升寫作效率
- MAUI模板專案閃退問題
- 關於這個知識點,我被讀者罵到回家種田
- 2022 年你手機裡有哪些堪稱神器的 App?
- webpack打包時如何修改檔名