yes

nnPerf+: Enabling Real-time On-device Profiling for Mobile DNNs and LLMs

Overview | Key Features | Installation instructions | Visualization tool
Support for LLM | ARM GPU Counters | Citation | Get Help | Credits | GitHub | PaperPDF | FAQ

Our paper "nnPerf" won the Best Paper Award Runner-up of ACM SenSys 2023 !

Overview

This webpage contains instructions to use our nnPerf+. nnPerf+ is a real-time on-device profiler designed to collect and analyze the DNN model and LLM(large language model) runtime inference latency on mobile platforms. nnPerf+ demystifies the hidden layers and metrics used for pursuing DNN and LLM optimizations and adaptations at the granularity of operators and kernels, ensuring every facet contributing to a DNN model or LLM's runtime efficiency is easily accessible to mobile developers via well-defined APIs.

With nnPerf+, the mobile developers can easily identify the bottleneck in model run-time efficiency and optimize the model architecture to meet system-level objectives (SLO).

The figure below compares nnPerf+ to existing DNN model and LLM analyzers designed for mobile platforms.

The effectiveness of nnPerf+
The effectiveness of nnPerf+

To cite this tool, the best reference is the SenSys 2023 paper.

Key Features

Demo of nnPerf.

[News] We will update the features for measuring the frequency, utilization, temperature, and PMU parameters (e.g, instructions, L2 cache miss) of multi-core processors later.


1. Plug-and-play design principles

• Follows a self-contained approach with no need for extra libraries or complex installations.

2. Real-time on-device profiling

• Monitor DNN inference delays directly on the device without external dependencies like adb.

3. Support measuring fine-grained information at the GPU kernel level

• Allows deep inspection of GPU kernels for detailed insights into DNN model optimization.

For more design details and features, please refer to our Sensys 2023 paper.

Installation instructions

Set up nnPerf in just a few steps. You can download the latest version of nnPerf here.

Quick start with apk

1. Use adb to connect to smartphones or mobile platforms (Android basic system)

2. Install the nnPerf_v1.0.apk

adb install -t .\nnPerf_v1.0.apk

Build android project

1. Install Android Studio 3.6.3 (Runtime version: 1.8.0_212-release-1586-b04 amd64).

2. Import Project

File -> Open -> Current file directory

3. Android Studio Setting

Android Gradle Plugin Version:   3.1.3
Gradle Version:                  4.4
NDK Version:                     21.0.6113669
JDK Verison:                     1.8.0_211
Complile Sdk Version:            27
Build Tools Version:             27.0.3

4. Run to profile

Output path: /data/data/com.example.android.nnPerf/

5. Model support list (Support for adding other .tflite models)

mobilenetV3-Large-Float
mobilenetV3-Small-Float
EfficientNet-b0-Float
mobilenetV1-Quant
mobilenetV2-Float
mobilenetV1-Float
Squeezenet-Float
Densenet-Float
MNasNet-1.0
MobileBert
SSDMobileV2
Esrgan

Online timeline visualization tool

We are developing an online timeline visualization tool for nnPerf, we will release it later.

Channel state information for four 1x1 links
Example of visualization tool

Features of our visualization tools:

1. Easily upload test data files and resize the interface using the scroll wheel.

2. Utilize the "Sort" button to organize data within the file based on three distinctcategories.

3. Retrieve and query previously uploaded files conveniently through the History feature.

4. Selectively hide specific filters through the intuitive legend in the upper right corner of the interface.

Support for LLM

We also support profiling LLM, such as GPT, BERT, Qwen, Gemma, RedPajama, Mistral.

latency (ns)
Inference Time (s)
Kernel-level GPU stall ratios
Kernel-level GPU stall ratios
Token-level GPU stall ratios
Token-level GPU stall ratios
Kernel-level GPU stall ratios during a single token generation on five mobile devices with different computation capacities.
Token-level GPU stall ratios across multiple tokens, grouped by five mobile devices.
Device 𝐷1 and 𝐷2 exclude model 𝑀5 due to their inadequate computation capacity.
𝐷1: OnePlus 8
𝐷2: Samsung Note10
𝐷3: OnePlus 9RT
𝐷4: OnePlus Ace Pro
𝐷5: Xiaomi 14Pro
𝑀1: Qwen2.5 0.5B
𝑀2: Qwen2.5 1.5B
𝑀3: Gemma2 2B
𝑀4: RedPajama 3B
𝑀5: Mistral 7B
Channel state information for four 1x1 links
Example of GPT-2
Channel state information for four 1x1 links
Example of MobileBERT

Newly Added PMU Counters Supported by nnPerf+

EVENT_GROUP_UCHE - Unified L2 Cache Event Group
UCHE_VBIF_READ_BEATS_SP
Number of 128-bit data beats read from VBIF/External Memory to UCHE for SP
UCHE_READ_REQUESTS_SP
Total read requests issued from Shader Processor (SP) to UCHE
UCHE_WRITE_REQUESTS_SP
Total write requests issued from Shader Processor (SP) to UCHE
UCHE_VBIF_READ_BEATS_TP
Number of 128-bit data beats read from VBIF/External Memory to UCHE for TP
UCHE_READ_REQUESTS_TP
Total read requests issued from Texture Pipe (TP) to UCHE
EVENT_GROUP_SP - Shader Processor Event Group
SP_BUSY_CYCLES
Number of cycles where the Shader Processor is actively executing instructions
SP_NON_EXECUTION_CYCLES
Cycles where SP is active but not executing instructions
SP_FS_STAGE_FULL_ALU_INSTRUCTIONS
Number of full-precision ALU instructions executed in Fragment Shader stage
SP_VS_STAGE_FULL_ALU_INSTRUCTIONS
Number of full-precision ALU instructions executed in Vertex Shader stage
SP_CS_INSTRUCTIONS
Total number of Compute Shader instructions executed
SP_ICL1_MISSES
Level 1 Instruction Cache misses in the Shader Processor
SP_ICL1_REQUESTS
Total Level 1 Instruction Cache fetch requests
SP_LM_LOAD_INSTRUCTIONS
Number of Load instructions from Local Memory
SP_LM_STORE_INSTRUCTIONS
Number of Store instructions to Local Memory
SP_LM_ATOMICS
Number of Atomic operations performed on Local Memory
SP_GM_LOAD_INSTRUCTIONS
Number of Load instructions from Global Memory
SP_GM_STORE_INSTRUCTIONS
Number of Store instructions to Global Memory
SP_GM_ATOMICS
Number of Atomic operations performed on Global Memory
SP_STALL_CYCLES_VPC
Cycles where SP is stalled waiting for the Vertex Parameter Cache (VPC)
SP_STALL_CYCLES_TP
Cycles where SP is stalled waiting for the Texture Pipe (TP)
SP_STALL_CYCLES_UCHE
Cycles where SP is stalled waiting for the Unified L2 Cache (UCHE)
SP_STALL_CYCLES_RB
Cycles where SP is stalled waiting for the Render Backend (RB)
EVENT_GROUP_TP - Texture Pipe Event Group
TP_BUSY_CYCLES
Total cycles during which the Texture Pipe is performing processing tasks
TP_L1_CACHELINE_MISSES
Level 1 Texture Cache line misses
TP_L1_CACHELINE_REQUESTS
Total fetch requests issued to the Level 1 Texture Cache
TP_STALL_CYCLES_UCHE
Cycles where the Texture Pipe is stalled waiting for data from UCHE (L2)
TP_LATENCY_CYCLES
Accumulated cycles reflecting the latency of texture fetch operations
TP_STARVE_CYCLES_SP
Cycles where the TP is idle/starving due to lack of input from the Shader Processor
TP_STARVE_CYCLES_UCHE
Cycles where the TP is idle/starving waiting for memory responses from UCHE
TP_OUTPUT_PIXELS_POINT
Number of pixels produced using Point (Nearest) filtering
TP_OUTPUT_PIXELS_BILINEA
Number of pixels produced using Bilinear filtering
TP_OUTPUT_PIXELS_MIP
Number of pixels produced using Mipmapped filtering
TP_OUTPUT_PIXELS_ANISO
Number of pixels produced using Anisotropic filtering
TP_OUTPUT_PIXELS_ZERO_LOD
Number of pixels produced with Level of Detail (LOD) zero
EVENT_GROUP_CUSTOM - Custom Derived Metrics Group
GFLOPs
Giga Floating Point Operations per second (Total throughput)
GBPs
Giga Bytes per second (Memory bandwidth throughput)
GPUCycles
Total GPU internal clock cycles elapsed during the task
ShaderComputeCycles
Estimated cycles spent strictly on shader compute/ALU tasks
ShaderLoadStoreCycles
Estimated cycles spent on memory Load/Store operations
ShaderTextureCycles
Estimated cycles spent on texture mapping and sampling operations
AluUtil
ALU Utilization percentage (Active compute relative to available capacity)
LoadStoreUtil
Load/Store Unit utilization percentage
TextureUtil
Texture Pipe utilization percentage
FullAluRatio
Ratio of full-precision (32-bit) ALU instructions vs total ALU instructions
ShaderBusyRatio
Percentage of time the shader cores are actively executing
ShaderStalledRatio
Percentage of time shader cores are stalled waiting for resources
TexturePipesBusyRatio
Percentage of time the texture units are active
TextureL1MissRatio
Miss rate of the Level 1 Texture Cache
TextureL2ReadMissRatio
Miss rate of the L2 Cache (UCHE) specifically for texture read requests
L2ReadMissRatio
General L2 Cache (UCHE) read miss rate
InstructionCacheMissRatio
Overall Instruction Cache miss rate for the GPU cores
L1TextureMissPerPixel
Average number of L1 texture cache misses per processed pixel

Citation

If you find nnPerf useful in your research, please consider citing:

    @inproceedings{nnPerf,
        author = {Chu, Haolin and Zheng, Xiaolong and Liu, Liang and Ma, Huadong},
        title = {nnPerf: Demystifying DNN Runtime Inference Latency on Mobile Platforms},
        year = {2023},
        publisher = {Association for Computing Machinery},
        address = {New York, NY, USA},
        url = {https://dl.acm.org/doi/10.1145/3625687.3625797},
        doi = {10.1145/3625687.3625797},
        booktitle = {Proceedings of the 21st ACM Conference on Embedded Networked Sensor Systems},
        pages = {125–137},
    }

Get Help

If you have any functional requirements or usage feedback for nnPerf, please send it to this email address: buptwins#163.com (replace # with @).

让您拥有移动端侧神经网络推理的“智子”视角@SenSys’23

Credits

Authors: Haolin Chu | Xiaolong Zheng | Liang Liu | Huadong Ma
Maintainers: Haolin Chu | Haiteng Xin | Boyan Lu | Zixu Wang