-
Notifications
You must be signed in to change notification settings - Fork 4.4k
per-layer feature mask #4273
Copy link
Copy link
Closed
Labels
Description
create a new param entry with id 31=uint
use bit for per-layer feature masking
bool use_fp16_packed;
bool use_fp16_storage;
bool use_fp16_arithmetic;Sample use case
7767517
6 6
Input data 0 1 data 0=224 1=224 2=3
Convolution conv1_1 1 1 data conv1_1 0=64 1=3 4=1 5=1 6=1728 9=1
Convolution conv1_2 1 1 conv1_1 conv1_2 0=64 1=3 4=1 5=1 6=36864 9=1
Pooling pool1 1 1 conv1_2 pool1 1=2 2=2
Convolution conv2_1 1 1 pool1 conv2_1 0=128 1=3 4=1 5=1 6=73728 9=1
Convolution conv2_2 1 1 conv2_1 output 0=128 1=3 4=1 5=1 6=147456 9=1Typically, we use fp16 computation to improve inference speed
Because the weight value of conv2_1 is large, fp16 accumulation may cause numerical overflow, so fp16 needs to be disabled individually for conv2_1, while other layers continue to use fp16 mode
Add 31=1 i.e. (1<<0) as disabled bit to disable fp16
Convolution conv2_1 1 1 pool1 conv2_1 0=128 1=3 4=1 5=1 6=73728 9=1 31=1It is also possible to control num_threads for each layer individually, but it is not very useful, so no more precious bits are used
| mask | bit | rationale |
|---|---|---|
| no fp16 arithmetic | 1<<0 | precision concern |
| no fp16 storage | 1<<1 | precision concern |
| no bf16 storage | 1<<2 | precision concern |
| no int8 | 1<<3 | debug dynamic quantized model |
| no vulkan | 1<<4 | reduce overhead for cpu op - gpu split - cpu op |
| no sgemm | 1<<5 | reduce some memory |
| no winograd | 1<<6 | reduce some memory |
These masks will be implemented, and more bits can be used to achieve other needs in the future
Reactions are currently unavailable