Skip to content

Cython Performance Optimizations#178

Merged
oyvindln merged 11 commits intooyvindln:vhs_decodefrom
eshaz:cython-perf
Jan 6, 2025
Merged

Cython Performance Optimizations#178
oyvindln merged 11 commits intooyvindln:vhs_decodefrom
eshaz:cython-perf

Conversation

@eshaz
Copy link
Copy Markdown

@eshaz eshaz commented Dec 24, 2024

Fixes

  • Remove -1 assignment to initial hsync location. This should be set to 0 when not yet filled in. This will fix a possible scenario where the estimated hsync drifts over time if there is a long span of time with no vblanking intervals.

Performance

Various performance tweaks to the cython sync code. This should help a bit with performance, especially with computers with CPUs that have modern vector instructions.

  • Added C compiler flags to optimize more, i.e. -O3 -march=native -flto
  • Refactor to use nogil where possible
  • Use contiguous arrays to allow for better auto-vectorization
  • Update types to reduce casting

Before (current vhs_decode branch):

$ sudo perf stat -e instructions:u,fp_arith_inst_retired.128b_packed_single:u,fp_arith_inst_retired.128b_packed_double:u,fp_arith_inst_retired.256b_packed_single:u,fp_arith_inst_retired.256b_packed_double:u,fp_arith_inst_retired.512b_packed_single:u,fp_arith_inst_retired.512b_packed_double:u -- ./vhs-decode -l 50 -n -t 3 -f 40 --overwrite [file_location].flac test
File Frame 50: VHS                                                              
Completed: saving JSON and exiting.  Took 9.12 seconds to decode 50 frames (5.79 FPS post-setup)

 Performance counter stats for './vhs-decode -l 50 -n -t 3 -f 40 --overwrite [file_location].flac test':

   195,119,171,841      instructions:u                                                        
                 0      fp_arith_inst_retired.128b_packed_single:u                                      
       526,252,196      fp_arith_inst_retired.128b_packed_double:u                                      
             1,783      fp_arith_inst_retired.256b_packed_single:u                                      
        53,500,004      fp_arith_inst_retired.256b_packed_double:u                                      
         4,688,137      fp_arith_inst_retired.512b_packed_single:u                                      
       300,630,107      fp_arith_inst_retired.512b_packed_double:u                                      

      10.112633828 seconds time elapsed

      23.986897000 seconds user
       3.288829000 seconds sys

After (this branch):

$ sudo perf stat -e instructions:u,fp_arith_inst_retired.128b_packed_single:u,fp_arith_inst_retired.128b_packed_double:u,fp_arith_inst_retired.256b_packed_single:u,fp_arith_inst_retired.256b_packed_double:u,fp_arith_inst_retired.512b_packed_single:u,fp_arith_inst_retired.512b_packed_double:u -- ./vhs-decode -l 50 -n -t 3 -f 40 --overwrite [file_location].flac test
File Frame 50: VHS                                                              
Completed: saving JSON and exiting.  Took 8.83 seconds to decode 50 frames (5.98 FPS post-setup)

 Performance counter stats for './vhs-decode -l 50 -n -t 3 -f 40 --overwrite [file_location].flac test':

   191,981,798,294      instructions:u                                                        
                 0      fp_arith_inst_retired.128b_packed_single:u                                      
       526,252,196      fp_arith_inst_retired.128b_packed_double:u                                      
             1,927      fp_arith_inst_retired.256b_packed_single:u                                      
        53,502,084      fp_arith_inst_retired.256b_packed_double:u                                      
         4,687,993      fp_arith_inst_retired.512b_packed_single:u                                      
       300,630,280      fp_arith_inst_retired.512b_packed_double:u                                      

       9.827384407 seconds time elapsed

      23.535043000 seconds user
       3.147495000 seconds sys

@eshaz eshaz marked this pull request as draft December 29, 2024 03:48
@eshaz eshaz marked this pull request as ready for review January 3, 2025 07:20
@oyvindln
Copy link
Copy Markdown
Owner

oyvindln commented Jan 3, 2025

CMAKE_BUILD_TYPE "Release means it's a release optimized build as opposed to debug build - which is what you want to use outside of development so it's the wrong thing to test for for adding --march=native. I think it's better to just have a suggestion to use ```CXXFLAGS="-march=native" CFLAGS="-march=native cmake ...." when building locally in the docs instead of putting it into cmakelists.txt.

LTO already gets enabled earlier in the cmakelists file if supported by the compiler, (set(CMAKE_INTERPROCEDURAL_OPTIMIZATION_RELEASE ON)) so don't need to add -flto in there, we could add -O3 I suppose but not sure if it makes any difference to the default O2 in release mode.

As for the cython side, I think it will break when using MSVC as it is now though would have to test to be sure. Not sure if it will pick up CXXFLAGS/CFLAGS from the environment similarly to cmake or if there is some conditional way to add flags only when using gcc/clang, but maybe it's better to just leave it off for now.

@eshaz
Copy link
Copy Markdown
Author

eshaz commented Jan 3, 2025

CMAKE_BUILD_TYPE "Release means it's a release optimized build as opposed to debug build - which is what you want to use outside of development so it's the wrong thing to test for for adding --march=native. I think it's better to just have a suggestion to use ```CXXFLAGS="-march=native" CFLAGS="-march=native cmake ...." when building locally in the docs instead of putting it into cmakelists.txt.

LTO already gets enabled earlier in the cmakelists file if supported by the compiler, (set(CMAKE_INTERPROCEDURAL_OPTIMIZATION_RELEASE ON)) so don't need to add -flto in there, we could add -O3 I suppose but not sure if it makes any difference to the default O2 in release mode.

That all sounds good, I'll make these changes.

As for the cython side, I think it will break when using MSVC as it is now though would have to test to be sure. Not sure if it will pick up CXXFLAGS/CFLAGS from the environment similarly to cmake or if there is some conditional way to add flags only when using gcc/clang, but maybe it's better to just leave it off for now.

Quickly searching this, it looks like there is a way to detect the compiler. I'll update this, and probably just omit the extra flags for MSVC. Someone who can test it / uses Windows can verify if any flags help or harm performance.

@eshaz
Copy link
Copy Markdown
Author

eshaz commented Jan 5, 2025

@oyvindln I made the updates discussed above. Let me know if there's anything else that needs changing.

@oyvindln oyvindln merged commit a0c0007 into oyvindln:vhs_decode Jan 6, 2025
@eshaz eshaz deleted the cython-perf branch March 8, 2025 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants