Skip to content

Conversation

@Quentin-Anthony
Copy link
Contributor

@Quentin-Anthony Quentin-Anthony commented Jun 13, 2022

This PR implements logging for all DeepSpeed communication calls

This PR introduces the DeepSpeed Communication Logger

After this PR, all communication calls from #1985 are automatically detected and logged (depending on config options). A final summary is then printed. For example:

Comm. Op            Message Size        Count               Total Latency(ms)   Avg Latency(ms)     tput_avg (Gbps)     busbw_avg (Gbps)    
broadcast
                    0B                  2                   0.19                0.10                0.00                0.00                
                    2.0 KB              146                 11.12               0.08                0.43                0.41                
                    6.0 KB              24                  1.84                0.08                1.29                1.21                
                    6.31 KB             1                   0.12                0.12                0.87                0.82                
                    8.0 KB              24                  1.85                0.08                1.73                1.62                
                    2.0 MB              24                  2.05                0.08                397.06              372.24              
                    4.0 MB              1                   0.15                0.15                434.01              406.89              
                    6.0 MB              24                  2.78                0.12                872.00              817.50              
                    8.0 MB              48                  6.36                0.13                1020.36             956.59              
                    98.25 MB            1                   8317.12             8317.12             0.20                0.19                
barrier
                    0B                  3                   237.96              79.32               0.00                0.00                
all_gather
                    128.0 B             146                 70.81               0.38                0.01                0.01                
                    384.0 B             24                  11.45               0.37                0.02                0.02                
                    512.0 B             24                  10.72               0.38                0.02                0.02                
                    128.0 KB            24                  12.44               0.52                4.20                3.94                
                    256.0 KB            1                   0.45                0.45                9.29                8.71                
                    384.0 KB            24                  18.01               0.57                11.50               10.78               
                    512.0 KB            48                  43.42               0.65                13.22               12.40               
                    6.14 MB             2                   17.76               8.88                40.96               38.40               
all_gather_base
                    128.0 B             1460                87.36               0.06                0.03                0.03                
                    256.0 B             147                 3.18                0.02                0.67                0.63                
                    384.0 B             240                 14.28               0.06                0.10                0.10                
                    512.0 B             240                 14.43               0.06                0.14                0.13                
                    128.0 KB            48                  3.38                0.07                29.79               27.93               
                    128.12 KB           24                  0.24                0.00                471.78              442.30              
                    256.0 KB            2                   0.18                0.09                45.52               42.67               
                    384.38 KB           72                  3.69                0.05                353.74              331.63              
                    512.0 KB            96                  5.22                0.06                412.72              386.92              
                    512.12 KB           24                  1.16                0.05                275.55              258.33              
                    512.5 KB            24                  1.70                0.07                120.35              112.83              
                    6.0 MB              219                 16.63               0.07                1401.24             1313.66             
                    6.0 MB              6                   0.51                0.08                1234.17             1157.03             
                    6.14 MB             11                  1.00                0.08                1308.19             1226.43             
                    6.25 MB             9                   0.31                0.03                15263.03            14309.09            
reduce_scatter_base
                    678.86 MB           40                  602.29              9.69                1468.06             1376.31             
all_reduce
                    1.0 B               20                  5572.57             6.37                0.00                0.00                
                    8.0 B               40                  100.00              0.58                0.00                0.00                
log_summary_barrier
                    0B                  1                   0.11                0.11                0.00                0.00     

This PR contributes the following features:

  • Automatic detection and logging of all comms calls with custom log names
  • Final comms summary (manual prints with log_summary method)
  • Config support
  • verbosity levels (for automatic grouping within DeepSpeed, e.g. all_gather_zero3)
  • An associated tutorial/documentation

Co-authored-by: Quentin Anthony qganthony@yahoo.com
Co-authored-by: Ammar Ahmad Awan ammar.awan@microsoft.com
Co-authored-by: Jeff Rasley jerasley@microsoft.com

Copy link
Contributor Author

@Quentin-Anthony Quentin-Anthony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been reviewed by Ammar

@Quentin-Anthony Quentin-Anthony changed the title DeepSpeed Communication Logging DeepSpeed Communication Profiling and Logging Jun 30, 2022
Copy link
Collaborator

@jeffra jeffra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Quentin-Anthony
Copy link
Contributor Author

@jeffra and @awan-10 -- All comments have been resolved and we're ready to merge from my side.

@jeffra jeffra merged commit 5349347 into master Jul 25, 2022
@jeffra jeffra deleted the staging-comms-logging-v1 branch July 25, 2022 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants