-
Notifications
You must be signed in to change notification settings - Fork 75.2k
Description
Click to expand!
Issue Type
Bug
Have you reproduced the bug with TF nightly?
Yes
Source
source
Tensorflow Version
tf 2.13.0-dev20230406
Custom Code
No
OS Platform and Distribution
Linux Ubuntu 22.04
Mobile device
n/a
Python version
3.9.16
Bazel version
n/a
GCC/Compiler version
n/a
CUDA/cuDNN version
11.8.0/8.6.0.163
GPU model and memory
NVIDIA GeForce RTX 3080 Ti 12GiB
Current Behaviour?
I am following the tutorial at https://www.tensorflow.org/tutorials/quickstart/beginner
I modified the code according to the instructions at https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras in order to enable profiling for a range of batches during training.
With this change, training seems to proceed as normal, with the logs indicating that a profiler session is created, and a profile is collected.
The logs directory contains one non-empty plugins/profile/<date>/<host>.xplane.pb file.
But when I run tensorboard (either main or tb-nightly) on the logs, it fails to detect a profile (the Profile tab is missing from the UI). I also confirm I ran pip install -U tensorboard-plugin-profile first.
I would have expected one of these two outcomes: either (a) tensorboard would show me the profiles, or (b) if something went wrong either when collecting or displaying the profiles, an error message would have indicated it so I can fix the issue.
Standalone code to reproduce the issue
# The code is at https://www.tensorflow.org/tutorials/quickstart/beginner
# I change the model.fit() call to use the Tensorboard callback to collect a profile:
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir=log_dir,
histogram_freq=1,
profile_batch=(500, 600))
model.fit(x_train, y_train, epochs=5, callbacks=[tensorboard_callback])Relevant log output
2023-04-06 23:17:28.048863: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2023-04-06 23:17:28.048880: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
2023-04-06 23:17:28.048915: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1671] Profiler found 1 GPUs
2023-04-06 23:17:28.237604: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
2023-04-06 23:17:28.237742: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1805] CUPTI activity buffer flushed
Epoch 1/5
2023-04-06 23:17:28.747772: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f08c0180cf0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-04-06 23:17:28.747785: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA GeForce RTX 3080 Ti, Compute Capability 8.6
2023-04-06 23:17:28.751189: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-04-06 23:17:28.834436: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:426] Loaded cuDNN version 8600
2023-04-06 23:17:28.868033: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-04-06 23:17:28.900180: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
522/1875 [=======>......................] - ETA: 5s - loss: 0.4875 - accuracy: 0.8590
2023-04-06 23:17:30.991051: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2023-04-06 23:17:30.991106: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
645/1875 [=========>....................] - ETA: 4s - loss: 0.4499 - accuracy: 0.8701
2023-04-06 23:17:31.542500: I tensorflow/tsl/profiler/lib/profiler_session.cc:70] Profiler session collecting data.
2023-04-06 23:17:31.545123: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1805] CUPTI activity buffer flushed
2023-04-06 23:17:31.570874: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_collector.cc:541] GpuTracer has collected 6158 callback api events and 5891 activity events.
2023-04-06 23:17:31.598454: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
1875/1875 [==============================] - 8s 4ms/step - loss: 0.3017 - accuracy: 0.9121
Epoch 2/5
1875/1875 [==============================] - 7s 4ms/step - loss: 0.1441 - accuracy: 0.9570
Epoch 3/5
1875/1875 [==============================] - 7s 4ms/step - loss: 0.1075 - accuracy: 0.9685
Epoch 4/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0878 - accuracy: 0.9732
Epoch 5/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0737 - accuracy: 0.9771