Skip to content

Collected Tensorflow profiles are not recognized in Tensorboard #60262

@stefanbucur

Description

@stefanbucur
Click to expand!

Issue Type

Bug

Have you reproduced the bug with TF nightly?

Yes

Source

source

Tensorflow Version

tf 2.13.0-dev20230406

Custom Code

No

OS Platform and Distribution

Linux Ubuntu 22.04

Mobile device

n/a

Python version

3.9.16

Bazel version

n/a

GCC/Compiler version

n/a

CUDA/cuDNN version

11.8.0/8.6.0.163

GPU model and memory

NVIDIA GeForce RTX 3080 Ti 12GiB

Current Behaviour?

I am following the tutorial at https://www.tensorflow.org/tutorials/quickstart/beginner

I modified the code according to the instructions at https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras in order to enable profiling for a range of batches during training.

With this change, training seems to proceed as normal, with the logs indicating that a profiler session is created, and a profile is collected.

The logs directory contains one non-empty plugins/profile/<date>/<host>.xplane.pb file.

But when I run tensorboard (either main or tb-nightly) on the logs, it fails to detect a profile (the Profile tab is missing from the UI). I also confirm I ran pip install -U tensorboard-plugin-profile first.

I would have expected one of these two outcomes: either (a) tensorboard would show me the profiles, or (b) if something went wrong either when collecting or displaying the profiles, an error message would have indicated it so I can fix the issue.

Standalone code to reproduce the issue

# The code is at https://www.tensorflow.org/tutorials/quickstart/beginner
# I change the model.fit() call to use the Tensorboard callback to collect a profile:

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(
    log_dir=log_dir,
    histogram_freq=1,
    profile_batch=(500, 600))
model.fit(x_train, y_train, epochs=5, callbacks=[tensorboard_callback])

Relevant log output

2023-04-06 23:17:28.048863: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2023-04-06 23:17:28.048880: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
2023-04-06 23:17:28.048915: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1671] Profiler found 1 GPUs
2023-04-06 23:17:28.237604: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
2023-04-06 23:17:28.237742: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1805] CUPTI activity buffer flushed
Epoch 1/5
2023-04-06 23:17:28.747772: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f08c0180cf0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-04-06 23:17:28.747785: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 3080 Ti, Compute Capability 8.6
2023-04-06 23:17:28.751189: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-04-06 23:17:28.834436: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:426] Loaded cuDNN version 8600
2023-04-06 23:17:28.868033: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-04-06 23:17:28.900180: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
 522/1875 [=======>......................] - ETA: 5s - loss: 0.4875 - accuracy: 0.8590
2023-04-06 23:17:30.991051: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2023-04-06 23:17:30.991106: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
 645/1875 [=========>....................] - ETA: 4s - loss: 0.4499 - accuracy: 0.8701
2023-04-06 23:17:31.542500: I tensorflow/tsl/profiler/lib/profiler_session.cc:70] Profiler session collecting data.
2023-04-06 23:17:31.545123: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1805] CUPTI activity buffer flushed
2023-04-06 23:17:31.570874: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_collector.cc:541]  GpuTracer has collected 6158 callback api events and 5891 activity events. 
2023-04-06 23:17:31.598454: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
1875/1875 [==============================] - 8s 4ms/step - loss: 0.3017 - accuracy: 0.9121
Epoch 2/5
1875/1875 [==============================] - 7s 4ms/step - loss: 0.1441 - accuracy: 0.9570
Epoch 3/5
1875/1875 [==============================] - 7s 4ms/step - loss: 0.1075 - accuracy: 0.9685
Epoch 4/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0878 - accuracy: 0.9732
Epoch 5/5
1875/1875 [==============================] - 6s 3ms/step - loss: 0.0737 - accuracy: 0.9771

Metadata

Metadata

Labels

TF 2.12For issues related to Tensorflow 2.12comp:tensorboardTensorboard related issuesstaleThis label marks the issue/pr stale - to be closed automatically if no activitystat:awaiting responseStatus - Awaiting response from authortype:bugBug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions