Skip to content

[BUG] Multi-gpu training notebook is giving error if we generate schema from core #651

@rnyak

Description

@rnyak

Bug description

I am getting the following error when I run multi-gpu training notebook

/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/usr/local/lib/python3.8/dist-packages/merlin/dtypes/mappings/tf.py:52: UserWarning: Tensorflow dtype mappings did not load successfully due to an error: No module named 'tensorflow'
  warn(f"Tensorflow dtype mappings did not load successfully due to an error: {exc.msg}")
/usr/local/lib/python3.8/dist-packages/merlin/dtypes/mappings/tf.py:52: UserWarning: Tensorflow dtype mappings did not load successfully due to an error: No module named 'tensorflow'
  warn(f"Tensorflow dtype mappings did not load successfully due to an error: {exc.msg}")
Traceback (most recent call last):
  File "pyt_trainer.py", line 41, in <module>
    input_module = tr.TabularSequenceFeatures.from_schema(
  File "/usr/local/lib/python3.8/dist-packages/transformers4rec/torch/features/sequence.py", line 193, in from_schema
    output: TabularSequenceFeatures = super().from_schema(  # type: ignore
  File "/usr/local/lib/python3.8/dist-packages/transformers4rec/torch/features/tabular.py", line 176, in from_schema
    output = cls(
  File "/usr/local/lib/python3.8/dist-packages/transformers4rec/torch/features/sequence.py", line 127, in __init__
    super().__init__(
  File "/usr/local/lib/python3.8/dist-packages/transformers4rec/torch/features/tabular.py", line 84, in __init__
    assert to_merge != {}, "Please provide at least one input layer"
AssertionError: Please provide at least one input layer
Traceback (most recent call last):
  File "pyt_trainer.py", line 41, in <module>
    input_module = tr.TabularSequenceFeatures.from_schema(
  File "/usr/local/lib/python3.8/dist-packages/transformers4rec/torch/features/sequence.py", line 193, in from_schema
    output: TabularSequenceFeatures = super().from_schema(  # type: ignore
  File "/usr/local/lib/python3.8/dist-packages/transformers4rec/torch/features/tabular.py", line 176, in from_schema
    output = cls(
  File "/usr/local/lib/python3.8/dist-packages/transformers4rec/torch/features/sequence.py", line 127, in __init__
    super().__init__(
  File "/usr/local/lib/python3.8/dist-packages/transformers4rec/torch/features/tabular.py", line 84, in __init__
    assert to_merge != {}, "Please provide at least one input layer"
AssertionError: Please provide at least one input layer
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23905) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pyt_trainer.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-16_19:30:24
  host      : 1902e905751e
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 23906)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-16_19:30:24
  host      : 1902e905751e
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 23905)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Steps/Code to reproduce bug

You need to run 01 and 03 notebooks in this folder in order. For dataset generation you can use

import calendar
import datetime

import numpy as np
import pandas as pd


def generate_synthetic_data(
    start_date: datetime.date, end_date: datetime.date, rows_per_day: int = 1000
) -> pd.DataFrame:
    assert end_date > start_date, "end_date must be later than start_date"

    number_of_days = (end_date - start_date).days
    total_number_of_rows = number_of_days * rows_per_day

    # Generate a long-tail distribution of item interactions. This simulates that some items are
    # more popular than others.
    long_tailed_item_distribution = np.clip(
        np.random.lognormal(3.0, 1.0, total_number_of_rows).astype(np.int64), 1, 50000
    )

    # generate random item interaction features
    df = pd.DataFrame(
        {
            "session_id": np.random.randint(70000, 80000, total_number_of_rows),
            "item_id": long_tailed_item_distribution,
        },
    )

    # generate category mapping for each item-id
    df["category"] = pd.cut(df["item_id"], bins=334, labels=np.arange(1, 335)).astype(
        np.int64
    )

    max_session_length = 60 * 60  # 1 hour

    def add_timestamp_to_session(session: pd.DataFrame):
        random_start_date_and_time = calendar.timegm(
            (
                start_date
                # Add day offset from start_date
                + datetime.timedelta(days=np.random.randint(0, number_of_days))
                # Add time offset within the random day
                + datetime.timedelta(seconds=np.random.randint(0, 86_400))
            ).timetuple()
        )
        session["timestamp"] = random_start_date_and_time + np.clip(
            np.random.lognormal(3.0, 1.0, len(session)).astype(np.int64),
            0,
            max_session_length,
        )
        return session

    df = df.groupby("session_id").apply(add_timestamp_to_session).reset_index()

    return df

interactions_df = generate_synthetic_data(datetime.date(2014, 4, 1), datetime.date(2014, 6, 30))
interactions_df = cudf.from_pandas(interactions_df)

Environment details

  • Transformers4Rec version:
  • Platform:
  • Python version:
  • Huggingface Transformers version:
  • PyTorch version (GPU?):
  • Tensorflow version (GPU?):

Additional context

I am using merlin-pytorch:23.02 image with the latest main branches pulled from libs.

Metadata

Metadata

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions