Skip to content

OneHotEncoder can accidentally create columns with same name #201

@lars-reimann

Description

@lars-reimann

Describe the bug

The OneHotEncoder uses the schema <old_column_name>_<value> to name the created columns. This can lead to conflicts, however.

To Reproduce

Run this program:

from safeds.data.tabular.containers import Table
from safeds.data.tabular.transformation import OneHotEncoder

if __name__ == '__main__':
    table = Table.from_dict({"a_b": ["c"], "a": ["b_c"]})
    transformed_table = OneHotEncoder().fit_and_transform(table)

    print(transformed_table)

It raises an exception:

ValueError: Length mismatch: Expected axis has 2 elements, new values have 1 elements

The issue is that two columns with the same name (a_b_c) get created.

Expected behavior

No exception. The names of all created columns should be unique. They should also not conflict with existing columns in the Table. This can be done by detecting conflicts between two created columns or between a created column and an existing, unchanged column and appending a suffix _<counter> to the names of the created columns (e.g. a_b_c_1 vs. a_b_c_2).

Screenshots (optional)

No response

Additional Context (optional)

No response

Metadata

Metadata

Assignees

Labels

releasedIncluded in a release

Type

No type

Projects

Status

✔️ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions