Skip to content

Throw error on tensor creation when sequence shape cannot be determined#7583

Merged
soumith merged 3 commits intopytorch:masterfrom
sethah:pandas_segfault
May 18, 2018
Merged

Throw error on tensor creation when sequence shape cannot be determined#7583
soumith merged 3 commits intopytorch:masterfrom
sethah:pandas_segfault

Conversation

@sethah
Copy link
Contributor

@sethah sethah commented May 15, 2018

Fixes #7278

Currently, tensors can be created from Python sequences (determined by PySequence_Check). The shape of the tensor to be created is determined by iterating over the first element in each of the (potentially nested) sequences. This is done here.

There is an assumption that it is safe to index the PyObect at element zero if PySequence_Check(obj) is true and PySequence_Length(obj) > 0. Unfortunately, Python objects are still free to raise errors in their __getitem__ methods under these conditions, which is often the case when creating tensors from Pandas objects. In this case, PySequence_GetItem will return a null pointer, which in turn causes a segmentation fault when the next PySequence_Check call is made.

This patch adds a simple check for a null pointer and raises a ValueError when this happens. The error trace from the call to __getitem__ is not propagated since it is generally unhelpful and confusing. A unit test is added that verifies the appropriate error is raised in this situation.

Examples

seq = pd.Series([1.0, 2.0, 3.0])
torch.Tensor(seq)  # succeeds, since seq[0] is defined
torch.Tensor(seq[1:])  # segfault, since seq[0] generates a KeyError

df = pd.DataFrame(np.ones((2, 3)), columns=['a', 'b', 'c'])
torch.Tensor(df)  # segfault, since df[0] tries to access a column named 0

Notes

  • It would be better to be able to handle Pandas objects in general, or at least give a nicer error message (e.g. "did you mean torch.Tensor(df.values)?"), but that code would be specific to checking for Pandas objects.
  • I don't believe there's any surefire way to get the first element in the underlying sequence, which is what PySequence_GetItem(obj, 0) tries to do, but I could have missed it
  • I am new to the code here, so if there is a better way to handle the error, or if the unit test is not quite exhaustive, please let me know.

@ezyang
Copy link
Contributor

ezyang commented May 15, 2018

@pytorchbot retest this please

1 similar comment
@yf225
Copy link
Contributor

yf225 commented May 16, 2018

@pytorchbot retest this please

@soumith soumith merged commit 32b23a4 into pytorch:master May 18, 2018
@soumith
Copy link
Collaborator

soumith commented May 18, 2018

thank you @sethah

onnxbot added a commit to onnxbot/onnx-fb-universe that referenced this pull request May 18, 2018
weiyangfb pushed a commit to weiyangfb/pytorch that referenced this pull request Jun 11, 2018
@kigenchesire
Copy link

This worked out for me. I was trying to convert a y_train and Y_val into a tensor.

train_labels = torch.tensor(y_train.to_numpy())
val_labels = torch.tensor(y_val.to_numpy())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python segfaults if torch.tensor is called with pandas series slice

5 participants