petastorm How can I refactor this NLP PyTorch code to work with pyspark+petastorm (custom collate needed...)?

Hi, I have created a variation of the PyTorch NLP tutorial https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html so to work with a custom dataset (a pandas dataframe), and you find it here.

I would like to refactor it to work with pyspark dataframes using petastorm, but it's not clear to me how to do it, and I would be grateful if somebody here could help clarify the steps to undertake (in broad strokes, sorry if I dare too much).

In particular, I would like to know:

Assuming that df is dumped to parquet format with df.write.parquet("file:///databricks/driver/test.parquet") am I fine creating a converter_train as in the snippet below?

spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, 'file:///databricks/driver') # I am running on Databricks
converter_train = make_spark_converter(df)

Should I dispose the TokenizedDataset class entirely, as using converter_train.make_torch_dataloader() doesn't need a custom Dataset class?
Should generate_batch be refactored using TransformSpec from petastorm? If so, can anybody share any hints about how to do it?
Should I replace any DataLoader creation with converter_train.make_torch_dataloader()? Can anyone point me out to what function args I should use for my case?

I tried to take inspiration from https://docs.microsoft.com/en-us/azure/databricks/_static/notebooks/deep-learning/petastorm-spark-converter-pytorch.html, but especially point 3. is a bit too tricky for me to overcome without hints... glad to share the final solution here if I succeed with the task!

Feb 15 '21 17:02 davidefiocco

Yes
With the current petastorm architecture: yes. Petastorm library is used instead of DataLoader+Dataset classes.
No. Your generate_batch is an implementation of collate. Petastorm does not support custom collate implementation and implements its own naive collate that assumes all fields have the same tensor dimension.
Yes.

We may try to let use customize collate logic. I think this would solve your case. Don't think it would be too difficult. I can take a stab at this in the next couple of days if you think it will help you.

The way you can use current implementation is to pad all tokenized fields to a maximal expected length with a sentinel value (or keep the length into a new column added in TransformSpec function). That way the naive petastorm collation logic would work.

Feb 18 '21 23:02 selitvin

Hi @selitvin, thanks! If you can take the stab at 3 (collating input of variable length) and update the issue... well, it would be super-helpful for me (eager to try it out). Thanks for all the tips, in the meantime I will try to resort to padding in the way you described, with fixed tensor dimensions.

Feb 19 '21 07:02 davidefiocco