How can I refactor this NLP PyTorch code to work with pyspark+petastorm (custom collate needed...)?
Hi, I have created a variation of the PyTorch NLP tutorial https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html so to work with a custom dataset (a pandas dataframe), and you find it here.
I would like to refactor it to work with pyspark dataframes using petastorm, but it's not clear to me how to do it, and I would be grateful if somebody here could help clarify the steps to undertake (in broad strokes, sorry if I dare too much).
In particular, I would like to know:
- Assuming that
dfis dumped to parquet format withdf.write.parquet("file:///databricks/driver/test.parquet")am I fine creating aconverter_trainas in the snippet below?
spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, 'file:///databricks/driver') # I am running on Databricks
converter_train = make_spark_converter(df)
- Should I dispose the
TokenizedDatasetclass entirely, as usingconverter_train.make_torch_dataloader()doesn't need a customDatasetclass? - Should
generate_batchbe refactored usingTransformSpecfrompetastorm? If so, can anybody share any hints about how to do it? - Should I replace any
DataLoadercreation withconverter_train.make_torch_dataloader()? Can anyone point me out to what function args I should use for my case?
I tried to take inspiration from https://docs.microsoft.com/en-us/azure/databricks/_static/notebooks/deep-learning/petastorm-spark-converter-pytorch.html, but especially point 3. is a bit too tricky for me to overcome without hints... glad to share the final solution here if I succeed with the task!
- Yes
- With the current petastorm architecture: yes. Petastorm library is used instead of DataLoader+Dataset classes.
- No. Your generate_batch is an implementation of collate. Petastorm does not support custom collate implementation and implements its own naive collate that assumes all fields have the same tensor dimension.
- Yes.
We may try to let use customize collate logic. I think this would solve your case. Don't think it would be too difficult. I can take a stab at this in the next couple of days if you think it will help you.
The way you can use current implementation is to pad all tokenized fields to a maximal expected length with a sentinel value (or keep the length into a new column added in TransformSpec function). That way the naive petastorm collation logic would work.
Hi @selitvin, thanks! If you can take the stab at 3 (collating input of variable length) and update the issue... well, it would be super-helpful for me (eager to try it out). Thanks for all the tips, in the meantime I will try to resort to padding in the way you described, with fixed tensor dimensions.