Preprocess the data with 4 simple steps described below. Depending on how large your data is, you should set aside ca. 20 minutes for these preprocessing steps.
The goal of preprocessing is to truncate the vocabulary, remove all the words that are not in the vocabulary and then put the text into numpy arrays. Then we split the data into training validation and test set and compute statistics of the data that we need to use in the algorithm (e.g the number of words in each group, and the names of the groups).
These are the steps. They are described in detail below:
- Decide on filenames for groups
- Create vocabulary file and save text data in numpy arrays
- Subsample the data and split into training and testing
- Create
dat_stats.pkl - Generate negative samples for evaluation
An example dataset called lorem_ipsum is included to demonstrate the steps.
The following folder structure is required to run structured Bernoulli embeddings.
Assumption: Your text is in text files in the dat/[dataset_name]/raw/ subfolder.
dat/
[dataset_name]/
raw/
*.txt
The preprocessing scripts will add the following folders and files
dat/
[dataset_name]/
unigram.txt
dat_stats.pkl
raw/
*.txt
*.npy
train/
*.npy
test/
*.npy
/neg
*.npy
valid/
*.npy
/neg
*.npy
The train/, test/ and valid/ folders will contain the .npy files with the data.
The file unigram.txt contains the vocabulary and the vocabulary counts and the file dat_stats.pkl contains a pickle object that holds a python dictionary with information about the data required to run the algorithm.
This folder contains an example dataset. Text files are in lorem_ipsum/raw/*.txt.
Running the python scripts
python step_1_count_words.py
python step_2_split_data.py
python step_3_create_data_stats.py
python step_4_negative_samples.py
without modification (and in numerical order) will prepare the dataset for fitting structured embeddings.
Your text files are in the raw/ subfolder. Decide now how many groups you want and give each group its own name.
Group names cannot contain underscores as the underscore separates the group name from the rest of the file name.
For example, your folder could contain the file names:
Group1_rest_of_filename_xxx.txt
Group1_rest_of_filename_xyz.txt
Group1_rest_of_filename_yyy.txt
...
Group2_rest_of_filename_xxx.txt
Group2_rest_of_filename_xyz.txt
...
...
GroupN_rest_of_filename_zzz.txt
GroupN_rest_of_filename_yyy.txt
Save the list of group names [Group1, Group2, Group3, ..., GroupN]. You will need it in step 3.
Tip: Instead of having many short files. You might want to have few longer files in each group. Simply concatenate multiple short files into longer text files.
For the lorem ipsum example, the raw filenames (in lorem_ipsum/raw/) either start with A_ or B_. This means that there are 2 groups with names 'A' and 'B'.
In this step you will run the script step_1_count_words.py.
The script counts distinct words and truncates the resulting vocabulary.
Then each word is replaced whith its index in the vocabulary and the resulting numpy arrays are saved.
Go into this file and change dataset_name to the name of the folder in which the data is.
It should be the subfolder under dat/ (as specified above your data is in dat/[dataset_name]/raw/).
The current default dataset name is lorem_ipsum.
Also modify the vocabulary size you want to use a good size for English corpora is V = 10000.
Assuming you are in dat/ simply run
python step_1_count_words.py
Tip: In this script data preprocesing is handled. Punctuation is removed, line-breaks are handled and everything is lower cased. Depending on your dataset, additional or different preprocessing steps might be required. We also recommend extracting bigrams and adding them to the vocabulary.
Now open the file step_2_split_data.py and change the dataset name.
Then from dat/ run
python step_2_split_data.py
Now, open the file step_3_create_data_stats.py and change the dataset_name.
Then add the list of group names from step 0 under states
(e.g. states = [Group1, Group2, Group3, ..., GroupN]).
Then run
python step_3_create_data_stats.py
In this step, you will generate negative samples to be used during model evaluation. By drawing negative samples before running any code, you ensure that all methods are evaluated on the same set of negative examples.
Open step_4_negative_samples.py and change the dataset_name.
Then run
python step_4_negative_samples.py
You will get an error during training if you train with more negative samples then you generated in step 4.