preprocess_csv

Preprocess Text in CSV

Pre-process text data for some text as data kinds of analyses.

The script takes a csv with raw text and outputs a csv with pre-processed text. Depending on what the user wants, the script removes stop words, stems, normalizes (takes out special characters, extra whitespace etc.), etc. and dumps clean text into a new csv along with all the other columns.

You can also just output a random sample of the file. See this Stackflow documentation for how random sampling is implemented.

The script depends on nltk library and Python 2.7. To install the dependency,

pip install nltk

You also need to download nltk data. The script downloads nltk_data to ./nltk_data directory if no such directory exists. But you can download all the nltk data via:

python -m nltk.downloader all

Or, by going into python shell and typing:

import nltk
nltk.download()

To run the script, on the shell, type:

python preprocessData.py [options] <CSV input file>

Script options and default value of the options:

Options:
  -h, --help            show this help message and exit
  -b BEGIN, --begin=BEGIN
                            Begin row number (default: 1)
  -e END, --end=END     End row number (default: 0)
  -r RANDOM, --random=RANDOM
                            Percent random sampling (default: 100)
  -c COLUMN, --column=COLUMN
                            Data column to be cleaned (default: 'Body')
  -k, --keep            Keep original data column (default: No)
  -o OUTFILE, --outfile=OUTFILE
                            Clean output CSV filename (default: 'cleaned.csv')
  --append              Append if output CSV exists (default: No)
  --keep-accented       Keep accented (default: No)
  --keep-punct          Keep punctuation (default: No)
  --keep-stopwords      Keep stopwords (default: No)
  --keep-numbers        Keep numbers and words that begin with numbers
                            (default: No)
  --keep-stems          Do not stem (default: No)

Example

To clean column name speaking in sample_in.csv and save output as sample-out.csv

python preprocessData.py -c speaking sample_in.csv -o sample-out.csv

Name		Name	Last commit message	Last commit date
parent directory ..
Readme.md		Readme.md
preprocessData.py		preprocessData.py
sample-out.csv		sample-out.csv
sample_in.csv		sample_in.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme.md

Preprocess Text in CSV

Example

FilesExpand file tree

preprocess_csv

Directory actions

More options

Directory actions

More options

Latest commit

History

preprocess_csv

Folders and files

parent directory

Readme.md

Preprocess Text in CSV

Example