Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Readme.md

Preprocess Text in CSV

Pre-process text data for some text as data kinds of analyses.

The script takes a csv with raw text and outputs a csv with pre-processed text. Depending on what the user wants, the script removes stop words, stems, normalizes (takes out special characters, extra whitespace etc.), etc. and dumps clean text into a new csv along with all the other columns.

You can also just output a random sample of the file. See this Stackflow documentation for how random sampling is implemented.

The script depends on nltk library and Python 2.7. To install the dependency,

pip install nltk

You also need to download nltk data. The script downloads nltk_data to ./nltk_data directory if no such directory exists. But you can download all the nltk data via:

python -m nltk.downloader all

Or, by going into python shell and typing:

import nltk
nltk.download()

To run the script, on the shell, type:

python preprocessData.py [options] <CSV input file>

Script options and default value of the options:

Options:
  -h, --help            show this help message and exit
  -b BEGIN, --begin=BEGIN
                            Begin row number (default: 1)
  -e END, --end=END     End row number (default: 0)
  -r RANDOM, --random=RANDOM
                            Percent random sampling (default: 100)
  -c COLUMN, --column=COLUMN
                            Data column to be cleaned (default: 'Body')
  -k, --keep            Keep original data column (default: No)
  -o OUTFILE, --outfile=OUTFILE
                            Clean output CSV filename (default: 'cleaned.csv')
  --append              Append if output CSV exists (default: No)
  --keep-accented       Keep accented (default: No)
  --keep-punct          Keep punctuation (default: No)
  --keep-stopwords      Keep stopwords (default: No)
  --keep-numbers        Keep numbers and words that begin with numbers
                            (default: No)
  --keep-stems          Do not stem (default: No)

Example

To clean column name speaking in sample_in.csv and save output as sample-out.csv

python preprocessData.py -c speaking sample_in.csv -o sample-out.csv