Information Retrieval and Text Mining

2023 Spring NTUIM

Programming Assignment 1: Term Extraction

Objective: Extract terms from a single English news document.
Process:
- Tokenization of the document text.
- Conversion of all text to lowercase.
- Stemming using the Porter Stemmer algorithm.
- Removal of stopwords.

Programming Assignment 2: TF-IDF Vectorization

Objective: Convert a set of documents into TF-IDF vectors.
Process:
- Construction of a dictionary based on extracted terms.
- Recording of document frequency for each term.
- Transformation of each document into a TF-IDF unit vector.
- Implementation of a cosine similarity function cosine(Docx, Docy) to calculate the cosine similarity between any two documents.

Programming Assignment 3: Multinomial Naive Bayes Classifier

Objective: Implement and test a Multinomial Naive Bayes Classifier.
Process:
- Classification of documents into 13 classes using a training set of 15 documents per class.
- Feature selection employed to reduce the vocabulary to the top 500 terms using methods like Χ2 test and likelihood ratio.
- Add-one smoothing used to avoid zero probabilities in the classification.

Programming Assignment 4: Hierarchical Agglomerative Clustering (HAC)

Objective: Perform HAC on the collection of documents.
Process:
- Documents represented as normalized TF-IDF vectors (from Assignment 2).
- Use of cosine similarity for pairwise document similarity.
- Exploration of different similarity measures between clusters including single-link, complete-link, group-average, and centroid similarity.
- Clustering results for K = 8, 13, and 20.
- Implementation of a custom HEAP to optimize the retrieval of the cluster pair with the maximal similarity.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
PA1		PA1
PA2		PA2
PA3		PA3
PA4		PA4
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Retrieval and Text Mining

Programming Assignment 1: Term Extraction

Programming Assignment 2: TF-IDF Vectorization

Programming Assignment 3: Multinomial Naive Bayes Classifier

Programming Assignment 4: Hierarchical Agglomerative Clustering (HAC)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval and Text Mining

Programming Assignment 1: Term Extraction

Programming Assignment 2: TF-IDF Vectorization

Programming Assignment 3: Multinomial Naive Bayes Classifier

Programming Assignment 4: Hierarchical Agglomerative Clustering (HAC)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages