MapReduce Word Count Project

Overview

This project processes large text datasets using a custom MapReduce framework implemented in C. It demonstrates parallel data processing by coordinating multiple worker threads to count word frequencies across numerous input files.

The core of the project is a library that abstracts the complexity of concurrent programming, providing a clean API for Map and Reduce operations, backed by a Shortest Job First (SJF) thread pool.

Features

MapReduce Library (mapreduce.c):
- Handles the orchestration of Mapping, Partitioning, and Reducing phases.
- Supports user-defined Map and Reduce functions.
Custom Thread Pool (threadpool.c):
- Manages a fixed number of worker threads (default: 5).
- Implements SJF (Shortest Job First) scheduling to optimize throughput by prioritizing smaller files/partitions.
Partitioning System:
- Uses a hashing algorithm to distribute keys (words) across 10 distinct partitions.
- Stores intermediate data in sorted linked lists for efficient reduction.
Memory Management:
- Clean resource handling with no memory leaks (verified via Valgrind).

File Structure

distwc.c: The "Word Count" client application that utilizes the MapReduce library.
mapreduce.c/h: The framework handling the logic for MR_Emit, MR_Run, and phase orchestration.
threadpool.c/h: A generic thread pool implementation using Pthreads, Mutexes, and Condition Variables.
list.c/h: A linked-list data structure used to store key-value pairs within partitions.
Makefile: Automation for building, running text, and cleaning up.

Building the Project

Use the extensive Makefile to compile the source code:

make

This creates the wordcount executable.

Running the Application

You can run the word count program on the provided test cases located in the testcase/ directory:

./wordcount testcase/*.txt

Alternatively, use the make target:

make run

Output

The program produces result files corresponding to each partition (e.g., result-0.txt ... result-9.txt). Each file contains unique words and their frequencies:

apple: 5
banana: 2
...

To verify the results (summing all counts):

make verify-results

Testing & Debugging

The project includes several make targets used for testing correctness and performance:

Check for Memory Leaks:
```
make valgrind
```
Check for Threading Errors:
```
make helgrind
```
Performance Timing:
```
make time-test
```

Implementation Details

Map Phase:
- The framework assigns each input file to a job in the thread pool.
- distwc.c parses files, tokenizes words, and calls MR_Emit(word, "1").
- MR_Emit hashes the word to find the correct partition and inserts it into a thread-safe sorted list.
Sort/Shuffle:
- Insertion into the partition lists keeps data sorted by key, simplifying the reduce step.
Reduce Phase:
- Once all maps are complete, the framework creates reduce jobs for each partition.
- Reduce iterates through the sorted list, aggregates counts for identical words, and writes the result to disk.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
testcase		testcase
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
distwc.c		distwc.c
distwc.o		distwc.o
extended_logs.txt		extended_logs.txt
list.c		list.c
list.h		list.h
list.o		list.o
mapreduce.c		mapreduce.c
mapreduce.h		mapreduce.h
mapreduce.o		mapreduce.o
threadpool.c		threadpool.c
threadpool.h		threadpool.h
threadpool.o		threadpool.o
wordcount		wordcount

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MapReduce Word Count Project

Overview

Features

File Structure

Building the Project

Running the Application

Output

Testing & Debugging

Implementation Details

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MapReduce Word Count Project

Overview

Features

File Structure

Building the Project

Running the Application

Output

Testing & Debugging

Implementation Details

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages