This project processes large text datasets using a custom MapReduce framework implemented in C. It demonstrates parallel data processing by coordinating multiple worker threads to count word frequencies across numerous input files.
The core of the project is a library that abstracts the complexity of concurrent programming, providing a clean API for Map and Reduce operations, backed by a Shortest Job First (SJF) thread pool.
- MapReduce Library (
mapreduce.c):- Handles the orchestration of Mapping, Partitioning, and Reducing phases.
- Supports user-defined
MapandReducefunctions.
- Custom Thread Pool (
threadpool.c):- Manages a fixed number of worker threads (default: 5).
- Implements SJF (Shortest Job First) scheduling to optimize throughput by prioritizing smaller files/partitions.
- Partitioning System:
- Uses a hashing algorithm to distribute keys (words) across 10 distinct partitions.
- Stores intermediate data in sorted linked lists for efficient reduction.
- Memory Management:
- Clean resource handling with no memory leaks (verified via Valgrind).
distwc.c: The "Word Count" client application that utilizes the MapReduce library.mapreduce.c/h: The framework handling the logic forMR_Emit,MR_Run, and phase orchestration.threadpool.c/h: A generic thread pool implementation using Pthreads, Mutexes, and Condition Variables.list.c/h: A linked-list data structure used to store key-value pairs within partitions.Makefile: Automation for building, running text, and cleaning up.
Use the extensive Makefile to compile the source code:
makeThis creates the wordcount executable.
You can run the word count program on the provided test cases located in the testcase/ directory:
./wordcount testcase/*.txtAlternatively, use the make target:
make runThe program produces result files corresponding to each partition (e.g., result-0.txt ... result-9.txt). Each file contains unique words and their frequencies:
apple: 5
banana: 2
...
To verify the results (summing all counts):
make verify-resultsThe project includes several make targets used for testing correctness and performance:
- Check for Memory Leaks:
make valgrind
- Check for Threading Errors:
make helgrind
- Performance Timing:
make time-test
-
Map Phase:
- The framework assigns each input file to a job in the thread pool.
distwc.cparses files, tokenizes words, and callsMR_Emit(word, "1").MR_Emithashes the word to find the correct partition and inserts it into a thread-safe sorted list.
-
Sort/Shuffle:
- Insertion into the partition lists keeps data sorted by key, simplifying the reduce step.
-
Reduce Phase:
- Once all maps are complete, the framework creates reduce jobs for each partition.
Reduceiterates through the sorted list, aggregates counts for identical words, and writes the result to disk.