Skip to content

YousefMoussa/MapReduce

Repository files navigation

MapReduce Word Count Project

Overview

This project processes large text datasets using a custom MapReduce framework implemented in C. It demonstrates parallel data processing by coordinating multiple worker threads to count word frequencies across numerous input files.

The core of the project is a library that abstracts the complexity of concurrent programming, providing a clean API for Map and Reduce operations, backed by a Shortest Job First (SJF) thread pool.

Features

  • MapReduce Library (mapreduce.c):
    • Handles the orchestration of Mapping, Partitioning, and Reducing phases.
    • Supports user-defined Map and Reduce functions.
  • Custom Thread Pool (threadpool.c):
    • Manages a fixed number of worker threads (default: 5).
    • Implements SJF (Shortest Job First) scheduling to optimize throughput by prioritizing smaller files/partitions.
  • Partitioning System:
    • Uses a hashing algorithm to distribute keys (words) across 10 distinct partitions.
    • Stores intermediate data in sorted linked lists for efficient reduction.
  • Memory Management:
    • Clean resource handling with no memory leaks (verified via Valgrind).

File Structure

  • distwc.c: The "Word Count" client application that utilizes the MapReduce library.
  • mapreduce.c/h: The framework handling the logic for MR_Emit, MR_Run, and phase orchestration.
  • threadpool.c/h: A generic thread pool implementation using Pthreads, Mutexes, and Condition Variables.
  • list.c/h: A linked-list data structure used to store key-value pairs within partitions.
  • Makefile: Automation for building, running text, and cleaning up.

Building the Project

Use the extensive Makefile to compile the source code:

make

This creates the wordcount executable.

Running the Application

You can run the word count program on the provided test cases located in the testcase/ directory:

./wordcount testcase/*.txt

Alternatively, use the make target:

make run

Output

The program produces result files corresponding to each partition (e.g., result-0.txt ... result-9.txt). Each file contains unique words and their frequencies:

apple: 5
banana: 2
...

To verify the results (summing all counts):

make verify-results

Testing & Debugging

The project includes several make targets used for testing correctness and performance:

  • Check for Memory Leaks:
    make valgrind
  • Check for Threading Errors:
    make helgrind
  • Performance Timing:
    make time-test

Implementation Details

  1. Map Phase:

    • The framework assigns each input file to a job in the thread pool.
    • distwc.c parses files, tokenizes words, and calls MR_Emit(word, "1").
    • MR_Emit hashes the word to find the correct partition and inserts it into a thread-safe sorted list.
  2. Sort/Shuffle:

    • Insertion into the partition lists keeps data sorted by key, simplifying the reduce step.
  3. Reduce Phase:

    • Once all maps are complete, the framework creates reduce jobs for each partition.
    • Reduce iterates through the sorted list, aggregates counts for identical words, and writes the result to disk.

About

This project processes large text datasets using a custom MapReduce framework implemented in C. It demonstrates parallel data processing by coordinating multiple worker threads to count word frequencies across numerous input files.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors