Skip to content

DIEPREYECD/assignment2-mapreduce-DIEPREYECD

Repository files navigation

Review Assignment Due Date

- - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Name : Diepreye Charles-Daniel

SID : 1766907

CCID : diepreye

- - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Overview

This project implements a multithreaded MapReduce runtime in C using POSIX threads. It includes a thread pool for concurrency, sorted intermediate key-value structures, and Shortest Job First (SJF) scheduling for both Map and Reduce phases. The example application is a distributed word count program (distwc.c).


Synchronization Primitives Used

Thread Pool

  • pthread_mutex_t mutex: guards the shared job queue, idle counters, and shutdown flag.
  • pthread_cond_t cond: signals worker threads when new jobs are available.
  • pthread_cond_t cond_all_idle: used to synchronize between map and reduce phases.

MapReduce Library

  • pthread_mutex_t (per partition): protects each partition’s sorted key-value list during concurrent MR_Emit calls.

Data Structures

Each partition maintains sorted key-value pairs:

typedef struct ValueNode {
    char *value;
    struct ValueNode *next;
} ValueNode;

typedef struct KVNode {
    char *key;
    ValueNode *values_head, *values_tail;
    struct KVNode *next;
} KVNode;

typedef struct Partition {
    pthread_mutex_t mutex;
    KVNode *kv_head;
    size_t bytes_used;
} Partition;

Keys are inserted in ascending order (strcmp) during MR_Emit. Each partition’s size is tracked for SJF scheduling.


Testing & Verification

  • Inputs: all testcase/sample*.txt (20 files).

  • Partitions: 10 (fixed).

  • Threads: varied as shown below.

  • Correctness checks:

    • No duplicate keys within or across partitions.
    • Each key’s count equals the expected total (e.g., 5000 in provided cases).
  • Tools: valgrind --tool=memcheck (no leaks) and valgrind --tool=helgrind (no races).

  • Command: make time-test (uses /usr/bin/time).

Runtime scaling (20 inputs, 10 partitions; elapsed wall time)

Threads Elapsed
1 0:00.14
2 0:00.15
4 0:00.13
8 0:00.13
16 0:00.15

Discussion (why speedup is limited here):

  • Workload size is tiny, so thread overhead and I/O dominate; adding threads doesn’t substantially reduce wall time.
  • Amdahl’s Law: serial sections remain (job submission, Map→Reduce barrier, one reducer per partition).
  • Contention: brief but frequent locking on the job queue and hot partitions during MR_Emit.
  • Load balance & tail effects: some partitions finish later; total time is gated by the slowest partition’s reducer.
  • Cache & scheduling overheads grow with more threads, offsetting any potential gains on such a small dataset.

On larger inputs you should see clearer improvements from 1→4→8 threads until the factors above dominate.


Build & Run

make           # build
make run       # run on testcase/*.txt
make memcheck  # check for leaks
make helgrind  # check for data races

Result files (result-*.txt) are deleted before each run to prevent duplicates.


Sources

  • Course handout/spec (CMPUT 379 A2).
  • man pthreads, man pthread_mutex, man pthread_cond.
  • Valgrind documentation (memcheck, helgrind).

No external code was copied; all implementation is original.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors