This example demonstrates Breadth-First Search (BFS) on a GPU using HIP, based on the algorithm presented in the HiPC'07 paper "Accelerating Large Graph Algorithms on the GPU using CUDA" by Pawan Harish and P.J. Narayanan.
BFS is a fundamental graph traversal algorithm that explores nodes level by level, starting from a source node. This parallel implementation efficiently processes large graphs by assigning one thread per node, allowing thousands of nodes to be processed simultaneously.
For more information on HIP programming, please refer to the HIP documentation.
- A graph is loaded from a text file containing node and edge information.
- Host memory is allocated for node structures, edge lists, and traversal state.
- The source node (node 0) is marked as the starting point.
- Device memory is allocated for all graph data structures.
- Graph data is copied from host to device memory.
- The BFS kernel is launched iteratively until no more nodes can be visited:
- Kernel 1 (BFS_Kernel): Processes all nodes in the current frontier:
- Each thread handles one node
- If a node is in the frontier, it explores all neighbors
- Unvisited neighbors are marked for the next iteration
- Distance/cost is updated for newly discovered nodes
- Kernel 2 (BFS_Update): Updates the frontier for the next iteration:
- Nodes discovered in this iteration become the new frontier
- Visited flags are updated to prevent revisiting
- A stop flag indicates if any new nodes were discovered
- Kernel 1 (BFS_Kernel): Processes all nodes in the current frontier:
- The process repeats until no new nodes are found (convergence).
- Results are copied back to host memory.
- The final costs (distances from source) are written to a file.
- All device memory is freed.
The input graph file format is:
<number_of_nodes>
<starting_edge_index> <number_of_edges> // For each node
...
<source_node>
<total_number_of_edges>
<destination_node> <edge_weight> // For each edge
...
BFS explores a graph level by level:
- Start with a source node at distance 0
- Visit all neighbors of the source (distance 1)
- Visit all unvisited neighbors of distance-1 nodes (distance 2)
- Continue until all reachable nodes are visited
The parallel implementation uses a frontier-based approach:
- Frontier: The set of nodes to be processed in the current iteration
- Two-kernel design:
- Process current frontier and discover new nodes
- Update frontier for next iteration
- One thread per node: Each thread is responsible for processing one node
- Iterative execution: Kernels repeat until the frontier is empty
- Nodes: Store the starting index in the edge list and number of edges
- Edges: Flat array of destination node IDs
- Masks and Flags:
graph_mask: Current frontier (nodes to process this iteration)updating_graph_mask: Nodes discovered this iteration (next frontier)graph_visited: Tracks all visited nodes (prevents revisiting)over: Flag to indicate if any new nodes were discovered
The graph uses a Compressed Sparse Row (CSR) format:
Nodes: [Node 0] [Node 1] [Node 2] ...
↓
Edges: [2,3,5] [1,4] [0,6,7] ...
Each node stores where its edges start in the edge array and how many edges it has.
hipMalloc: Allocates device memoryhipMemcpy: Transfers data between host and devicehipFree: Frees device memoryhipGetLastError: Retrieves the last error from a runtime callhipDeviceSynchronize: Blocks until all device operations complete
__global__: Declares a kernel function callable from hostblockIdx,blockDim,threadIdx: Built-in variables for grid/block indexing- Iterative kernel execution until convergence
- Frontier-based traversal: Efficiently processes only active nodes
- Level synchronization: All nodes at distance d are processed before distance d+1
- Distance computation: Computes shortest path distances (in number of hops)
- Work distribution: One thread per node provides good load balancing
- Memory coalescing: Threads access memory in patterns that can be coalesced
- Irregular parallelism: Graph structure affects parallel efficiency
- Warp-level optimization: Could use warp-level primitives to reduce divergence
- Load balancing: Large-degree nodes can cause load imbalance
- Memory layout: Could optimize edge list layout for better cache utilization
- Early termination: Could terminate when a specific target node is reached
makeThis builds both:
hip_bfs: The BFS applicationgraphgen: Graph generator utility
mkdir build && cd build
cmake ..
make./graphgen 1024 1k # Creates graph1k.txt with 1024 nodes
./graphgen 4096 4k # Creates graph4k.txt with 4096 nodes
./graphgen 65536 64k # Creates graph64k.txt with 65536 nodes./hip_bfs graph4096.txtresult.txt: Contains the distance from source to each node
Reading graph file: graph4096.txt
Number of nodes: 4096
Number of edges: 49152
Graph loaded successfully
Allocating GPU memory...
Copying data to GPU...
Starting BFS traversal...
Grid: 8 blocks, 512 threads per block
BFS completed after 12 iterations
Writing results to result.txt...
Results saved to result.txt
BFS Statistics:
Source node: 0
Reachable nodes: 4096 / 4096
Maximum distance: 11
Execution completed successfully.
- Sequential: O(V + E) where V = vertices, E = edges
- Parallel: O(D × (V/P + E/P)) where D = graph diameter, P = processors
- Best case: O(D) when V, E >> P (high parallelism)
- O(V + E) for graph storage
- O(V) for traversal state (masks, costs, visited flags)
The algorithm terminates when no new nodes are discovered, which occurs after D iterations where D is the graph diameter (longest shortest path).
blockIdxblockDimthreadIdx
hipDeviceSynchronizehipFreehipGetLastErrorhipMallochipMemcpyhipMemcpyHostToDevicehipMemcpyDeviceToHost
- Harish, P., & Narayanan, P. J. (2007). "Accelerating Large Graph Algorithms on the GPU using CUDA." In International Conference on High Performance Computing (HiPC).
The included graphgen utility creates random graphs for testing:
- Generates undirected graphs
- Random number of edges per node (2-4)
- Edge weights between 1-10
- Output format compatible with the BFS application