## GP-GPU and High Performances Computing

Lecture 12 – Graph

December 18, 2023

- ► Sparse data.
- ► Histograms.
- ► Shared and private memory.
- ► Atomic operations.

# An other sorting

➤ A bitonic sequence is a sequence of numbers a<sub>0</sub>, a<sub>1</sub>, ..., a<sub>n-1</sub> which monotonically increases in value, reaches a single maximum, and then monotonically decreases in value.

$$a_0 < a_1 < \ldots < a_{i-1} < a_i > a_{i+1} > \ldots > a_{n-2} > a_{n-1}$$

for some value of i. A sequence is also considered to be bitonic if the relation above can be achieved by shifting the numbers cyclically.

- ► Every 2 element sequence is bitonic
- $\blacktriangleright$  4 element:  $(a_0,a_1,a_2,a_3)$ 
  - ▶ Sort  $(a_0, a_1)$  such that  $a_0 \le a_1$
  - ▶ Sort  $(a_2, a_3)$  such that  $a_2 \ge a_3$
- ▶ 8 element:  $(a_0, a_1, a_2, a_3, a_4, a_5, a_6, a_7)$ 
  - Make  $(a_0, a_1, a_2, a_3)$  and  $(a_4, a_5, a_6, a_7)$  bitonic
  - Use bitonic split to make  $(a_0, a_1, a_2, a_3)$  increasing
  - Use bitonic split to make  $(a_4, a_5, a_6, a_7)$  decreasing

#### Example 3: Sequential Idea

- $\blacktriangleright$  For list length  $l=2,4,8,\ldots 2^m=n$
- Use bitonic sort to make successive groups of *l* elements alternatively increasing and decreasing





Graph traversal computation

- > Study graph search as a prototypical graph-based algorithm
- Learn techniques to mitigate the memory-bandwidth-centric nature of graph-based algorithms
- Introduce work queues and see how they fit into a massively parallel programming framework

- ► Social media connection graphs
- Driving directions
- ► Telecommunication networks
- Manufacturing process dependencies
- ► Computation graph
- ► 3D Meshes
- ► Graphical models

Massive graphs tend to be sparse!



A B C D E F G H I



Adjacency matrix might be store using CSR format.

Given a source node S, find the number of steps required to reach each node N in the graph.

Given this labelling of the graph, one can easily find a shortest path from S to a destination T.

## Example graph



## Example graph



#### Sequential code

```
1
     void BFS sequential(int source, const int * row ptr, const int * dest, int * dist) {
 ^{2}
       int queue[2][MAX QUEUE SIZE];
 3
       int * currentOueue= & gueue[0]:
 4
       int * previousOueue = &gueue[1]:
       int currentQueueSize= 0, previousQueueSize = 0;
 5
       insertIntoQueue(source, previousQueue, &previousQueueSize);
 6
       dist[source] = 0:
 7
       while (previousOueueSize > 0) {
 8
         // visit all vertices on the previous Queue
9
         for (int f = 0; f < previousQueueSize; f++) {</pre>
10
           const int currentVertex = previousOueue[f]:
11
           // check all outgoing edges
12
           for (int i = row_ptr[currentVertex]; i < row_ptr[currentVertex+1]; ++i) {</pre>
13
             if (dist[dest[i]] == -1) {
14
               // this vertex has not been visited vet
15
               insertIntoOueue(dest[i], currentOueue, &currentOueueSize);
16
               dist[dest[i]] = dist[currentVertex] + 1:
17
             }
18
           }
19
20
         swap(currentQueue, previousQueue);
21
         previousQueueSize = currentQueueSize;
22
23
         currentQueueSize = 0;
24
       }
25
```

- ► Assign one thread per vertex
- For each iteration, check all incoming edges to see if the source vertex was just visited in the last iteration; if so, mark as visited in this iteration
- Not very work efficient; O(VL) for V = number of vertices, L = length of longest path
- ► Difficult to detect stopping criterion

- Parallelize each individual iteration of the while loop in the sequential BFS code
- > Assign a section of the vertices in the previous Queue to each thread
- > Introduce a synchronization point at the end of each iteration

#### **BFS** host

```
void BFS_host(int source, const int * row_ptr, const int * dest. int * dist) {
 1
 2
         int dQueue[2][MAX_Queue_SIZE];
         int * dCurrentQueueSize;
 3
         int * dPreviousQueueSize; int hPreviousQueueSize;
 4
         int * dVisited:
 5
         int * dCurrentOueue = &Oueue[0]:
 6
         int * dPreviousQueue = &Queue[1];
 \overline{7}
         // allocate device memory, copy memory from device to host, initialize Queue sizes, etc.
8
9
         hPreviousOueueSize = 1:
10
         while (hPreviousOueueSize > 0) {
11
           int numBlocks = (hPreviousQueueSize-1) / BLOCK SIZE + 1;
12
           BFS Bqueue kernel<<<numBlocks, BLOCK SIZE>>>(dPreviousQueue, dPreviousQueueSize, dCurre
13
           drow ptr. dDestinations. dDistances. dVisited):
14
           swap(dCurrentOueue.dPreviousOueue):
15
16
           cudaMemcpy(dPreviousQueueSize, dCurrentQueueSize, sizeof(int), cudaMemcpyDeviceToDevice
17
           cudaMemset(dCurrentOueueSize. 0. sizeof(int)):
18
           cudaMemcpy(&hPreviousOueueSize, dPreviousOueueSize, sizeof(int), cudaMemcpyDeviceToHost
19
20
21
```

#### **Output Interference**

- > A flag marks whether or not a vertex has been visited.
- From a correctness perspective, output interference on flags can be ignored, but it will lead to additional/replicated work.
- ▶ We will be using atomicExch.





```
__global__ void BFS_Bqueue_kernel(const int * previousQueue, const int * pre
1
           int * currentQueue, int * currentQueueSize, const int * row ptr,
2
           const int * destinations, int * distances, int * visited) {
3
           const int t = threadIdx.x + blockDim.x * blockIdx.x;
4
           if (t < *previousQueueSize) {</pre>
\mathbf{5}
             const int vertex = previousOueue[t]:
6
             for (int i = row ptr[vertex]; i < row ptr[vertex+1]; ++i) {</pre>
7
               // check visitation atomically, avoiding redundant expansion
8
               const int alreadyVisited = atomicExch(&(visited[destinations[i]]),1);
9
               if (!alreadvVisited) {
10
                 // we're visiting a new vertex: get a spot in line atomically
11
                 const int gueueIndex = atomicAdd(currentQueueSize, 1);
12
                 // place the vertex in line
13
                 currentQueue[queueIndex] = destinations[i];
14
               }
15
             }
16
17
           __syncthreads();
18
19
```

Inference occurred when writing when placing vertices in the queue (line 14 of kernel code).

To obtain a correct output, threads need to synchronize with an atomic operation.

Main bottleneck of the basic kernel.



## Privatization of the Queue

- ► We can make a private, block-level copy of the queue.
- Once complete, the private queues are combined to form the global queue.



#### Kernel

```
global void BFS Bqueue kernel(const int * previousQueue, const int * previousQueueSize,
 1
 2
           int * currentQueue, int * currentQueueSize, const int * row ptr, const int * destinations,
 3
           int * distances. int * visited)
 4
 \mathbf{5}
            shared int sharedCurrentQueue[BLOCK QUEUE SIZE];
 6
            shared int sharedCurrentQueueSize, blockGlobalQueueIndex;
 7
 8
            if (threadIdx.x == 0) sharedCurrentOueueSize = 0:
 9
            syncthreads();
10
            const int t = threadIdx.x + blockDim.x * blockIdx.x:
            if (t < *previousOueueSize) {</pre>
12
              const int vertex = previousQueue[t];
13
14
              for (int i = row ptr[vertex]; i < row ptr[vertex+1]; ++i) {</pre>
                const int alreadyVisited = atomicExch(&(visited[destinations[i]]),1);
15
                if (!alreadvVisited) {
16
17
                  distances[destinations[i]] = distances[i] + 1:
18
                  const int sharedQueueIndex = atomicAdd(&sharedCurrentQueueSize,1);
                  if (sharedQueueIndex < BLOCK QUEUE SIZE) { // there is space in the</pre>
19
                                                                                          local queue
                     sharedCurrentOueue[sharedOueueIndex] = destinations[i]:
20
21
                  } else { // go directly to the global queue
^{22}
                     sharedCurrentQueueSize = BLOCK QUEUE SIZE;
                     const int globalQueueIndex = atomicAdd(currentQueueSize, 1);
                     currentQueue[globalQueueIndex] = destinations[i];
^{24}
25
26
27
^{28}
            __syncthreads();
29
30
            if (threadIdx.x == 0) blockGlobalQueueIndex = atomicAdd(currentQueueSize, sharedCurrentQueueSize);
32
            syncthreads();
33
            for (int i = threadIdx.x: i < sharedCurrentOueueSize: i += blockDim.x) {</pre>
34
35
              currentQueue[blockGlobalQueueIndex + i] = sharedCurrentQueue[i];
36
```

21

- Irregular global memory access: Access patterns depend on graph structure and is unpredictable
- Kernel launch overhead: There is little parallel work in iterations with narrow Queues
- Block-level queue length counter contention: Better than before, but there will still be many serialized atomic operations
- Load imbalance: Vertices can have vastly different numbers of outgoing edges



- ► Texture memory is another form of global memory
- ► Like constant memory, it is aggressively cached for read-only access
- Originally developed and optimized for storing and reading textures for graphics applications
  - ► Has hardware-level support for 1-,2-, or 3-D layouts and interpolated reads
  - > The texture cache is also spatial layout-aware
- > Can be useful for irregular access patterns with un-coalesced reads

Declaration:

```
texture<int, 1, cudaReadModeElementType> row_ptrTexture;
Host side:
    int * hrow_ptr;
    int row_ptrLength;
    cudaArray * texArray = 0;
    cudaChanneFormatDesc channelDesc = cudaCreateChannelDesc<int>();
    cudaMemcpvToArray(&texArray, &channelDesc, row_ptrLength);
    cudaMemcpvToArray(&texArray, 0, 0, hrow ptr,
```

```
row_ptrLength*sizeof(int), cudaMemcpyHostToDevice);
cudaBindTextureToArray(row_ptrTexture, texArray);
```

Device side:

```
for (int i = tex1D(row_ptrTexture,vertex); i < tex1D(row_ptrTexture,vertex+1); ++i)</pre>
```

For some iterations of BFS (especially near the beginning), the Queue can be quite small

The benefits of parallelism only outweigh the kernel launch overhead when the Queue becomes large enough

Some options:

Use the CPU if the queue size is below some threshold

Create a single-block variant of the BFS kernel that iterates until its block-level queue is full before returning to the host

```
// is the most up-to-date Queue information on host or device?
bool currentDataOnDevice = false:
while (hPreviousOueueSize > 0) {
    int numBlocks = (hPreviousQueueSize-1) / BLOCK SIZE + 1;
    if (numBlocks < NUM BLOCKS THRESHOLD) {</pre>
      if (currentDataOnDevice) {
      // copv data to host
      . . .
      BFS iterate sequential(hPreviousOueue. hPreviousOueueSize.
          hCurrentQueue. hCurrentQueueSize, row_ptr, destinations, distances);
      currentDataOnDevice = false:
    } else {
      if (!currentDataOnDevice) {
      // copy data to device
      . . .
    BFS Bqueue kernel<<<numBlocks, BLOCK SIZE>>>(dPreviousQueue, dPreviousQueueSize,
                        dCurrentOueue. dCurrentOueueSize.
                        drow ptr. dDestinations. dDistances. dVisited):
    currentDataOnDevice = true;
  }
```

- While the block-level queues reduced contention for global memory, the block-level counter is now the bottleneck
- > We can extend the hierarchy by further splitting the block-level queue

### Three-Level Queue Hierarchy

#### Global queue







#### Each thread may assign a element based on its index.

subQueueIndex = threadIdx.x / (blockDim.x / NUM\_SUB\_QUEUES);



Threads in the same warp will be using the same queue!

#### subQueueIndex = threadIdx.x & (NUM\_SUB\_QUEUES-1);



Load imbalance is caused by a data dependency and is thus tricky to avoid Two potential strategies:

- 1. Delay the assignment of work to threads until after the total amount of work to be done is known
- 2. Spawn new threads when needed to account for additional work

- ► Graphs can be processed in parallel!
- Texture memory can help with large, read-only memory w/ irregular access
- ► Work queues can be used to track tasks of varying size
- Privatization (and multi-level privatization hierarchies) can be used to reduce contention for work queue insertion