dtbl isca42 jin


1 Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs † † Jin Wang* Norm Rubin Sudhakar Yalamanchili* Albert Sidelnik † NVIDIA Research *Georgia Institute of Technology ∗ † {jin.wang,sudha}@gatech.edu, {nrubin,asidelnik}@nvidia.com and memory access behaviors. These of parallelism Abstract pockets GPUs have been proven effective for structured applica- occur in a data dependent, nested, time-varying manner. The tions that map well to the rigid 1D-3D grid of threads in mod- ] model extends the 28 CUDA Dynamic Parallelism (CDP) [ ern bulk synchronous parallel (BSP) programming languages. base CUDA programming model with device-side nested ker- However, less success has been encountered in mapping data nel launch capabilities to enable programmers to exploit this intensive irregular applications such as graph analytics, rela- dynamic evolution of parallelism in applications. OpenCL pro- tional databases, and machine learning. Recently introduced vides similar capabilities with the OpenCL device-side kernel nested device-side kernel launching functionality in the GPU enqueue. However, recent studies have shown that while these is a step in the right direction, but still falls short of being able extensions do address the productivity and algorithmic issues, to effectively harness the GPUs performance potential. the ability to harness modern high performance hardware ac- Dynamic Thread Block We propose a new mechanism called 36 celerators such as GPUs is still difficult in most cases [ ]. 39 ][ Launch (DTBL) to extend the current bulk synchronous par- This is primarily due to the cost of a device-side kernel launch allel model underlying the current GPU execution model by that is in turn a function of the semantics of kernels. supporting dynamic spawning of lightweight thread blocks. We also observe that much of the dynamic parallelism that This mechanism supports the nested launching of thread blocks occurs in these applications can be effectively harnessed with rather than kernels to execute dynamically occurring paral- a lighter weight mechanism to spawn parallel work. Accord- lel work elements. This paper describes the execution model Dynamic Thread Block Launch (DTBL) ingly we introduce the of DTBL, device-runtime support, and microarchitecture ex- mechanism to launch light weight thread blocks dynamically tensions to track and execute dynamically spawned thread and on demand from GPU threads. These thread blocks se- blocks. Experiments with a set of irregular data intensive mantically correspond to a cooperative thread array (CTA) in CUDA applications executing on a cycle-level simulator show CUDA or a work group in OpenCL. The overhead of launching that DTBL achieves average 1.21x speedup over the original a thread block is considerably smaller than launching a kernel. flat implementation and average 1.40x over the implementa- The finer granularity of a thread block provides effective con- tion with device-side kernel launches using CUDA Dynamic trol of exploiting smaller-scale, dynamically occurring pockets Parallelism. of parallelism during the computation. We extend the GPU mi- croarchitecture to integrate the execution of these dynamically 1. Introduction spawned parallel work units into the base GPU microarchitec- There has been considerable success in harnessing the superior ture. The results show improved execution performance for compute and memory bandwidth of GPUs to accelerate tradi- irregular applications with relatively modest hardware invest- 31 30 ]. ][ ][ 3 tional scientific and engineering computations [ ][ 24 ments and minimal extensions to the base execution model in These computations are dominated by structured control and CUDA or OpenCL. In fact, DTBL can be integrated within data flows across large data sets that can be effectively mapped existing languages retaining device-side nested kernel launch to the 1D-3D massively parallel thread block structures un- capability and reusing much of the existing device API stack derlying modern bulk synchronous programming languages without any impact on applications that do not use DTBL func- 27 ]. However, 17 ] and OpenCL[ for GPUs such as CUDA [ tionality. Consequently extensions to the device side runtime emerging data intensive applications in analytics, planning, are minimal. retail forecasting and similar applications are dominated by Accordingly this paper makes the following contributions. sophisticated algorithms from machine learning, graph analy- 1. We introduce DTBL as an effective lightweight execu- sis, and deductive reasoning characterized by irregular control, data, and memory access flows. Thus, it is challenging to tion mechanism for spawning dynamically created parallel effectively harness data parallel accelerators such as GPUs work. DTBL can be used to effectively harness the com- for these applications. This paper presents, evaluates, and pute and memory bandwidth of GPUs for data intensive demonstrates an effective solution to this challenge. irregular applications. We implement a device side runtime API for supporting 2. We first observe that within many irregular applications, segments of the computation locally exhibit structured control DTBL with a single call by leveraging the API for CUDA 1

2 Kernel Distributor Entry 2.2. Host-side Kernel Launching and Scheduling PC Dim Param ExeBL Warp Schedulers The CPU launches GPU kernels by dispatching kernel launch- Warp Context ing commands. Kernel parameters are passed from CPU to Kernel Distributor Registers HW Work Queues GPU at the kernel launching time and stored in the GPU global memory with necessary alignment requirements. The parame- SMX Scheduler Core Core Core Core Control Registers ter addresses are part of the launching command along with Kernel Management Unit Kernels Pending / Shard Memory L 1 Cache Interconnection Bus other kernel information such as dimension configuration and SMX SMX SMX SMX entry PC address. All the launching commands are passed to GPU the GPU through software stream queues (e.g. CUDA stream). Host Kernels from different streams are independent from each Memory DRAM L 2 Cache CPU Controller other and may be executed concurrently while kernels from the same stream should be executed in the order that they Figure 1: Baseline GPU Architecture are launched. The streams are mapped to Hardware Work Queues (HWQ) in the GPU that create hardware-managed Dynamic Parallelism. connections between the CPU and the GPU. The current gen- 3. The base GPU microarchitecture is extended at a mod- ] - a technique 26 eration of NVIDIA GPUs introduce Hyper-Q[ est cost to support the dynamic generation, tracking, and which constructs multiple HWQs and maps individual streams efficient management of the spawned thread blocks. to each HWQ to realize concurrency. However, if the number 4. An evaluation from multiple performance perspectives of software streams exceeds the number of HWQ, some of of several CUDA applications is presented executing on them will be combined and serialized. In our baseline GPU a cycle-level simulator. DTBL achieves average 1.21x architectures, the Kernel Management Unit (KMU) manages speedup over the original implementation with flat GPU multiple HWQs by inspecting and dispatching kernels at the programming methodology and average 1.40x over the im- head of the queue to the Kernel Distributor. Once the head plementation with device-side kernel launches using CDP. kernel is dispatched, the corresponding HWQ stops being in- The remainder of the paper introduces DTBL, microarchi- spected by the KMU until the head kernel completes. The tecture extensions to support DTBL, a detailed description of KMU also manages all the kernels dynamically launched or its execution behavior, and the evaluation across a range of suspended by an SMX (e.g. through use of the CUDA Dy- representative benchmarks. namic Parallelism feature) as discussed in section 2.4. The Kernel Distributor holds all the active kernels ready for 2. Baseline GPU Architecture and Computation execution. The number of entries in the Kernel Distributor is Execution the same as that of HWQs (32 in the GK110 architecture) as this is the maximum number of independent kernels that can be This section provides a brief background on modern general dispatched by the KMU. Each entry manages a set of registers purpose GPUs and the details of their execution model. that record the kernel status including kernel entry PC, grid and thread block dimension information, parameter addresses and the number of thread blocks to complete. The SMX scheduler 2.1. Baseline GPU Architecture and Execution Model takes one entry from the Kernel Distributor in first-come-first- serve (FCFS) order and sets up the SMX control registers We model the baseline GPU architecture according to the according to the kernel status. It then distributes the thread NVIDIA GK110 Kepler architecture so we adopt the NVIDIA blocks of the kernel to each SMX limited by the maximum and CUDA terminology. However, the basic ideas and re- number of resident thread block, threads, number of registers, sulting analysis are applicable to other GPU architectures. In and shared memory space per SMX. Figure 1, a GPU is connected to the host CPU by the intercon- nection bus and accepts operation commands such as memory During execution, a thread block is partitioned into groups copy and kernel launching from the CPU. In the CUDA pro- warp of 32 threads called a as the basic thread group executed gramming model, programmers express a GPU program as a on a 32-lane SIMD unit. SMXs maintain the warp contexts for the lifetime of the thread block. The warp scheduler selects kernel function that specifies a 1D-3D array of thread blocks called cooperative thread arrays (CTAs), all threads executing a warp from all the resident warps on the SMX that have no the same code. Thread blocks are scheduled and executed on unresolved dependency according to a scheduling policy (e.g. the GPU computation units called Streaming Multiprocessors round-robin) and then issues its next instruction to the cores. (SMXs), each of which features multiple cores, registers and By interleaving warps, an SMX is able to achieve hardware multithreading and hide memory latency. All the threads in a a scratch memory space used as L1 cache or shared memory. warp execute the same instruction in a lock-step fashion. When The GPU connects to the L2 cache and then to the DRAM through the memory controller and uses it as global memory. there is a branch and threads in a warp take different paths, 2

3 the execution of threads on different paths will be serialized. SMX to the KMU so that all the SMXs are able to issue new This is referred to as control flow divergence and results in low kernel launching commands to the KMU. Similar to host-side launched kernels, parameters are stored in the global memory SIMD lane utilization. Our baseline architecture uses a PDOM and the address is passed to the KMU with all other config- 13 ] to track and reconverge the threads reconvergence stack [ urations. When a parent decides to yield to a child kernel, that take different branches. Memory accesses generated by 32 the SMX suspends the parent kernel and notifies the KMU to threads in a warp for consecutive word addresses are coalesced hold the suspended kernel information. The KMU dispatches into one memory transaction. Otherwise memory instructions device-launched or suspended kernels to the Kernel Distribu- are replayed multiple times which may increase the memory tor along with other host-launched kernels in the same manner. access latency for the entire warp. This pattern of irregular memory accesses is referred as memory divergence [ ]. The 22 Therefore device-launched kernels also take advantage of con- current kernel execution capability. SMX scheduler keeps updating the control register and the register in each Kernel Distributor entry to reflect the number Current architecture support of device-side kernel launching of thread blocks that remain to be scheduled as well as those 36 ] analyze the comes with non-trivial overhead. Wang et al. [ still being executed. When all the thread blocks of a kernel CDP overhead on the Tesla K20c GPU and show that the total finish, the Kernel Distributor will release the corresponding kernel launching time scales with the number of child kernels, kernel entry to accept the next kernel from the KMU. . The 36.1% decreasing the performance by an average of launching time is composed of the time spent in allocating 2.3. Concurrent Kernel Execution the parameters, issuing a new launching command from SMX to the KMU, and dispatching a kernel from the KMU to the Concurrent kernel execution is realized by distributing thread Kernel Distributor. blocks from different kernels across one or more SMXs. If one kernel does not occupy all SMXs, the SMX scheduler takes the Device-side kernel launches also require a substantial global next kernel and distributes its thread blocks to the remaining memory footprint. A parent may generate many child kernels SMXs. When a thread block finishes, the corresponding SMX which can be pending for a long time before being executed, notifies the SMX scheduler to distribute a new blocks either thus requiring the GPU to reserve a fair amount of memory from the same kernel or from the next kernel entry in the for storing the associated information of the pending kernels. Kernel Distributor if the current kernel does not have any On the other hand, the device runtime has to save the states of remaining thread blocks to distribute and the SMX has enough the parent kernel when they are suspended to yield to the child resources available to execute a thread block of the next kernel. kernels at the explicit synchronization points. Therefore, multiple thread blocks from different kernels can 3. Dynamic Parallelism ]. Large kernels which either 11 execute on the same SMX [ have many thread blocks or use a large amount of resources are We propose to extend the BSP execution model to effectively not likely to be executed concurrently. On the other hand, if the and efficiently support dynamic parallelism. This section kernels in the Kernel Distributor only use a very small amount introduces major characteristics of dynamic parallelism and of SMX resources, the SMX may not be fully occupied even introduces our extension to the BSP model. after all the kernels in the Kernel Distributor are distributed, which results in under-utilization of the SMX. 3.1. Parallelism in Irregular Applications 2.4. Device-Side Kernel Launch Emerging data intensive applications are increasingly irreg- Recent advances in the GPU programming model and archi- ular by operating on unstructured data such as trees, graphs, tecture support device-side kernel launches - CUDA Dynamic relational data and adaptive meshes. These applications have Parallelism [ 28 ] - which provides the capability of launching inherent time-varying, workload-dependent and unpredictable kernels dynamically from the GPU. In this model, parent ker- memory and control flow behavior that may cause severe work- nels are able to launch nested child kernels by invoking several load imbalance, poor memory system performance and even- device-side API calls to specify the child kernel configuration, tually low GPU utilization. Despite the above observations, setup parameters, and dispatch kernels through device-side ynamically D they still exhibit ormed pockets of structured F software streams to express dependencies. Child kernels can arallelism ( data P DFP ) that can locally effectively exploit the start any time after they are launched by their parents. How- GPU compute and memory bandwidth. For example, typical ever, parent kernels may request explicit synchronization with GPU implementations for vertex expansion operations that the children, which results in an immediate yielding from are commonly used in graph problems assign one thread to parent to child. Device-side kernel launching preserves the expand each vertex in the vertex frontier with a loop that iter- original memory model used in the GPU: global memory is ates over all the neighbors. Since the number of neighbors for visible to both parent and children while shared memory and each vertex can vary, the implementation may suffer from poor local memory are private and cannot be passed to each other. workload balance across threads. However, DFP exists within In our baseline GPU architecture, there is a path from each each thread as the neighbor exploration can be implemented 3

4 as a structured parallel loop where each neighbor is examined either low utilization (if some of the SMXs are not assigned with any warps) or poor memory latency hiding ability (if independently. SMXs are assigned with small number of warps). The introduction of device-side kernel launching in GPUs enables a implementation scheme that new child kernels are 3.2. Light Weight Thread Blocks dynamically invoked for any detected DFP in irregular applica- tions. In the vertex expansion example, the original neighbor While current support of device-side kernel launch on the GPU exploration loop can be replaced by a dynamically launched provides substantial productivity for handling DFP, the major kernel that employs uniform control flow. The approach can issues of kernel launching overhead, large memory footprint, potentially increase the performance by reducing control flow and less efficient kernel scheduling prevent the performance divergence and memory irregularity. For some common data effective utilization of this functionality. structures used in this problem such as Compressed Sparse We propose to extend the current GPU execution model Row (CSR) where neighbor IDs of each vertex are stored in where thread blocks rather than entire kernels DTBL with consecutive addresses, parallel neighbor exploration may also can be dynamically launched from a GPU thread. Thread generate coalesced memory accesses. Prior studies have iden- blocks (TBs) can be viewed as light weight versions of a tified several characteristics of of such implementations [36]: kernel. A kernel can make nested TB calls on demand to High Kernel Density: Depending on the problem size, locally exploit small pockets of parallel work as they occur in DFP in irregular applications can show substantially high a data dependent manner. When a GPU kernel thread launches density where a large number of device kernels are launched. a TB, it is queued up for execution along with other TBs that For example, a typical breadth first search (BFS) iteration are initially created by the kernel launch. Using DTBL, the can have about 3K device kernel launches. The high DFP set of TBs that comprise a kernel are no longer fixed at launch density results in high kernel launching overhead and memory time but can vary dynamically over the lifetime of a kernel. footprint. As we will demonstrate, the dynamic creation of thread Low Compute Intensity: Device kernels launched for blocks can effectively increase the SMX occupancy, leading to DFP are usually fine-grained and have relatively low degrees higher GPU utilization. The dynamic TB launch overhead as of parallelism. Measurements across several irregular applica- well as memory footprint are significantly lower than that of tions show that the average number of threads in each device- kernel launch. Thus, DTBL enables more efficient support of launched kernel is around 40 which is close to the warp size. irregular applications by introducing a light weight mechanism Workload Similarity: As DFP may exist within each to dynamically spawn and control parallelism. thread in a kernel, and all threads are identical, the operations performed by each dynamically launched kernel are usually 4. Dynamic Thread Block Launch similar. However their instantiation may be with different In this section, we define the execution model of DTBL, pro- degrees of parallelism. As per the nature of DFP, most device- pose the architecture extensions, and analyze the potential launched kernels invoke the same kernel function but can have benefits. different configurations and parameter data. DFP gen- Low Concurrency and Scheduling Efficiency: 4.1. DTBL Execution Model erated by different threads are independent of each other and The execution model of DTBL allows new TBs to be dynami- are implemented by launching device kernels through differ- cally launched from GPU threads and coalesced with existing ent software streams to enable concurrency. Because of the kernels for scheduling efficiency. We extend the current GPU low compute intensity in these device kernels and thereby the BSP execution mode with several new concepts and terms to low resources usage when executed on SMX, multiple device support these new features. kernels are able to execute concurrently. However, current kernel scheduling strategy on the GPU imposes a limit on Figure 2 shows the execution model and thread hierarchy of kernel-level concurrency as the number of independent HWQs DTBL. Any thread in a GPU kernel can launch multiple TBs determines the maximum number of kernels that can be ex- with a single device API call (see later in this section). These utilizing a ecuted concurrently which is 32 in the GK110 architecture. TBs are composed as a single aggregated group As fine-grained device kernels can only be scheduled and exe- three dimensional organization similar to those of a native cuted concurrently up to this limit, there may not be enough kernel. An aggregated group is then coalesced with a kernel - this simply means the TBs in the aggregated groups are added warps to fully occupy the SMX. Consider an example where the device kernels have only 64 threads (2 warps) and 32 de- to the existing pool of TBs remaining to be scheduled and executed for that kernel. In fact, an aggregated group may vice kernels are concurrently scheduled to the GPU which result in 64 warps running on the SMXs simultaneously. This be coalesced with the kernel of the parent thread (Figure 2a) is only 1/13 of the maximum number of resident warps on a or with another kernel (Figure 2b). In either case, the newly generated aggregated group execute the same function code as Tesla K20c GPU (13 SMXs, each has maximum 64 resident warps). The limited warp concurrency could potentially cause the kernel with which it is coalesced, and may have different 4

5 Description Device Runtime API Calls Native Kernel K 1 K 2 cudaGetParameterBuffer Reused from the original CUDA device run- Native TB time library to allocate parameter buffer for a new aggregated group. Aggregated Group cudaLaunchAggGroup A new API call introduced by DTBL pro- 3 K Aggregated TB gramming interface which launches a new aggregated group. Aggregated Kernel Launch Table 1: List of Device Runtime API calls for DTBL b ) ( ( ) a Figure 2: DTBL execution model and thread hierarchy where launched kernel. Within an aggregated group, aggregated (a) shows the aggregated groups launched by kernel K1 are TBs are organized into one/two/three dimensions, identified coalesced to itself and (b) shows the aggregated groups by their three-dimensional TB indices. The value of each TB launched by kernel K2 are coalesced to another kernel K3. index dimension starts at zero. Similar to launching a device kernel, the programmers supply data addresses through param- input parameter values. Multiple aggregated groups can be eters and use TB indices within an aggregated group as well coalesced to a single kernel. as thread indices within an aggregated TB to index the data In DTBL, coalescing is essential to increasing the TB values used by each thread. scheduling performance due to i) TBs with the same con- Synchronization: DTBL uses the same synchronization se- figuration can be scheduled together to achieve the designed mantics as the current GPU execution model, i.e., threads occupancy for the original kernel, possibly leading to higher within an aggregated TB can be synchronized explicitly by GPU utilization and ii) coalesced aggregated groups only re- calling a barrier function. However, like the base program- quire one common context setup including kernel function ming model no explicit barrier is valid across native or aggre- loading, register and shared memory partitioning which can gated TBs. Unlike the parent-child synchronization seman- reduce the scheduling overhead. More details are described in tics in the device-kernel launching model in CDP, aggregated the microarchitecture later in Section 4.2. groups cannot be explicitly synchronized by its invoking ker- The kernel that is initially launched either by the host or by nel. Therefore, it is the programmers’ responsibility to ensure native the device using the kernel launching API is called a the correctness of the program without any assumption on the . The TBs that compose the native kernel are kernel native TBs . execution order of aggregated groups. When we implement TBs in an aggregated group are called . When aggregated TBs various irregular applications using device kernel launching, a native kernel is coalesced with new aggregated groups, it we avoid any explicit synchronization between the child and aggregated kernel becomes an . parents due to its high overhead in saving the parent state to The idea of DTBL can be illustrated with two examples. the global memory, so these applications can be easily adapted The first example is Adaptive Mesh Refinement (AMR) cor- to the new DTBL model. A more thorough analysis of the responding to the execution model in Figure 2a. The DTBL usage of explicit synchronization is left as future work. implementation uses a native kernel K1 for the initial grid DTBL also preserves the current GPU mem- Memory Model: where each thread may launch nested aggregated groups for ory model, i.e., global memory, constant memory and texture recursively refining the cells that are processed by the thread. memory storage are visible to all native and aggregated TBs. All the new aggregated groups are then coalesced with K1 Shared memory is private to each thread block and local mem- which become one aggregated kernel. The second example ory is private to each thread. No memory ordering, consis- is BFS corresponding to the execution model in Figure 2b tency, or coherence is guaranteed across different native or where a parent kernel K2 assigns threads to all the vertices aggregated TBs. in the vertex frontier and each parent thread may launch new Programming Interface: DTBL defines two device run- TBs to expand the vertex neighbors. The kernel K3 is a native time API calls on top of the original CUDA Device Run- kernel previously launched by the host or the device for vertex The API call time Library for CDP listed in Table 1. expansion. The new TBs generated by K2 are coalesced to K3 cudaGetParameterBuffer is the same as in the orig- rather than the parent. inal device runtime library that is used to allocate param- Thread Hierarchy Within an Aggregated TB: As in GPUs eter space for an aggregated group. The second API call today, DTBL uses a three-dimensional thread index to iden- cudaLaunchAggGroup is newly defined for dynamically tify the threads in an aggregated TB. When coalesced to a launching an aggregated group. Programmers can pass the native kernel, the number of threads in each dimension of kernel function pointer when calling this API to specify the an aggregated TB should be the same as that of a native TB. kernel to be executed by and possibly coalesce with the Therefore, aggregated TBs use the same configuration and new TBs. Similar to the device kernel launching API call the same amount of resources as native TBs, minimizing the in CDP, cudaLaunchAggGroup cudaLaunchDevice overhead when scheduled on an SMX. configures the new aggregated group with thread and TB num- Aggregated TB Hierarchy Within an Aggregated Group: bers in each dimension, shared memory size, and parameters. An aggregated group in DTBL is analogous to a device- Note that unlike a device kernel launching which should be 5

6 __ ) { parent __ global ( Aggregated Group Table ) { __ global __ parent ( Kernel Distributor cudaStream ; t s _ Entries ( cudaStreamCreateWithFlags , s & ) ; Next AggDim Param ExeBL Dim PC Param ExeBL = * ; () cudaGetParameterBuffer buf void ; cudaGetParameterBuffer = * () buf void LAGEI NAGEI // fill the buf with data // fill the buf with data cudaLaunchDevice child , buf , ( cudaLaunchAggGroup ( child , buf , DTBL grDim tbDim , sharedMem , s ) ; , , aggDim ; ) sharedMem , tbDim Scheduling } } ( child global ) __ __ { ) { __ global __ child ( Kernel SMX Scheduler } } Microarchtecture Management Control Registers FCFS Controller b ) ( ) a ( Extension Unit KDEI NextBL AGEI Kernel Distributor Figure 3: Example code segments for (a) CDP and (b) DTBL SMX Aggregated Group Information SMX SMX SMX SMX configured with an implicit or explicit software stream to Thread Block Next Param PC Dim express dependency on other kernels, the aggregated thread Control Registers groups are automatically guaranteed to be independent of each KDEI AGEI BLKID Memory DRAM Controller other. We show example code segments for both CDP and DTBL implementations in Figure 3 where a parent kernel Figure 4: Microarchitecture Flow for DTBL launches child kernels in CDP and corresponding aggregated groups in DTBL. The similarity between the two code seg- request procedure is the same as that of a device-side ker- ments demonstrate that DTBL introduces minimal extensions nel launch. After parameters are loaded to the parameter to the programming interface. buffer, the SMX passes the aggregation operation command to the SMX scheduler with the information for each aggregated 4.2. Architecture Extensions and SMX Scheduling 3 . group © To support the new DTBL execution model, the GPU microar- Thread Blocks Coalescing chitecture is extended to process the new aggregated groups In this step, the SMX scheduler receives the aggregation that are launched from the GPU threads. The baseline mi- operation command and attempts to match the newly launched croarchitecture maintains several data structures for keeping aggregated groups with the existing kernels in the Kernel Dis- track of deployed kernels and the TBs that comprise them. tributor Entries (KDE) for TB coalescing based on aggregated These data structures are extended to keep track of dynam- group configurations. If the coalescing is successful, the SMX ically formed aggregated groups and associating them with scheduler will push the new TBs in a scheduling TB pool for active kernels. This is achieved in a manner that is transpar- the corresponding kernel. The scheduling pool is implemented ent to the warp schedulers, control divergence mechanism, with several registers in microarchitecture to form a linked- and memory coalescing logic. Figure 4 illustrates the major list data structure for efficient TB scheduling. The process is microarchitecture extensions to support DTBL. With the ex- 4 implemented as a new part of the DTBL scheduling policy © tended data structure and SMX scheduler, new aggregated which is illustrated in Figure 5 and described in the following. groups are launched from the SMXs, coalescing to existing For each aggregation group, the SMX scheduler first kernels in the Kernel Distributor and scheduled to execute on searches the KDE to locate any existing eligible kernels that SMX with all other TBs in the coalesced kernel. The detailed 5 can accept the new TBs . Eligible kernels should have the © procedure and functionality of each data structure extension same entry PC addresses and TB configuration as the ag- are described as follows. gregated group. If none are found, the aggregated group is Launching Aggregated Groups launched as a new device kernel. Our experiments show that This is the first step that happens when the aggregated group an aggregated group is able to match eligible kernels on aver- launching API is invoked by one or more GPU threads. The age 98% of the time. Mismatches typically occur early, before SMX scheduler will react correspondingly to accept the new newly generated device kernels fill the KDE. aggregated groups and prepare necessary information for TB If an eligible kernel is found, the SMX scheduler allo- coalescing in the next step. (AGT) with Aggregated Group Table cates an entry in the Similar to the device kernel launching command, DTBL in- the three-dimensional aggregated group size and the parame- troduces a new aggregation operation command in the microar- 6 © . The AGT is composed of multiple Aggregated ter address chitecture. This command will be issued when the aggregated Group Entries (AGE) and serves to track all the aggregated group launching API calls are invoked simultaneously by one groups. Aggregated groups that are coalesced to the same eligi- or more threads within the same warp. These aggregated group 7 ble kernel are linked together with the field of the AGE . Next © launches are then combined together to be processed by the The AGT is stored on chip for fast accesses with a limit on aggregation operation command. the number of entries. When the SMX scheduler searches For each newly formed aggregated group, the SMX allo- for a free entry in the AGT, it uses a simple hash function 1 © cates global memory blocks through the memory controller to generate the search index instead of a brute-force search. 2 to store the parameters and configuration information © . The The hash function is defined as ind = hw_tid & (AGT_size - 6

7 Y N updated when the new aggregated group is the first aggregated Eligible Kernel group to be coalesced to this kernel. N Launch as Unlike NAGEI, LAGEI is always updated every time a new Free AGT entry ? device kernel aggregated group is generated for the kernel to reflect the last Y Get Agg Group aggregate group to be scheduled. With NAGEI, LAGEI and Get new AGT Info in global mem entry field of AGE, all the aggregated groups coalesced to the Next the same kernel are linked together to form a scheduling pool. N Kernel marked ? by FCFS Aggregated Thread Blocks Scheduling on SMX Mark Kernel by Y The last step in DTBL scheduling manages to schedule all FCFS Y First Agg Group ? the aggregated TBs on the SMXs. The SMX scheduler first Update NAGEI N determines whether the native kernel or a specific aggregated group should be scheduled according to the registers value Update LAGEI generated by the previous step for the scheduling pool. Then it Figure 5: DTBL scheduling procedure in SMX scheduler distributes the TBs in the kernel or the aggregated group to the SMXs with a set of registers to track their status. As described is the hardware thread index in each SMX 1) where hw_tid in the follows, this is implemented by updating the algorithm and is the size of AGT. The intuition behind the AGT_size used by the baseline GPU microarchitecture to distribute and hash function is that all threads on an SMX have the same execute the native TBs. probability in launching a new aggregated group. The SMX When the SMX scheduler receives a kernel from the Kernel scheduler is able to allocate an entry if the entry indexed by Distributor, it checks if it is the first time the kernel is marked ind is recorded as aggregated group ind in AGT is free and the by the FCFS controller. If so, the SMX scheduler starts dis- entry index (AGEI). Otherwise it will record the pointer to tributing the native TBs followed by aggregated TBs pointed global memory where the aggregated group information is to by the NAGEI (if any). Otherwise it directly starts distribut- 2 © stored . ing the aggregated thread blocks pointed by NAGEI since the Now that an eligible kernel and the corresponding KDE native TBs have already been scheduled when the kernel was is found, the TBs in the new aggregated group are added to previously dispatched by the FCFS controller. Another pos- the set of TBs in the eligible kernel waiting be executed. The sibility is that the new aggregated groups are coalesced to a AGEI or the global memory pointer of the new aggregated kernel that is currently being scheduled, the SMX scheduler group information is used to update the two KDE registers will then continue to distribute the new aggregated groups after 8 Last AGEI (LAGEI) if necessary (NAGEI) and Next AGEI . © finishing distributing the TBs from the native kernel or current NAGEI indicates the next aggregated group to be scheduled in aggregated group. The SMX scheduler updates the NAGEI the kernel. It is initialized when a kernel is newly dispatched to every time after finishing scheduling the current aggregated the Kernel Distributor to indicate no aggregated groups exists group and starts the next aggregated group indicated by the for the kernel. LAGEI indicates the last aggregated group to field of AGE pointed by NAGEI. Next be coalesced to this kernel. Once the SMX scheduler determines the native kernel or ag- All the kernels in the Kernel Distributor are marked by gregated group to schedule, it records the corresponding index 9 the FCFS © with a single bit when they are queued to be 10 of KDE (KDEI) and AGEI in its control registers (SSCR) . © scheduled and unmarked when all its TBs are scheduled. We SSCR also has a field to store the index of the next NextBL extend FCFS controller with an extra bit to indicate if it is TB to be distributed to the SMX. Note that since the TBs in the first time the kernel is marked by the FCFS. This is useful the native kernel and the aggregated groups have the same when the SMX scheduler attempts to update NAGEI under configuration and resource usage as constrained by the DTBL two different scenarios. execution model, the SMX scheduler can use a static resource At the first scenario, when a new aggregated group is gener- partitioning strategy for both the native and aggregated TBs, ated, the corresponding eligible kernel may have all its TBs saving the scheduling cost. scheduled to SMXs, be unmarked by the FCFS controller and The SMX scheduler then distributes TBs to each SMX. only be waiting for its TBs to finish execution. In this case, 11 © on each SMX The Thread Block Control Register (TBCR) the NAGEI is updated with the new aggregated group and the KDEI is updated correspondingly using the same value of kernel is marked again by the FCFS controller so that the new AGEI in SSCR to record the kernel index in the Kernel and aggregated group can be scheduled the next time the kernel is Distributor and the aggregated group index in the AGT so selected by the SMX scheduler. the SMX can locate the function entry and parameter address At the other scenario, the eligible kernel is still marked by correctly for the scheduled TB. The BLKID field records the corresponding TB index within a kernel or an aggregated FCFS as it is either waiting in the FCFS queue or is being scheduled by the SMX scheduler. In this case, there are still group. Once the TB finishes execution, the SMX notifies the 12 SMX scheduler to update the ExeBL field in the KDE TBs in the eligible kernel to be scheduled and NAGEI is only © and 7

8 706MHz SMX Clock Freq. 13 AGE © which track the number of TBs in execution. 2600MHz Memory Clock Freq. # of SMX 13 When all the TBs of the last aggregated group marked by 16 Max # of Resident Thread Blocks per SMX LAGEI have been distributed to an SMX, the SMX scheduler 2048 Max # of Resident Threads per SMX # of 32-bit Registers per SMX 65536 notifies the FCFS controller to unmark the current kernel to L1 Cache / Shared Mem Size per SMX 16KB / 48KB finish its scheduling. The corresponding entries in the Kernel 32 Max # of Concurrent Kernels Distributor or AGT will be released once all the TBs complete Table 2: GPGPU-Sim Configuration Parameters execution. 4.3. Overhead Analysis be decreased. For the same reason, the context setup overhead The hardware overhead is caused by extra data structures such as kernel function loading and resource allocation across introduced by the architectural extensions (shaded boxes in SMXs could be increased. The context setup overhead is ex- Figure 4). New fields in the KDE (NAGEI and LAGEI), FCFS pected to scale with the number of aggregated group scheduled Controller (the flag to indicate if the kernel has been previ- from KDE. ously dispatched), SSCR (AGEI) and SMX TBCR (AGEI) Second, hardware complexity and scheduling latency in the together take 1096 Bytes of on-chip SRAM. The size of AGT KMU and FCFS controller scales with number of KDE. For determines how many pending aggregated groups can be held example, the number of HWQ could be increased to keep up on-chip for fast accesses. A 1024-entry AGT takes 20KB the kernel concurrency, and the overhead for FCFS controller of on-chip SRAM (20Bytes per entry) which composes the to track and manage the status of each aggregated group also major hardware overhead (about 0.5% of the area taken by the increases linearly. shared memory and registers on all SMXs). In Section 5.2D 4.4. Benefits of DTBL we analyze the sensitivity of performance to the size of the AGT. DTBL is beneficial primarily for the following three reasons. The major timing overhead of launching aggregated groups First, compared to device-side kernel launching, dynamic TB includes time spent on allocating parameters, searching the launches have less overhead. Instead of processing the device- KDE and requesting free AGT entries. As discussed before, side launching kernel command through a long path from the launching from the threads within a warp are grouped together SMX to KMU and then to the Kernel Distributor, TBs are di- as a single command. Therefore, the overhead is evaluated on a rectly grouped with active kernels in the Kernel Distributor by per-warp basis. The procedure of allocating a parameter buffer the SMX scheduler. For irregular applications that may gener- for an aggregated group is the same as that for a device-launch ate a large amount of dynamic workload, reducing the launch kernel, so we use the measurement directly from a K20c GPU. overhead can effectively improve the overall performance. The search for eligible KDE entry can be pipelined for all the Second, due to the similarity of the dynamic workload in simultaneous aggregated groups launches in a warp, which irregular applications, dynamically generated TBs are very takes a maximum of 32 cycles (1 cycle per entry). Searching likely to be coalesced to the same kernel which enables more for a free entry in AGT only takes one cycle with the hash TBs to be executed concurrently. Recall that the concurrent function for each aggregated group. If a free entry is found, execution of fine-grained device kernels are limited by the size there will be zero cost for the SMX scheduler to load the of Kernel Distributor. The DTBL scheduling breaks this limit aggregated group information when it is scheduled. Otherwise as aggregated TBs are coalesced into a single native kernel the SMX scheduler will have to load the information from the that can take full advantage of the TB level concurrency on the global memory and the overhead is dependent on the global SMX. This more efficient scheduling strategy may increase memory traffic. It should be noted that allocating the parameter the SMX occupancy which is beneficial in increasing GPU buffer and searching the KDE/AGE can happen in parallel, utilization, hiding memory latency and increasing the memory the slower of which determines the overall time overhead of bandwidth. aggregated group launching. Third, both the reduced launch latency and increased An alternative approach to the proposed microarchitecture scheduling efficiency helps to consume the dynamically extension is to increase the number of KDE entries so that launched workload faster. As the size of reserved global mem- each aggregated group can be independently scheduled from ory depends on the number of pending aggregated groups, KDE. The argument is that the hardware overhead introduced DTBL can therefore reduce the global memory footprint. by AGT could be potentially saved. However, there are also 5. Experiments and Evaluation some major side effects for this approach. First, since aggregated groups are scheduled independently, 5.1. Methodology they are not coalesced so that TBs with different configura- tions are more likely to be executed on the same SMX. In We perform the experiments on the cycle-level GPGPU-Sim consequence, the designed occupancy for the original kernels simulator [ 5 ]. We first configure GPGPU-Sim to model the is less likely to be achieved and the execution efficiency could Tesla K20c GPU as our baseline architecture. The configu- 8

9 cudaStreamCreateWithFlag 7165 (CDP only) Application Input Data Set (CDP and DTBL) cudaGetParameterBuffer b: 8023, A: 129 Adaptive Mesh Refinement (AMR) Combustion Simulation[18] cudaLaunchDevice b: 12187, A: 1592 (CDP only) Barnes Hut Tree (BHT) [8] Random Data Points Kernel dispatching 283 Breadth-First Search (BFS) [23] Citation Network[4] USA Road Network[4] Table 3: Latency Modeling for CDP and DTBL (Unit: cycles) Cage15 Sparse Matrix [4] Graph Coloring (CLR) [10] Citation Network[4] Graph 500 Logn20[4] Cage15 Sparser Matrix [4] DARPA Network Packets [21] Regular Expression Match (REGX) [37] ration parameters are shown in Table 2. We also modify the Random String Collection SMX scheduler to support concurrent kernel execution on Product Recommendation (PRE) [25] Movie Lens [16] the same SMX. The warp scheduler is configured to use the Relational Join (JOIN) [12] Uniform Distributed Data Gaussian Distributed Data greedy-then-oldest scheduling policy [ 32 ]. As discussed be- Citation Network[4] Single Source Shortest Path (SSSP) [19] fore, our proposed microarchitecture extension is transparent Fight Network [1] Cage15 Sparser Matrix[4] to the warp scheduler so DTBL can take advantage of any warp scheduling optimization that is useful to the baseline Table 4: Benchmarks used in the experimental evaluation. GPU architecture. To support the device-side kernel launch capability (CDP by employing TB and warp level vertex expansion techniques. on K20c), we extend the device runtime of GPGPU-Sim For this application, our implementation uses CDP device ker- with the implementation of corresponding API calls. We nels or DTBL aggregated group to replace the TB or warp model the latency of these API calls which is part of level vertex expansion. We implemented the benchmarks with the kernel launching overhead by performing the measure- CDP in the way that a device kernel is launched for any DFP clock() function and ment on the K20c GPU with the with sufficient parallelism available. The same methodology use the average cycle values from 1,000 measurements applies to DTBL except that a device kernel is replaced with across all the evaluated benchmarks. According to our an aggregated group for a fair comparison. Note that the data cudaGetParameterBuffer and measurements, the API structures and algorithms of the original implementations are cudaLaunchDevice have a linear latency model per warp not changed in the CDP/DTBL implementations for a fair Ax + is the initialization latency b where b basis denoted as comparison. The proposed DTBL model for dynamic par- for each warp, A is the latency for each API called by one allelism can also be orthogonal to many optimizations, e.g. is the number of the threads calling x thread in the warp and worklist for work pulling and pushing to achieve high-level the API in a warp. Note that the execution of the device API workload balance, as they can be applied in either flat or nested calls will be interleaved for all the warps so that some portion implementations. of the latency introduced can also be hidden by the interleav- DTBL only uses device runtime API for thread block ing, similar as the memory latency hiding. Besides the API launching and does not introduce any new instructions or syn- latency, there is also a kernel dispatching latency (from KMU tax. Therefore, we use the CUDA compiler NVCC6.5 directly to Kernel Distributor). We measure this using the average time to compile the benchmarks. Extending the syntax to support difference between the end of the first kernel and the start of higher-level programming interface similar as the CUDA ker- the second kernel that is dependent of the first kernel. nel launching annotation “ ” is left as future work. <<<>>> We verify the accuracy of the simulation, especially the We use the same dataset for the GPU and the simulator and kernel launching overhead, by running all the benchmarks run the entire application from the beginning to the end ex- both on the K20c GPU and the simulator and use the same into several sections, manually . We break regx cept for regx correlation computation method by GPGPU-Sim. We also populate the memory in GPGPU-Sim, and run only computa- implement the proposed architecture extension for DTBL tion kernels. We then trace all the computation kernels of the with overhead assignment described in Section 4.3 where benchmarks to generate the performance data. the latency for parameter buffer allocation is the same as the cudaGetParameterBuffer API call and all other ag- 5.2. Result and Analysis gregated group launching latency is directly modeled by the In this section we report the evaluation and analysis of the microarchitecture extension. The latency numbers used in the benchmark in various performance aspects. simulator are shown in Table 3. A. Control Flow and Memory Behavior We select 8 irregular applications with different input data We evaluate the control flow behavior using the warp ac- sets as shown in Table 4. The source code of these applica- tivity percentage which is defined as average percentage of tions are from the latest benchmark suites or implemented as active threads in a warp as shown in Figure 6, and mem- described in recently published papers. We refer to the origi- nal CUDA implementations as implementations since the ory behavior using DRAM efficiency which is computed as flat nested algorithmic structure is flattened and effectively serial- dram_efficiency=(n_rd+n_write)/n_activity where n_rd and implementa- bfs ized within each thread. An exception is the n_write are the number of memory read and write commands ] where dynamic parallelism for DFP is implemented 23 tion [ issued by the memory controller and n_activity is the active 9

10 DTBL CDP Flat Flat CDP DTBL 100 0.6 80 0.5 0.4 60 Percentage 0.3 Efficiency 40 0.2 Activity 20 Dram 0.1 Warp 0 0 Figure 7: DRAM Efficiency Figure 6: Average Percentage of Active Threads in a Warp CDPI DTBLI CDP DTBL 100 cycles when there is a pending memory request. DRAM effi- ciency reveals the memory bandwidth utilization and increases 90 when there are more coalesced memory accesses in a given 80 period of time, as shown in Figure 7. On average, warp activity 70 Occupancy percentage of both CDP and DTBL increases 10.7% from the flat implementations and DRAM efficiency increases 0.029 or 60 SMX 1.14x for CDP and 0.053 or 1.27x for DTBL, demonstrating 50 that one important benefit of both CDP and DTBL is to dy- namically generate parallel workload for DFP that have more regular control flow and coalesced memory accesses. Since both DTBL and CDP launch dynamic parallel work- Figure 8: SMX Occupancy loads, they fundamentally behave the same in reducing control flow divergence and obtain the same amount of increase in B. Scheduling Performance and amr warp activity percentage. Some benchmarks, such as By coalescing dynamically generated TBs to existing ker- join_gaussian , have highly irregular computation workload nels on the fly, DTBL is able to increase the TB-level concur- and severe imbalance problem across the threads in their flat rency for fine-grained parallel workloads. Lower launching implementations and achieve most substantial increases in latency for DTBL also contributes to the increase in the num- warp activity percentage (45.3% and 21.3%). We can also ob- ber of available TBs that can be scheduled concurrently by serve warp activity percentage increase for the two bench- bfs the SMX scheduler. Therefore, DTBL is able to outperform marks. Although the baseline bfs implementation has already CDP by increasing SMX occupancy. We evaluate the SMX utilized TB-level and warp-level vertex expansion to handle dy- occupancy by measuring the average number of active warps namic parallelism, CDP and DTBL are able to use variable TB in each cycle on all of the SMXs divided by maximum num- sizes to achieve even better workload balance. The benchmark ber of resident warps per SMX. We isolate the influences of clr_cage15 does not show obvious changes and clr_graph500 scheduling strategy and launching latency by comparing the graph500 even shows a slight drop (-5.9%) because the graphs measurement with and without launching latency. The results already have relatively small variance in vertex cage15 and are shown in Figure 8 where the SMX occupancy achieved degree that generate balanced workload even in the flat imple- by CDP and DTBL without modeling launching latency are mentation. Launching dynamic parallel workloads for some denoted as CDP-Ideal (CDPI) and DTBL-Ideal (DTBLI) re- vertices but not for others may break the original balance and spectively. DTBLI has average of 17.9 or 1.24x increase over cause more control flow divergence. This is consistent with achieves the highest occupancy bht CDPI. The benchmark the understanding that CDP and DTBL are intended to work increases (24.6 or 1.38x) since it generates many fine-grained well over unbalanced workloads. parallel workloads (average number of threads in a device ker- nel or an aggregated group is 33.4 which is close to warp size). sssp_cage15 The clr_cage15 and are two benchmarks that Therefore, in its CDP implementation, the limited kernel-level achieve highest DRAM efficiency increase. In their flat im- cage15 has a distributed neighbor concurrency on the GPU causes only a few threads to be active plementations, as the graph list, the threads access vertex data far away from each other in on GPUs which results in low SMX occupancy and utiliza- memory and result in more non-coalesced memory accesses tion. DTBL, on the other hand, is able to aggregate all these and memory transactions. In comparison, CDP and DTBL fined-grained TBs together to fully utilize SMX with a higher occupancy. Other benchmarks that generate dynamic work- implement the DFP such that threads are more likely to access consecutive memory addresses. Memory irregularity could be loads with higher parallelism (coarse-grain workload) have significantly reduced in this case which is demonstrated by less of a significant increase in SMX occupancy, represented increasing DRAM efficiency. by pre (0.46 or 1.01x) with an average 1527.9 threads in a 10

11 CDPI DTBLI 4000 ing time when including the launching latency while its SMX 3500 CDP DTBL occupancy is significantly affected by the launching latency 3000 because all the dynamically launched kernels or aggregated Cycles 2500 groups are forced to wait for other kernels to complete and 2000 1500 release resources before they can be executed, which takes Thousand 1000 much longer than the launching latency. For the same reason, 500 this benchmark also does not have any memory footprint re- 0 duction as the information of all the aggregated groups have to be saved while they are pending. One solution to this problem is to enable the spatial sharing for the native kernels and the aggregated thread groups using the software techniques intro- Figure 9: Average Waiting Time for a Kernel or an Aggregated 35 ] or hardware preemption introduced in [ 2 duced in [ ]. This Group way the aggregated groups are able to execute on the SMX soon after they are generated and the memory reserved for (%) 70 1000 (MB) holding their information could be released for new aggregated 60 800 50 groups. Reduction 600 40 Reduction 30 C. Overall Performance 400 20 Footprint We show the overall speed up of CDPI, DTBLI, CDP and 200 Footprint 10 DTBL over the flat implementation in Figure 11. Note that 0 0 Memory data transferring time between CPU and the GPU is excluded. Memory As CDPI and DTBLI decrease control flow divergence and increases memory efficiency, they achieve average 1.43x and Figure 10: Memory Footprint Reduction of DTBL from CDP 1.63x ideal speedup respectively. However the non-trivial over- head of kernel launching negates the CDP performance gain, device kernel or an aggregated group. which results in an average of 1.16x slow down from the flat If the launching latency is included for both CDP and DTBL, implementations. DTBL, on the other hand, shows an aver- SMX occupancy decreases from both CDPI and DTBLI (av- age of 1.21x speedup over the flat implementation and 1.40x erage -10.7 or -13.5% for CDP and -5.2 or -7.6% for DTBL). over the CDP, which demonstrates that DTBL preserves the The launching latency for a device kernel is higher than an ag- capability of CDP in increasing control flow and memory reg- gregated group and causes a larger drop in SMX occupancy for ularity for irregular applications while using a more efficient CDP. In , DFP has high occurrence and generates regx_string scheduling strategy with lower launching overhead to increase a large number of dynamic parallel workloads. While DTBLI bfs_usa_road and the overall performance. The benchmark outperforms CDPI in SMX occupancy (11.2 or 1.14x) because show very little change in the DTBL speedup. The sssp_flight of the increased thread block concurrency, the launching la- reason is that most of vertices in the input graphs have very tency even enlarges the gap (25.4 or 1.48x). The increased low vertex degree. The DFP rarely occurs in these two bench- SMX occupancy of DTBL also improves the DRAM efficiency marks so that very few device kernels or aggregated groups as shown in Figure 7 (average 0.022 or 1.08x higher than CDP) are launched. Therefore, both CDP and DTBL have very because of the memory latency hiding capability. limited effect on the overall performance. In fact, these two We further evaluate DTBL scheduling efficiency by com- benchmarks also show limited changes in other characteristics paring the average waiting time (time between launching and evaluated and discussed previously. Two benchmarks have starting execution) and memory footprint for dynamically gen- (0.97x) and clr_graph500 slow down instead of speedup: erated kernels or aggregated groups as shown in Figure 9 and regx_string clr_graph500 operates (0.95x). The benchmark Figure 10 respectively. Again, we compare the waiting time input data set which has a very balanced graph500 on the with and without launching latency. On average, DTBLI re- vertex degree. Therefore, the flat implementation has good duces the waiting time by 18.8% from CDPI while DTBL control flow and memory behavior. Using CDP or DTBL does reduce the waiting time by 24.1% and the memory footprint not help reduce the control flow or memory irregularity but by 25.6% from CDP. Similar to the SMX occupancy behav- introduces extra launching overhead. For , while regx_string pre and ior, DTBLI of join_uniform show little change in the large number of dynamically generated aggregated groups average waiting time compared because they generate coarse- in brings significant speedup ideally (2.73x for CDPI and has grained dynamic workloads. The benchmark regx_string 3.10x for DTBLI), they also introduce substantial launching the highest DFP occurrence so it benefits most from DTBL by overhead even for DTBL and negates the performance gains. showing the largest waiting time decrease (-41.8%) and signif- D. Sensitivity to AGT Size icant memory footprint reduction (-51.2%). The benchmark does not show much change in the average wait- clr_graph500 As the major architecture extension, the size of AGT deter- 11

12 CDPI DTBLI 3.5 to handle nested parallelism in GPU applications. Compared CDP DTBL 3 with their work, our methodology is to generate workloads 2.5 dynamically for any nested parallel patterns. 2 Gupta et al. [ ] introduce the persistent threads program- 15 1.5 Speedup ming style on GPUs where enough thread blocks to occupy all 1 the SMX are initially launched and stay on GPU for the life 0.5 time of the kernel. These thread blocks dynamically generate 0 tasks that are appended to a globally visible software queue ] pro- 34 while persistently consuming tasks. Steffen et al. [ pose the idea of dynamic micro-kernel architecture for global rendering algorithm which supports dynamically spawning Figure 11: Overall Performance in terms of Speedup over Flat threads as a new warp to execute a subsection of the parent Implementation ] design a task aggregation frame- threads code. Orr et al. [ 29 2048 1024 512 work on GPU based on the channel abstraction proposed by 1.6 Gaster et al [ 14 ]. Each channel is defined as a finite queue in 1.4 1.2 virtual memory (global memory space that is visible to both 1 Speedup CPU and GPU) whose elements are dynamically generated 0.8 tasks that execute the same kernel function. Our approach is 0.6 0.4 embedded in the basic GPU execution model and can accom- 0.2 Normalized modate the preceding optimizations providing uniformity in 0 optimizing all of the aspects addressed in these schemes. Thus, we argues it has a wider range of applicability. 7. Conclusions Figure 12: Performance Sensitivity to AGT Size Normalized to 1024 Entries In this paper, we propose a new extension to the current GPU execution model that enables dynamic thread block launching mines the hardware overhead as well as the application perfor- and coalescing to existing kernels on the fly. The proposed mance. A larger AGT can increase the number of aggregate model is specifically designed to provide a more efficient solu- groups stored on-chip and thereby the scheduling efficiency at tion for executing dynamically formed pockets of parallelism the cost of more on-chip SRAM. We try to identify the trade in irregular applications. We define the execution model, pro- off by investigating the performance change over different pose minimal modification to the programming interface and AGT sizes as shown in Figure 12. In average, decreasing AGT discuss the microarchitecture extension. Through experimen- size from 1024 to 512 causes 1.31x slow down and increasing tal evaluation on various irregular CUDA applications, we to 2048 causes 1.20x speedup. Benchmarks that use relatively demonstrate that by increasing GPU scheduling efficiency and and bht high number of dynamic aggregated groups such as decreasing launching overhead, the proposed model achieves regx are more sensitive to AGT size. average 1.21x speedup over the original flat implementation and average 1.40x over the implementations using device- 6. Related Work kernel launch functionality. Characterization of various benchmarks show that implemen- tations of irregular GPU applications mainly suffer from work- Acknowledgement load imbalance and scattered memory accesses that cause This research was supported by the National Science Founda- ][ control flow and memory irregularity [ 7 9 ]. Researchers have tion under grant CCF 1337177 and by an NVIDIA Graduate been investigating and seeking more efficient solutions for Fellowship. We would also like to acknowledge the detailed these irregular applications by redesigning data structures and and constructive comments of the reviewers. re-organizing memory accesses through algorithm, compiler and runtime optimizations [ 33 ][ ][ 23 ][ 38 ] to harness the 40 References GPU capability. Increasing performance of irregular applications by han- [1] “Global flight network.” [Online]. Available: http://www.visualizing. org/datasets/global-flights-network dling nested parallelism is an important and challenging ques- J. Adriaens, K. Compton, N. S. Kim, and M. Schulte, “The case for [2] ] tion in the GPGPU programming community. Lars et al. [ 6 High Performance Computer Architec- gpgpu spatial multitasking,” in 20 ] implement the NESL language on GPU and Lee et al. [ ture (HPCA), 2012 IEEE 18th International Symposium on , 2012. J. A. Anderson, C. D. Lorenz, and A. Travesset, “General purpose [3] propose an auto-tuning framework that efficiently maps nested molecular dynamics simulations fully implemented on graphics pro- 39 patterns in GPU applications. Yang et al. [ ] propose a com- cessing units,” Journal of Computational Physics , vol. 227, no. 10, pp. piler technique that can activate or disable threads in runtime 5342–5359, 2008. 12

13 [4] D. A. Bader, H. Meyerhenke, P. Sanders, and D. Wagner, “10th dimacs [29] M. S. Orr, B. M. Beckmann, S. K. Reinhardt, and D. A. Wood, “Fine- Proceeding of the grain task aggregation and coordination on gpus,” in implementation challenge: Graph partitioning and graph clustering, , ser. 41st Annual International Symposium on Computer Architecuture 2011.” ISCA ’14, 2014. A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, “Analyz- [5] S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Lue- [30] ing cuda workloads using a detailed gpu simulator,” in 2009 IEEE bke, D. McAllister, M. McGuire, K. Morley, A. Robison , “Optix: et al. International Symposium on Performance Analysis of Systems and ACM Transactions on Graph- a general purpose ray tracing engine,” in Software(ISPASS 2009) , April 2009, pp. 163–174. , vol. 29, no. 4. ACM, 2010, p. 66. ics (TOG) [6] L. Bergstrom and J. Reppy, “Nested data-parallelism on the gpu,” in [31] V. Podlozhnyuk, “Black-scholes option pricing,” Nvidia, 2007. , vol. 47, no. 9. ACM, 2012, pp. 247–258. ACM SIGPLAN Notices T. G. Rogers, M. O’Connor, and T. M. Aamodt, “Cache-conscious [32] M. Burtscher, R. Nasre, and K. Pingali, “A quantitative study of irregu- [7] Proceedings of the 2012 45th Annual wavefront scheduling,” in lar programs on gpus,” in Workload Characterization (IISWC), 2012 IEEE/ACM International Symposium on Microarchitecture , 2012. IEEE International Symposium on . IEEE, 2012, pp. 141–151. [33] S. Solomon and P. Thulasiraman, “Performance study of mapping [8] M. Burtscher and K. Pingali, “An efficient cu da implementation of irregular computations on gpus,” in Parallel & Distributed Process- the tree-based barnes hut n-body algorithm,” GPU computing Gems ing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International , p. 75, 2011. Emerald edition . IEEE, 2010, pp. 1–8. Symposium on S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, “Pannotia: [9] [34] M. Steffen and J. Zambreno, “Improving simt efficiency of global Workload Char- Understanding irregular gpgpu graph applications,” in rendering algorithms with architectural support for dynamic micro- acterization (IISWC), 2013 IEEE International Symposium on . IEEE, Proceedings of the 2010 43rd Annual IEEE/ACM Interna- kernels,” in 2013, pp. 185–195. , ser. MICRO ’43, 2010. tional Symposium on Microarchitecture [10] J. Cohen and P. Castonguay, “Efficient graph matching and coloring [35] I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and GPU Technology Conference , 2012. on the gpu,” in Pro- M. Valero, “Enabling preemptive multiprogramming on gpus,” in [11] B. W. Coon, J. R. Nickolls, J. E. Lindholm, R. J. Stoll, N. Wang, and ceeding of the 41st Annual International Symposium on Computer J. H. Choquette, “Thread group scheduler for computing on a parallel , 2014. Architecuture thread processor,” US Patent 8,732,713, 2014. [36] J. Wang and S. Yalamanchili, “Characterization and analysis of dy- G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili, “Relational [12] namic parallelism in unstructured gpu applications,” in 2014 IEEE In- 18th ACM SIG- algorithms for multi-bulk-synchronous processors,” in , October 2014. ternational Symposium on Workload Characterization PLAN Symposium on Principles andPractice of Parallel Programming [37] L. Wang, S. Chen, Y. Tang, and J. Su, “Gregex: Gpu based high speed , February 2013. (PPOPP’13) Innovative Mobile and Internet regular expression matching engine,” in Services in Ubiquitous Computing (IMIS), 2011 Fifth International W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, “Dynamic warp for- [13] . IEEE, 2011, pp. 366–370. Conference on Proceedings mation and scheduling for efficient gpu control flow,” in of the 40th Annual IEEE/ACM International Symposium on Microar- H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili, “Kernel weaver: [38] chitecture . IEEE Computer Society, 2007, pp. 407–420. Automatically fusing database primitives for efficient gpu computation,” Proceedings of the 2012 45th Annual IEEE/ACM International in B. R. Gaster and L. Howes, “Can gpgpu programming be liberated [14] , 2012. Symposium on Microarchitecture Computer , vol. 45, no. 8, pp. 42–52, from the data-parallel bottleneck?” [39] Y. Yang and H. Zhou, “Cuda-np: Realizing nested thread-level paral- 2012. Proceedings of the 19th ACM SIG- lelism in gpgpu applications,” in K. Gupta, J. A. Stuart, and J. D. Owens, “A study of persistent threads [15] PLAN symposium on Principles and practice of parallel programming , style gpu programming for gpgpu workloads,” in Innovative Parallel 2014. Computing (InPar), 2012 . IEEE, 2012, pp. 1–14. [40] E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen, “On-the-fly [16] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl, “An algorith- ACM elimination of dynamic irregularities for gpu computing,” in mic framework for performing collaborative filtering,” in Proceedings , vol. 39, no. 1, 2011, pp. 369– SIGARCH Computer Architecture News of the 22nd annual international ACM SIGIR conference on Research 380. , 1999. and development in information retrieval [17] Khronos, “The opencl specification version 2.0,” 2014. 14th Inter- A. Kuhl, “Thermodynamic states in explosion fields,” in [18] national Symposium on Detonation, Coeur d’Alene Resort, ID, USA , 2010. [19] M. Kulkarni, M. Burtscher, C. Casçaval, and K. Pingali, “Lonestar: A suite of parallel irregular programs,” in ISPASS ’09: IEEE International Symposium on Performance Analysis of Systems and Software , 2009. [20] H. Lee, K. Brown, A. Sujeeth, T. Rompf, and K. Olukotun, “Locality- aware mapping of nested parallel patterns on gpus,” in the 47th Inter- , 2014. national Symposium on Microarchitecture (MICRO ’14) J. McHugh, “Testing intrusion detection systems: a critique of the 1998 [21] and 1999 darpa intrusion detection system evaluations as performed ACM transactions on Information and system by lincoln laboratory,” Security , vol. 3, no. 4, pp. 262–294, 2000. [22] J. Meng, D. Tarjan, and K. Skadron, “Dynamic warp subdivision for integrated branch and memory divergence tolerance,” in ACM SIGARCH Computer Architecture News , vol. 38, no. 3, 2010, pp. 235– 246. [23] D. Merrill, M. Garland, and A. Grimshaw, “Scalable gpu graph traver- sal,” in In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’12 , 2012. J. Mosegaard and T. S. Sørensen, “Real-time deformation of detailed [24] geometry based on mappings to a less detailed physical simulation Proceedings of the 11th Eurographics conference on on the gpu,” in . Eurographics Association, 2005, pp. 105–111. Virtual Environments [25] C. H. Nadungodage, Y. Xia, J. J. Lee, M. Lee, and C. S. Park, “Gpu accelerated item-based collaborative filtering for big-data applications,” in Big Data, 2013 IEEE International Conference on . IEEE, 2013, pp. 175–180. [26] NVIDIA, “Hyperq sample,” 2012. [27] ——, “Cuda c programming guide version 6.5,” 2014. [28] ——, “Cuda dynamic parallelism programming guide,” 2014. 13

Related documents