Aamodt1 1the university of british columbia 2amd research 36. Tor aamodt is a professor in the department of electrical and computer engineering at the university of british columbia where he has been a faculty member since 2006. Wavefront scheduler cacheconscious wavefront scheduling timothy g. Hardware architectures and software systems that enable. Cacheconscious wavefront scheduling csa iisc bangalore. Our engineers customize their own metrics to fully monitor their systems health and performance. Tor aamodt electrical and computer engineering ubc. Cacheconscious wavefront scheduling ccws dynamically determine the number of wavefronts allowed to access the memory system and which wavefronts those should be.
A software hardware comanaged cache architecture for reducing. It proposes a novel cacheconscious wavefront scheduling. Thread block compaction for efficient simt control flow. Barrieraware warp scheduling for throughput processors. Improving gpgpu resource utilization through alternative thread block scheduling. First, we are building a worldclass team to execute on this vision. This paper studies the effects of hardware thread scheduling on cache management in gpus. Locality and scheduling in the massively multithreaded era. We propose cacheconscious wave front scheduling ccws, an adaptive hardware mechanism that makes use of a novel intrawave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache.
Scalable high performance main memory system using phasechange memory technology, isca 2009. Aamodt, cacheconscious wavefront scheduling, in proceedings of the 45th ieeeacm international symposium on microarchitecture, pp. At the same time, current gpus fail to handle burstmode longaccess latency due to gpus poor warp scheduling method. Continuation analysis tasks for gpu task scheduling. Architectural support for virtual memory in gpus by bharath subramanian pichai a thesis submitted to the graduate schoolnew brunswick rutgers, the state university of new jersey. The primary contribution of this work is a cacheconscious wavefront scheduling ccws system that uses locality information from the memory system to shape future memory accesses through hardware thread scheduling. Improving gpgpu performance via cache locality aware. The blocks have the same dependence pattern, but at a block scale. Cacheconscious wavefront scheduling proceedings of the. Curriculum vitae purdue engineering purdue university.
Hence scheduling overheads can be amortized over blocks. Wavefront careers why build your career at wavefront. A 512 mb twochannel mobile dram onedram with shared memory array, jssc 2008. We propose cacheconscious wavefront scheduling ccws, an adaptive hardware mechanism that makes use of a novel intrawavefront locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity. Timothy rogers curriculum vitae purdue engineering. In one embodiment, a continuation packet is referenced directly by a first task. We propose cacheconscious wave front scheduling ccws, an adaptive hardware mechanism that makes use of a novel intrawave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity. We propose cacheconscious wave front scheduling ccws, an adaptive. Cacheconscious wavefront scheduling ieee conference. Tlbs per shader core, with and without cacheconscious wavefront scheduling, with and without thread block compaction. Improving gpgpu resource utilization through alternative thread block scheduling minseok lee, seokwoo song, joosik moon, john kim. While current gpus employ a perwarp or per wavefront stack to manage divergent control flow, it incurs decreased efficiency for applications with nested, datadependent control flow. Ccws uses a novel lost intra wavefront locality detector lld to update an adaptive locality scoring system and improves the performance of hcs workloads by 63% over existing wavefront schedulers.
Architectural support for address translation on gpus. A good solution is to aggregate the elements into contiguous blocks, and process the contents of a block serially. Systems, apparatuses, and methods for implementing continuation analysis tasks cats are disclosed. Locality and scheduling in the massively multithreaded era by timothy glenn rogers b. The gpu cache is inefficient due to a mismatch between the throughputoriented execution. Our ability to deliver such game changing software to our customers is a result of several core cultural values that we hold dear. Daws attempts to shift the burden of locality management from software to hardware, increasing the performance of simpler and. It proposes a novel cache conscious wavefront scheduling. The cooperativethreadarray cta schedulers employed by the current gpgpus greedily issue ctas to gpu cores as soon as the resources become available for higher thread level parallelism. We propose cacheconscious wavefront scheduling ccws, an adaptive hardware mechanism that makes use of a novel intra wavefront locality detector to capture locality that is lost by other.
Like traditional attempts to optimize cache replacement and insertion policies, ccws attempts to predict when cache lines will be. University of bristol resource library techrepublic. Eng, mcgill university, 2005 a thesis submitted in partial fulfillment of the requirements for the degree of doctor of philosophy in the faculty of graduate and postdoctoral studies electrical and computer engineering the university of british columbia. Tim rogers divergenceaware warp scheduling 12 online characterization to create cache footprint prediction 1. This article studies a set of economically important server applications and presents the cacheconscious wavefront scheduling ccws hardware mechanism, which uses feedback from the memory system to guide the issuelevel thread scheduler and shape the access pattern seen by. In proceedings of the international conference for high performance computing, networking, storage and analysis, page article 8, 2015. We propose cacheconscious wave front scheduling ccws, an adaptive hardware mechanism that makes use of a novel intrawavefront locality detector to capture lo. Rodriguesy, jie lvy, zhiying wangz and wenmei hwuy state key laboratory of high performance computing, national university of defense technology, changsha, china. The hardware groups these threads into warpswavefronts and executes them in lockstepdubbed singleinstruction, multiplethread simt by nvidia. They propose cacheconscious wavefront scheduling ccws, an adaptive hardware mechanism that makes use of a novel intrawavefront locality detector to capture locality that is lost by other. Cacheconscious wavefront scheduling, vancouver, bc.
Cacheconscious wavefront scheduling proceedings of the 2012. Maximizing memorylevel parallelism for gpus with coordinated warp and fetch scheduling. Daws uses these predictions to schedule warps such that data reused by active scalar threads is unlikely to exceed the capacity of the l1 data cache. Improving gpgpu resource utilization through alternative. Automatic cpugpu communication management and optimization, pldi 2011.
Thus, benefits of gpus high computing ability are reduced dramatically by the poor cache management and warp scheduling methods, which limit the system performance and energy efficiency. When the first task completes, the first task enqueues a. Unlike prior work on cacheconscious wavefront scheduling, which makes reactive scheduling decisions based on detected cache thrashing, daws makes proactive scheduling decisions based on cache usage predictions. Ppt cacheconscious wavefront scheduling powerpoint.
Divergenceaware warp scheduling ieee conference publication. Wavefront careers are defined by these valuestheyre ingrained into the way we work. Scheduling each fij element calculation separately is prohibitively expensive. Unlike prior work on cacheconscious wavefront scheduling, which makes. Design and implementation of a hardware platform and software design and implementation at the operating system and. Divergenceaware warp scheduling ubc ece university of.
Gpgpu improves the computing performance due to the massive parallelism. It proposes a novel cacheconscious wavefront scheduling ccws mechanism which can be implemented with no changes to the cache replacement policy. Timothy rogers curriculum vitae research interests hardware architectures and software systems that enable programmer productivity in a performant and energy efficient manner. In one embodiment, a system implements hardware acceleration of cats to manage the dependencies and scheduling of an application composed of multiple tasks.
Mitigating gpu memory divergence for dataintensive. Wavefront s powerful query language allows us to easily visualize and debug our time series telemetry data. Warp limiting swl, cache conscious wavefront scheduling ccws, and memory aware iii. Aamodt micro 2012 2 goal understand the relationship between schedulers warp wavefront and locality behaviors.
194 32 365 1275 268 1131 814 785 797 737 605 269 181 549 1272 426 1066 642 1238 629 587 170 1112 644 1187 887 134 728 218 486 119