Item Details

Radix Sort for Stream Architectures

Merrill, Duane; Grimshaw, Andrew
Format
Report
Author
Merrill, Duane
Grimshaw, Andrew
Abstract
This report presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors. Our radix sorting methods demonstrate sorting rates of 482 million key-value pairs per second, and 550 million keys per second (32-bit). Compared to the state-of-the-art, our implementations exhibit speedup of at least 2x for all fully-programmable generations of NVIDIA GPUs, and up to 3.7x for current GT200-based models. For this domain of sorting problems, we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture. We obtain our sorting performance by using a parallel scan stream primitive that has been generalized in two ways: (1) with local interfaces for producer/consumer operations (visiting logic), and (2) with interfaces for performing multiple related, concurrent prefix scans (multi- scan). These abstractions allow us to improve the overall utilization of memory and computational resources while maintaining the flexibility of a reusable component. We require 38% fewer bytes to be moved through the global memory subsystem and a 64% reduction in the number of thread-cycles needed for computation. As part of this work, we demonstrate a method for encoding multiple compaction problems into a single, composite parallel scan. This technique provides our local sorting strategies with a 2.5x speedup over bitonic sorting networks for small problem instances, i.e., sequences that can be entirely sorted within the shared memory local to a single GPU core.
Language
English
Date Received
20121029
Published
University of Virginia, Department of Computer Science, 2010
Published Date
2010
Collection
Libra Open Repository
Logo for In CopyrightIn Copyright

Availability

Access Online