PPQSort

PPQSort (Parallel Pattern QuickSort)

Parallel Pattern Quicksort (PPQSort) is a efficient implementation of parallel quicksort algorithm, written by using only the C++20 features without using third party libraries (such as Intel TBB). PPQSort draws inspiration from pdqsort, BlockQuicksort and cpp11sort and adds some further optimizations.

Focus on ease of use: Using only C++20 features, header only implementation, user-friendly API.
Comprehensive test suite: Ensures correctness and robustness through extensive testing.
Benchmarks shows great performance: Achieves impressive sorting times on various machines.

Integration

PPQSort is header only implementation. All the files needed are in include directory.

Add to existing CMake project using CPM.cmake:

include(cmake/CPM.cmake)
CPMAddPackage(
NAME PPQSort
GITHUB_REPOSITORY GabTux/PPQSort
VERSION 1.0.6 # change this to latest commit or release tag
)
target_link_libraries(YOUR_TARGET PPQSort::PPQSort)

Alternatively use FetchContent or just checkout the repository and add the include directory to the linker flags.

Usage

PPQSort has similiar API as std::sort, you can use ppqsort::execution::<policy> policies to specify how the sort should run.

// run parallel
ppqsort::sort(ppqsort::execution::par, input.begin(), input.end());

// Specify number of threads
ppqsort::sort(ppqsort::execution::par, input.begin(), input.end(), 16);

// provide custom comparator
ppqsort::sort(ppqsort::execution::par, input.begin(), input.end(), cmp);

// force branchless variant
ppqsort::sort(ppqsort::execution::par_force_branchless, input_str.begin(), input_str.end(), cmp);

PPQSort will by default use C++ threads, but if you prefer, you can link it with OpenMP and it will use OpenMP as a parallel backend. However you can still enforce C++ threads parallel backend even if linked with OpenMP:

#define FORCE_CPP
#include <ppqsort.h>
// ... rest of the code ...

Benchmark

We compared PPQSort with various parallel sorts. Benchmarks shows, that the PPQSort is one of the fastest parallel sorting algorithms across various input data and different machines.

Name	Algorithm	Memory usage	External dependencies	Highlight
PPQSort	Quicksort	in-place	None	parallel pattern quicksort algorithm
GCC BQS	Quicksort	in-place	OpenMP	allocating threads proportionally to subtask sizes
cpp11sort	Quicksort	in-place	None	Header-only, C++11 compliant
oneTBB parallel_sort	quicksort	out-place	oneTBB	Splits input to small tasks
poolSTL sort	Quicksort	in-place	None	Header-only, C++17 compliant
Boost block_indirect_sort	merging algorithm	out-place	Boost	Upper bounded small memory usage
AQsort	Quicksort	in-place	OpenMP	Allows the sorting of multiple datasets at once
MPQsort	Quicksort	in-place	OpenMP	Multiway Parallel Quicksort
IPS4o	Samplesort	in-place	oneTBB	Divides data into buckets and sort them recursively

Running on ARM cluster

Fujitsu A64FX CPU
NUMA architecture, 48 cores (4CPUs x 12cores)

Results for INT, input size was 2e9 (2 billions):

Algorithm	Random	Ascending	Descending	Rotated	OrganPipe	Heap	Total	Rank
PPQSort C++	5.84s	1.84s	4.55s	1.38s	2.96s	5.58s	22.15s	1
GCC BQS	13.72s	4.18s	19.11s	49.89s	8.24s	13.78s	108.92s	6
oneTBB	43.66s	0.09s	8.62s	13.84s	8.12s	43.9s	118.23s	9
poolSTL	34.63s	5.61s	7.23s	14.78s	7.81s	46.88s	116.94s	7
MPQsort	13.35s	5.74s	5.77s	4.67s	7.71s	12.87s	50.11s	5
cpp11sort	9.58s	2.47s	2.66s	5.47s	3.42s	9.9s	33.5s	3
AQsort	24.72s	3.66s	23.14s	21.83s	22.6s	25.31s	121.26s	8
Boost	8.2s	3.0s	4.26s	13.96s	6.97s	7.92s	44.31s	4
IPS$^4$o	4.8s	0.19s	5.97s	5.21s	5.59s	4.91s	26.67s	2

Summary

Extended benchmarks (detailed in forthcoming paper) shows that IPS4o (https://github.com/ips4o) often surpasses PPQSort in raw speed. However, IPS4o relies on the external library oneTBB (https://github.com/oneapi-src/oneTBB) introducing integration complexities. PPQSort steps up as a compelling alternative due to its:

Competitive Speed: Delivers performance comparable to IPS4o on most machines.
Hardware Agnostic: Maintains strong performance across various hardware, potentially surpassing IPS4o on specific systems, especially ARM platforms.
Dependency-Free: No external libraries are required, simplifying integration. For applications demanding a fast, dependency-free parallel sorting solution, PPQSort is an excellent choice.

PPQSort is also used in tracy project.

Running Tests and Benchmarks

Bash script for running or building specific components:

$ scripts/build.sh all
...
$ scripts/run.sh standalone
...

Note that the benchmark's CMake file will by default download sparse matrices (around 26GB).

Implementation

Details about the implementation and benchmark results can be found in the following resources:

IT SPY Poster
- This is a brief summary poster.
Full master thesis text
- a very detailed text, it also includes additional information not directly related to PPQSort, but related to parallel sorting.
Conference paper
- Presented on PPAM (Parallel Processing & Applied Mathematics)
Another forthcoming publication in a Concurrency and Computation: Practice and Experience journal.