This kernel has been a bit of a pain to optimize since it is excessively register bound. To reduce register pressure we use shared memory to help offload some of this pressure. Annoyingly, the optimal approach for CUDA 8.0 is not the same as CUDA 7.5, so implementation is compiler version dependent. The CUDA 8.0 optimal code runs 10x slower on 7.5, though the 7.5 code runs fine on 8.0. More...

#include <cstdio>
#include <cstdlib>
#include <cuda.h>
#include <tune_quda.h>
#include <gauge_field.h>
#include <cassert>
#include <jitify_helper.cuh>
#include <kernels/clover_deriv.cuh>

Include dependency graph for clover_deriv_quda.cu:

Go to the source code of this file.

Namespaces
	quda

Functions
void	quda::cloverDerivative (cudaGaugeField &force, cudaGaugeField &gauge, cudaGaugeField &oprod, double coeff, QudaParity parity)
	Compute the derivative of the clover matrix in the direction mu,nu and compute the resulting force given the outer-product field. More...

Detailed Description

This kernel has been a bit of a pain to optimize since it is excessively register bound. To reduce register pressure we use shared memory to help offload some of this pressure. Annoyingly, the optimal approach for CUDA 8.0 is not the same as CUDA 7.5, so implementation is compiler version dependent. The CUDA 8.0 optimal code runs 10x slower on 7.5, though the 7.5 code runs fine on 8.0.

CUDA >= 8.0

Used shared memory for force accumulator matrix
Template mu / nu to prevent register spilling of indexing arrays
Force the computation routine to inline

CUDA <= 7.5

Used shared memory for force accumulator matrix
Keep mu/nu dynamic and use shared memory to store indexing arrays
Do not inline computation routine

For the shared-memory dynamic indexing arrays, we use chars, since the array is 4-d, a 4-d coordinate can be stored in a single word which means that we will not have to worry about bank conflicts, and the shared array can be passed to the usual indexing routines (getCoordsExtended and linkIndexShift) with no code changes. This strategy works as long as each local lattice coordinate is less than 256.

Definition in file clover_deriv_quda.cu.

Namespaces

Functions

Detailed Description