QUDA
1.0.0
|
This kernel has been a bit of a pain to optimize since it is excessively register bound. To reduce register pressure we use shared memory to help offload some of this pressure. Annoyingly, the optimal approach for CUDA 8.0 is not the same as CUDA 7.5, so implementation is compiler version dependent. The CUDA 8.0 optimal code runs 10x slower on 7.5, though the 7.5 code runs fine on 8.0. More...
#include <cstdio>
#include <cstdlib>
#include <cuda.h>
#include <tune_quda.h>
#include <gauge_field.h>
#include <cassert>
#include <jitify_helper.cuh>
#include <kernels/clover_deriv.cuh>
Go to the source code of this file.
Namespaces | |
quda | |
Functions | |
void | quda::cloverDerivative (cudaGaugeField &force, cudaGaugeField &gauge, cudaGaugeField &oprod, double coeff, QudaParity parity) |
Compute the derivative of the clover matrix in the direction mu,nu and compute the resulting force given the outer-product field. More... | |
This kernel has been a bit of a pain to optimize since it is excessively register bound. To reduce register pressure we use shared memory to help offload some of this pressure. Annoyingly, the optimal approach for CUDA 8.0 is not the same as CUDA 7.5, so implementation is compiler version dependent. The CUDA 8.0 optimal code runs 10x slower on 7.5, though the 7.5 code runs fine on 8.0.
CUDA >= 8.0
CUDA <= 7.5
For the shared-memory dynamic indexing arrays, we use chars, since the array is 4-d, a 4-d coordinate can be stored in a single word which means that we will not have to worry about bank conflicts, and the shared array can be passed to the usual indexing routines (getCoordsExtended and linkIndexShift) with no code changes. This strategy works as long as each local lattice coordinate is less than 256.
Definition in file clover_deriv_quda.cu.