|
QUDA
0.9.0
|
This kernel has been a bit of a pain to optimize since it is excessively register bound. To reduce register pressure we use shared memory to help offload some of this pressure. Annoyingly, the optimal approach for CUDA 8.0 is not the same as CUDA 7.5, so implementation is compiler version dependent. The CUDA 8.0 optimal code runs 10x slower on 7.5, though the 7.5 code runs fine on 8.0. More...
#include <cstdio>#include <cstdlib>#include <cuda.h>#include <quda_internal.h>#include <tune_quda.h>#include <gauge_field.h>#include <gauge_field_order.h>#include <quda_matrix.h>#include <index_helper.cuh>#include <cassert>
Go to the source code of this file.
Namespaces | |
| quda | |
Macros | |
| #define | DYNAMIC_MU_NU |
| #define | SHARED_ACCUMULATOR |
| #define | DECLARE_LINK(U) |
| #define | LINK real* |
| #define | DECLARE_ARRAY(d, idx) int d[4] = {0, 0, 0, 0}; |
Functions | |
| template<typename real , typename Link > | |
| __device__ void | quda::axpy (real a, const real *x, Link &y) |
| template<typename real , typename Link > | |
| __device__ void | quda::operator+= (real *y, const Link &x) |
| template<typename real , typename Link > | |
| __device__ void | quda::operator-= (real *y, const Link &x) |
| void | quda::cloverDerivative (cudaGaugeField &force, cudaGaugeField &gauge, cudaGaugeField &oprod, double coeff, QudaParity parity) |
| Compute the derivative of the clover matrix in the direction mu,nu and compute the resulting force given the outer-product field. More... | |
This kernel has been a bit of a pain to optimize since it is excessively register bound. To reduce register pressure we use shared memory to help offload some of this pressure. Annoyingly, the optimal approach for CUDA 8.0 is not the same as CUDA 7.5, so implementation is compiler version dependent. The CUDA 8.0 optimal code runs 10x slower on 7.5, though the 7.5 code runs fine on 8.0.
CUDA >= 8.0
CUDA <= 7.5
For the shared-memory dynamic indexing arrays, we use chars, since the array is 4-d, a 4-d coordinate can be stored in a single word which means that we will not have to worry about bank conflicts, and the shared array can be passed to the usual indexing routines (getCoordsExtended and linkIndexShift) with no code changes. This strategy works as long as each local lattice coordinate is less than 256.
Definition in file clover_deriv_quda.cu.
Definition at line 133 of file clover_deriv_quda.cu.
| #define DECLARE_LINK | ( | U | ) |
Definition at line 64 of file clover_deriv_quda.cu.
| #define DYNAMIC_MU_NU |
Definition at line 43 of file clover_deriv_quda.cu.
| #define LINK real* |
Definition at line 73 of file clover_deriv_quda.cu.
| #define SHARED_ACCUMULATOR |
Definition at line 47 of file clover_deriv_quda.cu.
1.8.14