QUDA  0.9.0
Namespaces | Macros | Functions
clover_deriv_quda.cu File Reference

This kernel has been a bit of a pain to optimize since it is excessively register bound. To reduce register pressure we use shared memory to help offload some of this pressure. Annoyingly, the optimal approach for CUDA 8.0 is not the same as CUDA 7.5, so implementation is compiler version dependent. The CUDA 8.0 optimal code runs 10x slower on 7.5, though the 7.5 code runs fine on 8.0. More...

#include <cstdio>
#include <cstdlib>
#include <cuda.h>
#include <quda_internal.h>
#include <tune_quda.h>
#include <gauge_field.h>
#include <gauge_field_order.h>
#include <quda_matrix.h>
#include <index_helper.cuh>
#include <cassert>
Include dependency graph for clover_deriv_quda.cu:

Go to the source code of this file.

Namespaces

 quda
 

Macros

#define DYNAMIC_MU_NU
 
#define SHARED_ACCUMULATOR
 
#define DECLARE_LINK(U)
 
#define LINK   real*
 
#define DECLARE_ARRAY(d, idx)   int d[4] = {0, 0, 0, 0};
 

Functions

template<typename real , typename Link >
__device__ void quda::axpy (real a, const real *x, Link &y)
 
template<typename real , typename Link >
__device__ void quda::operator+= (real *y, const Link &x)
 
template<typename real , typename Link >
__device__ void quda::operator-= (real *y, const Link &x)
 
void quda::cloverDerivative (cudaGaugeField &force, cudaGaugeField &gauge, cudaGaugeField &oprod, double coeff, QudaParity parity)
 Compute the derivative of the clover matrix in the direction mu,nu and compute the resulting force given the outer-product field. More...
 

Detailed Description

This kernel has been a bit of a pain to optimize since it is excessively register bound. To reduce register pressure we use shared memory to help offload some of this pressure. Annoyingly, the optimal approach for CUDA 8.0 is not the same as CUDA 7.5, so implementation is compiler version dependent. The CUDA 8.0 optimal code runs 10x slower on 7.5, though the 7.5 code runs fine on 8.0.

CUDA >= 8.0

CUDA <= 7.5

For the shared-memory dynamic indexing arrays, we use chars, since the array is 4-d, a 4-d coordinate can be stored in a single word which means that we will not have to worry about bank conflicts, and the shared array can be passed to the usual indexing routines (getCoordsExtended and linkIndexShift) with no code changes. This strategy works as long as each local lattice coordinate is less than 256.

Definition in file clover_deriv_quda.cu.

Macro Definition Documentation

◆ DECLARE_ARRAY

#define DECLARE_ARRAY (   d,
  idx 
)    int d[4] = {0, 0, 0, 0};

Definition at line 133 of file clover_deriv_quda.cu.

◆ DECLARE_LINK

#define DECLARE_LINK (   U)
Value:
extern __shared__ int s[]; \
real *U = (real*)s; \
{ \
const int tid = (threadIdx.z*blockDim.y + threadIdx.y)*blockDim.x + threadIdx.x; \
const int block = blockDim.x * blockDim.y * blockDim.z; \
for (int i=0; i<18; i++) force[i*block + tid] = 0.0; \
}
dim3 dim3 blockDim
for(int s=0;s< param.dc.Ls;s++)

Definition at line 64 of file clover_deriv_quda.cu.

◆ DYNAMIC_MU_NU

#define DYNAMIC_MU_NU

Definition at line 43 of file clover_deriv_quda.cu.

◆ LINK

#define LINK   real*

Definition at line 73 of file clover_deriv_quda.cu.

◆ SHARED_ACCUMULATOR

#define SHARED_ACCUMULATOR

Definition at line 47 of file clover_deriv_quda.cu.