This kernel has been a bit of a pain to optimize since it is excessively register bound. To reduce register pressure we use shared memory to help offload some of this pressure. Annoyingly, the optimal approach for CUDA 8.0 is not the same as CUDA 7.5, so implementation is compiler version dependent. The CUDA 8.0 optimal code runs 10x slower on 7.5, though the 7.5 code runs fine on 8.0. More...

#include <cstdio>
#include <cstdlib>
#include <cuda.h>
#include <quda_internal.h>
#include <tune_quda.h>
#include <gauge_field.h>
#include <gauge_field_order.h>
#include <quda_matrix.h>
#include <index_helper.cuh>
#include <cassert>

Include dependency graph for clover_deriv_quda.cu:

Go to the source code of this file.

Namespaces
	quda

Macros
#define	DYNAMIC_MU_NU

#define	SHARED_ACCUMULATOR

#define	DECLARE_LINK(U)

#define	LINK real*

#define	DECLARE_ARRAY(d, idx) int d[4] = {0, 0, 0, 0};

Functions
template<typename real , typename Link >
__device__ void	quda::axpy (real a, const real *x, Link &y)

template<typename real , typename Link >
__device__ void	quda::operator+= (real *y, const Link &x)

template<typename real , typename Link >
__device__ void	quda::operator-= (real *y, const Link &x)

void	quda::cloverDerivative (cudaGaugeField &force, cudaGaugeField &gauge, cudaGaugeField &oprod, double coeff, QudaParity parity)
	Compute the derivative of the clover matrix in the direction mu,nu and compute the resulting force given the outer-product field. More...

Detailed Description

This kernel has been a bit of a pain to optimize since it is excessively register bound. To reduce register pressure we use shared memory to help offload some of this pressure. Annoyingly, the optimal approach for CUDA 8.0 is not the same as CUDA 7.5, so implementation is compiler version dependent. The CUDA 8.0 optimal code runs 10x slower on 7.5, though the 7.5 code runs fine on 8.0.

CUDA >= 8.0

Used shared memory for force accumulator matrix
Template mu / nu to prevent register spilling of indexing arrays
Force the computation routine to inline

CUDA <= 7.5

Used shared memory for force accumulator matrix
Keep mu/nu dynamic and use shared memory to store indexing arrays
Do not inline computation routine

For the shared-memory dynamic indexing arrays, we use chars, since the array is 4-d, a 4-d coordinate can be stored in a single word which means that we will not have to worry about bank conflicts, and the shared array can be passed to the usual indexing routines (getCoordsExtended and linkIndexShift) with no code changes. This strategy works as long as each local lattice coordinate is less than 256.

Definition in file clover_deriv_quda.cu.

Macro Definition Documentation

◆ DECLARE_ARRAY

#define DECLARE_ARRAY	(	d,
		idx
	)	int d[4] = {0, 0, 0, 0};

Definition at line 133 of file clover_deriv_quda.cu.

◆ DECLARE_LINK

#define DECLARE_LINK ( U )

Value:

extern __shared__ int s[];            \
  real *U = (real*)s;             \
  {                 \
    const int tid = (threadIdx.z*blockDim.y + threadIdx.y)*blockDim.x + threadIdx.x; \
    const int block = blockDim.x * blockDim.y * blockDim.z;   \
    for (int i=0; i<18; i++) force[i*block + tid] = 0.0;    \
  }

Definition at line 64 of file clover_deriv_quda.cu.

◆ DYNAMIC_MU_NU

#define DYNAMIC_MU_NU

Definition at line 43 of file clover_deriv_quda.cu.

◆ LINK

#define LINK real*

Definition at line 73 of file clover_deriv_quda.cu.

◆ SHARED_ACCUMULATOR

#define SHARED_ACCUMULATOR

Definition at line 47 of file clover_deriv_quda.cu.

Namespaces

Macros