This is the implementation of the color-spinor halo packer for an arbitrary field. This implementation uses the fine-grained accessors and should support all field types reqgardless of precision, number of color or spins etc. More...

#include <color_spinor_field.h>
#include <tune_quda.h>
#include <jitify_helper.cuh>
#include <kernels/color_spinor_pack.cuh>

Include dependency graph for color_spinor_pack.cu:

Go to the source code of this file.

Classes
class	quda::GenericPackGhostLauncher< Float, block_float, Ns, Ms, Nc, Mc, Arg >

struct	quda::precision_spin_color_mapper< T, G, nSpin, nColor_ >

struct	quda::precision_spin_color_mapper< T, G, 1, nColor_ >

struct	quda::precision_spin_color_mapper< float, short, 4, nColor_ >

struct	quda::precision_spin_color_mapper< float, char, 4, nColor_ >

struct	quda::precision_spin_color_mapper< double, double, 1, nColor_ >

struct	quda::precision_spin_color_mapper< double, double, 2, nColor_ >

struct	quda::precision_spin_color_mapper< double, double, 4, nColor_ >

struct	quda::spin_order_mapper< nSpin, order_ >

struct	quda::spin_order_mapper< 2, QUDA_FLOAT4_FIELD_ORDER >

struct	quda::spin_order_mapper< 1, QUDA_FLOAT4_FIELD_ORDER >

struct	quda::non_native_precision_mapper< typename >

struct	quda::non_native_precision_mapper< double >

struct	quda::non_native_precision_mapper< float >

struct	quda::non_native_precision_mapper< short >

struct	quda::non_native_precision_mapper< char >

struct	quda::float4_precision_mapper< T >

struct	quda::float4_precision_mapper< double >

struct	quda::float4_precision_mapper< short >

struct	quda::float4_precision_mapper< char >

Namespaces
	quda

Functions
void	quda::genericPackGhost (void *ghost, const ColorSpinorField &a, QudaParity parity, int nFace, int dagger, MemoryLocation destination=nullptr)
	Generic ghost packing routine. More...

Detailed Description

This is the implementation of the color-spinor halo packer for an arbitrary field. This implementation uses the fine-grained accessors and should support all field types reqgardless of precision, number of color or spins etc.

Using a different precision of the field and of the halo is supported, though only QUDA_SINGLE_PRECISION fields with QUDA_HALF_PRECISION or QUDA_QUARTER_PRECISION halos are instantiated. When an integer format is requested for the halos then block-float format is used.

As well as tuning basic block sizes, the autotuner also tunes for the dimensions to assign to each thread. E.g., dim_thread=1 means we have one thread for all dimensions, dim_thread=4 means we have four threads (e.g., one per dimension). We always uses seperate threads for forwards and backwards directions. Dimension, direction and parity are assigned to the z thread dimension.

If doing block-float format, since all spin and color components of a given site have to reside in the same thread block (to allow us to compute the max element) we override the autotuner to keep the z thread dimensions in the grid and not the block, and allow for smaller tuning increments of the thread block dimension in x to ensure that we can always fit within a single thread block. It is this constraint that gives rise for the need to cap the limit for block-float support, e.g., MAX_BLOCK_FLOAT_NC.

At present we launch a volume of threads (actually multiples thereof for direction / dimension) and thus we have coalesced reads but not coalesced writes. A more optimal implementation will launch a surface of threads for each halo giving coalesced writes.

Definition in file color_spinor_pack.cu.

Classes

Namespaces

Functions

Detailed Description