QUDA
1.0.0
|
This is the implementation of the color-spinor halo packer for an arbitrary field. This implementation uses the fine-grained accessors and should support all field types reqgardless of precision, number of color or spins etc. More...
#include <color_spinor_field.h>
#include <tune_quda.h>
#include <jitify_helper.cuh>
#include <kernels/color_spinor_pack.cuh>
Go to the source code of this file.
Namespaces | |
quda | |
Functions | |
void | quda::genericPackGhost (void **ghost, const ColorSpinorField &a, QudaParity parity, int nFace, int dagger, MemoryLocation *destination=nullptr) |
Generic ghost packing routine. More... | |
This is the implementation of the color-spinor halo packer for an arbitrary field. This implementation uses the fine-grained accessors and should support all field types reqgardless of precision, number of color or spins etc.
Using a different precision of the field and of the halo is supported, though only QUDA_SINGLE_PRECISION fields with QUDA_HALF_PRECISION or QUDA_QUARTER_PRECISION halos are instantiated. When an integer format is requested for the halos then block-float format is used.
As well as tuning basic block sizes, the autotuner also tunes for the dimensions to assign to each thread. E.g., dim_thread=1 means we have one thread for all dimensions, dim_thread=4 means we have four threads (e.g., one per dimension). We always uses seperate threads for forwards and backwards directions. Dimension, direction and parity are assigned to the z thread dimension.
If doing block-float format, since all spin and color components of a given site have to reside in the same thread block (to allow us to compute the max element) we override the autotuner to keep the z thread dimensions in the grid and not the block, and allow for smaller tuning increments of the thread block dimension in x to ensure that we can always fit within a single thread block. It is this constraint that gives rise for the need to cap the limit for block-float support, e.g., MAX_BLOCK_FLOAT_NC.
At present we launch a volume of threads (actually multiples thereof for direction / dimension) and thus we have coalesced reads but not coalesced writes. A more optimal implementation will launch a surface of threads for each halo giving coalesced writes.
Definition in file color_spinor_pack.cu.