#include <tune_quda.h>

Inheritance diagram for Tunable:

Public Member Functions
	Tunable ()
virtual	~Tunable ()
virtual TuneKey	tuneKey () const =0
virtual void	apply (const cudaStream_t &stream)=0
virtual void	preTune ()
virtual void	postTune ()
virtual int	tuningIter () const
virtual std::string	paramString (const TuneParam &param) const
virtual std::string	perfString (float time) const
virtual void	initTuneParam (TuneParam &param) const
virtual void	defaultTuneParam (TuneParam &param) const
virtual bool	advanceTuneParam (TuneParam &param) const
Protected Member Functions
virtual long long	flops () const
virtual long long	bytes () const
virtual int	sharedBytesPerThread () const =0
virtual int	sharedBytesPerBlock () const =0
virtual bool	advanceGridDim (TuneParam &param) const
virtual bool	advanceBlockDim (TuneParam &param) const
virtual bool	advanceSharedBytes (TuneParam &param) const

Detailed Description

Definition at line 66 of file tune_quda.h.

Constructor & Destructor Documentation

Tunable::Tunable ( ) [inline]

Definition at line 133 of file tune_quda.h.

virtual Tunable::~Tunable ( ) [inline, virtual]

Definition at line 134 of file tune_quda.h.

Member Function Documentation

virtual bool Tunable::advanceBlockDim ( TuneParam & param ) const [inline, protected, virtual]

Reimplemented in DslashCuda.

Definition at line 91 of file tune_quda.h.

virtual bool Tunable::advanceGridDim ( TuneParam & param ) const [inline, protected, virtual]

Reimplemented in DslashCuda, and CloverCuda< sFloat, cFloat >.

Definition at line 78 of file tune_quda.h.

virtual bool Tunable::advanceSharedBytes ( TuneParam & param ) const [inline, protected, virtual]

The goal here is to throttle the number of thread blocks per SM by over-allocating shared memory (in order to improve L2 utilization, etc.). Note that:

On Fermi, requesting greater than 16 KB will switch the cache config, so we restrict ourselves to 16 KB for now.
On GT200 and older, kernel arguments are passed via shared memory, so available space may be smaller than 16 KB. We thus request the smallest amount of dynamic shared memory that guarantees throttling to a given number of blocks, in order to allow some extra leeway.

Definition at line 113 of file tune_quda.h.

virtual bool Tunable::advanceTuneParam ( TuneParam & param ) const [inline, virtual]

Definition at line 176 of file tune_quda.h.

virtual void Tunable::apply ( const cudaStream_t & stream ) [pure virtual]

Implemented in BlasCuda< FloatN, M, writeX, writeY, writeZ, writeW, InputX, InputY, InputZ, InputW, OutputX, OutputY, OutputZ, OutputW, Functor >, CopyCuda< FloatN, N, Output, Input >, WilsonDslashCuda< sFloat, gFloat >, CloverDslashCuda< sFloat, gFloat, cFloat >, TwistedDslashCuda< sFloat, gFloat >, DomainWallDslashCuda< sFloat, gFloat >, StaggeredDslashCuda< sFloat, fatGFloat, longGFloat >, CloverCuda< sFloat, cFloat >, TwistGamma5Cuda< sFloat >, and ReduceCuda< doubleN, ReduceType, ReduceSimpleType, FloatN, M, writeX, writeY, writeZ, InputX, InputY, InputZ, InputW, InputV, Reducer, OutputX, OutputY, OutputZ >.