The Parallel Thread Execution (PTX) programming model is explicitly parallel: a PTX 
program specifies the execution of a given thread of a parallel thread array.  A cooperative 
thread array, or CTA, is an array of threads that execute a kernel concurrently or in parallel. 
Threads within a CTA can communicate with each other.  To coordinate the communication 
of the threads within the CTA, one can specify synchronization points where threads wait 
until all threads in the CTA have arrived. 
Each thread has a unique thread id within the CTA.  Programs use a data parallel 
decomposition to partition inputs, work, and results across the threads of the CTA.  Each 
CTA thread uses its thread id to determine its assigned role, assign specific input and output 
positions, compute addresses, and select work to perform.  The thread id is a three-element 
vector tid, (with elements tid.x, tid.y, and tid.z) that specifies the thread’s position within a 
1D, 2D, or 3D CTA.  Each thread id component ranges from 0 up to the number of thread 
ids in that CTA dimension. 
Each CTA has a 1D, 2D, or 3D shape specified by a three-element vector ntid (with 
elements ntid.x, ntid.y, and ntid.z). The vector ntid specifies the number of threads in each 
CTA dimension. 
Threads within a CTA execute in SIMT (single-instruction, multiple-thread) fashion in 
groups called warps.  A warp is a maximal subset of threads from a single CTA, such that the 
threads execute the same instructions at the same time.  Threads within a warp are 
sequentially numbered.  The warp size is a machine-dependent constant.  Typically, a warp 
has 32 threads.  Some applications may be able to maximize performance with knowledge of 
the warp size, so PTX includes a run-time immediate constant, WARP_SZ, which may be 
used in any instruction where an immediate operand is allowed.

http://moss.csc.ncsu.edu/~mueller/cluster/nvidia/2.0/ptx_isa_12.pdf 



머라는거지; 발번역

parallel하게 실행되는 thread배열.
CTA의 thread들은 서로 통신할 수 있다. synchronization points를 정할수 있고, CTA의 나머지 thread들이 끝날때 까지 기다려야한다.
각각 thread id가 있고, role할당을 구분할 때 필요하다.
그런데 tid는 3차원이다. xyz.
ntid xyz는 CTA의 크기.
CTA는 SIMT로 실행되고, warp으로 묶여있다. wrap이란 하나의 CTA에 최대의 thread subset (same insts) 이다. warp의 thread들은 numbering되어있고, warp크기는 machine마다 다르다. 보통, 32개의 thread가 하나의 warp을 이룬다. 어떤 app들은 warp크기를 알면 성능을 올릴 수 있다. 그래서 PTX는 run-time immediate constant, WARP_SZ를 포함한다. (아마 warp사이즈를 원하는 크기로 바꿀 수 있나보다)



 

A “cooperative thread array,” or “CTA,” is a group of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique thread identifier assigned at thread launch time that controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Different threads of the CTA are advantageously synchronized at appropriate points during CTA execution using a barrier synchronization technique in which barrier instructions in the CTA program are detected and used to suspend execution of some threads until a specified number of other threads also reaches the barrier point. 






















신고
Posted by Leo 리오 트랙백 0 : 댓글 0

티스토리 툴바