Cooperative Thread Array (CTA)

The Parallel Thread Execution (PTX) programming model is explicitly parallel: a PTX

program specifies the execution of a given thread of a parallel thread array. A cooperative

thread array, or CTA, is an array of threads that execute a kernel concurrently or in parallel.

Threads within a CTA can communicate with each other. To coordinate the communication

of the threads within the CTA, one can specify synchronization points where threads wait

until all threads in the CTA have arrived.

Each thread has a unique thread id within the CTA. Programs use a data parallel

decomposition to partition inputs, work, and results across the threads of the CTA. Each

CTA thread uses its thread id to determine its assigned role, assign specific input and output

positions, compute addresses, and select work to perform. The thread id is a three-element

vector tid, (with elements tid.x, tid.y, and tid.z) that specifies the thread’s position within a

1D, 2D, or 3D CTA. Each thread id component ranges from 0 up to the number of thread

ids in that CTA dimension.

Each CTA has a 1D, 2D, or 3D shape specified by a three-element vector ntid (with

elements ntid.x, ntid.y, and ntid.z). The vector ntid specifies the number of threads in each

CTA dimension.

Threads within a CTA execute in SIMT (single-instruction, multiple-thread) fashion in

groups called warps. A warp is a maximal subset of threads from a single CTA, such that the

threads execute the same instructions at the same time. Threads within a warp are

sequentially numbered. The warp size is a machine-dependent constant. Typically, a warp

has 32 threads. Some applications may be able to maximize performance with knowledge of

the warp size, so PTX includes a run-time immediate constant, WARP_SZ, which may be

used in any instruction where an immediate operand is allowed.

http://moss.csc.ncsu.edu/~mueller/cluster/nvidia/2.0/ptx_isa_12.pdf

~~머라는거지; 발번역~~

parallel하게 실행되는 thread배열.
CTA의 thread들은 서로 통신할 수 있다. synchronization points를 정할수 있고, CTA의 나머지 thread들이 끝날때 까지 기다려야한다.
각각 thread id가 있고, role할당을 구분할 때 필요하다.
그런데 tid는 3차원이다. xyz.
ntid xyz는 CTA의 크기.
CTA는 SIMT로 실행되고, warp으로 묶여있다. wrap이란 하나의 CTA에 최대의 thread subset (same insts) 이다. warp의 thread들은 numbering되어있고, warp크기는 machine마다 다르다. 보통, 32개의 thread가 하나의 warp을 이룬다. 어떤 app들은 warp크기를 알면 성능을 올릴 수 있다. 그래서 PTX는 run-time immediate constant, WARP_SZ를 포함한다. (아마 warp사이즈를 원하는 크기로 바꿀 수 있나보다)

A “cooperative thread array,” or “CTA,” is a group of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique thread identifier assigned at thread launch time that controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Different threads of the CTA are advantageously synchronized at appropriate points during CTA execution using a barrier synchronization technique in which barrier instructions in the CTA program are detected and used to suspend execution of some threads until a specified number of other threads also reaches the barrier point.

ColorCode

Cooperative Thread Array (CTA)

댓글

티스토리툴바