Let uint32_t * Data be an array of length n = 230 storing numbers from the set S = { 0, . . . , 9} and uint32_t * Counts be an array of length m = 10 storing the number of element in S.
(i) Implement a histogram kernel where each thread reads an entry from Data and atomically increments the corresponding slot in Counts.
(ii) Improve that kernel by computing local histograms per CUDA thread block in shared memory and subsequently merge the partial histograms using atomic operations.
(iii) Provide a register-only variant where each thread independently increments m = 10 registers. Subsequently the counts stored in registers have to be accumulated using warp intrinsics. Finally, the block-local histograms are written atomically to Counts.
Measure the execution times. Which approach performs best?

"Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!"
Looking for a Similar Assignment? Our Experts can help. Use the coupon code SAVE30 to get your first order at 30% off!