Implementing Low Level GPU Hans Kristian Munich 2019
Implementing Low Level GPU Hans Kristian Munich 2019
© Hans-Kristian Arntzen 2
Problem space
It’s just triangles, how hard can it be, right?
© Hans-Kristian Arntzen 3
Old-school?
• Late 90s, early 2000s console graphics hardware is quirky
• Does not look anything like a modern GPU
• Understanding how legacy tech works is fun
• Nintendo 64 RDP as an example
• Reimplementing them is also “fun”
• Goal here is accurate software rendering
• … but on Vulkan compute!
• … because why not
© Hans-Kristian Arntzen 4
High-level emulation
• Reinterpreting the intent of an
application
• Almost exclusively used for N64
and beyond
• Higher internal resolution
• Rasterization and fragment
pipeline
• Sacrifices accuracy for speed and
“bling”
• Many challenges, but not what this
talk is about
© Hans-Kristian Arntzen 5
Low-level emulation N64 was known to be
blurry for a reason …
© Hans-Kristian Arntzen 6
Old-school rasterization
© Hans-Kristian Arntzen 7
Complex span
equation
shapes
© Hans-Kristian Arntzen 8
Triangle setup and CPU vertex processing
• Poly-counts were generally low
• Good use case for programmable co-processor / DSP
• GTE on PS1, RSP on N64, VU0/1 on PS2, etc …
• SW lighting, Gouraud shading
• A low-level emulation will usually consume triangle setup
• Precomputed interpolation equations
• Usually fixed point
© Hans-Kristian Arntzen 9
Rasterization pipeline problems
• Rasterization is one of the last major fixed function blocks in modern GPUs
• Hi-Z and early-ZS is key for high performance, well suited for fixed HW
• Rasterization rules
• Workaround -> manual test -> discard -> late-Z everything ->
• Blending
• Programmable blending is inefficient on desktop – fine on mobile though!
• Depth/stencil buffer
• Programmable depth/stencil testing is even worse – and terrible on mobile as well
• Anti-aliasing
• This gets extremely weird …
• Memory aliasing
• fragment_shader_interlock is the current “solution”
• Throwing performance out the window is not fun
© Hans-Kristian Arntzen 10
Anti-aliasing problems
• AA in N64 is notoriously weird
• Fixed function MSAA does not map to anything
• N64 feeds coverage data into post-AA in video scanout
Coverage counter +
scanout filter
© Hans-Kristian Arntzen 11
Correlated coverage guides post-AA
12
The fragment interlock hammer
• We can have programmable everything in fragment with interlocks
• Take a mutex per pixel with some hardware assistance
• Lock must be in API order => ordered interlock
• Can be extremely slow …
• In order semantics + locks => recipe for performance horror
• Spotty support
• Generally considered an extremely obscure feature
• Alternatives?
• Atomic linked list of pixels, incredibly difficult to pull off in emulation use case
© Hans-Kristian Arntzen 13
Rethinking the problem
• Fragment shader
• Start with fast, fixed function API interfaces
• Add a million hacks and workarounds to make it work
• Compute shader
• Abandon fixed function restrictions
• Build from ground up, full implementation flexibility
• Tune for relevant constraints
• More future looking?
© Hans-Kristian Arntzen 14
Unsolvable problems?
• Any multi-threaded emulation will fail on these, not just GPU
• Cycle accuracy
• Texture cache behavior
• N64 has a programmer-maintained 4 KiB texture cache though …
• Generally not important for correctness
• Obscure cycle timing on CPU is a thing
• Not really on GPU
• Arbitrary cross-pixel dependencies
• Aliasing color / depth not 1:1
• Is this a thing?
© Hans-Kristian Arntzen 15
Compute shader rasterization
Gotta use those teraflops for something
The render loop
• Basic CPU software rendering is super easy
• foreach (prim in primitives) render_all_pixels_in(prim)
• Need to feed compute with a massive number of threads
• Naïve CPU loops are not MT friendly
• Common solution is going tile based
• foreach (tile) foreach (prim covering tile) render(tile, prim)
• Lots of techniques for compute in this domain
© Hans-Kristian Arntzen 17
The naïve compute ubershader
int x = coord_for_thread().x;
int y = coord_for_thread().y; Very few pixels pass
Color color = load_color_framebuffer(x, y);
this test
Depth depth = load_depth_framebuffer(x, y);
© Hans-Kristian Arntzen 19
Better bitscan loops
int num_prims_32 = (num_prims + 31) / 32;
int num_prims_1024 = (num_prims_32 + 31) / 32; findLSB / ctz / etc
Single instruction
// All the loops are dynamically uniform. GPU is happy ☺
// Skip over huge batches of primitives.
© Hans-Kristian Arntzen 21
Splitting up the ubermonster
• What if the per-tile kernel consumed pre-shaded tiles?
• Reduce the ubershader to only deal with blending and depth-testing
• Distribute chunks of work tagged with (tile + primitive index)
• One vkCmdDispatchIndirect per variant
• Can assign different specialized shaders to deal with different render state
• Specialization constants are perfect here
• Need to allocate storage space for color/depth/etc
• Intense on bandwidth, but gotta use that GDDR6 for something, right?
• Could be a reasonable tradeoff for 240p / 480p content
• Callable shaders would be nice …
© Hans-Kristian Arntzen 22
Split architecture
Shading VRAM
in intNx32_t gl_VertexIndex;
void main() Subgroups
{ - Represents a SIMD unit
floatNx32_t v = floatNx32_t(gl_VertexIndex); - The ISPC paradigm
} © Hans-Kristian Arntzen 25
Wrapping your head around subgroups
• A “thread” -> “lane in a SIMD vector register”
• Branches -> “make lanes active or inactive”
• All values are the same in subgroup -> “scalar register”
• Shading languages do not express this well
• Many recent graphics talks will mention it
• Console optimization
• https://www.khronos.org/blog/vulkan-subgroup-tutorial
• https://www.khronos.org/assets/uploads/developers/library/2018-
vulkan-devday/06-subgroups.pdf
© Hans-Kristian Arntzen 26
subgroupBallot()
Navi ISA
bool primitive_is_binned = test_primitive(thread_index);
// Reduce a boolean test across all threads to a bitmap, nice!
uvec4 ballot = subgroupBallot(primitive_is_binned);
if (subgroupElect())
{
// Only one thread needs to write.
if (gl_SubgroupSize == 32)
write_u32_mask(ballot.x);
else if (gl_SubgroupSize == 64)
write_u32x2_mask(ballot.xy);
}
© Hans-Kristian Arntzen 27
VK_EXT_subgroup_size_control
• Not all GPUs have a fixed subgroup size
• Compilers can vary subgroup sizes
• Intel (8, 16 or 32) - May even vary within a dispatch
• AMD pre-Navi (64 only)
• AMD Navi (32 or 64)
• NVIDIA (32 only)
• gl_SubgroupSize builtin is fixed, but lanes might disappear
• This is totally fine for many use cases of subgroups, but …
• VARYING_SIZE_BIT, FULL_GROUP_BIT and subgroupSizeRequirement
• Critical extension to use subgroups well on Intel
© Hans-Kristian Arntzen 28
Subgroup atomic amortization
• Distributing work on GPU typically means atomic increments
• Atomics are expensive
• Amortize atomics overhead
• Most compilers do this when incrementing a fixed address by 1
• May or may not when incrementing by != 1
© Hans-Kristian Arntzen 29
Arithmetic subgroup operations
// Merge atomic adds
// Reduce in registers
uint total_bit_count = subgroupAdd(bit_count);
uint offset = 0u;
if (subgroupElect())
offset = atomicAdd(counts, total_bit_count);
offset = subgroupBroadcastFirst(offset);
offset += subgroupExclusiveAdd(bit_count);
// Legacy?
// Prefix sum in shared memory and then atomic.
// Write result to shared, then broadcast.
© Hans-Kristian Arntzen 30
8-bit and 16-bit storage
• Storing intermediate data in 32-bit all the time would be wasteful
• Color might fit in uint8 * 4 (or pack manually in uint)
• Depth might fit in 16 bpp
• Coverage/AUX state in 8 bpp
• YMMV
© Hans-Kristian Arntzen 31
Mip-mapping / derivatives
• Simple
• Each thread works on a pixel group
• Good luck with that register pressure …
• Shared memory and barriers?
• Ugh
• subgroupShuffleXor
• NV_compute_shader_derivatives
• Widely supported on desktop, should be EXT!
• dFdx/dFdy/fwidth and implicit LOD all in compute
• Enables subgroupQuad operations
• Linear ID or 2x2 ID grid
© Hans-Kristian Arntzen 32
Quad operations in compute is kinda nice
© Hans-Kristian Arntzen 33
Async compute
• Pipeline is split in two
• Binning (+ shading if not using ubershader)
• COMPUTE queue
• Final ROP stage doing depth / blending
• GRAPHICS queue
• Only in-order part of pipeline
• Overlaps bandwidth intense work with ALU intensive
• Need to serialize if using off-screen rendering and consuming result next pass
© Hans-Kristian Arntzen 34
Test implementation results
A fake retro GPU - RetroWarp
• Sandbox environment
• https://github.com/Themaister/RetroWarp
• Uses Granite as Vulkan backend
• Span equation based
• Bilinear / trilinear texture filtering
• Implemented manually, no texture()
• Perspective correct
• Trivial texture combiner
• 16-bit color w/ dither, 16-bit depth
• Fully fixed point
• ... except attribute interpolation (might need 64-bit ints)
© Hans-Kristian Arntzen 36
Test scene
• Sponza
• 1280x720
• Overkill, 640x480 is more plausible
• 97309 primitives
• Post clipping
• Post back face culling
• N64 had ~5k primitive budget
• Everything is alpha blended
• Why not
• Didn’t sort triangles, looks funny
© Hans-Kristian Arntzen 37
Results – GTX 1660 Ti
Options Ubershader Ubershader Split shader Split shader
(16x16 tiles) (8x8 tiles) (16x16 tiles) (8x8 tiles)
Subgroups ON 7.7 ms 9.5 ms 9.0 ms 9.9 ms
Async compute OFF
Subgroups ON 7.2 ms 8.8 ms 8.2 ms 9.5 ms
Async compute ON
Subgroups OFF 10.6 ms 10.2 ms 9.3 ms 10.5 ms
Async compute OFF
Subgroups OFF 9.6 ms 9.2 ms 8.4 ms 9.5 ms
Async compute ON
© Hans-Kristian Arntzen 39
Results – RX 5700 XT – RADV (LLVM)
Options Ubershader Ubershader Split shader Split shader
(16x16 tiles) (8x8 tiles) (16x16 tiles) (8x8 tiles)
Subgroups ON 11.6 ms 9.9 ms 6.3 ms 5.9 ms
Async compute OFF
Subgroups ON 11.3 ms 9.3 ms 5.2 ms 5.1 ms
Async compute ON
Subgroups OFF 14.6 ms 10.6 ms 6.4 ms 6.4 ms
Async compute OFF
Subgroups OFF 14.3 ms 9.9 ms 5.5 ms 5.6 ms
Async compute ON
© Hans-Kristian Arntzen 42
Conclusion
• Compute shaders are a viable alternative
• Allows accurate rendering in real-time
• Subgroup operations can be useful in unexpected places
• Data sharing without barriers is very nice
• Async compute + graphics queue compute is a thing
• Radeon GPU Analyzer is useful
• Verifying assumptions with ISA is great
© Hans-Kristian Arntzen 43
Thanks!
@Themaister
themaister.net/blog
arntzen-software.no
© Hans-Kristian Arntzen