0% found this document useful (0 votes)
25 views44 pages

Implementing Low Level GPU Hans Kristian Munich 2019

Hans-Kristian Arntzen discusses reimplementing the graphics pipelines of older console hardware like the Nintendo 64 using Vulkan compute shaders. He outlines some of the challenges in mapping the fixed-function rasterization of older GPUs to programmable compute shaders. These include implementing depth and blending operations, handling anti-aliasing, and dealing with arbitrary dependencies between pixels. The talk explores techniques like using compute shaders to rasterize primitives into tiles and splitting the rendering work across multiple specialized shaders.

Uploaded by

pulp noir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views44 pages

Implementing Low Level GPU Hans Kristian Munich 2019

Hans-Kristian Arntzen discusses reimplementing the graphics pipelines of older console hardware like the Nintendo 64 using Vulkan compute shaders. He outlines some of the challenges in mapping the fixed-function rasterization of older GPUs to programmable compute shaders. These include implementing depth and blending operations, handling anti-aliasing, and dealing with arbitrary dependencies between pixels. The talk explores techniques like using compute shaders to rasterize primitives into tiles and splitting the rendering work across multiple specialized shaders.

Uploaded by

pulp noir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Implementing old-school

graphics chips with


Vulkan compute
Hans-Kristian Arntzen
Khronos Meetup – Munich
2019-10-11
Content
• The problem space
• Compute shader rasterization
• Optimizing with Vulkan subgroups
• Test implementation

© Hans-Kristian Arntzen 2
Problem space
It’s just triangles, how hard can it be, right?

© Hans-Kristian Arntzen 3
Old-school?
• Late 90s, early 2000s console graphics hardware is quirky
• Does not look anything like a modern GPU
• Understanding how legacy tech works is fun
• Nintendo 64 RDP as an example
• Reimplementing them is also “fun”
• Goal here is accurate software rendering
• … but on Vulkan compute!
• … because why not

© Hans-Kristian Arntzen 4
High-level emulation
• Reinterpreting the intent of an
application
• Almost exclusively used for N64
and beyond
• Higher internal resolution
• Rasterization and fragment
pipeline
• Sacrifices accuracy for speed and
“bling”
• Many challenges, but not what this
talk is about

© Hans-Kristian Arntzen 5
Low-level emulation N64 was known to be
blurry for a reason …

• Focus on recreating exact


behavior
• Emulate what the GPU is doing
in detail
• Usually only reserved for a CPU
reference renderer
• Slow!
• Very specific “look and feel”

© Hans-Kristian Arntzen 6
Old-school rasterization

© Hans-Kristian Arntzen 7
Complex span
equation
shapes

© Hans-Kristian Arntzen 8
Triangle setup and CPU vertex processing
• Poly-counts were generally low
• Good use case for programmable co-processor / DSP
• GTE on PS1, RSP on N64, VU0/1 on PS2, etc …
• SW lighting, Gouraud shading
• A low-level emulation will usually consume triangle setup
• Precomputed interpolation equations
• Usually fixed point

© Hans-Kristian Arntzen 9
Rasterization pipeline problems
• Rasterization is one of the last major fixed function blocks in modern GPUs
• Hi-Z and early-ZS is key for high performance, well suited for fixed HW
• Rasterization rules
• Workaround -> manual test -> discard -> late-Z everything -> 
• Blending
• Programmable blending is inefficient on desktop – fine on mobile though!
• Depth/stencil buffer
• Programmable depth/stencil testing is even worse – and terrible on mobile as well
• Anti-aliasing
• This gets extremely weird …
• Memory aliasing
• fragment_shader_interlock is the current “solution”
• Throwing performance out the window is not fun 

© Hans-Kristian Arntzen 10
Anti-aliasing problems
• AA in N64 is notoriously weird
• Fixed function MSAA does not map to anything
• N64 feeds coverage data into post-AA in video scanout

Coverage counter +
scanout filter

© Hans-Kristian Arntzen 11
Correlated coverage guides post-AA

12
The fragment interlock hammer
• We can have programmable everything in fragment with interlocks
• Take a mutex per pixel with some hardware assistance
• Lock must be in API order => ordered interlock
• Can be extremely slow …
• In order semantics + locks => recipe for performance horror
• Spotty support
• Generally considered an extremely obscure feature
• Alternatives?
• Atomic linked list of pixels, incredibly difficult to pull off in emulation use case

© Hans-Kristian Arntzen 13
Rethinking the problem
• Fragment shader
• Start with fast, fixed function API interfaces
• Add a million hacks and workarounds to make it work
• Compute shader
• Abandon fixed function restrictions
• Build from ground up, full implementation flexibility
• Tune for relevant constraints
• More future looking?

© Hans-Kristian Arntzen 14
Unsolvable problems?
• Any multi-threaded emulation will fail on these, not just GPU
• Cycle accuracy
• Texture cache behavior
• N64 has a programmer-maintained 4 KiB texture cache though …
• Generally not important for correctness
• Obscure cycle timing on CPU is a thing
• Not really on GPU
• Arbitrary cross-pixel dependencies
• Aliasing color / depth not 1:1
• Is this a thing?

© Hans-Kristian Arntzen 15
Compute shader rasterization
Gotta use those teraflops for something
The render loop
• Basic CPU software rendering is super easy
• foreach (prim in primitives) render_all_pixels_in(prim)
• Need to feed compute with a massive number of threads
• Naïve CPU loops are not MT friendly
• Common solution is going tile based
• foreach (tile) foreach (prim covering tile) render(tile, prim)
• Lots of techniques for compute in this domain

© Hans-Kristian Arntzen 17
The naïve compute ubershader
int x = coord_for_thread().x;
int y = coord_for_thread().y; Very few pixels pass
Color color = load_color_framebuffer(x, y);
this test
Depth depth = load_depth_framebuffer(x, y);

for (auto &setup : primitive_setups)


{
if (!test_rasterization_coverage(setup, x, y))
continue; Expect more like
Color rgba = interpolate_rgba(setup, x, y); 2000 lines of code
UV uv = interpolate_uv(setup, x, y); here
Color samp = sample_texture(setup.texinfo, uv);
Color combined_color = combiner(rgba, samp, setup.comb);
depth_stage(depth, interpolate_z(setup, x, y)); Fully programmable
blend_stage(color, combined_color); blending and ROP, in
} GPU registers!

store_color_framebuffer(x, y, color); HOST_VISIBLE buffer if CPU and GPU


store_depth_framebuffer(x, y, depth); need to be coherent (N64 >_<) ...
© Hans-Kristian Arntzen 18
Tile binning
to bitmasks

© Hans-Kristian Arntzen 19
Better bitscan loops
int num_prims_32 = (num_prims + 31) / 32;
int num_prims_1024 = (num_prims_32 + 31) / 32; findLSB / ctz / etc
Single instruction
// All the loops are dynamically uniform. GPU is happy ☺
// Skip over huge batches of primitives.

for (coarse_index in num_prims_1024)


{
foreach_bit(coarse_bit, coarse_mask[tile][coarse_index])
{
int mask_index = coarse_bit + coarse_index * 32;
Alternative: flat list
foreach_bit(bit, mask[tile][mask_index])
of triangle IDs, but
{
massive VRAM
int primitive_index = bit + mask_index * 32;
requirement in
// Now we test and render.
worst case
}
scenario.
}
} © Hans-Kristian Arntzen 20
To ubershade or not to ubershade
• Even ancient GPUs have a lot of state bits
• Primitive type?
• Texture combiner state?
• Blending modes?
• Depth modes?
• And more …
• Compute kernel to render a tile needs to handle everything
• Insane register pressure => poor occupancy => poor performance
• At least the branches are dynamically uniform ☺
• Might be best solution if configuration space is small
• Bindless resources

© Hans-Kristian Arntzen 21
Splitting up the ubermonster
• What if the per-tile kernel consumed pre-shaded tiles?
• Reduce the ubershader to only deal with blending and depth-testing
• Distribute chunks of work tagged with (tile + primitive index)
• One vkCmdDispatchIndirect per variant
• Can assign different specialized shaders to deal with different render state
• Specialization constants are perfect here
• Need to allocate storage space for color/depth/etc
• Intense on bandwidth, but gotta use that GDDR6 for something, right?
• Could be a reasonable tradeoff for 240p / 480p content
• Callable shaders would be nice …

© Hans-Kristian Arntzen 22
Split architecture

Low-res binning Binning Blend / depth

Shading VRAM

Position data (SSBO) – 16k entries

Attribute data (SSBO) – 16k entries

Primitive state descriptor index (UBO) – 16k entries

State descriptors (UBO) – 1k entires


© Hans-Kristian Arntzen 23
Advanced Vulkan features
Now this is where it gets interesting
Subgroup operations
• New feature in Vulkan 1.1
• Share data between threads efficiently
• Peeks into a world where threads execute in lock-step in a SIMD fashion
• Alternative is going through shared memory + barrier()
in int gl_VertexIndex;
void main()
Isolated threads
{
float v = float(gl_VertexIndex); - Shader code is nice and scalar
}

in intNx32_t gl_VertexIndex;
void main() Subgroups
{ - Represents a SIMD unit
floatNx32_t v = floatNx32_t(gl_VertexIndex); - The ISPC paradigm
} © Hans-Kristian Arntzen 25
Wrapping your head around subgroups
• A “thread” -> “lane in a SIMD vector register”
• Branches -> “make lanes active or inactive”
• All values are the same in subgroup -> “scalar register”
• Shading languages do not express this well
• Many recent graphics talks will mention it
• Console optimization
• https://www.khronos.org/blog/vulkan-subgroup-tutorial
• https://www.khronos.org/assets/uploads/developers/library/2018-
vulkan-devday/06-subgroups.pdf
© Hans-Kristian Arntzen 26
subgroupBallot()

Navi ISA
bool primitive_is_binned = test_primitive(thread_index);
// Reduce a boolean test across all threads to a bitmap, nice!
uvec4 ballot = subgroupBallot(primitive_is_binned);

if (subgroupElect())
{
// Only one thread needs to write.
if (gl_SubgroupSize == 32)
write_u32_mask(ballot.x);
else if (gl_SubgroupSize == 64)
write_u32x2_mask(ballot.xy);
}

// Legacy way? atomicOr on shared memory or reduction passes <_<

© Hans-Kristian Arntzen 27
VK_EXT_subgroup_size_control
• Not all GPUs have a fixed subgroup size
• Compilers can vary subgroup sizes
• Intel (8, 16 or 32) - May even vary within a dispatch
• AMD pre-Navi (64 only)
• AMD Navi (32 or 64)
• NVIDIA (32 only)
• gl_SubgroupSize builtin is fixed, but lanes might disappear
• This is totally fine for many use cases of subgroups, but …
• VARYING_SIZE_BIT, FULL_GROUP_BIT and subgroupSizeRequirement
• Critical extension to use subgroups well on Intel

© Hans-Kristian Arntzen 28
Subgroup atomic amortization
• Distributing work on GPU typically means atomic increments
• Atomics are expensive
• Amortize atomics overhead
• Most compilers do this when incrementing a fixed address by 1
• May or may not when incrementing by != 1

© Hans-Kristian Arntzen 29
Arithmetic subgroup operations
// Merge atomic adds

uint bit_count = bitCount(binned_mask));

// Reduce in registers
uint total_bit_count = subgroupAdd(bit_count);
uint offset = 0u;
if (subgroupElect())
offset = atomicAdd(counts, total_bit_count);
offset = subgroupBroadcastFirst(offset);
offset += subgroupExclusiveAdd(bit_count);

// Legacy?
// Prefix sum in shared memory and then atomic.
// Write result to shared, then broadcast.

© Hans-Kristian Arntzen 30
8-bit and 16-bit storage
• Storing intermediate data in 32-bit all the time would be wasteful
• Color might fit in uint8 * 4 (or pack manually in uint)
• Depth might fit in 16 bpp
• Coverage/AUX state in 8 bpp
• YMMV

© Hans-Kristian Arntzen 31
Mip-mapping / derivatives
• Simple
• Each thread works on a pixel group
• Good luck with that register pressure …
• Shared memory and barriers?
• Ugh
• subgroupShuffleXor
• NV_compute_shader_derivatives
• Widely supported on desktop, should be EXT!
• dFdx/dFdy/fwidth and implicit LOD all in compute
• Enables subgroupQuad operations
• Linear ID or 2x2 ID grid

© Hans-Kristian Arntzen 32
Quad operations in compute is kinda nice

© Hans-Kristian Arntzen 33
Async compute
• Pipeline is split in two
• Binning (+ shading if not using ubershader)
• COMPUTE queue
• Final ROP stage doing depth / blending
• GRAPHICS queue
• Only in-order part of pipeline
• Overlaps bandwidth intense work with ALU intensive
• Need to serialize if using off-screen rendering and consuming result next pass

© Hans-Kristian Arntzen 34
Test implementation results
A fake retro GPU - RetroWarp
• Sandbox environment
• https://github.com/Themaister/RetroWarp
• Uses Granite as Vulkan backend
• Span equation based
• Bilinear / trilinear texture filtering
• Implemented manually, no texture()
• Perspective correct
• Trivial texture combiner
• 16-bit color w/ dither, 16-bit depth
• Fully fixed point
• ... except attribute interpolation (might need 64-bit ints)

© Hans-Kristian Arntzen 36
Test scene
• Sponza
• 1280x720
• Overkill, 640x480 is more plausible
• 97309 primitives
• Post clipping
• Post back face culling
• N64 had ~5k primitive budget
• Everything is alpha blended
• Why not
• Didn’t sort triangles, looks funny

© Hans-Kristian Arntzen 37
Results – GTX 1660 Ti
Options Ubershader Ubershader Split shader Split shader
(16x16 tiles) (8x8 tiles) (16x16 tiles) (8x8 tiles)
Subgroups ON 7.7 ms 9.5 ms 9.0 ms 9.9 ms
Async compute OFF
Subgroups ON 7.2 ms 8.8 ms 8.2 ms 9.5 ms
Async compute ON
Subgroups OFF 10.6 ms 10.2 ms 9.3 ms 10.5 ms
Async compute OFF
Subgroups OFF 9.6 ms 9.2 ms 8.4 ms 9.5 ms
Async compute ON

- Async compute helps


- Subgroup ops help a lot with ubershader
- The ubershader is not uber enough to trigger issues
© Hans-Kristian Arntzen 38
Results – RX 5700 XT - Windows
Options Ubershader Ubershader Split shader Split shader
(16x16 tiles) (8x8 tiles) (16x16 tiles) (8x8 tiles)
Subgroups ON 5.5 ms 5.3 ms 6.3 ms 6.2 ms
Async compute OFF
Subgroups ON 4.6 ms 4.1 ms 5.0 ms 5.3 ms
Async compute ON
Subgroups OFF 7.8 ms 5.7 ms 6.3 ms 5.9 ms
Async compute OFF
Subgroups OFF 7.0 ms 4.5 ms 5.0 ms 5.0 ms
Async compute ON

- Very similar story here, but 8x8 tiles win here.


- Gain in perf over GTX 1660 Ti correlates well with peak TFlops.

© Hans-Kristian Arntzen 39
Results – RX 5700 XT – RADV (LLVM)
Options Ubershader Ubershader Split shader Split shader
(16x16 tiles) (8x8 tiles) (16x16 tiles) (8x8 tiles)
Subgroups ON 11.6 ms 9.9 ms 6.3 ms 5.9 ms
Async compute OFF
Subgroups ON 11.3 ms 9.3 ms 5.2 ms 5.1 ms
Async compute ON
Subgroups OFF 14.6 ms 10.6 ms 6.4 ms 6.4 ms
Async compute OFF
Subgroups OFF 14.3 ms 9.9 ms 5.5 ms 5.6 ms
Async compute ON

- Here we see catastrophic failure when ubershader is too large.


- Happens eventually on any compiler …
- LLVM compiler got some catching up to do.
© Hans-Kristian Arntzen 40
Results – UHD 620 – Mesa ANV
Options Ubershader Ubershader Split shader Split shader
(16x16 tiles) (8x8 tiles) (16x16 tiles) (8x8 tiles)
Subgroups ON 175 ms 155 ms 115 ms 113 ms
Async compute OFF
Subgroups ON No change No change No change No change
Async compute ON
Subgroups OFF 493 ms 216 ms 136 ms 131 ms
Async compute OFF
Subgroups OFF No change No change No change No change
Async compute ON

- This is too much for integrated.


- More realistic resolution and geometry complexity improves it a lot.
- Subgroup ops help a lot
© Hans-Kristian Arntzen 41
Other implementations?
• High-Performance Software Rasterization on GPUs (2011)
• https://research.nvidia.com/sites/default/files/pubs/2011-08_High-
Performance-Software-Rasterization/laine2011hpg_paper.pdf
• CUDA, highly optimized
• Has similar idea of coarse-then-fine binning
• paraLLEl-RDP (2016)
• Earlier attempt to emulate N64 RDP in Vulkan compute
• PCSX2 (2014)
• OpenCL
• https://github.com/PCSX2/pcsx2/pull/302

© Hans-Kristian Arntzen 42
Conclusion
• Compute shaders are a viable alternative
• Allows accurate rendering in real-time
• Subgroup operations can be useful in unexpected places
• Data sharing without barriers is very nice
• Async compute + graphics queue compute is a thing
• Radeon GPU Analyzer is useful
• Verifying assumptions with ISA is great

© Hans-Kristian Arntzen 43
Thanks!
@Themaister
themaister.net/blog
arntzen-software.no

© Hans-Kristian Arntzen

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy