0% found this document useful (0 votes)

25 views44 pages

Implementing Low Level GPU Hans Kristian Munich 2019

Hans-Kristian Arntzen discusses reimplementing the graphics pipelines of older console hardware like the Nintendo 64 using Vulkan compute shaders. He outlines some of the challenges in mapping the fixed-function rasterization of older GPUs to programmable compute shaders. These include implementing depth and blending operations, handling anti-aliasing, and dealing with arbitrary dependencies between pixels. The talk explores techniques like using compute shaders to rasterize primitives into tiles and splitting the rendering work across multiple specialized shaders.

Uploaded by

pulp noir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views44 pages

Implementing Low Level GPU Hans Kristian Munich 2019

Uploaded by

pulp noir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Implementing old-school

graphics chips with

Vulkan compute
Hans-Kristian Arntzen
Khronos Meetup – Munich
2019-10-11
Content
• The problem space
• Compute shader rasterization
• Optimizing with Vulkan subgroups
• Test implementation

© Hans-Kristian Arntzen 3
Old-school?
• Late 90s, early 2000s console graphics hardware is quirky
• Does not look anything like a modern GPU
• Understanding how legacy tech works is fun
• Nintendo 64 RDP as an example
• Reimplementing them is also “fun”
• Goal here is accurate software rendering
• … but on Vulkan compute!
• … because why not

© Hans-Kristian Arntzen 4
High-level emulation
• Reinterpreting the intent of an
application
• Almost exclusively used for N64
and beyond
• Higher internal resolution
• Rasterization and fragment
pipeline
• Sacrifices accuracy for speed and
“bling”
• Many challenges, but not what this
talk is about

• Focus on recreating exact

behavior
• Emulate what the GPU is doing
in detail
• Usually only reserved for a CPU
reference renderer
• Slow!
• Very specific “look and feel”

© Hans-Kristian Arntzen 8
Triangle setup and CPU vertex processing
• Poly-counts were generally low
• Good use case for programmable co-processor / DSP
• GTE on PS1, RSP on N64, VU0/1 on PS2, etc …
• SW lighting, Gouraud shading
• A low-level emulation will usually consume triangle setup
• Precomputed interpolation equations
• Usually fixed point

© Hans-Kristian Arntzen 9
Rasterization pipeline problems
• Rasterization is one of the last major fixed function blocks in modern GPUs
• Hi-Z and early-ZS is key for high performance, well suited for fixed HW
• Rasterization rules
• Workaround -> manual test -> discard -> late-Z everything -> 
• Blending
• Programmable blending is inefficient on desktop – fine on mobile though!
• Depth/stencil buffer
• Programmable depth/stencil testing is even worse – and terrible on mobile as well
• Anti-aliasing
• This gets extremely weird …
• Memory aliasing
• fragment_shader_interlock is the current “solution”
• Throwing performance out the window is not fun 

© Hans-Kristian Arntzen 10
Anti-aliasing problems
• AA in N64 is notoriously weird
• Fixed function MSAA does not map to anything
• N64 feeds coverage data into post-AA in video scanout

Coverage counter +
scanout filter

12
The fragment interlock hammer
• We can have programmable everything in fragment with interlocks
• Take a mutex per pixel with some hardware assistance
• Lock must be in API order => ordered interlock
• Can be extremely slow …
• In order semantics + locks => recipe for performance horror
• Spotty support
• Generally considered an extremely obscure feature
• Alternatives?
• Atomic linked list of pixels, incredibly difficult to pull off in emulation use case

© Hans-Kristian Arntzen 13
Rethinking the problem
• Fragment shader
• Start with fast, fixed function API interfaces
• Add a million hacks and workarounds to make it work
• Compute shader
• Abandon fixed function restrictions
• Build from ground up, full implementation flexibility
• Tune for relevant constraints
• More future looking?

© Hans-Kristian Arntzen 14
Unsolvable problems?
• Any multi-threaded emulation will fail on these, not just GPU
• Cycle accuracy
• Texture cache behavior
• N64 has a programmer-maintained 4 KiB texture cache though …
• Generally not important for correctness
• Obscure cycle timing on CPU is a thing
• Not really on GPU
• Arbitrary cross-pixel dependencies
• Aliasing color / depth not 1:1
• Is this a thing?

© Hans-Kristian Arntzen 15
Compute shader rasterization
Gotta use those teraflops for something
The render loop
• Basic CPU software rendering is super easy
• foreach (prim in primitives) render_all_pixels_in(prim)
• Need to feed compute with a massive number of threads
• Naïve CPU loops are not MT friendly
• Common solution is going tile based
• foreach (tile) foreach (prim covering tile) render(tile, prim)
• Lots of techniques for compute in this domain

© Hans-Kristian Arntzen 17
The naïve compute ubershader
int x = coord_for_thread().x;
int y = coord_for_thread().y; Very few pixels pass
Color color = load_color_framebuffer(x, y);
this test
Depth depth = load_depth_framebuffer(x, y);

for (auto &setup : primitive_setups)

{
if (!test_rasterization_coverage(setup, x, y))
continue; Expect more like
Color rgba = interpolate_rgba(setup, x, y); 2000 lines of code
UV uv = interpolate_uv(setup, x, y); here
Color samp = sample_texture(setup.texinfo, uv);
Color combined_color = combiner(rgba, samp, setup.comb);
depth_stage(depth, interpolate_z(setup, x, y)); Fully programmable
blend_stage(color, combined_color); blending and ROP, in
} GPU registers!

store_color_framebuffer(x, y, color); HOST_VISIBLE buffer if CPU and GPU

store_depth_framebuffer(x, y, depth); need to be coherent (N64 >_<) ...
© Hans-Kristian Arntzen 18
Tile binning
to bitmasks

© Hans-Kristian Arntzen 19
Better bitscan loops
int num_prims_32 = (num_prims + 31) / 32;
int num_prims_1024 = (num_prims_32 + 31) / 32; findLSB / ctz / etc
Single instruction
// All the loops are dynamically uniform. GPU is happy ☺
// Skip over huge batches of primitives.

for (coarse_index in num_prims_1024)

{
foreach_bit(coarse_bit, coarse_mask[tile][coarse_index])
{
int mask_index = coarse_bit + coarse_index * 32;
Alternative: flat list
foreach_bit(bit, mask[tile][mask_index])
of triangle IDs, but
{
massive VRAM
int primitive_index = bit + mask_index * 32;
requirement in
// Now we test and render.
worst case
}
scenario.
}
} © Hans-Kristian Arntzen 20
To ubershade or not to ubershade
• Even ancient GPUs have a lot of state bits
• Primitive type?
• Texture combiner state?
• Blending modes?
• Depth modes?
• And more …
• Compute kernel to render a tile needs to handle everything
• Insane register pressure => poor occupancy => poor performance
• At least the branches are dynamically uniform ☺
• Might be best solution if configuration space is small
• Bindless resources

© Hans-Kristian Arntzen 21
Splitting up the ubermonster
• What if the per-tile kernel consumed pre-shaded tiles?
• Reduce the ubershader to only deal with blending and depth-testing
• Distribute chunks of work tagged with (tile + primitive index)
• One vkCmdDispatchIndirect per variant
• Can assign different specialized shaders to deal with different render state
• Specialization constants are perfect here
• Need to allocate storage space for color/depth/etc
• Intense on bandwidth, but gotta use that GDDR6 for something, right?
• Could be a reasonable tradeoff for 240p / 480p content
• Callable shaders would be nice …

Low-res binning Binning Blend / depth

Shading VRAM

Position data (SSBO) – 16k entries

Attribute data (SSBO) – 16k entries

Primitive state descriptor index (UBO) – 16k entries

State descriptors (UBO) – 1k entires

© Hans-Kristian Arntzen 23
Advanced Vulkan features
Now this is where it gets interesting
Subgroup operations
• New feature in Vulkan 1.1
• Share data between threads efficiently
• Peeks into a world where threads execute in lock-step in a SIMD fashion
• Alternative is going through shared memory + barrier()
in int gl_VertexIndex;
void main()
Isolated threads
{
float v = float(gl_VertexIndex); - Shader code is nice and scalar
}

in intNx32_t gl_VertexIndex;
void main() Subgroups
{ - Represents a SIMD unit
floatNx32_t v = floatNx32_t(gl_VertexIndex); - The ISPC paradigm
} © Hans-Kristian Arntzen 25
Wrapping your head around subgroups
• A “thread” -> “lane in a SIMD vector register”
• Branches -> “make lanes active or inactive”
• All values are the same in subgroup -> “scalar register”
• Shading languages do not express this well
• Many recent graphics talks will mention it
• Console optimization
• https://www.khronos.org/blog/vulkan-subgroup-tutorial
• https://www.khronos.org/assets/uploads/developers/library/2018-
vulkan-devday/06-subgroups.pdf
© Hans-Kristian Arntzen 26
subgroupBallot()

Navi ISA
bool primitive_is_binned = test_primitive(thread_index);
// Reduce a boolean test across all threads to a bitmap, nice!
uvec4 ballot = subgroupBallot(primitive_is_binned);

if (subgroupElect())
{
// Only one thread needs to write.
if (gl_SubgroupSize == 32)
write_u32_mask(ballot.x);
else if (gl_SubgroupSize == 64)
write_u32x2_mask(ballot.xy);
}

// Legacy way? atomicOr on shared memory or reduction passes <_<

© Hans-Kristian Arntzen 27
VK_EXT_subgroup_size_control
• Not all GPUs have a fixed subgroup size
• Compilers can vary subgroup sizes
• Intel (8, 16 or 32) - May even vary within a dispatch
• AMD pre-Navi (64 only)
• AMD Navi (32 or 64)
• NVIDIA (32 only)
• gl_SubgroupSize builtin is fixed, but lanes might disappear
• This is totally fine for many use cases of subgroups, but …
• VARYING_SIZE_BIT, FULL_GROUP_BIT and subgroupSizeRequirement
• Critical extension to use subgroups well on Intel

© Hans-Kristian Arntzen 28
Subgroup atomic amortization
• Distributing work on GPU typically means atomic increments
• Atomics are expensive
• Amortize atomics overhead
• Most compilers do this when incrementing a fixed address by 1
• May or may not when incrementing by != 1

uint bit_count = bitCount(binned_mask));

// Reduce in registers
uint total_bit_count = subgroupAdd(bit_count);
uint offset = 0u;
if (subgroupElect())
offset = atomicAdd(counts, total_bit_count);
offset = subgroupBroadcastFirst(offset);
offset += subgroupExclusiveAdd(bit_count);

// Legacy?
// Prefix sum in shared memory and then atomic.
// Write result to shared, then broadcast.

© Hans-Kristian Arntzen 30
8-bit and 16-bit storage
• Storing intermediate data in 32-bit all the time would be wasteful
• Color might fit in uint8 * 4 (or pack manually in uint)
• Depth might fit in 16 bpp
• Coverage/AUX state in 8 bpp
• YMMV

© Hans-Kristian Arntzen 31
Mip-mapping / derivatives
• Simple
• Each thread works on a pixel group
• Good luck with that register pressure …
• Shared memory and barriers?
• Ugh
• subgroupShuffleXor
• NV_compute_shader_derivatives
• Widely supported on desktop, should be EXT!
• dFdx/dFdy/fwidth and implicit LOD all in compute
• Enables subgroupQuad operations
• Linear ID or 2x2 ID grid

© Hans-Kristian Arntzen 33
Async compute
• Pipeline is split in two
• Binning (+ shading if not using ubershader)
• COMPUTE queue
• Final ROP stage doing depth / blending
• GRAPHICS queue
• Only in-order part of pipeline
• Overlaps bandwidth intense work with ALU intensive
• Need to serialize if using off-screen rendering and consuming result next pass

© Hans-Kristian Arntzen 34
Test implementation results
A fake retro GPU - RetroWarp
• Sandbox environment
• https://github.com/Themaister/RetroWarp
• Uses Granite as Vulkan backend
• Span equation based
• Bilinear / trilinear texture filtering
• Implemented manually, no texture()
• Perspective correct
• Trivial texture combiner
• 16-bit color w/ dither, 16-bit depth
• Fully fixed point
• ... except attribute interpolation (might need 64-bit ints)

© Hans-Kristian Arntzen 36
Test scene
• Sponza
• 1280x720
• Overkill, 640x480 is more plausible
• 97309 primitives
• Post clipping
• Post back face culling
• N64 had ~5k primitive budget
• Everything is alpha blended
• Why not
• Didn’t sort triangles, looks funny

© Hans-Kristian Arntzen 37
Results – GTX 1660 Ti
Options Ubershader Ubershader Split shader Split shader
(16x16 tiles) (8x8 tiles) (16x16 tiles) (8x8 tiles)
Subgroups ON 7.7 ms 9.5 ms 9.0 ms 9.9 ms
Async compute OFF
Subgroups ON 7.2 ms 8.8 ms 8.2 ms 9.5 ms
Async compute ON
Subgroups OFF 10.6 ms 10.2 ms 9.3 ms 10.5 ms
Async compute OFF
Subgroups OFF 9.6 ms 9.2 ms 8.4 ms 9.5 ms
Async compute ON

- Async compute helps

- Subgroup ops help a lot with ubershader
- The ubershader is not uber enough to trigger issues
© Hans-Kristian Arntzen 38
Results – RX 5700 XT - Windows
Options Ubershader Ubershader Split shader Split shader
(16x16 tiles) (8x8 tiles) (16x16 tiles) (8x8 tiles)
Subgroups ON 5.5 ms 5.3 ms 6.3 ms 6.2 ms
Async compute OFF
Subgroups ON 4.6 ms 4.1 ms 5.0 ms 5.3 ms
Async compute ON
Subgroups OFF 7.8 ms 5.7 ms 6.3 ms 5.9 ms
Async compute OFF
Subgroups OFF 7.0 ms 4.5 ms 5.0 ms 5.0 ms
Async compute ON

- Very similar story here, but 8x8 tiles win here.

- Gain in perf over GTX 1660 Ti correlates well with peak TFlops.

© Hans-Kristian Arntzen 39
Results – RX 5700 XT – RADV (LLVM)
Options Ubershader Ubershader Split shader Split shader
(16x16 tiles) (8x8 tiles) (16x16 tiles) (8x8 tiles)
Subgroups ON 11.6 ms 9.9 ms 6.3 ms 5.9 ms
Async compute OFF
Subgroups ON 11.3 ms 9.3 ms 5.2 ms 5.1 ms
Async compute ON
Subgroups OFF 14.6 ms 10.6 ms 6.4 ms 6.4 ms
Async compute OFF
Subgroups OFF 14.3 ms 9.9 ms 5.5 ms 5.6 ms
Async compute ON

- Here we see catastrophic failure when ubershader is too large.

- Happens eventually on any compiler …
- LLVM compiler got some catching up to do.
© Hans-Kristian Arntzen 40
Results – UHD 620 – Mesa ANV
Options Ubershader Ubershader Split shader Split shader
(16x16 tiles) (8x8 tiles) (16x16 tiles) (8x8 tiles)
Subgroups ON 175 ms 155 ms 115 ms 113 ms
Async compute OFF
Subgroups ON No change No change No change No change
Async compute ON
Subgroups OFF 493 ms 216 ms 136 ms 131 ms
Async compute OFF
Subgroups OFF No change No change No change No change
Async compute ON

- This is too much for integrated.

- More realistic resolution and geometry complexity improves it a lot.
- Subgroup ops help a lot
© Hans-Kristian Arntzen 41
Other implementations?
• High-Performance Software Rasterization on GPUs (2011)
• https://research.nvidia.com/sites/default/files/pubs/2011-08_High-
Performance-Software-Rasterization/laine2011hpg_paper.pdf
• CUDA, highly optimized
• Has similar idea of coarse-then-fine binning
• paraLLEl-RDP (2016)
• Earlier attempt to emulate N64 RDP in Vulkan compute
• PCSX2 (2014)
• OpenCL
• https://github.com/PCSX2/pcsx2/pull/302

© Hans-Kristian Arntzen 42
Conclusion
• Compute shaders are a viable alternative
• Allows accurate rendering in real-time
• Subgroup operations can be useful in unexpected places
• Data sharing without barriers is very nice
• Async compute + graphics queue compute is a thing
• Radeon GPU Analyzer is useful
• Verifying assumptions with ISA is great

Unreal Engine Graphics & Rendering
100% (1)
Unreal Engine Graphics & Rendering
32 pages
GPU Compute
100% (1)
GPU Compute
58 pages
Graphics Pipeline & Rasterization MIT
No ratings yet
Graphics Pipeline & Rasterization MIT
98 pages
Social Bookmarking Site List With Page Rank
100% (2)
Social Bookmarking Site List With Page Rank
19 pages
Zioma, Renaldas 2015 Siggraph Mobile Optimising PBR Slides
No ratings yet
Zioma, Renaldas 2015 Siggraph Mobile Optimising PBR Slides
49 pages
Parallel Distributed Computing
No ratings yet
Parallel Distributed Computing
38 pages
Zioma, Renaldas 2015 Siggraph Mobile Optimising PBR Notes
No ratings yet
Zioma, Renaldas 2015 Siggraph Mobile Optimising PBR Notes
49 pages
06 Pipeline
No ratings yet
06 Pipeline
40 pages
2023waldemarson VulkanRTRT
No ratings yet
2023waldemarson VulkanRTRT
41 pages
Ray Tracing On GPU
No ratings yet
Ray Tracing On GPU
44 pages
A Brief Introduction To 3d
100% (1)
A Brief Introduction To 3d
84 pages
Pierre Loup Griffais and John McDonald Vulkan
No ratings yet
Pierre Loup Griffais and John McDonald Vulkan
65 pages
Reac2023 Modern Mobile Rendering at Hypehype
No ratings yet
Reac2023 Modern Mobile Rendering at Hypehype
28 pages
10+ +Rasterization+Pipelines
No ratings yet
10+ +Rasterization+Pipelines
73 pages
Cuda Basics
No ratings yet
Cuda Basics
44 pages
Introduction To GPU Architecture: © 2006 University of Central Florida
100% (1)
Introduction To GPU Architecture: © 2006 University of Central Florida
41 pages
Modern GPU Architecture
No ratings yet
Modern GPU Architecture
93 pages
Wa0020
No ratings yet
Wa0020
23 pages
Fe Sol
No ratings yet
Fe Sol
12 pages
How A GPU Works: Kayvon Fatahalian 15-462 (Fall 2011)
No ratings yet
How A GPU Works: Kayvon Fatahalian 15-462 (Fall 2011)
87 pages
HC28.22.110 Bifrost JemDavies ARM v04 9
No ratings yet
HC28.22.110 Bifrost JemDavies ARM v04 9
31 pages
08 gpuSoftwareRasterLaineAndPantaleoni BPS2011
No ratings yet
08 gpuSoftwareRasterLaineAndPantaleoni BPS2011
40 pages
4 Vulkan Getting Explicit How Hard Is Vulkan Really GDC Mar18
No ratings yet
4 Vulkan Getting Explicit How Hard Is Vulkan Really GDC Mar18
46 pages
Defecte Multiplexare
No ratings yet
Defecte Multiplexare
22 pages
The Evolution of Gpus For General Purpose Computing
No ratings yet
The Evolution of Gpus For General Purpose Computing
38 pages
Batch and Cull in Opengl
No ratings yet
Batch and Cull in Opengl
25 pages
Aaltonen Sebastian GPU Based Clay
No ratings yet
Aaltonen Sebastian GPU Based Clay
70 pages
Business of Fantasy Sports - Final
No ratings yet
Business of Fantasy Sports - Final
71 pages
E0 271 Assignment 2
No ratings yet
E0 271 Assignment 2
5 pages
CCS347 GD - Unit 3
No ratings yet
CCS347 GD - Unit 3
47 pages
Mali GPU Architecture
No ratings yet
Mali GPU Architecture
21 pages
Chapter 9 Real Mortgage
100% (3)
Chapter 9 Real Mortgage
6 pages
Iso 15614 11 2002
No ratings yet
Iso 15614 11 2002
12 pages
ccs347 GD Unit 3 Notes
No ratings yet
ccs347 GD Unit 3 Notes
42 pages
OpenGL Intro
No ratings yet
OpenGL Intro
98 pages
The Marxist Approach in Comparative Politics
75% (4)
The Marxist Approach in Comparative Politics
2 pages
Deferred Shading Optimizations
No ratings yet
Deferred Shading Optimizations
40 pages
Introduction To Graphics Hardware and Gpus Introduction To Graphics Hardware and Gpus
No ratings yet
Introduction To Graphics Hardware and Gpus Introduction To Graphics Hardware and Gpus
22 pages
Introduction To The Graphics Pipeline of The PS3
No ratings yet
Introduction To The Graphics Pipeline of The PS3
29 pages
Reznor Handbook
100% (1)
Reznor Handbook
72 pages
Introduction To Modern Opengl Programming: Ed Angel University of New Mexico Dave Shreiner Arm, Inc
No ratings yet
Introduction To Modern Opengl Programming: Ed Angel University of New Mexico Dave Shreiner Arm, Inc
109 pages
Direct3D 11 Computer Shader More Generality For Advanced Techniques
No ratings yet
Direct3D 11 Computer Shader More Generality For Advanced Techniques
54 pages
Shader Fundamentals
No ratings yet
Shader Fundamentals
154 pages
The Raine Report Issue 02
No ratings yet
The Raine Report Issue 02
51 pages
Introduction To Graphics
No ratings yet
Introduction To Graphics
10 pages
How A GPU Works - Kayvon Fatahalian
No ratings yet
How A GPU Works - Kayvon Fatahalian
87 pages
Porting Source To Linux
No ratings yet
Porting Source To Linux
90 pages
GPU Programming EE 4702-1 Final Examination: Name Solution
No ratings yet
GPU Programming EE 4702-1 Final Examination: Name Solution
10 pages
Mset Rendering April29 2014
No ratings yet
Mset Rendering April29 2014
41 pages
El Mansouri Jalal Rendering Rainbow Six PDF
No ratings yet
El Mansouri Jalal Rendering Rainbow Six PDF
82 pages
3D Graphics With OpenGL - The Basic Theory
No ratings yet
3D Graphics With OpenGL - The Basic Theory
22 pages
(Corus) SHS Jointing - Flowdrill and Hollo-Bolt
No ratings yet
(Corus) SHS Jointing - Flowdrill and Hollo-Bolt
13 pages
Oracle Database Administration
No ratings yet
Oracle Database Administration
57 pages
Configuring A JOB in T24
No ratings yet
Configuring A JOB in T24
2 pages
Vertex & Pixel Shaders: CPS124 - Computer Graphics
No ratings yet
Vertex & Pixel Shaders: CPS124 - Computer Graphics
11 pages
GLSL Tutorial 9 Shader Introduction
No ratings yet
GLSL Tutorial 9 Shader Introduction
14 pages
CUDA OpenGL
No ratings yet
CUDA OpenGL
9 pages
Openglfor2015 150902085548 Lva1 App6891 PDF
No ratings yet
Openglfor2015 150902085548 Lva1 App6891 PDF
47 pages
3D Graphics With OpenGL
No ratings yet
3D Graphics With OpenGL
31 pages
Easy Going Vector Graphics As Textures On The GPU
No ratings yet
Easy Going Vector Graphics As Textures On The GPU
4 pages
Write Up PGDBF
No ratings yet
Write Up PGDBF
11 pages
Understanding The Graphics Pipeline
No ratings yet
Understanding The Graphics Pipeline
35 pages
Mali GPU Architecture
No ratings yet
Mali GPU Architecture
21 pages
Event Action Script Call Equivalents
No ratings yet
Event Action Script Call Equivalents
17 pages
GPU Fundamentals
No ratings yet
GPU Fundamentals
21 pages
Agilent Technologies E4350B User Manual
No ratings yet
Agilent Technologies E4350B User Manual
129 pages
Unit 2 - GPU DFG
No ratings yet
Unit 2 - GPU DFG
27 pages
Procedural Shaders
No ratings yet
Procedural Shaders
28 pages
Lecture4 - Guest Lecture Shaders
No ratings yet
Lecture4 - Guest Lecture Shaders
72 pages
Model BFV-300 Butterfly Valve Wafer Style General Description Technical Data
No ratings yet
Model BFV-300 Butterfly Valve Wafer Style General Description Technical Data
8 pages
1 Atlas
No ratings yet
1 Atlas
13 pages
ARM313R Data Sheet
No ratings yet
ARM313R Data Sheet
2 pages
Understanding The Adobe Illustrator Tools.
No ratings yet
Understanding The Adobe Illustrator Tools.
7 pages
Tao Et Al - 2017 - Reconfigurable Conversions of Reflection, Transmission, and Polarization States
No ratings yet
Tao Et Al - 2017 - Reconfigurable Conversions of Reflection, Transmission, and Polarization States
6 pages
The End of The Gpu Roadmap: Tim Sweeney CEO, Founder Epic Games
No ratings yet
The End of The Gpu Roadmap: Tim Sweeney CEO, Founder Epic Games
74 pages
1 Flatasm
No ratings yet
1 Flatasm
50 pages
IPR Assignment
No ratings yet
IPR Assignment
5 pages
Loading and Animating Md5 Models With Opengl - 3d Game Engine Programming
No ratings yet
Loading and Animating Md5 Models With Opengl - 3d Game Engine Programming
43 pages
AC 21 New Features Guide
No ratings yet
AC 21 New Features Guide
39 pages
Replication Promblem of DNS
No ratings yet
Replication Promblem of DNS
4 pages
Incred PL LXDEL18924 257067901 06 12 2024 - Statement - of - Account - 140420
No ratings yet
Incred PL LXDEL18924 257067901 06 12 2024 - Statement - of - Account - 140420
2 pages
Matsumoto Hakuō II
No ratings yet
Matsumoto Hakuō II
3 pages
PDS AlphaTec MANUAL PRESSURE TEST KIT 2003
No ratings yet
PDS AlphaTec MANUAL PRESSURE TEST KIT 2003
1 page
The Dabbawalas, Feeding Mumbai
No ratings yet
The Dabbawalas, Feeding Mumbai
14 pages
Cato DLP WP
No ratings yet
Cato DLP WP
10 pages
2 Qoxb P2 K
No ratings yet
2 Qoxb P2 K
9 pages
Maaz Assignment # 3 Deep Learning
No ratings yet
Maaz Assignment # 3 Deep Learning
5 pages
Jeschke Wave Cages
No ratings yet
Jeschke Wave Cages
8 pages
Mandeville-The Grumbling Hive
No ratings yet
Mandeville-The Grumbling Hive
5 pages
Mun of La Carlota V NAWASA
No ratings yet
Mun of La Carlota V NAWASA
2 pages
24 Coercion Exercise
No ratings yet
24 Coercion Exercise
1 page
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
PC Engine / TurboGrafx-16 Architecture: Architecture of Consoles: A Practical Analysis, #16
From Everand
PC Engine / TurboGrafx-16 Architecture: Architecture of Consoles: A Practical Analysis, #16
Rodrigo Copetti
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
From Everand
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Implementing Low Level GPU Hans Kristian Munich 2019

Uploaded by

Implementing Low Level GPU Hans Kristian Munich 2019

Uploaded by

Implementing old-school

graphics chips with

• Focus on recreating exact

for (auto &setup : primitive_setups)

store_color_framebuffer(x, y, color); HOST_VISIBLE buffer if CPU and GPU

for (coarse_index in num_prims_1024)

Low-res binning Binning Blend / depth

Position data (SSBO) – 16k entries

Attribute data (SSBO) – 16k entries

Primitive state descriptor index (UBO) – 16k entries

State descriptors (UBO) – 1k entires

// Legacy way? atomicOr on shared memory or reduction passes <_<

uint bit_count = bitCount(binned_mask));

- Async compute helps

- Very similar story here, but 8x8 tiles win here.

- Here we see catastrophic failure when ubershader is too large.

- This is too much for integrated.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.