10 Years of PhyreEngine™

Almost exactly 10 years ago, at a global SCE R&D meeting following the SIGGRAPH graphics conference, the decision was made to start to work on a graphics engine for PlayStation 3.

In the beginning

Initially codenamed Pootle, the aim was to provide a graphics engine for the upcoming PlayStation 3 with the intention of being used as both an engine for games development and a technical reference for future PlayStation 3 developers. Research began in SCEI’s European R&D team on the best ways to take advantage of the platform’s unique features such as the Synergistic Processing Units (SPUs – blindingly fast processors that need to be well treated to obtain maximum performance).

From the start there were several goals for the project:

  • It would be given away for free to all licensed PlayStation 3 developers to ensure all developers could access it.
  • It would be provided as source to allow developers to see how it works for research and debugging, taking any parts they need, or adding new functionality.
  • It would be cross-platform, at least in terms of licensing, so that developers could use it without having to exclusively target PlayStation 3 (to this day developers are still surprised we allow this).

To allow this last feature, the source was written to support multiple platforms including PC. Providing support for PC also meant that developers could use familiar PC tools to get started, transitioning to PlayStation specific tools to further tailor their title to the platform. Later we also decided to include ‘game templates’, more fully featured example code, representative of particular game genres, with all of the artwork included to show how it’s done.

Introduced to the world

PhyreEngine was first shown to the larger world at the 2006 Develop conference in Brighton, at that stage referred to as “PSSG”, named in line with other SDK components, where I presented the development process of one of our game templates to an audience mostly populated with the PhyreEngine faithful.

It next surfaced at Games Developer Conference (GDC) (slides), where we re-launched the engine with a new, more memorable name and a lot of enhancements. PhyreEngine took PSSG and extended it from a graphics engine to a game engine by adding common gameplay features such as physics, audio, and terrain. This coincided nicely with Chris Eden’s presentation “Making Games for the PlayStation Network” combining the low cost of PlayStation 3 debug kits with the free code and examples that you need to get started writing a game.

Terrain

The following year, PhyreEngine returned to GDC to announce PhyreEngine 2.4.0. This was the first time we were able to gather a long list of very happy PhyreEngine users to share their experiences using PhyreEngine. Along with the developers directly using PhyreEngine for their titles, we also heard from CTOs that used PhyreEngine for reference. This was highlighted in Matt’s Swoboda’s talk “Deferred Lighting and Post Processing on PlayStation 3” showing advanced research on the use of the SPUs to accelerate post processing techniques – techniques which are now the back-bone of many other rendering engines.

PostProcessing09b

New platforms

2010 saw the release of PhyreEngine for PSP, bringing the engine to PSP based on interest from the PSP development community. Matt came back to GDC in 2011 to introduce PhyreEngine 3.0. This newer version was better designed for modern multicore architectures, focusing on PS Vita and laying the ground-work for PlayStation 4, while taking the best parts of PlayStation 3 support from PhyreEngine 2. The presentation also dived deeply into some of the new technology and showed our latest game template, an indoor game using the new navigation and AI scripting features and showing rendering techniques that allowed us to reproduce the same beautiful image on PlayStation 3 and Vita.

SpaceStation

At the 2013 GDC this March we announced PhyreEngine 3.5. This was the third release of PhyreEngine 3 to support PlayStation 4 and our cross-platform approach meant that any developers already using PhyreEngine 3 on PlayStation 3 or PS Vita could take their title to PlayStation 4 with minimal changes. We were lucky to have worked in collaboration with other SDK teams to be able to provide feedback and develop early versions of PhyreEngine that could be used by other developers with early access to the platform.

We’ve also been working the the PS.First team to provide PhyreEngine to universities and academic groups. This is a video from Birmingham University’s GamerCamp from this last year using PhyreEngine.

The numbers so far

At the time of writing, PhyreEngine has been used in at least 130 games released on PlayStation consoles alone. I say “at least” because the PhyreEngine license is very liberal, and developers don’t even have to tell us they’re using it, let alone include a credit. These 130 titles come from at least 58 different studios and more than 11 of those studios have released 4 or more games. There’s also a fair split between retail and digital with 61% of titles being digital-only. This also does not include any titles from developers who have taken advantage of our open approach, and utilised components of PhyreEngine in their own engines. These games cover a wide range of genre and platform (indeed, many of the titles appear on multiple platforms), and we’re proud of the tiny role we’ve had in each and every one of them.

The future

PhyreEngine provided support for PS4 with one of the earliest pre-release SDKs so that it was able to form the graphical base for the IDU (interactive display unit) software that will be used around the world in shops to showcase launch games, as well as at least six games being released during the initial launch window. One already announced is Secret Ponchos from Switchblade Monkeys – hopefully we’ll be able to introduce more of them sometime soon! We currently estimate 50 titles in development too so we expect to be busy for quite a while.

Thanks

We’d like to thank our developer community for all the great games they’ve made with PhyreEngine over the years, and we hope to see many more in the future. You guys are awesome – and probably a little bit crazy – and we love you all.

HPG 2013

This year HPG took place in Anaheim on July 19th-21st, collocating and running just prior to SIGGRAPH. The program is here.

Friday July 19

Advanced Rasterization

Moderator: Charles Loop, Microsoft Research

Theory and Analysis of Higher-Order Motion Blur Rasterization Site  Slides
Carl Johan Gribel, Lund University; Jacob Munkberg, Intel Corporation; Jon Hasselgren, Intel Corporation; Tomas Akenine-Möller, Lund University/Intel Corporation

The conference started with a return to Intel’s work on Higher Order Rasterization. The presentation highlighted that motion is typically curved rather than linear and is therefore better represented by quadratics. The next part showed how to change the common types of traversal to handle this curved motion. The presenter demonstrated Interval and Tile based methods and how to extend them to handle quadratic motion. This section introduced Subdividable Linear Efficient Function Enclosures (SLEFES) which I’d not heard of before. SLEFES allows you to give tighter bounds on a function over an interval which are better than the convex hull of control points that you’d typically use – definitely something to look at later.

PixelPie: Maximal Poisson-disk Sampling with Rasterization Paper Slides (should be)
Cheuk Yiu Ip, University of Maryland, College Park; M. Adil Yalçi, University of Maryland, College Park; David Luebke, NVIDIA Research; Amitabh Varshney, University of Maryland, College Park

All Poisson-disk sampling talks start with a discussion of the basic dart-throwing and rejection based implementation first put forward in 1986, before going into the details of their own implementation. The contribution of this talk was the idea of using rasterization to maintain the minimum distance requirement. This is handled by rendering disks which will occlude each other if overlapping, where overlapping means too close – simple but effective. Of course there’s a couple of issues. Firstly there’s some angular bias due to the rasterization if the radius is small because of the projection of the disk’s edge to the pixels. The other problem was that even once you have a good set of initial points, there’s extra non-rasterization compute work to handle the empty space via stream compaction. One extra feature you get cheaply is support for importance sampling since you can change the size of each disk based on some additional input. This was shown by using the technique to select points that map to features on images – something I’d not seen before.

Out-of-Core Construction of Sparse Voxel Octrees Paper Slides
Jeroen Baert, Department of Computer Science, KU Leuven; Ares Lagae, Department of Computer Science, KU Leuven; Philip Dutré, Department of Computer Science, KU Leuven

The fundamental contribution from this talk was the use of Morton ordering when partitioning the mesh to minimize the amount of local memory when voxelising. One interesting side effect of this memory reduction is improved locality resulting in faster voxelization. In the example cases, this meant that the tests with 128MB were quicker than 1GB or 4GB. The laid back nature of the presenter and the instant results made it feel like you could go implement it right now, but then the source was made available taking the fun out of that idea!

Shadows

Moderator: Samuli Laine, NVIDIA Research

Screen-Space Far-Field Ambient Obscurance Slides Site including source Paper (Video)
Ville Timonen, Åbo Akademi University

The first thing to note is the difference between occlusion and obscurance; obscurance includes a falloff term such as a distance weight. The aim is to find a technique that can operate over greater distances, highlighting the issues previous techniques where direct sampling misses important values and the alternative of mipmapping average, minimum or maximum depth result in either flattening, or over or under occlusion. The contribution of this talk was to focus on the details important for AO based on scanning the depth map in multiple directions. This information is then converted into prefix sums to easily get the range of important height samples across a sector. The results of the technique were shown to be closer to ray traces of a depth buffer than the typical mipmap technique. One other thing I noticed was the use of a 10% guard band, so from 1280×720 (921600 pixels) to 1536×864 (1327104), a 44% increase in pixels! Another useful result was a comment from the presenter that it’s better to treat possibly occluding surfaces as a thin shell rather than a full volume since the eye notices incorrect shadowing before incorrect lighting.

Imperfect Voxelized Shadow Volumes Paper
Chris Wyman, NVIDIA; Zeng Dai, University of Iowa

The aim of this paper was interactive performance or better when generating a shadow volume per virtual point light (VPL) on an area light. The initial naive method, one voxelized shadow volume per point light, ran at less than 1 FPS. The problem is how to handle many VPLs. The first part of the solution is imperfect shadow maps (ISMs), a technique for calculating and storing lots of small shadow maps generated from point splats within the scene with the gaps filled in (Area Lights are actually described as another application in the ISM paper). After creating an ISM, each shadow sub-map can processed in parallel. The results looked good with a lot of maps and there’s the ability to balance the number of maps against their required size in the ISM. For example, a sharper point light could use the entire ISM space for a single map for sharpness, but a more diffuse light with many samples could pack more smaller maps into the ISM.

Panel: High-Performance Graphics in Film

Moderator: Matt Pharr

Dreamworks, Eric Tabellion; Weta Digital, Luca Fascione; Disney Animation, David Adler / Rasmus Tamstorf; Solid Angle, Thiago Ize / Marcos Fajardo

Introductions:

Disney
Use OpenGL in some preview tools
Major GPU challenges are development and deployment
They are interested in the use of compute and are hiring a research scientist

PDI Dreamworks
OpenGL display pipeline for tools
Useful for early iterations
Also mentioned Amorphous – An OpenGL Sparse Volume Renderer

Weta
Highlighted that the production flow included kickback loop where everything fed back to an earlier stage
Not seeing GPU as an option

Arnold
Long code life – can’t be updated to each new language/driver
Reuse of hardware too
Highlighted that 6GB GPUs cost $2k (and I was thinking a PS4 was much less than that and had more memory)
Preview lighting must be accurate including errors of final render

Questions: (replies annotated with speaker/company where possible)

How much research is reused in Film?
Tabellion: The relevant research is used.
Disney: Other research used, not just rendering i.e. physics
Thiago: Researchers need access to data
Kayvon: Providing content to researchers has come up before. And the access to the environment too – lots of CPUs.
Tabellion: Feels that focus on research may be more towards games at HPG
Need usable licenses and no patents
Lots of work focused on polys and not on curves
Need to consider performance and memory usage of larger solutions

Convergence between films and games
Tabellion: Content production – game optimize for scene, film is many artists in parallel with no optimisation
Rasmus: Both seeing complexity increase
Weta: More tracking than convergence. Games have to meet hard limit of frame time

Discussion of Virtual Production
Real time preview of mocap in scene
With moveable camera tracked in the stage

Separate preview renderer?
Have to maintain 2 renderers
Using same [huge] assets – sometimes not just slow to render but load too
Difficult to match final in real time now moving to GI and ray tracing

Work to optimise management of data
Lots of render nodes want the same data
Disney: Just brute forces it
Weta: Don’t know of scheduler that knows about the data required. Can solve abstractly but not practically. Saw bittorrent-like example.

What about exploiting coherence?
Some renders could take 6-10 hours, but need the result next day so can’t try putting two back-to-back

Do you need all of the data all of the time? Could you tile the work to be done?
Not in Arnold – need all of the data for possible intersections
Needs pipeline integration, render management

Example of non water tight geometry – solving in Arnold posted to JCGT (Robust BVH Ray Traversal)
Missing ray intersection can add minutes of pre processing and gigs of memory

Double precision?
Due to some hacks when using floats, you could have done it just as fast in double instead
Arnold: Referred to JCGT paper
Disney: Don’t have to think when using doubles
Tabellion: Work in camera space or at focal point
Expand bvh by double precision – fail – look up JCGT paper

Saturday July 20

Keynote 1: Michael Shebanow (Samsung): An Evolution of Mobile Graphics Slides

Not a lot to report here and the slides cover a lot of what was said.

Fast Interactive Systems

Moderator: Timo Alia, NVIDIA Research

Lazy Incremental Computation for Efficient Scene Graph Rendering Slides Paper
Michael Wörister, VRVis Research Center; Harald Steinlechner, VRVis Research Center; Stefan Maierhofer, VRVis Research Center; Robert F. Tobler, VRVis Research Center

The problem with the scenegraph traversal in this case was the cost. The aim was to reduce this cost by maintaining an external optimized structure and propagate changes from the scenegraph to this structure. Most of the content was based on how to minimize the cost of keeping the two structures synchronized and the different techniques. Overall, using the caching did improve performance since it enabled a set of optimizations. Despite the relatively small amount of additional memory required, I did note a 50% increase in startup time was mentioned.

Real-time Local Displacement using Dynamic GPU Memory Management Site
Henry Schäfer, University of Erlangen-Nuremberg; Benjamin Keinert, University of Erlangen-Nuremberg; Marc Stamminger, University of Erlangen-Nuremberg

The examples for this paper were footsteps in terrain, sculpturing and vector displacement. The displacements are stored in a buffer dynamically allocated from a larger memory area and then sampled when rendering. The storage of the displacement is based on an earlier work by the same authors: Multiresolution Attributes for Tessellated Meshes. The memory management part of the work seems quite familiar having seen quite a few presentations on partially resident textures. The major advantage is that the management can take place GPU side, rather than needing a CPU to update memory mapping tables.

Real-Time High-Resolution Sparse Voxelization with Application to Image Based Modeling (similar site)
Charles Loop, Microsoft Research; Cha Zhang, Microsoft Research; Zhengyou Zhang, Microsoft Research

Ths presentation introduced an MS research project using multiple cameras to generate a voxel representation of a scene that could be textured. The aim was a possible future use as a visualization of remote scenes for something like teleconferencing. The voxelization is performed on GPU based on the images from the cameras and the results appear very plausible with only minor issues on common problem areas such as hair. It looks like fun going on the videos of the testers using it.

Building Acceleration Structures for Ray Tracing

Moderator: Warren Hunt, Google

Efficient BVH Construction via Approximate Agglomerative Clustering Slides Paper
Yan Gu, Carnegie Mellon University; Yong He, Carnegie Mellon University; Kayvon Fatahalian, Carnegie Mellon University; Guy Blelloch, Carnegie Mellon University

This work extends the agglomerative clustering described in Bruce Walter et al’s 2008 paper Fast Agglomerative Clustering for Rendering to improve performance by exposing additional parallelism. The parallelism comes from partitioning the primitives to allow multiple instances of the agglomeration to run in their own local partition. This provides a greater win at the lower level where most of the time is typically spent. The sizing of the partitions and number of clusters in each partition leads to a parameters that can be tweaked to provide choices between speed and quality.

Fast Parallel Construction of High-Quality Bounding Volume Hierarchies Slides Page
Tero Karras, NVIDIA; Timo Aila, NVIDIA

This presentation started with the idea of effective performance, based on the number of rays traced per unit rendering time, but rendering time includes the time to build your bounding volume hierarchy as well as the time to intersect rays with that hierarchy, so you need to balance speed and quality of the BVH. This work takes the idea of building a fast low quality BVH (from the same presenter at last year’s HPG – Maximizing Parallelism in the Construction of BVHs, Octrees, and kd Trees) and then improving the BVH by optimizing treelets, subtrees of internal nodes. Perfect optimization of these treelets is NP-hard based on the size of the treelets so instead they iterate 3 times on treelets with a maximum size of 7 nodes – which actually has 10K possible layouts! This gives a good balance between performance and diminishing returns. The presentation also covers a practical implementation of splitting triangles with bounding boxes that are a poor approximation to the underlying triangle.

On Quality Metrics of Bounding Volume Hierarchies Slides Page
Timo Aila, NVIDIA; Tero Karras, NVIDIA; Samuli Laine, NVIDIA

This presentation started with an overview of the Surface Area Heuristic (SAH), which gives great results despite the questionable assumptions on which it rests. To check how well the SAH actually correlates with performance, they tested multiple top-down BVH builders and calculated how the surface area heuristic predicted the ray intersection performance of the BVH from the builder for multiple scenes. A lot of the results correlated well, but the San Miguel and Hairball scenes typically showed a loss of correlation which indicated that maybe SAH doesn’t give a complete picture of performance. Reconsidering the work done in ray tracing, an additional End Point Overlap metric was introduced for handling the points at each end of the ray which appears to improve the correlation. This was then further supplemented with another possible contribution to the cost, leaf variability, which was introduced to account for how the resulting BVH affects SIMD traversal. This paper reminded me of the Power Efficiency for Software Algorithms running on Graphics Processors paper from the previous year, leading us to question the basis for how we evaluate our work.

Hot3D

Michael Mantor, Senior Fellow Architect (AMD): The Kabini/Temash APU: bridging the gap between tablets, hybrids and notebooks
Marco Salvi (Intel): Haswell Processor Graphics
John Tynefield & Xun Wang (NVIDIA): GPU Hardware and Remote Interaction in the Cloud

Hot3D is a session that typically gives a lot of low level details about the latest hardware or tech. AMD started by introducing the Kabini/Temash APU. This was the most technical of the talks, discussing the HD 8000 GPU which features their Graphics Core Next (GCN) architecture and asynchronous compute engines – all seems quite familiar really. Intel were next discussing Haswell and covering some of the mechanisms used for lowering power usage and allowing better power control, such as moving the voltage regulator from motherboard. Marco also mentioned the new Pixel Sync features of Haswell which was covered at many times during HPG and SIGGRAPH. NVIDIA were last in this section and they presented some of their cloud computing work.

Sunday July 21st

Keynote 2: Steve Seitz (U. Washington (and Google)): A Trillion Photos (Slides very similar to EPFL 2011)

Very similar to Alexei’s presentation from EGSR last year (Big Data and the Pursuit of Visual Realism), Steve wowed the audience with the possibilities available when you have the entirety of the images from Flickr available and know the techniques you need to match them up. Scale-invariant feature transform (SIFT) was introduced first. This (apparently patented) technique detects local features in images then uses this description to identify similar features in other images. The description of the features was described as a histogram of edges. This was shown applied to images from the NASA Mars Rover to match locations across images. Next Steve introduced Structure from Motion which allows the reconstruction of an approximate 3D environment based on multiple 2D images. This allowed the Building Rome in a day project which reconstructed the landmarks of Rome based on the the million photos of Rome in Flickr in 24 hours! This was later followed by a Rome on a Cloudless day project that produced much denser geometry and appearance information. Steve also referenced other work by Yasutaka Furukawa on denser geometry generation such as Towards Internet-scale Multi-view Stereo which later lead to the tech for GL maps in Google Maps. One of the last examples was a 3D Wikipedia that could cross reference text with a 3D reconstruction of a scene from photos where auto-detected keywords could be linked to locations in the scene.

Ray Tracing Hardware and Techniques

Moderator: Philipp Slusallek, Saarland University

SGRT: A Mobile GPU Architecture for Real-Time Ray Tracing Slides Page
Won-Jong Lee, SAMSUNG Advanced Institute of Technology; Youngsam Shin, SAMSUNG Advanced Institute of Technology; Jaedon Lee, SAMSUNG Advanced Institute of Technology; Jin-Woo Kim, Yonsei University; Jae-Ho Nah, University of North Carolina at Chapel Hill; Seokyoon Jung, SAMSUNG Advanced Institute of Technology; Shihwa Lee, SAMSUNG Advanced Institute of Technology; Hyun-Sang Park, National Kongju University; Tack-Don Han, Yonsei University

Similar to last years talk, the reasoning behind aiming for mobile realtime ray tracing was better quality for augmented reality which also reminds me of Jon Olick’s Keynote from last year and his AR results. The solution presented was the same hybrid CPU/GPU solution with updates from SIGGRAPH Asia from the Parallel-pipeline-based Traversal Unit for Hardware-accelerated Ray Tracing presentation which showed performance improvements with coherent rays by splitting the pipeline into separate parts, such as AABB or leaf tests, to allow rays to be iteratively processed in one part without needing to occupy the entire pipeline.

An Energy and Bandwidth Efficient Ray Tracing Architecture Slides Page
Daniel Kopta, University of Utah; Konstantin Shkurko, University of Utah; Josef Spjut, University of Utah; Erik Brunvand, University of Utah; Al Davis, University of Utah

This presentation was based on TRaX (TRaX: A Multi-Threaded Architecture for Real-Time Ray Tracing from 2009) and investigating how to reduce energy usage without reducing performance. Most of the energy usage is in data movement so the main aim is to change the pipeline to use macro instructions which will perform multiple operations without needing to write intermediate operands back to the register file. Also, the new system is treelet based since they can be streamed in and remain in L1 cache. The result was a 38% reduction in power with no major loss of performance.

Efficient Divide-And-Conquer Ray Tracing using Ray Sampling Slides Page
Kosuke Nabata, Wakayama University; Kei Iwasaki, Wakayama University/UEI Research; Yoshinori Dobashi, Hokkaido University/JST CREST; Tomoyuki Nishita, UEI Research/Hiroshima Shudo University

Following last year’s SIGGRAPH Naive Ray Tracing: A Divide-And-Conquer Approach presentation by Benjamin Mora, this research focuses on problems discovered with the initial implementation. These problems stem from inefficiencies when splitting geometry without considering the coherence in the rays and low quality filtering during ray division which can result in only a few rays being filtered against geometry. The fix is to select some sample rays, generate partitioning candidates to create bins for the triangles, then use the selected samples to calculate inputs for a cost function to minimize. While discussing this cost metric, they mentioned the poor estimates of the SAH metric with non-uniform ray distributions, seeming timely with Timo’s earlier presentations. The samples can also indicate which child bounding box to traverse first. The results look good although it appears to work best with incoherent rays which have a lot of applications in ray tracing after dealing with primary paths.

Megakernels Considered Harmful: Wavefront Path Tracing on GPUs Slides Page
Samuli Laine, NVIDIA; Tero Karras, NVIDIA; Timo Aila, NVIDIA

A megakernel is a ray tracer with all of the code in a single kernel which is bad for several reasons; instruction cache thrashing, low occupation due to register consumption, and divergence. In the case of this paper, one of the materials shown is a beautiful 4 layer car paint whose shader was white and green specks of code on a powerpoint slide. A pooling mechanism (maintaining something like a million paths) is used to allow the raytracing to queue similar work to be batch processed by smaller kernels performing path generation or material intersection, reducing the amount of code and registers required and minimizing divergence. The whole thing sounds very similar to the work queuing performed in hardware by GPUs until there is sufficient work to kick off a wavefront, nicely described by Fabian Giesen in his Graphics Pipeline posts. It would be good to know what the hardware ray tracing guys think of these results since the separation of the pipeline appears similar to Won-Jong Lee’s parallel pipeline traversal unit.

Panel: Hardware/API Co-evolution

Moderator: Peter Glaskowsky (replies annotated with speaker/company where possible)

ARM: Tom Olson, Intel: Michael Apodaca, Microsoft: Chas Boyd, NVIDIA: Neil Trevett, Qualcomm: Vineet Goel, Samsung: Michael Shebanow

Introduction – Thoughts on API HW Evolution
AMD: deprecate cost of API features
Tom Olson: Is TBDR dead with tessellation? Is tessellation dead?
Intel: Memory is just memory. Bindless and precompiled states.
Microsoft: API as convergence.
NVIDIA: Power and more feedback to devs
Qualcomm: Showed GPU use cases
Samsung: Reiterated that APIs are power inefficient as mentioned in keynote

Power usage?
AMD: Good practice. Examples of power use.
ARM: We need better IHV tools
Intel, Microsoft, NVIDIA: Agree
NVIDIA: OpenGL 4 efficient hinting is difficult
Qualcomm: Offers tile based hints
Samsung: Need to stop wasting work

Charles Loop: Tessellation not dead. Offers advantages, geometry internal to GPU, don’t worry about small tris and rasterise differently – derivatives
? Possibly poor tooling
? Opensubdiv positive example of work being done
Tom: Not broken but needs tweaking

Expose query objects as first class?
Chas: typically left to 3rd parties
Not really hints but required features

When will we see tessellation in mobile? Eg on 2W rather than 200W
Qualcomm: Mobile content different
Neil: Tessellation can save power
Chas: quality will grow
Tom: Mobile evolving differently due to ratios

Able to get info on what happens below driver?
? Very complex scheduling

What about devs that don’t want to save power?
Tom: It doesn’t matter to $2 devs, but AAA
Chas: Devs will become more sensitive

Ray tracing in hardware? Current API
Chas: Don’t know but could add minor details to gpus
Samsung: RT needs all the geometry

SOC features affect usage?
Qualcomm: Heterogenous cores to be exposed to developers

Shared/unified memory?
AMD: Easy to use power
Neil: Yes we need more tools

What about lower level access?

Best Paper Award

All 3 places went to the NVIDIA raytracing team:

1st: On Quality Metrics of Bounding Volume Hierarchies Timo Aila, Tero Karras, Samuli Laine
2nd: Megakernels Considered Harmful: Wavefront Path Tracing on GPUs Samuli Laine, Tero Karras, Timo Aila
3rd: Fast Parallel Construction of High-Quality Bounding Volume Hierarchies Tero Karras, Timo Aila

Next year

HPG 2014 is currently expected to be Lyon during the week of 23-27 June. Hope to see you there!

Blog Hiatus 1

So it’s been a while since I last published a post – actually, it’s four and a half months since CCC and DefCon Videos – Part 1 in March. I knew it was going to happen, for several major reasons:

  1. First came GDC, an incredibly busy time of preparation, then meetings and then following up those meetings. And this was a great GDC for Phyre, with the 3.5 announcement and more positive quotes from our users than we could include in that blog post.
  2. An amazing Vita line up, especially in May and July. Thomas was Alone, Velocity Ultra, Rymdkapsel, and Soul Sacrifice in May and then Hotline Miami and Stealth Inc in July. Every one of these I’d recommend to anyone with a Vita.
  3. HPG 2013 and SIGGRAPH 2013 in Anaheim. More planning and preparation and a whole lot of busy. And now, pages and pages of notes to transcribe.

Other than these distractions, there have been some things that I’ve been looking at for new posts:

  1. More DefCon videos. I’ve already got a list of great videos which is longer than the first post.
  2. Dabbling in Android. Thanks to vs-android, I’ve been able to stay in my happy place with Visual Studio while developing for Android. I wanted to see how hard it would be to a) run something on my new Nexus 10 b) then port and run something with several hundred thousand lines of code.
  3. Spherical Harmonics. Having read the Robin Green paper several times until I understood it and then the Peter-Pike Sloan doc, then anything else I could find, I’ve come to the conclusion that there’s quite a gulf between the simplicity of the underlying maths as exposed by a library like the DirectXSH library from Chuck Walbourn or the D3DXSH maths functions, and the literature that describes how to apply the maths. This makes me want to write an “SH for dummies” style post that would let me verify that I understand what is needed to integrate basic SH lighting and then highlight what I’ve missed.
  4. Actually writing up my HPG 2013 and SIGGRAPH 2013 experiences – that normally takes a few weeks!
  5. Updates to my HTML5/JS experiments. I still pick them up from time to time. I just love the simplicity of JS and HTML for protoyping, but I miss the exclusive control over the device that you get from console development.

Give me time, the next post isn’t too far away. (And the 1 in the title is for the fact that this isn’t the last time this is going to happen!)

CCC and DefCon Videos – Part 1

Recently, @rygorous tweeted a link to a video called Writing a Thumbdrive from Scratch presented at 29c3. YouTube’s recommended other videos list lead me through some other C3 talks and then on to Defcon videos. Defcon is described as “the world’s longest running and largest underground hacking conference.” C3 in this case is the Chaos Communication Congress and the about page says that the events blog is maintained by members of the Chaos Computer Club.

For me, a lot of the enjoyment of watching these videos comes from learning something from an area completely outside of my comfort zone. I’ve only been to Games and Graphics conferences and even at those, it’s always the talks about subjects wildly different to what I know well that I learn the most from. Hacking and security topics are something I find interesting despite a lack of exposure to that kind of content. There’s also the fact that some of the presenters are entertaining and the material can be quite funny.

The Videos

Writing a Thumbdrive from Scratch – Travis Goodspeed (29c3)

This talk was about the possibility of designing your own USB drive from scratch. Using something like a Facedancer (prototyped by the presenter) you can prototype your own USB mass storage device. However since you’re able to program the behaviour of the device, beyond handling standard storage requests, you can also add your own behaviour based on how the device is used. For example, based on the different ways in which operating systems access the drive, you can decide whether to allow access to the drive, expose different data or just not work at all. Similarly, by detecting block copy operations, you can tell when the drive is being copied, possibly for later forensic examination,  and this means you could respond by returning something else or just destroy the drive.

There’s more information about Travis’s work on his blog.

Trolling reverse engineers with math – frank^2  (DefCon 18)

Initially the title confused me but the underlying principle is about thinking of a different way to obfuscate your code to obstruct any reverse engineering efforts. This is based on remapping the code into memory based on a lookup function. The example uses a sine wave to distribute chunks of code so that you have multiple ops or basic blocks evenly distributed over memory and someone using a disassembler (or if they’re unlucky, a debugger) will have to track all of the jumping back and forth. Further obfuscation involved adding prologues and epilogues to each op or basic block with extra branches not taken and setting states to be picked up after the jump.

Looking from the obfuscator’s point of view at complicating the lives of those disassembling their work makes this quite a funny presentation.

And that’s how I lost my eye: Exploring Emergency Data Destruction Shane Lawson Senior Security Engineer (DefCon 19)

A simple premise: how do you destroy a hard disk in 60 seconds? There’s a couple of gotchas like limited physical space, not setting off alarms (smoke, seismic etc) and no killing any sysadmins or other humans nearby. The range of explored options is entertaining and the final solution is surprising. Any discussion of the talk will give away some of the highlights, so just go watch it instead.

My life as a spyware developer Garry Pejski (Defcon 18)

The story of a guy who picked up shady job from Craigslist and ended up writing spyware. Since this happened several years ago, the application was designed to work as an IE plugin and was delivered a custom installer using an exploit. The installer meant that there was an element of legitimacy since the user was partially involved with the install, despite it being almost impossible to exit – there was even a set of terms and conditions! The application itself performed affiliate link redirection (changing affiliate IDs to its own) and displayed popups based on intercepted searches. However, it sounds like the affiliate/popup monetization strategy didn’t make as much as using the application for installing other people’s spyware, leading to more hilarious stories.

At the time, the state of the art malware removal was quite basic and sounds easy to defeat. I noticed that the presenter raised the idea of needing to whitelist applications in the future which made me think of the discussion of whitelisting in the Windows Store.

Worth a watch, especially for the view from the Dark Side from one of the guys building the Death Star.

The Art of Trolling (Slides) – Matt ‘openfly’ Joyce (DefCon 19)

A comedy piece covering examples of trolls through history and a examples of the different types of trolling. With all of the anecdotes in this talk and the underlying material, it’s difficult to know what’s true. You won’t learn much, but you might have a laugh.

How I met your girlfriendSamy Kamkar (DefCon 18)

The presenter starts by introducing himself as someone previously banned from using computers for a little misbehaviour. The talk is based on penetrating someone’s Facebook account and starts by focusing heavily on the quality of the elements that form the basis for hashed passwords in PHP. Reducing the entropy of each of these elements results in something that limits the possible range of values to brute force. Ironically this doesn’t affect Facebook because they use their own HipHop system and since this vulnerability was found, PHP has been patched too, as well as the obvious recommendation that you use your own mechanism for seeding the random numbers.

This talk also introduced me to NAT pinning, enabling forwarding of a port from a user’s computer through their router. This is based on submitting invisible HTML forms which I think scared me even more. Getting these invisible forms talking to an IRC server behind the user’s back further escalates the fear. The final link in the chain was establishing location which is easy thanks to Google’s roving mappers having grabbed the location of a lot of routers based on their MAC address.

Definitely worth watching to learn a few new things and for the entertainment value too.

The Dark Side of Crime-fighting, Security, and Professional Intelligence – Richard Thieme ThiemeWorks (DefCon 19)

I assumed this would be funny anecdotes, but it’s something darker. The presenter started by highlighting his history with the conference, as a father figure it sounds like he’s seen it all, and as the talk goes on you realize that his experience means that he knows a lot of people. However his stories from the dark side are less comedic and more about reiterating the scary state of affairs in the professional intelligence community.

Overall, a sober and honest look at where we were in 2011.

Practical Cellphone Spying – Chris Paget (DefCon 18)

This is one I watched wanting to find out how simple it actually was. Apparently very simple. Spend a couple of thousand dollars on a laptop and USRP (Universal Software Radio Peripheral). You start by spoofing a network for phones to connect to – easy enough with well known IDs for the major networks.

Of course you’ll be thinking it’s all encrypted, and it might be, but you can ask the phone to turn off encryption. Nevermind, you’d say, no-one can turn off my encryption without asking, but in fact they can and a lot of phones ship with the disable-encryption warning turned off. This is thanks to countries that need it off by default (for example, India) and not wanting to confuse consumers when they get told that it’s being disabled. Even more worrying is the idea of the security of 2G making it HTTP to 3G’s HTTPS, when you’re so grateful for any kind of connection, you don’t typically think of the security implications.

The whole presentation makes the whole thing seem incredibly easy. Slightly scary, but interesting.

Physical Security You’re Doing It Wrong A.P. Delchi (DefCon 18)

This presentation covers considerations for physical security, and the fact that physical implemented poorly is useless, or more realistically, funny presentation content. The presentation starts with the 5 As:

  1. Assessment – where and what to protect.
  2. Assignment – prioritize what to protect.
  3. Arrangement – how to protect.
  4. Approval – get it signed off.
  5. Action – install it.

Starting at 21 minutes is the what could possibly go wrong section: A discussion of the management and vendor level problems and how to handle them. I do like the example of talking to the construction workers as they know what’s actually going on. The last few minutes covers user’s and HR’s greatest hits. I also learnt Spafford’s Law of Security: “If you have responsibility for security, but no authority to make changes, then you’re just there to take the blame when something goes wrong.”

A good balance of very practical advice and comedy things to be aware of.

You spent all that money and you still got owned – Joseph McCray (DefCon 18)

In this, the presenter tells a good story all about his experiences penetration testing and mechanisms for actually performing the testing. There’s discussion of the different tools and scripts to help with things like load balancer detection, handling intrusion prevention and detection systems, and discovering web application firewalls – all things that cost money and in the examples, are all providing only the illusion of security. The talk then goes on to what you can do when you’re in.

This was probably the first of the talks I saw that tells you what’s available for this kind of work and how easy the tools are to use. And be warned, the presenter uses NSFW language, and you may be offended if you like Ruby.

Steal Everything, Kill Everyone, Cause Total Financial Ruin! Jayson E. Street CIO of Stratagem 1 Solutions (DefCon 19)

A good follow-on from Physical Security You’re Doing It Wrong, this starts with history of the presenter’s work with entertaining stories from actually penetration testing offices by entering them. The presentation is split into 3 parts based on the title:

  1. Examples of stealing – what you can find lying about in an office.
  2. How to kill everyone – due to a lack of security in a hotel allowing access to the kitchen or plant room.
  3. Financial thievery – such as grabbing the paper in the shred bin.

Start to end, this is an entertaining talk that will show you what a focused intruder can achieve, and hopefully while you’re thinking that couldn’t happen where you work, I hope you’re also thinking about how you’d stop it.

Pwned By the owner What happens when you steal a hackers computer (DefCon 18)

This is the story of the consequences of using a hackers stolen computer – with a bonus 5 minute story while setting up the equipment. Starting with the theft, it’ll make you think about your security situation. But after that it’s an exciting story about what you can do with a back door to the computer you used to own, and you’ll be glad not to be the new guy using it.

One of the shorter presentations but worth watching. The only thought that came to mind was that maybe the presenter wasn’t picking on the thief, but a new owner, but either way, it was his kit being used.

Nmap: Scanning the Internet Fyodor, Hacker, insecure.org (DefCon 16)

Nmap is a tool I’ve always wondered about – never having had to use it or really understanding what it does. This talk gives a lot of examples of how to use it and then tips on more advanced usage. The examples show the epic command lines you use to drive the thing and it’s quite obvious that the presenter is the author of the tool. The presentation also shows a nicer GUI frontend to NMap with extra features like a graph of connectivity between nodes.

Interesting stuff if you know very little about Nmap.

Jackpotting Automated Teller Machines Redux– Barnaby Jack (DefCon 18)

To be honest, I confused myself with the title, assuming it was something to do with fruit machines, but even more intriguing, it’s about the gritty internals of ATM machines, focusing on the simple boxes you find in a small shop or petrol station. Although appearing suspicious buying and transporting his own ATMs, the presenter has taken the time to investigate what’s inside. Starting with reverse engineering, he moved on to writing tools to remotely access the ATMs (Dillinger) and rootkit to install (Scrooge).

Although most of the real life excitement of experimenting with the machines happens off screen, the rest of the talk is fascinating enough to make this worth watching.

SIGGRAPH 2012 – Tuesday

Beyond Programmable Shading (SlidesSIGGRAPH Page)

Five Major Challenges in Real-Time Rendering http://bps12.idav.ucdavis.edu/talks/02_johanAndersson_5MajorChallenges_bps2012.pdf
After the introduction to the BPS course, Johan lead straight into a followup to his talk from the 2010 course. The fact is that the 5 challenges from 2 years ago are still there. However, each of these challenges can be broken down into different areas resulting in a total of roughly 25 different topics to cover spread over the 5 major groups:

  1. Cinematic Image Quality – Types of aliasing, anti-aliasing and blur.
  2. Illumination – Dynamic GI, shadows, reflections
  3. Programmability – Exposing a common front end shader language for different backends (for example HSAIL (Heterogeneous Systems Architecture Intermediate Language), supporting the GPU generating its own work, improving coherency between tasks with things like queues, simpler CPU-GPU collaboration. Programmable blending (as is being exposed by Apple in iOS6 via APPLE_shader_framebuffer_fetch discussed here)
  4. Production costs – Reduce iteration time. A great renderer won’t reduce costs, need to reduce cost of creating content.
  5. Scaling – Power vs resolution.

I do know that prior to the talk, Johan was soliciting feedback from others in the industry which meant that you knew the challenges listed were more than one man’s personal list and it was good to see the names of the contributors at the end of the talk.

Intersecting Lights with Pixels: Reasoning about Forward and Deferred Rendering http://bps12.idav.ucdavis.edu/talks/03_lauritzenIntersectingLights_bps2012.pdf

This was an interesting overview of the state of the art for generating the lists of lights to apply to pixels in a forward or deferred renderer without going in to a forward vs deferred discussion beyond some bandwidth comparisons that were nearly inconclusive due to the approximate equality.

There were 2 main things that I thought made useful takeaways:
1) Doing per tile checks against the bounds of each tile performs a lot of the same tests multiple times. Instead you should use something like a hierarchical quad-tree which should reduce the number of tests required and avoid redundancy.
2) The suggestion of using bimodal Z clustering rather than a higher number of buckets as used in Clustered Deferred and Forward Shading by Olsson et al at HPG. While reading the Olsson paper, I thought that heavily subdividing the Z range was overkill and some twitter discussion around that time highlighted that in some games, there’s limited content between the end of the sights and the nearest wall, in which case a split into two Z ranges would help.

I’d recommend taking the time to read through the slides if you didn’t attend.

Dynamic Sparse Voxel Octrees for Next-Gen Real-Time Rendering http://bps12.idav.ucdavis.edu/talks/04_crassinVoxels_bps2012.pdf

With all of the buzz regarding Sparse Voxel Octrees based on their usage in Unreal 4 (covered in the Advances in Real-Time Rendering course), I thought it would be good to see a presentation from Cyril Crassin who has been working hard and presenting a lot of the work in this area. Most of the talk was an overview of how the system works and how it fits in versus a polygonal geometry representation, comparing the advantages that you can get from cone tracing an SVO.

Yet again, this was another SVO talk where processing cost and storage space were skated over, but since the slides are now available, you can see the numbers. 9 levels of SVO comes in at 200MB to 1GB and the initial construction for the GL sponza demo is 70ms with an update cost of 4-5ms for animated data each frame. The performance figures were slightly confused later when it was said that the GL demo from the new OpenGL Insights book (released chapter here) can build Sponza in 15.44ms. However since the technique is being used in Unreal Engine 4, it was expected that The Technology Behind the “Unreal Engine 4 Elemental Demo” presentation on the following day would show where you could reduce the time and space requirements.

I do like SVOs for the fact that they provide a scalable solution to the visibility part of a GI system and their use in Unreal shows that they can have a runtime implementation. If only they didn’t cost so much to generate and store!

Power Friendly GPU Programming http://bps12.idav.ucdavis.edu/talks/05_ribblePowerRendering_bps2012.pdf

This was a Snapdragon-based presentation on general optimizations that could be applied to save power. Unfortunately the generality of the optimizations – compress textures, draw front to back, and consider your render target changes – brought very little to the power saving discussion. Once frame limiting was recommended as the best way to save power I did wonder how helpful the content would be. I missed the end of the talk with an aim to return for the panel.

From Publication to Product: How Recent Graphics Research has (and has not) Shaped the Industry

This panel was lead by Kayvon Fatahalian and discussed the relationship between research departments and industry with a set of industry luminaries on the panel.

Some of the things that I took note of:

  • No one wants to learn new language – every time someone comes up with a new language, it’s unlikely to get a lot of adoption due to the languages already in use.
  • Papers need to use realistic workloads and industry needs to provide better workloads to facilitate this. Researchers working closely with industry typically get the most relevant workloads due to the requirements of the research.
  • The HLSL language was not expected to last this long – they thought it might run for 5 years or so.

Light Rays Technical Papers http://s2012.siggraph.org/attendees/sessions/100-59

Naïve Ray Tracing: A Divide-And-Conquer Approach ACM Digital Library Version

This presentation started with a back-to-basics description of ray tracing – intersect a bunch of rays with a bunch of geometry. A lot of optimization has gone into the geometry intersection using techniques such as bounding volume hierarchies which adds time to build, memory to store and complexity to create and intersect, and which typically ignores the distribution of the rays and can even increase cost with dynamic scenes. This new technique is based on recursively splitting the set of rays and the set of geometry until you can perform a naive set of intersection tests. It’s a simple algorithm so most of the rest of the talk consisted of results. The performance looked good and they said that the major limit was bandwidth as the reordered the rays and geometry.

This was an interesting presentation since it could lead to a rethink in the way that ray tracing is performed. I imagined the tests to partition the sets of rays and geometry would be prohibitively expensive but it sounds like it could be a good win. It’ll be interesting to see what comes of this technique and any further research. There’s a good write up here too.

Manifold Exploration: Rendering Scenes With Difficult Specular Transport SitePDF

This talk centered on a new way of dealing with specular that can be applied to Markov Chain Monte Carlo (MCMC) based rendering. My understanding is that once you know which points you want to transport light between and which surface to reflect from or refract through, you can make multiple paths that fulfil that transport. This was best expressed by the images in the slides. There was also an extension to account for roughness based on the area around the manifold. This works well in the highly reflective and rough scenes that they showed.

The source is available and includes an implementation of Veach’s Metropolis Light Transport which is notoriously difficult to implement – this meant an extra round of applause for the presenter.

Bidirectional Lightcuts http://www.cs.cornell.edu/~kb/publications/SIG12BidirLC.pdf

Continuing the theme of bias reduction in Virtual Point Light (VPL) systems, Bidirectional Lightcuts extends multidimensional lightcuts to use the same tracing mechanism that adds VPLs to add Virtual Sensor Points (VPSs). The paper introduces weighting mechanisms for the VPL/VPS pairings to allow the use of more advanced features such as gloss, subsurface scattering and anisotropic volumes.

Virtual Ray Lights for Rendering Scenes With Participating Media Site PDF

Virtual Ray Lights are intended to fix the issues that arise when rendering participating media with a VPL technique (firefly-like singularities around each VPL). The ray lights are the paths traced through the volume when adding VPLs and they can be integrated with the camera ray when rendering to add their contribution which gives very convincing results.

As mentioned before, this paper was superceded by beam lights (Progressive Virtual Beam Lights as seen at EGSR) since the ray lights change the firefly singularities to bright streaks.

Fun with Video http://s2012.siggraph.org/attendees/sessions/100-56

Video Deblurring for Hand-Held Cameras Using Patch-Based Synthesis http://cg.postech.ac.kr/research/video_deblur/

This paper discussed a method for fixing the motion blur that remains even after image stabilization. The algorithm is based on finding periods with sharp frames (since shaky videos have sharper and blurrier frames) then applying a patch based process that finds neighbours and blends. They find blur kernels to match the blur and use them to match the patches.

I’d not realized how bad the remaining motion blur could be, but this was a very interesting presentation and covered a lot of previous work.

Eulerian Video Magnification for Revealing Subtle Changes in the World http://people.csail.mit.edu/mrub/vidmag/

The fast forward version of this paper showed 2 example videos that demonstrated detecting a person’s heartrate and an infant’s breathing based on amplifying changes in videos – the main reasons I wanted to see this. The previous work section referenced Motion Magnification from SIGGRAPH 2005. For this technique, the video is spatially decomposed and then they calculate the luminance and apply a temporal filter that smooths the trace. Many equations are used to explain why this works, and slides refer you to the paper for more details.

More interesting results included detecting Bruce Wayne’s pulse in a Batman film when he’s supposed to be asleep and a running demo where the user moved slightly and his eyes didn’t which gave scary results.

Selectively De-Animating Video http://graphics.berkeley.edu/papers/Bai-SDV-2012-08/

The presentation demonstrated a user driven tool that could warp parts of a video to make some elements appear stationary while leaving the motion of other parts intact. These can be looped to create a cinemagraph. The features marked by the user are tracked through the video and used to define the required warp. The warped version is composited with the original to remove the last motion. A major advantage was that the user input requirement seemed minimal.

The main example was a beer being poured into a glass (a common SIGGRAPH video source) where the video was warped such that the glass looked stationary. Other examples given were a roulette wheel where the motion of the ball was faked or the wheel was held still, a video of a guitar player where the guitar was kept still, and a model that was kept still while moving her hair and eyes.

SIGGRAPH Dailies – http://s2012.siggraph.org/attendees/siggraph-dailies

SIGGRAPH Dailies is an end of day session that is based on multiple presentations of a minute each with a wide variety of topics. My memory of the event was that it was very artist driven with a very vocal set of presenters from Texas A+M. The length of the presentations means that they have a very strong visual component and it’s the artists presenting their work that I remember most.

SIGGRAPH 2012 – Monday

Pointed Illumination session http://s2012.siggraph.org/attendees/sessions/100-119

Progressive Lightcuts for GPU (Progressive Lightcuts for GP Homepage – ACM digital library version)

The first presentation of the session was about offloading the lightcuts process (explained the previous day) to the GPU and producing a progressive result since the progressive result converges to the final result quite quickly. The obvious idea of caching where the tree was cut per pixel was thrown out due to the massive amount of memory required to store that state. Instead they average several Lightcuts images based on different sets of VPLs. The system is limited to using a heapless traversal on GPU and schedules CPU work if the depth of the cut is too great. This presentation also demonstrated a new way to clamp the lighting contribution that helps define the number of iterations required.

It seemed that the major contribution was the new clamping method which was skated over quite quickly referring to “Mathematica happens”.

SGRT: A Scalable Mobile GPU Architecture Based on Ray Tracing (ACM digital library version)

In this presentation, Won Jong Lee presented a ray tracing based GPU. He started by showing that ray tracing requires more FLOPs than available on current mobile GPUs and the underlying process doesn’t map well to multithreaded SIMD operations due to the incoherence between rays. Their first question was whether to go fixed function or programmable. Fixed function is lower power but programmable elements are required for things like ray generation and surface shaders.

The underlying hardware is split into several parts. The fixed function traversal and intersection system is based on T&I Engine: Traversal and Intersection Engine for Hardware Accelerated Ray Tracing presented at SIGGRAPH Asia last year. Internally the system supports optionally restarting traversal based on storing a short stack of kd-tree nodes. The Ray Accumulation Unit gathers rays that hit the same cache line but are still waiting for that data.

The numbers given for 2 test scenes rendered at 800×480 were as follows:
Fairy – 255M rays/sec, 87 fps
Ferrari – 170M rays/sec, 67 fps
These compare favorably with Kepler ray tracing figures of 156-317M rays/sec.

Overall, the system looks interesting although there’s several interesting things.
* The example was based on Samsung’s reconfigurable processor which represented the bulk of the posters at HPG (for example A Scalable GPU Architecture based on Dynamically Reconfigurable Embedded Processor). It appears to be an area of research investment at Samsung that is being reused in many different hardware projects.
* The BVH is currently generated CPU side and they’ve not investigated too far into that so there’s no support for dynamic scenes.
* The clock rate being discussed was 1Ghz which would result in burnt fingers if the battery lasts long enough.
* The presenter asserted many times that this was the first known hardware implementation but I thought Imagination (of PowerVR fame) had already been looking at something called Caustic and I managed to find the Imagination Caustic PowerVR ray-tracing hardware reference platform as announced at SIGGRAPH this year.

Point-Based Global Illumination Directional Importance Mapping

This presentation by Eric Tabellion showed 2 applications of importance mapping to Point-Based Global Illumination (PBGI). For a quick recap, PBGI involves creating point samples of a scene, clustering them and then sampling the clusters based on a solid angle error to gather global illumination. Eric showed some live demos of the points selected for rendering from the point of view of a position being illuminated and that gave a good idea of the blockiness of the point sprites used in rasterization of the cubemap that represents the GI.

The first use of importance mapping showed that more important directions need higher quality and the BRDF bounds a cone of normals based on the roughness of the surface. This means that the solid angle error metric can be varied proportional to the roughness in order to rasterize finer points for smoother surfaces.

The second use was similar, but was based on the use of high dynamic range environment maps. Although mipmapping the HDR envmap for quick lookup already, they also updated the system to base the solid angle error on the luminance sampled from the envmap.

Following from the PBGI presentations at HPG/EGSR I found this a really interesting talk with a takeaway that could be applied by anyone with a working PBGI implementation. The results also highlighted the time that can be saved when working with importance mapping, cutting a 4 hour render to 90 minutes.

Ill-Loom-inating Handmade Fabric in “Brave”

This presentation demonstrated the techniques used to solve the complex requirements of the fabric in the film Brave.

The first attempt was a ray marcher in tangent space against distance field representing the fibers of the fabric (as soon as I heard distance field I thought it was possibly due to Inigo Quilez having joined Pixar and his history of ray marching demos). The example shown was highly detailed, going so far as being able to see the fibers separate when the material was compressed. The first problem was a lack of a silhouette on curved areas, so the marching was changed to bend the ray (rather than the distance field) based on the curvature, as is usual in distance field renderers. One advantage of the distance field system is that it can be used to calculate local ambient occlusion. However, there were several disadvantages such as AA support, shadows and support for the existing lighting model.

The second attempt was based on the use of RiCurves in Pixar’s RenderMan and was named Loom. The system procedurally generated the fabric based on subds. At render time, the aim is to only generate the geometry required for one face at a time and the neighbors in an effort to reduce memory usage. However there were still problems with memory usage as well as geometry LODding and shadows. This technique was used a lot in the final film, with the presenter focussing on a tapestry which appeared to be heavily used in the film.

As part of the Q&A, someone asked about how his first technique compared with Relief Mapping. The presenter highlighted that he had only a procedural description of the surface and Relief Mapping typically works from a 2.5D description.

Although not as practical as the preceding PBGI talk, this presentation was more of an inspiration and a review of a set of different ways of achieving the same goal.

Virtual Texturing in Software and Hardware http://s2012.siggraph.org/attendees/sessions/virtual-texturing-software-and-hardware

All of the slides for this course are on J.M.P. van Waveren’s site.

The first time I actually understood exactly how Sparse Virtual Textures worked was at Sean Barrett’s (@nothings) GDC 2008 talk (site here) which was performed from a set of slides stored in a megatexture.

The session started with an overview of the technique, reusing some of the recognizable images from the GDC 2008 talk and you could read either version to catch up. The session then moved on to the practical use of software to implement virtual textures in Rage.

Most of the Rage content has also been covered before in J.M.P. van Waveren’s id Tech 5 Challenges talk at SIGGRAPH 09 with this year’s talk mostly focussing on the work addressing the virtual texture, and the issues with filtered texture sampling from virtual textures, especially in the case of anisotropic filtering which can apparently be artifact-y.

The meat of the talk was AMD’s presentation of how their Partial Resident Texture (PRT) system works. In a similar way that Sean Barrett defined sparse virtual texturing in terms of Virtual Memory (VM), the presenter started with an overview of the hardware Virtual Memory subsystem on a modern GPU, and a description of how every texture sampling operation uses the VM to find the physical address for the texture data. PRT works by allowing you to allocate virtual address space and only map the pages of physical memory you want to access for your texture, meaning that your actual memory usage can be lower or you can use the indirection to relocate blocks of your texture. However, this means when you sample your texture, it’s possible to hit unmapped pages and this error condition needs to be returned to the caller of texture lookup and in theory handled in some way. This section of the talk finished describing the kind of work required inside a driver to expose this functionality.

Next up was the AMD_sparse_texture extension to OpenGL which is AMD’s method for exposing this functionality (covered in some detail here). The extension adds:

1) glTexStorageSparseAMD() as a simple replacement for glTexStorage2D() when you want to allocate virtual memory space to use as a PRT.
2) New queries for texture page sizes since they vary with format (glGetInternalFormativ() with GL_VIRTUAL_PAGE_SIZE_X/Y/Z_AMD) and GL_MIN_SPARSE_LEVEL_AMD to get the level at which mips are packed.
3) An extra texture parameter, GL_MIN_WARNING_LOD_AMD, defining the low water mark for mips – which I assume means the point at which you need higher res mipmaps, since I’m not sure of the value of the alternative. If the watermark is hit, a warning is returned to the shader, but at least the data is valid. (It would be nice if this were stored to a flag on the texture).
4) GLSL sparseTexture() sampling methods which return flags and additional methods to check the flags. The previous texture() methods still work, but they have undetermined results if the texture is not resident, so you still need to manage that externally.

You are also able to use the sparse storage for render targets where writes to unmapped memory are just discarded. A list of possible issues such as running out of virtual or physical memory, or GL texture size limitations (for example 4Kx4K) were also covered, although in the case of GL, you could allocate large texture arrays or volumes. As future extensions, the most interesting was extending this VM support to other GL types such as the ubiquitous buffers.

As part of the Q&A session, the questions was one I’d already noted to ask: What happens when filtering if one of the samples is unavailable? The lookup fails.

The last part of the session was an example of Rage running with PRTs which was based on running a version of the engine that could swap between software and hardware virtual texturing. Once the required assets were cached (once for each) you could swap between the SW and HW versions. This was the first chance to see the artifacts from the anisotropic sampling, thanks to the super-magnifying glass that they applied to the SW version to find a microscopic line of pixel fail – something I’d never seen before on the PS3 version. They also demonstrated the better texture sampling result when using the HW since they could increase the anisotropy level when sampling higher frequency textures such as tire tracks.

A couple of things came up here and in the final Q&A:

1) Rage still uses a prepass to detect required pages, rather than using the errors returned when texture sampling.
2) Sampling from a lower mip on lookup fail requires looping in the shader.
3) There’s no real LRU support in the PRT system – you’d still need to go back to the feedback pass method.
4) In the case of the 3 textures per surface in Rage, for SW virtual textures, there’s one VM (texture) lookup before sampling, but in the HW version, there’s a VM (HW) lookup per texture. (Of course the SW version has 3 HW VM lookups too).

Overall, I’m glad I understand how it works, but I’m not sure of the practical use. My only ideas so far:

1) The obvious megatexturing.
2) Streaming of higher resolution mips.

I’m not sure of the mileage you’d get trying to apply PRTs to sparse voxel routines since it all really would add is a high level error return. I’m looking forward to seeing more practical applications.

Surf & Turf http://s2012.siggraph.org/attendees/sessions/100-121

From a Calm Puddle to a Stormy Ocean: Rendering Water in Uncharted (NaughtyDog) (ACM digital library version – similar to the GDC 2012 presentation)

The presentation started with the water tech from the earlier versions of Uncharted based on multiple layers, such as refraction based on depth, soft shadows, and foam. The system blended 2 textures advected by flow and offset in phase (very similar to Valve’s Water Flow in Portal 2 presented at SIGGRAPH 2010). For the triangle mesh representing the surface, they also mentioned that they move the vertices in circles.

The talk then moved on to the ocean in Uncharted 3. Rather than use Gerstner or Tessendorf, they wanted something simpler and settled on Wave Particles presented at SIGGRAPH 2007 which offered a lot of advantages, although the only ones I recall were SPU friendliness and tile-ability. Another example from Uncharted 3 was the flood in the corridors of the ocean liner. The flood itself was based on a simulation in Houdini and was rendered based on a skin with a set of joints (apparently enough joints to reach the limits of the animation system). The particle effects during the flood were actually placed by hand. They also covered the forces applied to the objects in the water which was simply based on an average position and normal.

The whole talk showed how Naughty Dog are a team that combines artistic vision with a strong technical implementation.

What if the Earth Was Flat: The Globe UI System in SSX (EA) (ACM digital library version)

This talk presented the mechanisms used to render the globe in SSX which had a specific set of limitations: It should be available from anywhere in the game and it could only use a minimal amount memory. This meant that they need to go procedural, basically render a quad and populate with a heavy shader – apparently it ended up being a 600 line shader and was described as “not pretty” and at the middle of development it was 17ms on 360 and 23ms on PS3.

The mountains were rendered with the Relaxed Cone Stepping for Relief Mapping technique. The actual memory used was limited to 2 textures, which contained the diffuse image of the earth’s surface, a cloud channel, the cone information and the height. Other effects were based on simple tricks using this information, such as using the height map to generate a specular value, ie. low is ocean.

Other random things that came up:
* They missed the perspective transform of the globe when rendering the quad – similar to what is covered here.
* The quad was changed to an octagon to avoid rendering pixels that didn’t represent the globe – a trick mentioned elsewhere at SIGGRAPH this year by Emil Persson.
* Many hacks were applied, such as passing through cloud, when transitioning from globe to close up to avoid artifacts.

Although the problem being solved was interesting, none of the techniques used were that new or revolutionary, and this was just a presentation of producing nice effects within a set of minimal requirements.

Adaptive Level-of-Detail System for End of Nations (Petroglyph Games) (ACM digital library version)

This presentation was about automatic control of the use of level of detail (LOD) in an RTS environment where there’s a high object count and varying load with a large possible number of players (56 in this case). The game includes a frame rate monitor that is able to manipulate various detail settings in an attempt to maintain at least a minimum frame rate. When the system was implemented, the developers informed the QA and art teams so that they would be aware of possible quality drops in response to low frame rates. The changes made to the levels of detail are based on voting and a hysteresis is applied to avoid toggling the changes each frame in response to the frame rate improvement.

The finale of the talk was an example of the LOD control system being applied was in a frantic scene with a lot of units mutually attacking each other overlaid with various effects. In my eyes, the effects of the LOD system couldn’t be seen until they got to the point of their “nuclear option” which was toggling the list of units to get rendered each frame, resulting in flickering units. I think this is because most of the content in an RTS after units and projectiles are arbitrary eye candy and their loss doesn’t affect someone playing the game.

Screen Space Decals in Warhammer 40,000: Space Marine (Relic) (ACM digital library version)

I missed this due to needing to leave as it began.

Electronic Theater

The Electronic Theater is one of my best sources of inspiration at the SIGGRAPH conference. Some of my favourites:

For The Remainder
I found this animation beautiful and wished I was one of programmers involved since the credits mention NPR software and tools programming.

How to Eat Your Apple
This was an intricate video which reminded me of Lil and Laarg from Escape Plan.

Mac ‘n’ Cheese
A frenetic chase with great animation.

Wanted Melody (Extract (NSFW))
A weird one that had the audience going ‘huh’ at the start and laughing like drains by the end.

SIGGRAPH 2012 – Sunday

SIGGRAPH started with the Fundamentals seminar which I skipped since the Level was Introductory and the Intended Audience was raw beginners.

My SIGGRAPH education started at Optimizing Realistic Rendering With Many-Light Methods (course notes here) which was heavily focussed on Virtual Point Light methods.

Instant radiosity

Alexander Keller started with a retrospective look at Instant Radiosity (1997 – original slides) from which the VPL concept could be said to originate. This presentation introduced 2 major ideas: the idea of lighting a scene from many individual points and accumulating the results; and the idea of a singularity due to inverse square distance scale component of the lighting equation – for example when the distance from the light is less than 1.0 units, the inverse square will increase the contribution of the light.

The original technique just used OpenGL to rasterize the scene multiple times with different lights enabled. The initial example sampled the light positions from an area light and the second example added virtual point lights to simulate a light bounce. At that time OpenGL would clamp the lighting to 1.0 when writing to the render target which clamped the singularity away. This was the first time the course highlighted the need to compensate for the energy lost by clamping to avoid overdarkening in scenes and this is where a lot of the VPL research has focussed in the last few years.

The last part of the talk went on to how to generate light paths but appeared to be an advert for the Advanced (Quasi) Monte Carlo Methods for Image Synthesis course later in the week.

Overall, the talk itself explained the original technique and the limitations but was heavy on equations with minimal descriptions which will leave the slides lacking notes somewhat harder to read – an audience member even raised a query based on a missing equation during the Q&A.

Virtual Spherical Lights (originally presented at SIGGRAPH Asia)

Virtual Spherical Lights is an extension to Virtual Point Lights to better handle glossy surfaces with pointed BRDF lobes – the example given being a kitchen with specular responses on all of the materials. The presenter reiterated that clamping the contribution to avoid the singularity loses energy and VSLs intend to reduce the need for clamping. Instead of integrating single points that can generate the singularities, VSLs integrate over the surface of a virtual spherical light (like a pseudo area light). The results look much better in the kitchen scene although in a contrived scene, “Anisotropic Tableau,” which contains strong directional lights and an anisotropic metallic plane, convergence requires a lot of lights.

The talk highlighted the reuse of the work in participating media extending ray lights (presented at SIGGRAPH this year as Virtual Ray Lights for Rendering Scenes with Participating Media) to beam lights (Progressive Virtual Beam Lights as seen at EGSR).

During the Q&A, someone asked how you pick the radius. The presenter said it’s based on the 10 nearest lights. The original paper also mentions a user specified constant.

Overall, the technique seems to give better results than VPLs at a minor cost increase, but the random choice of local light count and user specified constant feels a bit too voodoo.

Improved VPL distribution

3 techniques for distributing VPLs were covered in this talk, each with it’s own example scene. The examples for this talk were a simplex box with an object in, room shaped like an S curve where the light source was at one end of the S and the camera was at the other, and another kitchen surface with reflective walls. In each case, VPLs were needed in specific places to improve their contribution to resulting image.

1) Rejection of VPLs based on their contribution versus the average result (Simple and Robust Iterative Importance Sampling of Virtual Point Lights by Iliyan Georgiev, Philipp Slusallek).

The average result is generated from an initial set of pilot VPLs. The presenter said that there’s no need to be accurate and the mechanism is cheap and simple to implement. However this distribution is not good for local interreflections due to the sparse distribution of pilot VPLs. This mechanism is used in Autodesk’s 360 product covered later.

2) Build light paths from the light to the camera and select the 2nd vertex from the camera as the VPL.

Based on something like Metropolis Instant Radiosity you can mutate the paths using the Metropolis-Hastings algorithm to get a range of values. Although this technique works better than rejection in more complex cases, implementing Metropolis-Hastings is notoriously hard.

3) Sample VPLs from camera

Appearing in Bidirectional Instant Radiosity and Combining Global and Local Virtual Lights for Detailed Glossy Illumination, this method generates local VPLs which influence local tiles, rejecting those with zero contribution. This produces a result that can compensate for the earlier clamping when calculating the global VPL contribution. When compared to VSLs, VSLs lose highlights, but local VPLs lose shadow definition.

Overall, the rejection method seemed to give the best results for the complexity of implementation, although the scene in the talk and those in the paper are both relatively simple. Looking at the S curve room, my first thought was how was any real-time technique going to be able to light that and to be fair, since it was from the Metropolis example, that was the only one that solved it. Personally I think it should be used by all VPL techniques as a new Cornell box but it’s something that I’d expect to be better lit anyways.

Scalability with many lights I
This presentation introduced LightCuts and later Multidimensional LightCuts.

The basic principle is that each of the contributions from thousands or millions of VPLs is minimal and therefore the contribution of multiple VPLs could be evaluated at a time. To support this, the VPLS in the scene are recursively clustered where each cluster is based on a picking a dominant VPL from the cluster as a representative. These clusters represent a hierarchy called a light tree.

Once the light tree is built, you can light each pixel in the scene by creating a cut through the tree where the cut is the list of best representative lights that you will evaluate to light that pixel. Starting with the root of the light tree, you evaluate the error in lighting the pixel with that node and then use the child nodes if the error would be too great. The quoted error metric is 2% and is based on Weber’s Law which quantitatively defines perception. 2% implies that transitions should not be visible.

Multidimensional LightCuts extends light cuts into additional dimensions, such as those handled by higher order rasterization – time, antialiasing samples, sampling the aperture of the camera, and even participating media were provided as examples. The Multidimensional extension discretizes the points at which you’ll be gathering lighting contributions and then clusters those into a gather tree. The cartesian product of the light and gather trees creates a product graph which represents the combinations of cuts through both trees. Rather than creating the huge graph, selecting cuts in both trees achieves the same results. The error also needs to be evaluated between the gather cluster and light cluster rather than point to cluster in classical light cuts.

Although I’d heard of the LightCuts technique, I’d not looked into it’s implementation but this presentation made it easy to understand. There’s an obvious parallel with the Point Based Global Illumination techniques that use clustering of the points and then selecting points for lighting based on a solid angle error metric, but I do prefer the more accurate error metric of LightCuts.

Scalability with many lights II

This presentation demonstrated alternatives to the LightCuts method.

The first example was based on interpreting the lights to points mapping as a matrix and evaluating several rows and columns (Matrix Sampling For Global Illumination). Rows are randomly selected and lit and then the columns within the selected rows are clustered based on cost minimization. Once a set of clustered columns is selected, we can use those columns to light all of the other pixels. Examples demonstrated good results except where there were lots of independent lighting contributions which the clustering couldn’t handle.

The technique can also be extended to support animation by having a 2D matrix per frame and generating a 3D matrix. In this case, the sampled rows become horizontal slices and the clusters are smaller 3D matrices that are split based on highest cost with the split being based on time or lights depending on which is the better alternative.

The talk also introduced LightSlice: Matrix Slice Sampling for the Many-Lights Problem which chooses different clusters for large slices of the matrix. Another alternative implementation is clustering the visibility rather than the shading as part of the previously mentioned Combining Global and Local Virtual Lights for Detailed Glossy Illumination. The talk mentioned another clustered visibility method Real-time Indirect Illumination with Clustered Visibility which was presented by Dong in 2009.

Working with the problem as a matrix was an interesting way of looking at it. However, the first method being fundamentally based on the selection of the initial rows has an obvious parallel with randomly selecting points from the scene. It’s this randomness that makes this another technique that feels like voodoo. The system highlights the minimal contribution from a lot of the VPLs that techniques need to consider for scalability.

Real-time many-light rendering

This talk was presented by Carsten Dachsbacher who as far as I am aware has previously worked in the demo scene and who has a good idea what compromises are required to improve performance to real-time from hell slow (a term commonly used in this presentation to describe the performance of an offline implementation).

The major problem with the real-time implementation of VPLs is visibility computation between each point and each light. In real-time this means focussing more on rasterization than raytracing and reducing the number of VPLs, looking at thousands rather than a million.

To start with, the simplest way to generate VPLs is rasterization from the point of view of the light into something like a Reflective Shadow Map (PDF) (which is similar to the first stage of Light Propagation Volumes). Once the VPLs have been generated, there are lots of optimizations for using many VPLs with deferred shading, but the real bottleneck is the visibility check. Using shadow maps to accelerate this was the focus of the next part of the talk. The majority of the cost of the tiny shadow map required per VPL is in the rasterization stage. The suggested mechanism was to store a precalculated set of random points in the scene (as {triangle ID, barycentric coordinate} pairs to allow dynamic scenes) and use those to render the shadow maps – resulting in a “crappy map” due to the sparsity of the points leaving holes in the map. A technique called pull-push can reconstruct the missing parts of the map and can be applied to all maps at once if stored in the same texture. An alternative to rendering a set of points like this is to render something like a QSplat hierarchy which was implemented in the Micro-Rendering for Scalable, Parallel Final Gathering paper using CUDA.

The next part of the course covered achieving high quality results which focused on bias compensation and participating media. Reformulating the lighting equation, you can arrive at a version that bounds the transport to avoid singularities and then a residual which you iteratively accumulate in screen space. Only a few iterations are required, with more iterations needed at discontinuities which can be found by downsampling G buffers and creating a discontinuity buffer. As with all screen space techniques, this suffers from the lack of non-visible information that hasn’t been captured in a G buffer.

The participating media example referred to Approximate Bias Compensation for Rendering Scenes with Heterogeneous Participating Media. Handling participating media needs smarter bias compensation since VPLs generate sparkly points and sampling along a ray from the camera is an additional expense however you handle participating media. The underlying technique adds some real time approximations to Raab et al’s Unbiased Global Illumination with Participating Media.

The most important part of the presentation was the agreement that the visibility problem is the hardest part of the equation, something repeated from the PBGI work and something that needs an additional scene mechanism to handle, such as rasterization, raytracing or a voxel scene representation. I’m also glad to know that researchers are looking at the problem from a real-time point of view, although 8 fps isn’t sufficiently real-time yet. This presentation also left me with a large number of references to read!

Autodesk 360
Although very similar to the 360 presentation at HPG, this focused more on how the previously presented techniques from the course were integrated into and extended for Autodesk’s 360 cloud rendering service. They went with Multidimensional LightCuts since they describe it as scalable, uniform and supporting advanced effects. There were 3 main areas they improved the algorithm:

1) Eye Ray Splitting to better handle specular effects on glossy objects.
2) VPL targetting to support good lighting when rendering small subsets of large architectural models.
3) Directional Variant Light support since many Autodesk products expose them (and they are now part of bidirectional lightcuts).

The presentation covered 5 major advantages all sufficiently covered in-depth on the slides themselves.

My Wrapup

By the end of the course it appeared that there were some important takeaways:

1) VPLs are an active area of research mostly due to limitations with the bias and extensions to include other features.
2) Bias compensation is incredibly important due to the clamping to avoid problems with the inverse distance squared relation. Participating media makes this even scarier due to the generation of sparkly points local to the VPL.
3) Specular reflections are incredibly complex with VPL techniques.
4) Real time usage is limited due to the visbility term in the equation.

Technical Papers Fast Forward

A staple of the SIGGRAPH experience, the Technical Papers Fast Forward session has all of the technical papers cued up and ready to show with less than a minute per presenter to cover the salient points of their paper and attempt to coerce you to see their presentation. The video preview was made available prior to SIGGRAPH and was also shown in the Electronic Theater sessions. Typically songs, funny videos or overdramatic voice overs are used to promote the papers, but this year was mostly Haikus.

These are the main papers that caught my eye:

Theory, Analysis and Applications of 2D Global Illumination
Based on the idea of understanding 2D GI better, and to extend that understanding to applications in 3D GI.

Reconstructing the Indirect Light Field for Global Illumination
This reconstruction method appears to improve image quality based on sparse samples which could be a possibility for low resolution rendering or sampling.

Animating bubble interactions in a liquid foam (from http://www.cse.ohio-state.edu/~tamaldey/papers.html)
The videos for this talk made it appear interesting and approachable, based on what looks like a simple variant of Voronoi diagrams.

Eulerian Video Magnification for Revealing Subtle Changes in the World
The 2 example videos for this paper demonstrated detecting a person’s heart rate and an infant’s breathing based on amplifying changes in videos.

Tools for Placing Cuts and Transitions in Interview Video (Video)
This paper appeared interesting for its possible applications to evil by recutting any interview clips to appear linear. Fortunately a later paper appeared to be investigating techniques for forensic detection of modifications to video which could lead to an interesting arms race.

Resolution Enhancement by Vibrating Displays Video
I was surprised by the novel use of vibration to increase the perceived resolution of an image.

Design of Self-supporting Surfaces
This seemed interesting due to the comparisons to the low technology methods used by architects such as Gaudi – hang weights from strings and you can model your building upside down!

Position-Correcting Tools for 2D Digital Fabrication
The video showed a milling tool manipulated by hand that seemed to autocorrect as the user followed the template guide.

Additional Links:

The Self Shadow blog is doing a fantastic job of collecting all of the links to complement Ke-Sen Huang’s Technical Papers page.