A short while ago, Adobe set loose the newest version of the Flash player, along with the development SDK.
Among the advertised benefits of this new player, there is the ability to use native support for drawing three-dimensional objects. Nothing on the scale of Papervision3D, though: the programming interface is extremely limited in its scope and, although a great step from what Flash supported natively in earlier versions, it’s still a long way from what desktop 3D programmers have grown accustomed to. We can only hope for existing engine developers (or new teams) to harness the low-level primitives of Flash Player 10 into high-level drawing primitives.
Graphics Processing Units
I suspect that many readers, especially those with a Flash/Flex background, are not familiar with the modern graphics pipeline found in the hardware of most reasonably recent computers, so a quick primer on the subject could be helpful.
The CPU in a typical computer is a general-purpose piece of hardware: it can handle 3D as well as it can handle spreadsheets or web browsing. This led to the creation of a new class of software which was specifically designed to perform computations related to 3D: it could only do these computations, but it did them darn fast. So, video games and CAO programs started delegating some processing to this hardware, which slowly evolved over time into our modern GPU hardware.
Early on, the GPU could only handle elementary functions such as projecting 3D triangles on a 2D screen, but quickly gained new features, such as mapping textures on triangles or handling lights (this is what the TnL revolution was: Texture and Lighting the availability of Transform and Lighting in hardware), and more recently the ability to apply an arbitrary amount of operations independently to every pixel drawn on the screen. Today’s GPUs are massive beasts, able to handle thousands of pixels in parallel where a CPU could only handle one or two.
How does it work?
The CPU collects or creates the data that should be displayed. This is generally composed of 3D data (triangles that are to be drawn) and 2D data (textures to be mapped onto triangles). Most of the time, these are sent to the GPU during an initialization step and stored in the GPU memory. Sometimes, new data is computed at runtime and is sent asynchronously (that is, the GPU receives the new data while drawing the old data, so that no time is wasted).
Once the required data is present on the GPU, the CPU sends rendering instructions. These instructions are usually quite small (“Draw object #3 at position (x,y,z) using textures #5 and #7″) but result in massive amounts of work (Object #3 contains thousands of triangles and fills millions of pixels on the screen) so that the majority of data transfers happen between the GPU local memory and the GPU local processor using high-speed connections on the video card.
The result is that the CPU can send a few dozen cheap rendering instructions, then handle whatever it needs to handle (gathering user input, performing physics simulations, handling the network connections), while the the GPU toils away at running the provided instructions on thousands of polygons and millions of pixels with amazing performance.
A common techique to improve performance is batching. Once the multiprocessors start working on a new instruction, they work fast, but starting a new task takes some time. As such, drawing 600 polygons in one instruction is significantly slower than drawing 300 polygons twice with two instructions. Batching consists in grouping as many polygons as possible in a single batch that can be drawn with a single instruction, and is the backbone of performance improvement.
The exact details of what happens on the GPU has changed over time, with the addition of textures, lighting, instancing, bump-mapping, arbitrary pixel shaders and vertex shaders, 64-bit numbers, integer support, the unified shader model of DirectX 10, and a whole lot of other esoteric names for esoteric features. The core processes delegated to the GPU, however, have not really changed:
- Vertex operations involve gathering 3D data from various sources (usually, local memory that has been initialized with data from the CPU) and then transforming it according to certain sets of rules, so that it ends up as a 2D shape projected onto the screen.Various additional operations at this stage involve:
- Lighting: for every point of the 3D shape, the illumination is computed and stored.
- Instancing: creating many copies of a single piece of 3D data, such as soldiers in an army or trees in a forest.
- Tessellation: transforming non-polygonal curved surfaces into polygons.
- Pixel operations involve filling the projected 3D data with on-screen pixels, using predefined rules for deciding what the color will be or whether the pixel is to be drawn.A large number of additional operations at this stage can be performed:
- Blending: instead of replacing existing color on the screen, a mix is computed to achieve transparency or illumination effects.
- Textures: data for every pixel is collected from one or more textures.
- Lighting: using the illumination stored in the 3D shape, as well as other sources (normal maps, bump maps) illumination is computed for every pixel.
- Depth: if the pixel being drawn is behind the pixel present at the same location on the screen, it’s not drawn.
- Stencils: using a screen-sized boolean texture, rendering can be limited to only certain areas.
The result of these operations can be displayed or kept in an off-screen buffer for use as a source of 2D data later on.

Back to Flash Player 10
Flash Player 10 allows the programmer to control the pixel operations level. That is, the programmer provides the 2D data and the Shapes, and the Flash player performs the required pixel operations and outputs the resulting pixels to a Graphics object. The function to perform this is the raw Graphics.drawTriangles(), or the combination of Graphics.drawGraphicsData() and the GraphicsTrianglePath class.
I see two problems with this:
There is no Depth Buffering
This is the biggest problem, in my opinion: drawing projected data does not handle the depth of pixels at all, which makes it impossible to tell whether a certain triangle stands in front of another. As a consequence, the Graphics.drawTriangles() function assumes that the triangles it is asked to draw are ordered from the furthest to the nearest using the Painter’s Algorithm. Since triangles have no reason to be in the correct order to begin with, the programmer is forced to sort triangles by hand. This has two obvious performance-related consequences:
- No caching of the polygon data on the GPU is possible: since the order of polygons might change from one frame to the next, the CPU-GPU transfer lines are used to send the polygon data every frame, which decreases performance.Even worse, general batching is impossible: if polygons from two batches are interleaved, then the batches must be split to respect the order of the polygons. This means that instead of sending 600-polygon batches to the GPU like a video game would, the Flash Player will be sending a whole lot of 2-polygon or 4-polygon batches instead.
Of course, smart programmers might still achieve batching as a custom trick in specific situations, but common rendering techniques which automatically achieve high amounts of batching tend to rely on depth buffering to work, and as such might not be carried over as easily.
- Instead of being done on the GPU (often in a highly optimized fashion), sorting has to happen on the CPU. Polygon-based sorting is either incorrect (sorting polygon centers might result in overlap problems with polygons that are not facing the eye), or complex (using a BSP tree to determine optimal rendering order). Not to mention the problem of overlapping or crossing polygons in the Painter’s Algorithm which can only be solved by slicing the inconvenient polygons beforehand.
Vertex operations remain on the CPU
In order to know where to render things, one needs to project 3D shapes onto a 2D screen. This task is traditionally performed by the GPU in classic desktop applications, but has to be done on the CPU in Flash 10. This is usually done by constructing a view-and-projection matrix, and then using Matrix3D.transformVector() to transform vectors individually, then manually extracting the x and y coordinates of the resulting vectors and plugging them in an array that is compatible with Graphics.drawTriangles().
Resist the temptation of using Matrix3D.transformVectors() to transform an array of vectors in one shot: while this may result in higher performance, it is not equivalent: if your matrix incorporates projection (which is required for projecting 3D shapes on 2D screens) that projection will not be applied, only the affine transform will be. Besides, you will still have to manually extract the x and y coordinates.
My choice was to use Matrix3D.transformVectors() to apply the entire affine transform, then applying the projection manually using arithmetic operators:
matrix.transformVectors(triangles, transformed); for (var i : int = 0; i < vertices; ++i) { var z : Number = transformed[3*i+2]; projected[2*i] = ((transformed[3*i] / z) + 1) * width; projected[2*i+1] = ((transformed[3*i+1] / z) + 1) * height; }
This choice was dictated by another fun fact: unlike OpenGL or Direct3D, there is no simple and obvious way of creating a projection matrix in Flash 10. In fact, when I tried to use Matrix3D.rawData to create a projection matrix from its mathematical definition, the program complained that the matrix was not invertible (which is often the point of a projection matrix).
A step forward
Reading my above disappointments, it would be easy to conclude that Flash Player 10 is a worthless technology with only marginal improvements over previous versions and limited lip service to the ressurection of browser-based 3D. This is not the case, and there are massive benefits to Flash Player 10. Even if there is still no batching or GPU-based geometry transforming, and even if the user has to sort polygons by hand, there is a high-performance way to map a texture using texture coordinates, which should definitely improve the performance of several existing 3D engines from the Flash Player 9 era within a matter of weeks. An active engine undergoing continued development efforts, such as Papervision3D, should be able to integrate the new features quite easily.
Besides, Flash 10 is not aimed at rivaling high-end 3D video games (or at least, I hope it wasn’t) but rather to apply simple 3D effects to DisplayObjects, thereby leveraging existing 2D codebases in a fresh new direction. And this, it does quite well: sorting works quite simply (as long as you have no overlaps), transforming is handled internally using matrices, and so on.
Still, a better interface could have been chosen to support general 3D. I could understand that depth buffering cannot work on every possible piece of phone hardware, but sacrificing what is a de facto standard on desktop and laptop computers seems strange. The same goes for vertex and index buffers (sent to the GPU). These can be trivially emulated in software, yet provide tremendous performance benefits by lowering the transfers from CPU to GPU.
In conclusion, don’t expect to write the next Crysis or Half-Life 2 using Flash Player 10, but the technology should be able to equal older 3D game graphics without too much difficulty (Crash Bandicoot, The Sims, Ragnarok Online, Second Life) and enhancing the aspect of existing Flex applications with 3D effects should be a breeze.
Hi. I'm Victor Nicollet,
A nice read.
Small comment:
“(this is what the TnL revolution was: Texture and Lighting)”
TnL is an acronym for Transform and Lighting not Texture and Lighting.
You’re right. Corrected the above post.
After hearing about Graphics.drawTriangles, I was pretty stoked but once I realized it only took x and y and I was really disappointed.
The complexity of ordering triangles seems likely to outweigh the benefits of using the routines! Sorrow!
Maybe flash 11 will add a drawTrianglez method…
An interesting and informative article, thank you
I am puzzled as to why you chose to write your own projection algorithm rather than use Utils3D.projectVectors() which is part of the new Graphics API.
Surely the in-built projector is faster, being written directly into the flash player, as opposed to actionscript code which has to be interpreted and doesn’t have the benefit of low-level memory management, or other techniques for improving algorithm speed.
Perhaps i’m wrong, you seem to know a lot more about this than me.
I suspect the Utils3D.projectVectors() is faster, indeed. My problem when evaluating the library had been the construction of the projection matrix to be applied, not the application itself (as I was attempting to construct a specific projection matrix from its 4×4 coordinates, and encountered several difficulties).
I’ve never considered constructing the matrix using the raw data.. I don’t see what advantage it would give over using the Matrix3D transformation methods.
I’m trying to build a 3d engine using the new flash stuff, but i’m unsure of how to handle the depth sorting or occlusion culling. Since there is no access to the GPU it needs to be a fast software only solution. I’ve read about BSP tree’s and understand that they don’t handle dynamic changes all that well which is important. I also read that bsp can be used for static polygons and something else for movable objects.
Given what’s available do you think you could recommend the fastest method or algorithm for depth and culling? No problem if not, just thought i’d ask.
Thank you.
I’m a hacker at heart, so I like composing my own component-wise matrices like in the good old days
BSP tends to be good for static data, as you mention, but constructing an optimal BSP tree is a mess that you don’t want to do at runtime. Other tree structures, such as ABT or Octrees, can be useful as well, especially if you combine them with simple culling procedures without pre-computations (such as antiportals). Two issues can get quite annoying, though:
- You have to decide how your non-scene objects occlude each other (self-occlusion can be handled through an appropriate sort algorithm): are they drawn as a whole, in order, or can two objects occlude each other?
- You have to cull all polygons that are behind the camera. This can be quite annoying, because it has to be done on a per-polygon basis, so you want to rely on backface culling as much as you can.
All of this points towards splitting your objects into convex or concave sub-sections that are handled as a whole (to avoid handling individual polygons) so that mutual occlusion does not happen and backface culling is enough to eliminate backfaces if you can see only part of a sub-section.
Wow that’s a lot of food for thought. ABT’s and Octrees seem useful, and right now i’m uncertain as to whether I should use them over BSP. Portal’s seem to require application at design level which i’m not fond of.
BSP for static environments along with z-buffering for dynamic moving object’s seems to be the simplest approach, but perhaps not the fastest, i’m not sure. AFAIK the quake engine used a similar approach.
The library is going to be pretty generic with nothing fancy like particle systems and only elementary lighting/shading somewhere down the line. I’m aiming at original playstation or N64 graphics, if that, as I don’t think flash needs or can handle anything more.
I want it to handle pretty much any scene I thow at it, i.e indoor/outdoor, landscapes, high buildings etc. Octree’s or ABT’s seem suited to most of this wheras BSP’s are apparently best suited only for indoor type scenes; and then i’ve read about combinations of the two.. having BSP trees at the leaf nodes of octrees.
What a headache! I suppose i’ll just pick one solution and run with it after a bit more reading. The problem is a lot of literature includes using these methods in conjunction with the GPU, and i’m dealing with software only.
Thanks for the pointers.
Careful. Antiportals are not portals: they’re just large opaque areas that can be used to perform occlusion culling faster. They don’t have to be placed at design level, you can incorporate them in your octree construction simply by coloring those octree cell walls that are fully contained within opaque geometry (with the added benefit that you can cull antiportals with other antiportals this way by culling the corresponding cells).