# Pathfinder, a Fast GPU-based Font Rasterizer in Rust

Ever since some initial discussions with Raph Levien (author of `font-rs`) at RustConf last September, I’ve been thinking about ways to improve vector graphics rendering using modern graphics hardware, specifically for fonts. These ideas began to come together in December, and over the past couple of months I’ve been working on actually putting them into a real, usable library. They’ve proved promising, and now I have some results to show.

Today I’m pleased to announce Pathfinder, a Rust library for OpenType font rendering. The goal is nothing less than to be the fastest vector graphics renderer in existence, and the results so far are extremely encouraging. Not only is it very fast according to the traditional metric of raw rasterization performance, it’s practical, featuring very low setup time (end-to-end time superior to the best CPU rasterizers), best-in-class rasterization performance even at small glyph sizes, minimal memory consumption (both on CPU and GPU), compatibility with existing font formats, portability to most graphics hardware manufactured in the past five years (DirectX 10 level), and security/safety.

## Performance

To illustrate what it means to be both practical and fast, consider these two graphs:

(Click each graph for a larger version.)

The first graph is a comparison of Pathfinder with other rasterization algorithms with all vectors already prepared for rendering (and uploaded to the GPU, in the case of the GPU algorithms). The second graph is the total time taken to prepare and rasterize a glyph at a typical size, measured from the point right after loading the OTF file in memory to the completion of rasterization. Lower numbers are better. All times were measured on a Haswell Intel Iris Pro (mid-2015 MacBook Pro).

From these graphs, we can see two major problems with existing GPU-based approaches:

1. Many algorithms aren’t that fast, especially at small sizes. Algorithms aren’t fast just because they run on the GPU! In general, we want rendering on the GPU to be faster than rendering on the CPU; that’s often easier said than done, because modern CPUs are surprisingly speedy. (Note that, even if the GPU is somewhat slower at a task than the CPU, it may be a win for CPU-bound apps to offload some work; however, this makes the use of the algorithm highly situational.) It’s much better to have an algorithm that actually beats the CPU.

2. Long setup times can easily eliminate the speedup of algorithms in practice. This is known as the “end-to-end” time, and real-world applications must carefully pay attention to it. One of the most common use cases for a font rasterizer is to open a font file, rasterize a character set from it (Latin-1, say) at one pixel size for later use, and throw away the file. With Web fonts now commonplace, this use case becomes even more important, because Web fonts are frequently rasterized once and then thrown away as the user navigates to a new page. Long setup times, whether the result of tessellation or more exotic approaches, are real problems for these scenarios, since what the user cares about is the document appearing quickly. Faster rasterization doesn’t help if it regresses that metric.

(Of the two problems mentioned above, the second is often totally ignored in the copious literature on GPU-based vector rasterization. I’d like to see researchers start to pay attention to it. In most scenarios, we don’t have the luxury of inventing our own GPU-friendly vector format. We’re not going to get the world to move away from OpenType and SVG.)

## Vector drawing basics

In order to understand the details of the algorithm, some basic knowledge of vector graphics is necessary. Feel free to skip this section if you’re already familiar with Bézier curves and fill rules.

OpenType fonts are defined in terms of resolution-independent Bézier curves. TrueType outlines contain lines and quadratic Béziers only, while OpenType outlines can contain lines, quadratic Béziers, and cubic Béziers. (Right now, Pathfinder only supports quadratic Béziers, but extending the algorithm to support cubic Béziers should be straightforward.)

In order to fill vector paths, we need a fill rule. A fill rule is essentially a test that determines, for every point, whether that point is inside or outside the curve (and therefore whether it should be filled in). OpenType’s fill rule is the winding rule, which can be expressed as follows:

1. Pick a point that we want to determine the color of. Call it P.

2. Choose any point outside the curve. (This is easy to determine since any point outside the bounding box of the curve is trivially outside the curve.) Call it Q.

3. Let the winding number be 0.

4. Trace a straight line from Q to P. Every time we cross a curve going clockwise, add 1 to the winding number. Every time we cross a curve going counterclockwise, subtract 1 from the winding number.

5. The point is inside the curve (and so should be filled) if and only if the winding number is not zero.

## How it works, conceptually

The basic algorithm that Pathfinder uses is the by-now-standard trapezoidal pixel coverage algorithm pioneered by Raph Levien’s libart (to the best of my knowledge). Variations of it are used in FreeType, `stb_truetype` version 2.0 and up, and `font-rs`. These implementations differ as to whether they use sparse or dense representations for the coverage buffer. Following `font-rs`, and unlike FreeType and `stb_truetype`, Pathfinder uses a dense representation for coverage. As a result, Raph’s description of the algorithm applies fairly well to Pathfinder as well.

There are two phases to the algorithm: drawing and accumulation. During the draw phase, Pathfinder computes coverage deltas for every pixel touching (or immediately below) each curve. During the accumulation phase, the algorithm sweeps down each column of pixels, computing winding numbers (fractional winding numbers, since we’re antialiasing) and filling pixels appropriately.

The most important concept to understand is that of the coverage delta. When drawing high-quality antialiased curves, we care not only about whether each pixel is inside or outside the curve but also how much of the pixel is inside or outside the curve. We treat each pixel that a curve passes through as a small square and compute how much of the square the curve occupies. Because we break down curves into small lines before rasterizing them, these coverage areas are always trapezoids or triangles, and so and so we can use trapezoidal area expressions to calculate them. The exact formulas involved are somewhat messy and involve several special cases; see Sean Barrett’s description of the `stb_truetype` algorithm for the details.

Rasterizers that calculate coverage in this way differ in whether they calculate winding numbers and fill at the same time they calculate coverage or whether they fill in a separate step after coverage calculation. Sparse implementations like FreeType and `stb_truetype` usually fill as they go, while dense implementations like `font-rs` and Pathfinder fill in a separate step. Filling in a separate step is attractive because it can be simplified to a prefix sum over each pixel column if we store the coverage for each pixel as the difference between the coverage of the pixel and the coverage of the pixel above it. In other words, instead of determining the area of each pixel that a curve covers, for each pixel we determine how much additional area the curve covers, relative to the coverage area of the immediately preceding pixel.

This modification has the very attractive property that all coverage deltas both inside and outside the curve are zero, since points completely inside a curve contribute no additional area (except for the first pixel completely inside the curve). This property is key to Pathfinder’s performance relative to most vector texture algorithms. Calculating exact area coverage is slow, but calculating coverage deltas instead of absolute coverage essentially allows us to limit the expensive calculations to the edges of the curve, reducing the amount of work the GPU has to do to a fraction of what it would have to do otherwise.

In order to fill the outline and generate the final glyph image, we simply have to sweep down each column of pixels, calculating the running total of area coverage and writing pixels according to the winding rule. The formula to determine the color of each pixel is simple and fast: `min(|coverage total so far|, 1.0)` (where 0.0 is a fully transparent pixel, 1.0 is a fully opaque pixel, and values in between are different shades). Importantly, all columns are completely independent and can be calculated in parallel.

## Implementation details

With the advanced features in OpenGL 4.3, this algorithm can be straightforwardly adapted to the GPU.

1. As an initialization step, we create a coverage buffer to hold delta coverage values. This coverage buffer is a single-channel floating-point framebuffer. We always draw to the framebuffer with blending enabled (`GL_FUNC_ADD`, both source and destination factors set to `GL_ONE`).

2. We expand the TrueType outlines from the variable-length compressed `glyf` format inside the font to a fixed-length, but still compact, representation. This is necessary to be able to operate on vertices in parallel, since variable-length formats are inherently sequential. These outlines are then uploaded to the GPU.

3. Next, we draw each curve as a patch. In a tessellation-enabled drawing pipeline like the one that Pathfinder uses, rather than submitting triangles directly to the GPU, we submit abstract patches which are converted into triangles in hardware. We use indexed drawing (`glDrawElements`) to take advantage of the GPU’s post-transform cache, since most vertices belong to two curves.

4. For each path segment that represents a Bézier curve, we tessellate the Bézier curve into a series of small lines on the GPU. Then we expand all lines out to screen-aligned quads encompassing their bounding boxes. (There is a complication here when these quads overlap; we may have to generate extra one-pixel-wide quads here and strip them out with backface culling. See the comments inside the tessellation control shader for details.)

5. In the fragment shader, we calculate trapezoidal coverage area for each pixel and write it to the coverage buffer. This completes the draw step.

6. To perform the accumulation step, we attach the coverage buffer and the destination texture to images. We then dispatch a simple compute shader with one invocation per pixel column. For each row, the shader reads from the coverage buffer and writes the total coverage so far to the destination texture. The `min(|coverage total so far, 1.0)` expression above need not be computed explicitly, because our unsigned normalized atlas texture stores colors in this way automatically.

The performance characteristics of this approach are excellent. No CPU preprocessing is needed other than the conversion of the variable-length TrueType outline to a fixed-length format. The number of draw calls is minimal—any number of glyphs can be rasterized in one draw call, even from different fonts—and the depth and stencil buffers remain unused. Because the tessellation is performed on the fly instead of on the CPU, the amount of data uploaded to the GPU is minimal. Area coverage is essentially only calculated for pixels on the edges of the outlines, avoiding expensive fragment shader invocations for all the pixels inside each glyph. The final accumulation step has ideal characteristics for GPU compute, since branch divergence is nonexistent and cache locality is maximized. All pixels in the final buffer are only painted at most once, regardless of the number of curves present.

## Compatibility concerns

For any GPU code designed to be shipping to consumers, especially OpenGL 3.0 and up, compatibility and portability are always concerns. As Pathfinder is designed for OpenGL 4.3, released in 2012, it is no exception. Fortunately, the algorithm can be adapted in various ways depending on the available functionality.

• When compute shaders are not available (OpenGL 4.2 or lower), Pathfinder uses OpenCL 1.2 instead. This is the case on the Mac, since Apple has not implemented any OpenGL features newer than OpenGL 4.2 (2011). The compute-shader crate abstracts over the subset of OpenGL and OpenCL necessary to access GPU compute functionality.

• When tessellation shaders are not available (OpenGL 3.3 or lower), Pathfinder uses geometry shaders, available in OpenGL 3.2 and up.

(Note that it should be possible to avoid both geometry shaders and tessellation shaders, at the cost of performing that work on the CPU. This turns out to be quite fast. However, since image load/store is a hard requirement, this seems pointless: both image load/store and geometry shaders were introduced in DirectX 10-level hardware.)

Although these system requirements may seem high at first, the integrated graphics found in any Intel Sandy Bridge (2011) CPU or later meet them.

## Future directions

The immediate next step for Pathfinder is to integrate into WebRender as an optional accelerated path for applicable fonts on supported GPUs. Beyond that, there are several features that could be added to extend Pathfinder itself.

1. Support vector graphics outside the font setting. As Pathfinder is a generic vector graphics rasterizer, it would be interesting to expose an API allowing it to be used as the backend for e.g. an SVG renderer. Rendering the entire SVG specification is outside of the scope of Pathfinder itself, but it could certainly be the path rendering component of a full SVG renderer.

2. Support CFF and CFF2 outlines. These have been seen more and more over time, e.g. in Apple’s new San Francisco font. Adding this support involves both parsing and extracting the CFF2 format and adding support for cubic Bézier curves to Pathfinder.

3. Support WOFF and WOFF2. In the case of WOFF2, this involves writing a parser for the transformed `glyf` table.

4. Support subpixel antialiasing. This should be straightforward.

5. Support emoji. The Microsoft `COLR` and Apple `sbix` extensions are straightforward, but the Google `SVG` table allows arbitrary SVGs to be embedded into a font. Full support for SVG is probably out of scope of Pathfinder, but perhaps the subset used in practice is small enough to support.

6. Optimize overlapping paths. It would be desirable to avoid antialiasing edges that are covered by other paths. The fill rule makes this trickier than it initially sounds.

7. Support hinting. This is low-priority since it’s effectively obsolete with high-quality antialiasing, subpixel AA, and high-density displays, but it might be useful to match the system rendering on Windows.

## Conclusion

Pathfinder is available on GitHub and should be easily buildable using the stable version of Rust and Cargo. Please feel free to check it out, build it, and report bugs! I’m especially interested in reports of poor performance, crashes, or rendering problems on a variety of hardware. As Pathfinder does use DirectX 10-level hardware features, some amount of driver pain is unavoidable. I’d like to shake these problems out as soon as possible.

Finally, I’d like to extend a special thanks to Raph Levien for many fruitful discussions and ideas. This project wouldn’t have been possible without his insight and expertise.

# Drawing CSS Box Shadows in WebRender

I recently landed a change in WebRender to draw CSS box shadows using a specialized shader. Because it’s an unusual approach to drawing shadows, I thought I’d write up how it works.

Traditionally, browsers have drawn box shadows in three passes: (1) draw the unblurred box (or a nine-patch corner/edge for one); (2) blur in the horizontal direction; (3) blur in the vertical direction. This works because a Gaussian blur is a separable filter: it can be computed as the product of two one-dimensional convolutions. This is a reasonable approach, but it has downsides. First of all, it has a high cost in memory bandwidth; for a standard triple box blur on the CPU, every pixel is touched 6 times, and on the GPU every pixel is touched `$$6 sigma$$` times (or `$$3 sigma$$` times if a common linear interpolation trick is used), not counting the time needed to draw the unblurred image in the first place. (`$$sigma$$` here is half the specified blur radius.) Second, the painting of each box shadow requires no fewer than three draw calls including (usually) one shader switch, which are expensive, especially on mobile GPUs. On the GPU, it’s often desirable to use parallel algorithms that reduce the number of draw calls and state changes, even if those algorithms have a large number of raw floating-point operations—simply because the GPU is a stream processor that’s designed for such workloads.

The key trick used in WebRender is to take advantage of the fact that we’re blurring a (potentially rounded) box, not an ordinary image. This allows us to express the Gaussian blur in (in the case of an unrounded box) a closed form and (in the case of a rounded box) a closed form minus a sum computed with a small loop. To draw a box shadow, WebRender runs a shader implementing this logic and caches the results in a nine-patch image mask stored in a texture atlas. If the page contains multiple box shadows (even those with heterogeneous sizes and radii), the engine batches all the invocations of this shader into one draw call. This means that, no matter how many box shadows are in use, the number of draw calls and state changes remains constant (as long as the size of the texture atlas isn’t exhausted). Driver overhead and memory bandwidth are minimized, and the GPU spends as much time as possible in raw floating-point computation, which is exactly the kind of workload it’s optimized for.

The remainder of this post will be a dive into the logic of the fragment shader itself. The source code may be useful as a reference.

For those unfamiliar with OpenGL, per-pixel logic is expressed with a fragment shader (sometimes called a pixel shader). A fragment shader (in this case) is conceptually a function that maps arbitrary per-pixel input data to the RGB values defining the resulting color for that pixel. In our case, the input data for each pixel simply consists of the `$$x$$` and `$$y$$` coordinates for that pixel. We’ll call our function `$$RGB(u,v)$$` and define it as follows:

``````$$RGB(u,v) = sum_{y=-oo}^{oo} sum_{x=-oo}^{oo}G(x-u)G(y-v)RGB_{"rounded box"}(x,y)$$
``````

Here, `$$RGB_{"rounded box"}(x,y)$$` is the color of the unblurred, possibly-rounded box at the coordinate `$$(x,y)$$`, and `$$G(x)$$` is the Gaussian function used for the blur:

``````$$G(x)=1/sqrt(2 pi sigma^2) e^(-x^2/(2 sigma^2))$$
``````

A Gaussian blur in one dimension is a convolution that maps each input pixel to an average of the pixels adjacent to it weighted by `$$G(x)$$`, where `$$x$$` is the distance from the output pixel. A two-dimensional Gaussian blur is simply the product of two one-dimensional Gaussian blurs, one for each dimension. This definition leads to the formula for `$$RGB(x,y)$$` above.

Since CSS box shadows blur solid color boxes, the color of each pixel is either the color of the shadow (call it `$$RGB_{"box"}$$`) or transparent. We can rewrite this into two functions:

``````$$RGB(x,y) = RGB_{"box"}C(x,y)$$
``````

and

``````$$C(u,v) = sum_{y=-oo}^{oo} sum_{x=-oo}^{oo}G(x-u)G(y-v)C_{"rounded box"}(x,y)$$
``````

where `$$C_{"rounded box"}(x,y)$$` is 1.0 if the point $$(x,y)$$ is inside the unblurred, possibly-rounded box and 0.0 otherwise.

Now let’s start with the simple case, in which the box is unrounded. We’ll call this function `$$C_{"blurred box"}$$`:

``````$$C_{"blurred box"}(u,v) = sum_{y=-oo}^{oo} sum_{x=-oo}^{oo}G(x-u)G(y-v)C_{"box"}(x,y)$$
``````

where `$$C_{"box"}(x,y)$$` is 1.0 if the point $$(x,y)$$ is inside the box and 0.0 otherwise.

Let `$$x_{"min"}, x_{"max"}, y_{"min"}, y_{"max"}$$` be the left, right, top, and bottom extents of the box respectively. Then `$$C_{"box"}(x,y)$$` is 1.0 if `$$x_{"min"} <= x <= x_{"max"}$$` and `$$y_{"min"} <= y <= y_{"max"}$$` and 0.0 otherwise. Now let’s rearrange `$$C_{"blurred box"}(x,y)$$` above:

``````$$C_{"blurred box"}(u,v) = (sum_{y=-oo}^{y_{"min"} - 1} sum_{x=-oo}^{x=oo} G(x-u)G(y-v)C_{"box"}(x,y)) + (sum_{y=y_{"min"}}^{y_{"max"}} (sum_{x=-oo}^{x_{"min"}-1} G(x-u)G(y-v)C_{"box"}(x,y)) + (sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)G(y-v)C_{"box"}(x,y)) + (sum_{x=x_{"max"}+1}^{x=oo} G(x-u)G(y-v)C_{"box"}(x,y))) + (sum_{y=y_{"max"} + 1}^{oo} sum_{x=-oo}^{x=oo} G(x)G(y)C_{"box"}(x,y))$$
``````

We can now eliminate several of the intermediate sums, along with `$$C_{"box"}(x,y)$$`, using its definition and the sum bounds:

``````$$C_{"blurred box"}(u,v) = sum_{y=y_{"min"}}^{y_{"max"}} sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)G(y-v)$$
``````

Now let’s simplify this expression to a closed form. To begin with, we’ll approximate the sums with integrals:

``````$$C_{"blurred box"}(u,v) ~~ int_{y_{"min"}}^{y_{"max"}} int_{x_{"min"}}^{x_{"max"}} G(x-u)G(y-v) dxdy$$

$$= int_{y_{"min"}}^{y_{"max"}} G(y-v) int_{x_{"min"}}^{x_{"max"}} G(x-u) dxdy$$
``````

Now the inner integral can be evaluated to a closed form:

``````$$int_{x_{"min"}}^{x_{"max"}}G(x-u)dx = int_{x_{"min"}}^{x_{"max"}}1/sqrt(2 pi sigma^2) e^(-(x-u)^2/(2 sigma^2))dx = 1/2 "erf"((x_{"max"}-u)/(sigma sqrt(2))) - 1/2 "erf"((x_{"min"}-u)/(sigma sqrt(2)))$$
``````

`$$"erf"(x)$$` here is the Gauss error function. It is not found in GLSL (though it is found in `<math.h>`), but it does have the following approximation suitable for evaluation on the GPU:

``````$$"erf"(x) ~~ 1 - 1/((1+a_1x + a_2x^2 + a_3x^3 + a_4x^4)^4)$$
``````

where `$$a_1$$` = 0.278393, `$$a_2$$` = 0.230389, `$$a_3$$` = 0.000972, and `$$a_4$$` = 0.078108.

Now let’s finish simplifying `$$C(u,v)$$`:

``````$$C_{"blurred box"}(u,v) ~~ int_{y_{"min"}}^{y_{"max"}} G(y-v) int_{x_{"min"}}^{x_{"max"}} G(x-u) dxdy$$

$$= int_{y_{"min"}}^{y_{"max"}} G(y-v) (1/2 "erf"((x_{"max"}-u)/(sigma sqrt(2))) - 1/2 "erf"((x_{"min"}-u)/(sigma sqrt(2)))) dy$$

$$= 1/2 "erf"((x_{"max"}-u)/(sigma sqrt(2))) - 1/2 "erf"((x_{"min"}-u)/(sigma sqrt(2))) int_{y_{"min"}-v}^{y_{"max"}} G(y-v) dy$$

$$= 1/4 ("erf"((x_{"max"}-u)/(sigma sqrt(2))) - "erf"((x_{"min"}-u)/(sigma sqrt(2)))) ("erf"((y_{"max"}-v)/(sigma sqrt(2))) - "erf"((y_{"min"}-v)/(sigma sqrt(2))))$$
``````

And this gives us our closed form formula for the color of the blurred box.

Now for the real meat of the shader: the handling of nonzero border radii. CSS allows boxes to have elliptical radii in the corners, with separately defined major axis and minor axis lengths. Each corner can have separately defined radii; for simplicity, we only consider boxes with identical radii on all corners in this writeup, although the technique readily generalizes to heterogeneous radii. Most border radii on the Web are circular and homogeneous, but to handle CSS properly our shader needs to support elliptical heterogeneous radii in their full generality.

As before, the basic function to compute the pixel color looks like this:

``````$$C(u,v) = sum_{y=-oo}^{oo} sum_{x=-oo}^{oo}G(x-u)G(y-v)C_{"rounded box"}(x,y)$$
``````

where `$$C_{"rounded box"}(x,y)$$` is 1.0 if the point $$(x,y)$$ is inside the box (now with rounded corners) and 0.0 otherwise.

Adding some bounds to the sums gives us:

``````$$C(u,v) = sum_{y=y_{"min"}}^{y_{"max"}} sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)G(y-v) C_{"rounded box"}(x,y)$$
``````

`$$C_{"rounded box"}(x,y)$$` is 1.0 if `$$C_{"box"}(x,y)$$` is 1.0—i.e. if the point `$$(x,y)$$` is inside the unrounded box—and the point is either inside the ellipse defined by the value of the `border-radius` property or outside the border corners entirely. Let `$$C_{"inside corners"}(x,y)$$` be 1.0 if this latter condition holds and 0.0 otherwise—i.e. 1.0 if the point `$$(x,y)$$` is inside the ellipse defined by the corners or completely outside the corner area. Graphically, `$$C_{"inside corners"}(x,y)$$` looks like a blurry version of this:

Then, because `$$C_{"box"}(x,y)$$` is always 1.0 within the sum bounds, `$$C_{"rounded box"}(x,y)$$` reduces to `$$C_{"inside corners"}(x,y)$$`:

``````$$C(u,v) = sum_{y=y_{"min"}}^{y_{"max"}} sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)G(y-v) C_{"inside corners"}(x,y)$$
``````

Now let `$$C_{"outside corners"}(x,y)$$` be the inverse of `$$C_{"inside corners"}(x,y)$$`—i.e. `$$C_{"outside corners"}(x,y) = 1.0 - C_{"inside corners"}(x,y)$$`. Intuitively, `$$C_{"outside corners"}(x,y)$$` is 1.0 if `$$(x,y)$$` is inside the box but outside the rounded corners—graphically, it looks like one shape for each corner. With this, we can rearrange the formula above:

``````$$C(u,v) = sum_{y=y_{"min"}}^{y_{"max"}} sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)G(y-v) (1.0 - C_{"outside corners"}(x,y))$$

$$= sum_{y=y_{"min"}}^{y_{"max"}} sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)G(y-v) - G(x-u)G(y-v)C_{"outside corners"}(x,y)$$

$$= (sum_{y=y_{"min"}}^{y_{"max"}} sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)G(y-v)) - sum_{y=y_{"min"}}^{y_{"max"}} sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)G(y-v)C_{"outside corners"}(x,y)$$

$$= C_{"blurred box"}(u,v) - sum_{y=y_{"min"}}^{y_{"max"}} sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)G(y-v)C_{"outside corners"}(x,y)$$
``````

We’ve now arrived at our basic strategy for handling border corners: compute the color of the blurred unrounded box, then “cut out” the blurred border corners by subtracting their color values. We already have a closed form formula for `$$C_{"blurred box"}(x,y)$$`, so let’s focus on the second term. We’ll call it `$$C_{"blurred outside corners"}(x,y)$$`:

``````$$C_{"blurred outside corners"}(u,v) = sum_{y=y_{"min"}}^{y_{"max"}} sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)G(y-v)C_{"outside corners"}(x,y)$$
``````

Let’s subdivide `$$C_{"outside corners"}(x,y)$$` into the four corners: top left, top right, bottom right, and bottom left. This is valid because every point belongs to at most one of the corners per the CSS specification—corners cannot overlap.

``````$$C_{"blurred outside corners"}(u,v) = sum_{y=y_{"min"}}^{y_{"max"}} sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)G(y-v)(C_{"outside TL corner"}(x,y) + C_{"outside TR corner"}(x,y) + C_{"outside BR corner"}(x,y) + C_{"outside BL corner"}(x,y))$$

$$= (sum_{y=y_{"min"}}^{y_{"max"}} sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)G(y-v)(C_{"outside TL corner"}(x,y) + C_{"outside TR corner"}(x,y))) + sum_{y=y_{"min"}}^{y_{"max"}} sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)G(y-v)(C_{"outside BR corner"}(x,y) + C_{"outside BL corner"}(x,y))$$

$$= (sum_{y=y_{"min"}}^{y_{"max"}} G(y-v) ((sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)C_{"outside TL corner"}(x,y)) + sum_{x=x_{"min"}}^{x_{"max"}} G(x-u) C_{"outside TR corner"}(x,y))) + sum_{y=y_{"min"}}^{y_{"max"}} G(y-v) ((sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)C_{"outside BL corner"}(x,y)) + sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)C_{"outside BR corner"}(x,y))$$
``````

Let `$$a$$` and `$$b$$` be the horizontal and vertical border radii, respectively. The vertical boundaries of the top left and top right corners are defined by `$$y_min$$` on the top and `$$y_min + b$$` on the bottom; `$$C_{"outside TL corner"}(x,y)$$` and `$$C_{"outside TR corner"}(x,y)$$` will evaluate to 0 if `$$y$$` lies outside this range. Likewise, the vertical boundaries of the bottom left and bottom right corners are `$$y_max - b$$` and `$$y_max$$`.

(Note, again, that we assume all corners have equal border radii. The following simplification depends on this, but the overall approach doesn’t change.)

Armed with this simplification, we can adjust the vertical bounds of the sums in our formula:

``````$$C_{"blurred outside corners"}(u,v) = (sum_{y=y_{"min"}}^{y_{"min"} + b} G(y-v) ((sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)C_{"outside TL corner"}(x,y)) + sum_{x=x_{"min"}}^{x_{"max"}} G(x-u) C_{"outside TR corner"}(x,y))) + sum_{y=y_{"max"} - b}^{y_{"max"}} G(y-v) ((sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)C_{"outside BL corner"}(x,y)) + sum_{x=x_{"min"}}^{x_{"max"}} G(x-u)C_{"outside BR corner"}(x,y))$$
``````

And, following similar logic, we can adjust their horizontal bounds:

``````$$C_{"blurred outside corners"}(u,v) = (sum_{y=y_{"min"}}^{y_{"min"} + b} G(y-v) ((sum_{x=x_{"min"}}^{x_{"min"} + a} G(x-u)C_{"outside TL corner"}(x,y)) + sum_{x=x_{"max"} - a}^{x_{"max"}} G(x-u) C_{"outside TR corner"}(x,y))) + sum_{y=y_{"max"} - b}^{y_{"max"}} G(y-v) ((sum_{x=x_{"min"}}^{x_{"min"} + a} G(x-u)C_{"outside BL corner"}(x,y)) + sum_{x=x_{"max"} - a}^{x_{"max"}} G(x-u)C_{"outside BR corner"}(x,y))$$
``````

At this point, we can work on eliminating all of the `$$C_{"outside corner"}$$` functions from our expression. Let’s look at the definition of `$$C_{"outside TR corner"}(x,y)$$`. `$$C_{"outside TR corner"}(x,y)$$` is 1.0 if the point `$$(x,y)$$` is inside the rectangle formed by the border corner but outside the ellipse that defines that corner. That is, `$$C_{"outside TR corner"}(x,y)$$` is 1.0 if `$$y_{"min"} <= y <= y_{"min"} + b$$` and `$$E_{"TR"}(y) <= x <= x_{"max"}$$`, where `$$E_{"TR"}(y)$$` defines the horizontal position of the point on the ellipse with the given `$$y$$` coordinate. `$$E_{"TR"}(y)$$` can easily be derived from the equation of an ellipse centered at `$$(x_0, y_0)$$:

``````$$(x-x_0)^2/a^2 + (y-y_0)^2/b^2 = 1$$

$$(x-x_0)^2 = a^2(1 - (y-y_0)^2/b^2)$$

$$x = x_0 + sqrt(a^2(1 - (y-y_0)^2/b^2))$$

$$E_{"TR"}(y) = x_0 + a sqrt(1 - ((y-y_0)/b)^2)$$
``````

Parallel reasoning applies to the other corners.

Now that we have bounds within which each `$$C_{"outside corner"}$$` function evaluates to 1.0, we can eliminate all of these functions from the definition of `$$C_{"blurred outside corners"}$$`:

``````$$C_{"blurred outside corners"}(u,v) = (sum_{y=y_{"min"}}^{y_{"min"} + b} G(y-v) ((sum_{x=x_{"min"}}^{E_{"TL"}(y)} G(x-u)) + sum_{x=E_{"TR"}(y)}^{x_{"max"}} G(x-u))) + sum_{y=y_{"max"} - b}^{y_{"max"}} G(y-v) ((sum_{x=x_{"min"}}^{E_{"BL"}(y)} G(x-u)) + sum_{x=E_{"BR"}(y)}^{x_{"max"}} G(x-u))$$
``````

To simplify this a bit further, let’s define an intermediate function:

``````$$E(y, y_0) = a sqrt(1 - ((y - y_0)/b)^2)$$
``````

And rewrite `$$C_{"blurred outside corners"}(x,y)$$` as follows:

``````$$C_{"blurred outside corners"}(u,v) = (sum_{y=y_{"min"}}^{y_{"min"} + b} G(y-v) ((sum_{x=x_{"min"}}^{x_{"min"} + a - E(y, y_{"min"} + b)} G(x-u)) + sum_{x=x_{"max"} - a + E(y, y_{"min"} + b)}^{x_{"max"}} G(x-u))) + (sum_{y=y_{"max" - b}}^{y_{"max"}} G(y-v) ((sum_{x=x_{"min"}}^{x_{"min"} + a - E(y, y_{"max"} - b)} G(x-u)) + sum_{x=x_{"max"} - a + E(y, y_{"max"} - b)}^{x_{"max"}} G(x-u)))$$
``````

Now we simply follow the procedure we did before for the box. Approximate the inner sums with integrals:

``````$$C_{"blurred outside corners"}(u,v) ~~ (sum_{y=y_{"min"}}^{y_{"min"} + b} G(y-v) ((int_{x_{"min"}}^{x_{"min"} + a - E(y, y_{"min"} + b)} G(x-u)dx) + int_{x_{"max"} - a + E(y, y_{"min"} + b)}^{x_{"max"}} G(x-u)dx)) + (sum_{y=y_{"max" - b}}^{y_{"max"}} G(y-v) ((int_{x_{"min"}}^{x_{"min"} + a - E(y, y_{"max"} - b)} G(x-u)dx) + int_{x_{"max"} - a + E(y, y_{"max"} - b)}^{x_{"max"}} G(x-u)dx))$$
``````

Replace `$$int G(x)dx$$` with its closed-form solution:

``````$$C_{"blurred outside corners"}(u,v) ~~ (sum_{y=y_{"min"}}^{y_{"min"} + b} G(y-v) (1/2 "erf"((x_{"min"} - u + a - E(y, y_{"min"} - v + b)) / (sigma sqrt(2))) - 1/2 "erf"((x_{"min"} - u) / (sigma sqrt(2))) + (1/2 "erf"((x_{"max"} - u) / (sigma sqrt(2))) - 1/2 "erf"((x_{"max"} - u - a + E(y, y_{"min"} - v + b)) / (sigma sqrt(2)))))) + sum_{y=y_{"max"} - b}^{y_{"max"}} G(y-v) (1/2 "erf"((x_{"min"} - u + a - E(y, y_{"max"} - v - b)) / (sigma sqrt(2))) - 1/2 "erf"((x_{"min"} - u) / (sigma sqrt(2))) + (1/2 "erf"((x_{"max"} - u) / (sigma sqrt(2))) - 1/2 "erf"((x_{"max"} - u - a + E(y, y_{"max"} - v - b)) / (sigma sqrt(2)))))$$

$$= 1/2 (sum_{y=y_{"min"}}^{y_{"min"} + b} G(y-v) ("erf"((x_{"min"} - u + a - E(y, y_{"min"} - v + b)) / (sigma sqrt(2))) - "erf"((x_{"min"} - u) / (sigma sqrt(2))) + ("erf"((x_{"max"} - u) / (sigma sqrt(2))) - "erf"((x_{"max"} - u - a + E(y, y_{"min"} - v + b)) / (sigma sqrt(2)))))) + sum_{y=y_{"max"} - b}^{y_{"max"}} G(y-v) ("erf"((x_{"min"} - u + a - E(y, y_{"max"} - v - b)) / (sigma sqrt(2))) - "erf"((x_{"min"} - u) / (sigma sqrt(2))) + ("erf"((x_{"max"} - u) / (sigma sqrt(2))) - "erf"((x_{"max"} - u - a + E(y, y_{"max"} - v - b)) / (sigma sqrt(2)))))$$
``````

And we’re done! Unfortunately, this is as far as we can go with standard mathematical functions. Because the parameters to the error function depend on `$$y$$`, we have no choice but to evaluate the inner sum numerically. Still, this only results in one loop in the shader.

The current version of the shader implements the algorithm basically as described here. There are several further improvements that could be made:

1. The Gauss error function approximation that we use is accurate to `$$5 xx 10^-4$$`, which is way more accurate than we need. (Remember that the units here are 8-bit color values!) The approximation involves computing `$$x, x^2, x^3, " and " x^4$$`, which is expensive since we evaluate the error function many times for each pixel. It could be a nice speedup to replace this with a less accurate, faster approximation. Or we could use a lookup table.

2. We should not even compute the amount to subtract from the corners if the pixel in question is more than `$$3 sigma$$` pixels away from them.

3. `$$C_{"blurred outside corners"}(x,y)$$` is a function of sigmoid shape. It might be interesting to try approximating it with a logistic function to avoid the loop in the shader. It might be possible to do this with a few iterations of least squares curve fitting on the CPU or with some sort of lookup table. Unfortunately, the parameters to the approximation will have to be per-box-shadow, because `$$C_{"blurred outside corners"}$$` depends on `$$a$$`, `$$b$$`, `$${x, y}_{"min, max"}$$`, and `$$sigma$$`.

4. Evaluating `$$G(x)$$` could also be done with a lookup table. There is already a GitHub issue filed on this.

Finally, it would obviously be nice to perform some comprehensive benchmarks of this rendering algorithm, fully optimized, against the standard multiple-pass approach to drawing box shadows. In general, WebRender is not at all GPU bound on most hardware (like most accelerated GPU vector graphics rasterizers), so optimizing the count of GPU raster operations has not been a priority so far. When and if this changes (which I suspect it will as the rendering pipeline gets more and more optimized), it may be worth going back and optimizing the shader to reduce the load on the ALU. For now, however, this technique seems to perform quite well in basic testing, and since WebRender is so CPU bound it seems likely to me that the reduction in draw calls and state changes will make this technique worth the cost.

# Revamped Parallel Layout in Servo

Over the past week I’ve submitted a series of pull requests that significantly revamp the way parallel layout works in Servo. Originally I did this work to improve performance, but it’s also turned out to be necessary to implement more advanced CSS 2.1 features. As it’s a fairly novel algorithm (as far as I’m aware) I’d like to take some time to explain it. I’ll start with where we are in Servo head and explain how it evolved into what’s in my branch. This post assumes a little knowledge about how browser engines work, but little else.

## Overview

Servo’s layout operates on a flow tree, which is similar to the render tree in WebKit or Blink and the frame tree in Gecko. We call it a flow tree rather than a render tree because it consists of two separate data types: flows, which are organized in a tree, and boxes, which belong to flows and are organized in a flat list. Roughly speaking, a flow is an object that can be laid out in parallel with other flows, while a box is a box that must be laid out sequentially with other boxes in the same flow. If you’re familiar with WebKit, you can think of a box as a `RenderObject`, and if you’re familiar with Gecko, you can think of a box as a `nsFrame`. We want to lay out boxes in parallel as much as possible in Servo, so we group boxes into flows that can be laid out in parallel with one another.

Here’s a simple example. Suppose we have the following document:

``````<html>
<body>
<p>Four score and seven years ago our <b>fathers</b> brought forth on this
continent, a new nation, conceived in Liberty, and dedicated to the
proposition that all men are created equal.</p>
<p>Now we are engaged in a great civil war, testing whether that nation,
or any nation so conceived and so dedicated, can long endure. We are met
on a great battle-field of that war. We have come to <i>dedicate</i> a
portion of that field, as a final resting place for those who here gave
their lives that that nation might live. It is altogether fitting and
proper that we should do this.</p>
</body>
</html>
``````

This would result in the following flow tree:

Notice that there are three inline boxes under each `InlineFlow`. We have multiple boxes for each because each contiguous sequence of text in the same style—known as a text run—needs a box. During layout, the structure of the flow tree remains immutable, but the boxes get cut up into separate lines, so there will probably be many more boxes after layout (depending on the width of the window).

One neat thing about this two-level approach is that boxes end up flattened into a flat list instead of a linked data structure, improving cache locality and memory usage and making style recalculation faster because less needs to be allocated. Another benefit (and in fact the original impetus for this data structure) is that the line breaking code need not traverse trees in order to do its work—it only needs to traverse a single array, making the code simpler and improving cache locality.

Now that we know how the flow tree looks, let’s look at how Servo currently performs layout to figure out where boxes should go.

## The current algorithm

The current algorithm for parallel layout in Servo (i.e. what’s in the master branch before my changes) consists of three separate passes over the flow tree.

1. Intrinsic width calculation or `bubble_widths` (bottom-up). This computes the minimum width and preferred width for each flow. There are no sequential hazards here and this can always be computed in parallel. Note that this is information is not always needed during layout, and eventually we will probably want to implement optimizations to avoid computation of this information for subtrees in which it is not needed.

2. Actual width calculation or `assign_widths` (top-down). This computes the width of each flow, along with horizontal margins and padding values.

3. Height calculation or `assign_heights` (bottom-up). This computes the height of each flow. Along the way, line breaking is performed, and floats are laid out. We also compute vertical margins and padding, including margin collapse.

Within each flow, boxes are laid out sequentially—this is necessary because, in normal text, where to place the next line break depends on the lines before it. (However, we may be able to lay boxes out in parallel for `white-space: nowrap` or `white-space: pre`.)

For simple documents that consist of blocks and inline flows, Servo achieves excellent parallel wins, in line with Leo Meyerovich’s “Fast and Parallel Webpage Layout”, which implemented this simple model.

## The problem with floats

Unfortunately, in the real world there is one significant complication: floats. Here’s an example of a document involving floats that illustrates the problem:

``````<div style="float: right">
I shot the sheriff.
But I swear it was in self-defense.
I shot the sheriff.
And they say it is a capital offense.
</div>
<div>
I shot the sheriff
But I didn't shoot no deputy.
</div>
<div>
All around in my home town,
They're tryin' to track me down;
They say they want to bring me in guilty
For the killing of a deputy,
For the life of a deputy.
</div>
``````

Rendered with a little extra formatting added, it looks like this:

I shot the sheriff. But I swear it was in self-defense. I shot the sheriff. And they say it is a capital offense.
I shot the sheriff But I didn’t shoot no deputy.
All around in my home town, They’re tryin’ to track me down; They say they want to bring me in guilty For the killing of a deputy, For the life of a deputy.

The flow tree for this document might look like this:

Notice how the float in green (“I shot the sheriff…”) affects where the line breaks go in the two blocks to its left and below it (“I shot the sheriff…” and “All around…”). Line breaking is performed during the height assignment phase in Servo, because where the line breaks go determines the height of each flow.

This has important implications for the parallel algorithm described above. We don’t know how tall the float is until we’ve laid it out, and its height determines where to place the line breaks in the blocks next to it, so we have to lay out the float before laying out the blocks next to it. This means that we have to lay out the float before laying out any blocks that it’s adjacent to. But, more subtly, floats prevent us from laying out all the blocks that they impact in parallel as well. The reason is that we don’t know how many floats “stick out” of a block until we know its height, and in order to perform line breaking for a block we have to know how many floats “stuck out” of all the blocks before it. Consider the difference between the preceding document and this one:

I shot the sheriff. But I swear it was in self-defense. I shot the sheriff. And they say it is a capital offense.
I shot the sheriff But I didn’t shoot no deputy. I shot the sheriff But I didn’t shoot no deputy.
All around in my home town, They’re tryin’ to track me down; They say they want to bring me in guilty For the killing of a deputy, For the life of a deputy.

The only difference between the first document and the this one is that the first unfloated block (“I shot the sheriff…”) is taller. But this impacts the height of the second block (“All around…”), by affecting where the lines get broken. So the key thing to note here is that, in general, floats force us to sequentialize the processing of the blocks next to them.

The way this was implemented in Servo before my pull requests is that any floats in the document caused all unfloated blocks to be laid out sequentially. (The floats themselves could still be laid out in parallel, but all other blocks in the page were laid out in order.) Unsurprisingly, this caused our parallel gains to evaporate on most real-world Web pages. The vast majority of modern Web pages do use floats in some capacity, as they’re one of the most popular ways to create typical layouts. So losing our parallel gains is quite unfortunate.

Can we do better? It turns out we can.

## Clearing floats

As most designers know, the `float` property comes with a very useful companion property: `clear`. The `clear` property causes blocks to be shifted down in order to avoid impacting floats in one or both directions. For example, the document above with `clear: right` added to the second block looks like this:

I shot the sheriff. But I swear it was in self-defense. I shot the sheriff. And they say it is a capital offense.
I shot the sheriff But I didn’t shoot no deputy.
All around in my home town, They’re tryin’ to track me down; They say they want to bring me in guilty For the killing of a deputy, For the life of a deputy.

This property is widely used on the Web to control where floats can appear, and we can take advantage of this to gain back parallelism. If we know that no floats can impact a block due to the use of the `clear` property, then we can lay it out in parallel with the blocks before it. In the document above, the second block (“All around…”) can be laid out at the same time as the float and the first block.

My second pull request implements this optimization in this way: During flow construction, which is a bottom-up traversal, we keep track of a flag, `has_floated_descendants`, and set it on each flow if it or any of its descendants are `FloatFlow` instances. (Actually, there are two such flags—`has_left_floated_descendants` and `has_right_floated_descendants`—but for the purposes of this explanation I’ll just treat it as one flag.) During width computation, we iterate over our children and set two flags: `impacted_by_floats`. (Again, there are actually two such flags—`impacted_by_left_floats` and `impacted_by_right_floats`.) `impacted_by_floats` is true for a flow if and only if any of the following is true:

1. The parent flow is impacted by floats.

2. The flow has floated descendants.

3. Any previous sibling flow is impacted by floats, unless the appropriate `clear` property has been set between this flow and that sibling.

Only subtrees that have `impacted_by_floats` set to true are laid out sequentially, in order. The remaining subtrees can be laid out in parallel.

With this optimization implemented, documents above can be laid out in parallel as much as possible. It helps many real-world Web pages, as `clear` is a very commonly-used property.

At this point, two questions arise: “Can we do even more?” and “Is this algorithm enough to properly handle CSS?” As you might expect, the answer to the first is “yes”, and the answer to the second is “no”. To understand why, we need dive into the world of block formatting contexts.

## Block formatting contexts

The behavior of `overflow: hidden` is subtle. Consider this document, which is identical to the document we’ve been using but with `overflow: hidden` specified on the blocks adjacent to the float:

``````<div style="float: right">
I shot the sheriff.
But I swear it was in self-defense.
I shot the sheriff.
And they say it is a capital offense.
</div>
<div style="overflow: hidden">
I shot the sheriff
But I didn't shoot no deputy.
</div>
<div style="overflow: hidden">
All around in my home town,
They're tryin' to track me down;
They say they want to bring me in guilty
For the killing of a deputy,
For the life of a deputy.
</div>
</div>
``````

Rendered, it looks like this:

I shot the sheriff. But I swear it was in self-defense. I shot the sheriff. And they say it is a capital offense.
I shot the sheriff But I didn’t shoot no deputy.
All around in my home town, They’re tryin’ to track me down; They say they want to bring me in guilty For the killing of a deputy, For the life of a deputy.

Notice that, with `overflow: hidden` specified, the float makes the entire width of the block next to it smaller: all the lines have been wrapped, not just those that impact the float.

What’s going on here is that `overflow: hidden` establishes what’s known as a block formatting context in CSS jargon. In Servo, block formatting contexts make our layout algorithm significantly more complex, because they require width assignment and height assignment to be intertwined, and for height assignment to be interruptible. To see why this is, recall that the flow tree for this document looks like this:

Remember that Servo’s layout algorithm performs width calculation top-down, then height calculation bottom-up—this works under the assumption that widths never depend on heights. But with block formatting contexts adjacent to floats, this is no longer true: the width of a block formatting context depends on the height of floats next to it. This is because we don’t know whether a float, such as the green float above, is tall enough to impact a block formatting context, like those that the “I shot the sheriff…” and “All around…” above establish, until we lay out all blocks prior to the context and the float itself. And without knowing that, we cannot assign the width of the block formatting contexts.

To handle this case, my patches change Servo’s layout in several ways:

1. When we see a block formatting context during width calculation, we check the value of the `impacted_by_floats` flag. If it is on, then we don’t calculate widths for that flow or any of its descendants. Instead, we set a flag called `width_assignment_delayed`.

2. When we encounter a block formatting context child of a flow while calculating heights, if that block formatting context has the flag `width_assignment_delayed` set, we suspend the calculation of heights for that node, calculate the width of the block formatting context, and begin calculating widths and heights for that node and all of its descendants (in parallel, if possible).

3. After calculating the height of a block formatting context, we resume calculation of heights for its parent.

Let’s look at the precise series of steps that we’ll follow for the document above:

1. Calculate the width of the root flow.

2. Calculate the width of the float flow.

3. Don’t calculate the widths of the two block flows; instead, set the `width_assignment_delayed` flag.

4. Calculate the width of the float flow’s inline flow child. The main width assignment phase is now complete.

5. Begin height calculation. First, calculate the height of the float flow and its inline child.

6. Start calculating the height of the root flow by placing the float.

7. We see that we’ve hit a block formatting context that has its width assignment delayed, so we clear that flag, determine its width, and start width calculation for its descendants.

8. Calculate width for the block flow’s inline child. Now width calculation is done for that subtree.

9. Calculate the height of the block flow’s inline child, and the block flow itself. Now height calculation is done for this subtree.

10. Resume calculating the height of the root flow. We see that the next block formatting context has its width assignment delayed, so we assign its width and repeat steps 8 and 9.

11. We’ve now calculated the height of the root flow, so we’re done.

Now this particular page didn’t result in any parallel speedups. However, block formatting contexts can result in additional parallelism in some cases. For example, consider this document:

``````<div id=sidebar style="float: left">
<div>Coupons</div>
<div>Freebies</div>
<div>Great Deals</div>
</div>
<div id=main style="overflow: hidden">
<div>Deals in your area:</div>
<ul>
<li>Buy 1 lawyer, get 1 free</li>
<li>Free dental fillings</li>
</ul>
</div>
``````

Rendered, it looks like this:

Deals in your area:
• Buy 1 lawyer, get 1 free
• Free dental fillings

In this document, after we’ve laid out the sidebar, we can continue on and lay out the main part of the page entirely in parallel. We can lay out the block “Deals in your area” in parallel with the two list items “Buy 1…” and “Free dental fillings”. It turns out that this pattern is an extremely common way to create sidebars in real Web pages, so the ability to lay out the insides of block formatting contexts in parallel is a crucial optimization in practice. The upshot of all this is that block formatting contexts are a double-edged sword: they add an unfortunate dependency between heights and widths, but they enable us to recover parallelism even when blocks are impacted by floats, since we can lay out their interior in parallel.

## Conclusion

No doubt about it, CSS 2.1 is tricky—floats perhaps more than anything else. But in spite of their difficulties, we’re finding that there are unexpected places where we can take advantage of parallelism to make a faster engine. I’m cautiously optimistic that Servo’s approaching the right design here—not only to make new content faster but also to accelerate the old.

# Removing Garbage Collection From the Rust Language

I’ve been floating ways to simplify the memory management story in Rust around the core team lately. Memory management is a contentious topic, since we’ve worked hard to get to the current state of things, and with the push toward stability lately, there is a (quite reasonable!) resistance to any changes at this state. Still, I think the current memory management story in Rust is worth revisiting, as the current state of things may cause us problems down the line. Working with Dave Herman and Niko Matsakis, I’ve formulated a fairly concrete proposal at this point. The basic idea is to remove garbage collection from the core language and relegate it to the standard library, with a minimal set of language hooks in place to allow for flexible, pluggable automatic memory management.

This post is designed to explain the “why”, not the “how”—I’m leaving the concrete details of the proposed system to a future blog post or mailing list discussion. Rather, this explains the issue that I see with the current system. I think that the garbage collection story as it stands in Rust is not quite ideal, for three reasons: familiarity, simplicity, and flexibility. I’ll cover each in turn.

# Familiarity

One of the most common questions almost every Rust beginner asks is “when do I use managed pointers, and when do I use owned pointers?” Or, more simply, “what are all these `~` and `@` symbols everywhere?” Having worked on Rust for many years now, I’ve seen several reasons for the difficulty. Chief among them are:

1. The difference between the stack and the heap is a difficult concept to grasp for many programmers used to languages like Java that don’t make such a distinction. This is, unfortunately, a fundamental difficulty of working in a systems language. There’s little that can be done about this without taking control of allocation out of the hands of the programmer. Doing that, however, would compromise the goals of the language—in low-level, performance-critical programming, being able to precisely control whether allocations occur on the stack or on the heap is crucial.

2. The sigils make the code unfamiliar before the concepts are learned. Unlike the rest of the punctuation in Rust, `~` and `@` are not part of the standard repertoire of punctuation in C-like languages, and as a result the language can seem intimidating. One of the benefits of keywords is that they are self-documenting in a way that punctuation is not. This could be fixed by switching to keywords, which I prefer for this reason; however, syntactic beauty is in the eye of the beholder and so I won’t lose sleep over this not changing if the community prefers the current syntax.

3. There are two heaps, not just one, so beginners are confused as to which one to allocate into. This is a result of the “minimize sharing by default” philosophy of the concurrency system. However, the concurrency system has been part of the library rather than the language for several years now, so this seems somewhat out of place.

4. Programmers don’t know which to use, since some operations are available with `~` and some operations are available with `@`. Actually, we were confused on this point for a long time as well—it wasn’t clear whether `~` or `@` would become dominant. We debated for a long time which to present first, `~` or `@`. However, as the language and community evolved, and coding standards became more settled, a clear winner emerged: the owning pointer `~`. In practice, the rule has been that programmers should use `~` to allocate in all circumstances except when they have no way of knowing precisely when the object in question should be freed.

Point (4), to me, is the most critical. The rule that emerged—`~` over `@`—should not be surprising, in retrospect, as it is how systems software has been developed for decades. The key insight that was missing is that the owning pointer `~` is just the Rust equivalent of `malloc` and `free`. For many, probably most C programs, `malloc` and `free` are just fine (assuming you use them correctly, of course); each heap allocation is allocated in just one place and destroyed in just one place. Only when the lifetimes of objects become very complex do C and C++ programmers resort to manual reference counting to determine when an object should be freed (and many, perhaps most, C programs never get there). This is the role that has emerged for `@` in Rust programs: `@` is a replacement for manual reference counting in C programs. The `kobject` system in the Linux kernel, the `GObject` system in `glib`, and so forth, are the C equivalents of `@` in Rust.

The key point here is that these are very specialized use cases in C, and `@` has been relegated to a similarly marginal role in idiomatic Rust code. We thought for a while that many Rust programs would use `@` extensively and that it would ease the learning curve for those not used to destructor-based memory management and references. This has not, however, been the case in practice. In reality, since the libraries all use owning pointers (`~`), Rust programmers have to learn them quickly anyhow. And once Rust programmers learn how to use `~` effectively, they quickly find `@` relegated to a marginal role, if it’s used at all. `~` has so many advantages: deterministic allocation and destruction, interaction with the standard library, freedom from GC marking pauses, simpler semantics, appendability where vectors and strings are concerned, and sendability across tasks.

I think we’re better off teaching `~` as the go-to solution for most programs and relegating `@` to a specialized role. `@` has its use cases, to be sure; large, event-driven C++ programs use reference counting for a reason. But those use cases are specialized. Beginners should not be asking “should I use `~` or `@`?” The answer is almost always `~`.

In this regard relegating `@` to a library is just the natural conclusion of this approach. I feel that what beginners should be taught is that `~` is the way to allocate in Rust, and letting an `~` owning pointer go out of scope is the way you free in Rust. This is what we should be teaching in the language tutorial. As beginners become more comfortable with this and explore the libraries, they will learn about ways to achieve more dynamic memory management: tracing garbage collection with the `Gc` type, reference counting with the `Rc` type, and thread-safe reference counting with the `Arc` type. But by building only `~` into the language, we can reduce confusion by, in effect, making the language more opinionated.

# Simplicity

Although Rust didn’t start out that way, one of the most interesting applications of Rust has been very low-level programming, even down to the level of kernels. The interest in this application of Rust was something of a surprise to us, but in hindsight it makes perfect sense. Low-level control over memory management isn’t something that most applications software, especially on the server side, wants; most of that software has migrated over to languages like Java, Ruby, and JavaScript that trade control and performance for convenience by making memory management automatically, and dynamically, managed by the runtime. The remaining class of software, most of which is written in C and C++, is software that must manage memory manually in order to achieve some combination of performance, simplicity, and/or the ability to self-host. The prospect of using a new language for this class of software, which includes OS kernels, game engines, and browser engines among others, is what is fueling the growth of the nascent Rust community.

It might be possible to create a language that presents only a simple, fully automatic memory management system at first, and which surfaces the machinery of safe manual memory management* only when the programmer requires it for maximum performance. This would ease the learning curve, as programmers would be able to write many, perhaps most programs without ever learning how to manage memory at all. However, at this point I don’t think that this language exists yet, and in particular I don’t think Rust is that language. There are basically two problems here: (1) `~` owning pointers are everywhere in Rust, from the standard library to the built-in macros, making learning about them a necessity from the get-go; and (2) it is basically impossible to program Rust without at least a cursory understanding of references (a.k.a. `&` pointers) and their lifetime semantics; even `vec::each()` uses references.

Despite the fact that this might seem like a negative result, I actually think it’s quite positive for the project. It helps to define the project’s scope. I don’t think automatic memory management in Rust is ever going to be as convenient as memory management in, say, Ruby or Java, and that’s OK! The same level of control that adds cognitive overhead to memory management in Rust compared to other languages also makes Rust able to go where few other industry languages have. This space, I think, is where Rust can really shine.

In short, I think that Rust as a language should focus on roughly the same application domain as C++ does.†

Important to this effort is to have as small of a runtime as possible, just as C++ does, leaving higher-level abstractions to libraries. And, in fact, we are almost there already. The only runtime support that compiled Rust programs require are a small set of “language items”, which are magic functions or traits written in Rust that are known to the compiler. Looking at the set of language items, and disqualifying legacy items that will be removed soon such as `annihilate` and `log_type`, there are just a few categories:

1. Operator traits, like `Add` and `Sub`. These are analogous to `operator+`, `operator-`, and so forth in C++.

2. Memory primitives, like `str_eq`. These are somewhat legacy at this point and probably could be converted to LLVM intrinsics like `memcmp` without much trouble, especially after dynamically sized types happens. In any case, in most C++ compilers `memcmp` and friends are builtins.

3. Failure: `fail` and `fail_bounds_check`. This is analogous to `throw` in C++, although a Rust program that doesn’t want to use stack unwinding might want to use `abort` instead (which would be like `-fno-exceptions`) or do something more elaborate like the Linux kernel’s “oops” functionality.

4. Allocation primitives `malloc` and `free`. These have direct C++ equivalents: `operator new` and `operator delete`.

5. Garbage collection primitives.

Of these, the only language items that don’t have direct C++ equivalents are the garbage collection primitives. If those were eliminated, then Rust as a language would be every bit as freestanding as C++ is. In terms of suitability for kernel and embedded development, Rust would be on truly equal footing.

In summary: (1) all Rust programmers have to know how `~` and `&` work, despite the presence of `@`; (2) the only additional runtime primitives that Rust exposes and C++ doesn’t are those related to `@`.

# Flexibility

When it comes to memory management, there are obviously many different strategies: stack allocation, heap allocation with `malloc` and `free`, arena allocation, and garbage collection. What’s less well known is that even among garbage collection, there are many different approaches, each with advantages and disadvantages. There’s thread-local GC, thread-safe GC, incremental GC, generational GC, reference counting, thread-safe reference counting, deferred reference counting, ulterior reference counting—the list goes on and on. (For a good survey of automatic memory management techniques and how they relate to one another, check out “A Unified Theory of Garbage Collection” by Bacon et al.) A program that wants to maximize performance among some axis (latency versus throughput) and remain safe with objects with complex lifetimes may have reasons to choose one or the other.

Specifically, there’s the perennial debate between reference counting and tracing garbage collection. Many applications are better with tracing GC because of the increased throughput it provides and straightforward handling of cycles, and many applications are better with reference counting because of implementation simplicity, cache behavior, mostly-incremental operation, and promptness of deallocation. It makes sense for applications to be able to choose between the two. Even more important is the tradeoff between thread-safe and thread-local garbage collection: concurrent garbage collection is practically always more expensive than thread-local garbage collection, so it makes sense for programs to restrict concurrent GC (including atomic reference counting) to be used only when needed.

Integrating multiple tracing garbage collectors or cycle collectors into one system is a hard problem, and I don’t think Rust is going to really solve it. However, integrating reference counting into a garbage collected system is straightforward, as long as cycles are not created (and in Rust we can forbid the creation of such cycles through clever use of the type system). In practice this seems to work well: we typically use thread-local tracing GC for data with complex lifetimes within one task, and we use thread-safe reference counting for data that must be shared between tasks.

Equally important is the ability to integrate with external garbage collection systems (usually reference counted ones). This is a problem that is often overlooked, but is terribly important for client software such as mobile apps and browser engines. On Windows, apps must integrate with the reference-counted COM system in order to use DirectX and other APIs. On the Mac and on iOS, apps have to integrate with Objective-C and the closely-related Core Foundation, also reference-counted systems. On Linux, GNOME apps have to integrate with GObject, again a reference-counted system. On Android, apps have to integrate with the garbage-collected Dalvik subsystem via the JNI. All of this requires that the memory management system in the language be deeply flexible.

Because of this, I’m suspect of blessing any particular form of automatic memory management in the core language. In Rust, the `@` type is not only blessed with special syntax, but is eligible for borrowing and other operations in a way that user-defined types aren’t. Although Rust provides the facilities needed to build practically all the other forms of garbage collection, as well as those needed to integrate with external GC systems in a safe way, the resulting smart pointers feel second-class compared to `@`. A systems language designed to work in a diverse set of environments should have the flexibility to create memory management abstractions that feel first-class.

# Conclusion

For these three reasons—familiarity, simplicity, and flexibility—I’d like to propose removing `@` pointers from the language and replacing them with a small set of hooks allowing the same functionality to be implemented as a library and on user-defined types. We would ship tracing GC as part of the standard library and make it just as powerful and convenient as it is today (except for the `@` syntax). We’d gain a flexible set of abstractions, make the language easier to learn, and make Rust into a truly freestanding language environment.

* Note that the safe qualifier here disqualifies manually-built free lists in garbage-collected languages, as these manually-built free lists provide no protection against errors like double “frees”, leaks, and danging pointers. (They’re significantly worse than true manual memory management anyhow; the GC still has to trace through objects in arenas at mark time, copy the objects within out into the tenured generation when they survive a minor collection, write barrier the objects, and so forth.)

† Note that I don’t mean you shouldn’t write Web frameworks and Web sites in Rust: in fact, I think Rust would be a fantastic language for many classes of Web server software, especially that which must scale to the highest loads and squeeze every ounce of performance out of the servers on the racks.

# Safe Manual Memory Management

If there’s one feature of Rust that is probably the most unique among languages in industry, it’s safe manual memory management.

It’s easiest to explain safe manual memory management by explaining how it differs from the memory models of other languages. There are a few different models in common use in industry languages:

• Unsafe manual memory management—These languages provide very fine-grained control over memory allocation; heap memory can be explicitly allocated and deallocated. The most important examples here are C and C++. The well-known downside of this approach is that memory safety violations can only be detected at runtime with a memory safety checker such as Valgrind or Address Sanitizer. Memory safety violations that go unchecked can lead to crashes at best and exploitable security vulnerabilities at worst.

• Full garbage collection—The majority of modern languages expose a memory model that falls into this category—the space is very diverse, ranging from Java to Go to JavaScript to Ruby to Haskell. In general, such languages place all allocations into the heap instead of the stack, although escape analysis and value types may be used to reduce the number of heap allocations. Periodically, a garbage collector scans all pointers on the stack and in the heap, judges unreachable objects dead, and reclaims them. This approach has the advantage of memory safety at compile time—the language arranges for there to be no dangling pointers, wild pointers, and so forth. The downsides, however, are:

1. The garbage collector may run at an inconvenient time. This can be mitigated by explicit control over when the GC runs, although if the garbage collector must collect multiple threads’ heaps at the same time, this may be difficult to synchronize. This can also be mitigated by using manual memory pooling and free lists, although pooling has undesirable safety properties—much like unsafe manual memory management, there is no static guarantee that objects allocated from a pool are returned properly or that an object is not reachable when returned to the pool. Incremental and concurrent garbage collectors help here, but they are not free, as they typically require write and/or read barriers, reducing throughput.

2. When it runs, the garbage collector must mark all pointers to discover which ones are live, reducing throughput of the application. Essentially, the GC must discover at runtime what a C++ (say) programmer knows at compile time. Not much can typically be done about this cost in fully garbage-collected languages, short of falling back to unsafe manual memory management. Pools don’t help much here, because the GC must still trace the pointers into the pool. Even pointers into the stack generally must be traced.

• Garbage collection with value types and references—This category includes languages like C#. (I believe D falls into this category as well, although I may be mistaken.) These languages are essentially garbage-collected, but they include value types which are guaranteed to be stack-allocated if in local variables. Additionally, and most importantly, they include reference parameters (and sometimes reference locals), which allow stack-allocated values to be temporarily aliased when calling another function. Effective use of value types can reduce marking and sweeping time. In general, this system is an effective addition to a garbage-collected system, allowing a good measure of additional control without much cost in complexity and no cost in memory safety. It is not, however, typically sufficient to write programs without using the garbage collector at all; the system is too simple to statically encode anything other than the most basic memory management patterns.

Where does Rust fit in? Actually, it fits into a category all to itself among industry languages (although one shared by various research languages, like Cyclone). Rust offers safe manual memory management (although some have objected to the term “manual” here). It extends the system described above as “garbage collection with value types and references” in two important ways:

1. You can allocate memory that will not be traced by the garbage collector, and free it manually if you choose. This is the feature of Rust known as “unique pointers”. Rust will automatically free memory that is uniquely owned when its owning pointer goes out of scope. It’s also easy to write a function that acts exactly as `free` does, so you can precisely choose when your objects die. Unique pointers are not traced by the GC (unless they point to a type that transitively contains a garbage-collected pointer), so they are an effective way to cut down marking times.

2. You can return references and place references into data structures. Like other references, these references are not traced by the garbage collector. As long as the references follow a stack discipline, meaning that they point to memory that was allocated by one of the callers of the current function, the compiler allows them to be placed anywhere. This adds a great deal of expressiveness over the reference parameter approach, and it enables a large number of programs to be written without using the garbage collector at all.

In terms of safety and performance, safe manual memory management is having your cake and eating it too. You get memory safety like a garbage-collected language, but control like that of unsafe manual memory management. But this system has its downsides as well—most importantly, complexity of implementation and interface. Learning to use references and unique pointers poses a significant learning curve. But, once the system is learned, it’s remarkably flexible, with an attractive combination of performance and safety.

# Performance of Sequential Rust Programs

Although Rust is designed for parallel programs, it is important that the performance of single-threaded, sequential programs not suffer in its design. As far as Servo is concerned, sequential performance is still important in many domains that a Web browser engine must compete in.

Below are some selected single-threaded benchmarks from the Computer Language Benchmarks Game (formerly, and still informally, called the “shootout”). This is far from an ideal set. These benchmarks are showing their age quite heavily, they are too small and simplistic to extrapolate to real-world use cases, and many of them are too I/O-bound.

It is perfectly legal per the rules of the benchmarks game to use unsafe code (or calling libraries written in C, which is equivalent), and I believe it’s very difficult to precisely match C’s performance without resorting to unsafe code. (Practically speaking, one would need an extremely smart JIT, or a research language with a complex dependent type system.) As my colleague Niko pointed out, a more interesting benchmark would not allow any languages to use unsafe code and would exclude C and C++ from competing at all, except as a point of comparison—such a benchmark would be interesting to determine how much performance one has to trade for type safety in mainstream languages. But the shootout is what it is, and so the Rust versions of these benchmarks heavily use unsafe code. Over time, I hope to be able to reduce the amount of unsafe code present in the Rust versions of these benchmarks, but a couple of benchmarks will likely always remain unsafe.

Neither the C nor the Rust versions of these benchmarks use SIMD or threads. This is by design, as the goal of this test is to measure Rust’s sequential performance. Over time, as Rust gains SIMD support and the scheduler improves (both of which are active areas of development), the benchmarks will be updated to use multiple threads. But keep in mind that the C implementation tested against is not usually the top one on the shootout site; rather, I selected the fastest implementation that did not use SIMD or threads for comparison. As the Rust benchmarks are updated to use SIMD and threads, equivalent C versions will be used for comparison.

For all these reasons and more, it is important to not read too much into these benchmark results. It would be a mistake to conclude that “Rust is faster than C” because of the performance on the `k-nucleotide` benchmark. Likewise, it would be a mistake to conclude that “C is faster than Rust” because of the `fasta-redux` benchmark. The goal here is simply to demonstrate that sequential Rust can be written in a way that approaches competitive parity with equivalent C code.

Note that the benchmarks include `clang` because GCC 4.2 is a very old version. The purpose of this benchmark is not to benchmark C compilers, but rather to perform cross-implementation comparisons between two languages.

Enough disclaimers; on to the results:

These programs were tested on a 2.53 GHz Intel Core 2 Duo MacBook Pro with 4 GB of RAM, running Mac OS X 10.6 Snow Leopard. “GCC 4.2” is GCC 4.2.1, Apple build 5666; “clang 1.7” is Apple clang 1.7, based on LLVM 2.9svn; “clang 3.1” is LLVM 3.1, trunk 149587. GCC and clang were run with `-O2`, and Rust was run with `-O` (which is like `-O2`). Three runs were averaged together to produce each result. Results are normalized to GCC 4.2. Lower numbers are better.

As mentioned before, this is a selected set of benchmarks. The benchmarks that were not included are:

• `fasta` is omitted because it is similar to `fasta-redux`.

• `regexp-dna` is omitted because it consists of an uninteresting binding to PCRE.

• `binary-trees` is omitted because it is a garbage collection benchmark and the C version uses an arena, defeating the purpose (although I suspect a Rust version that did the same would do well).

• `chameneos-redux` and `threadring` are omitted because they are threading benchmarks.

You can see the changes to the Rust compiler that were made to optimize these tests, as well as the benchmark sources, on my branch of the compiler on GitHub. The goal will be to land these changes over the next few days.

# A Hard Case for Memory Safety

Quick quiz: In this C++ program, is the definition of `munge` guaranteed to be memory safe? (Assume that the definition of `increment_counter` uses only modern C++ idioms and doesn’t do anything like dereference an invalid pointer.)

``````#include <iostream>
#include <vector>

class foo {
public:
std::vector<int> indices;
int counter;

foo() : indices(), counter(0) {
indices.push_back(1);
indices.push_back(2);
indices.push_back(3);
}

void increment_counter();

int &get_first_index() {
assert(indices.size() > 0);
return indices[0];
}

void munge() {
int &first = get_first_index();
increment_counter();
std::cout << first << std::endl;
first = 20;
}
};

int main() {
foo foo;
foo.munge();
return 0;
}
``````

The answer: Even with this caveat, we can’t tell! It depends on the definition of `increment_counter`.

If `increment_counter` has this definition, the code is memory safe:

``````void foo::increment_counter() {
counter++;
}
``````

But if `increment_counter` has this definition, for example, then it isn’t:

``````void foo::increment_counter() {
indices.clear();
counter++;
}
``````

This definition would cause the `first` reference in `munge` to become a dangling reference, and the call to `std::cout` and subsequent assignment of `first` will have undefined behavior. If `first` were not an `int` but were instead an instance of a class, and `munge` attempted to perform a virtual method call on it, then this would constitute a critical security vulnerability.

The point here is that determining memory safety in C++ requires non-local reasoning. Any analysis that tries to determine safety of C++ code, whether performed by a machine or performed by a human auditor, has to analyze many functions all at once, rather than one function at a time, to determine whether the code is memory safe. As this example illustrates, sticking to modern C++ coding styles, even with bounds checks, is not enough to prevent this.

There are a few ways around this:

• For each function call, analyze the source to the called function to determine whether it’s memory safe in the context of the caller. This doesn’t always work, though: it’s hard or impossible when function pointers or virtual methods are involved (which function ends up being called?), and it’s hard with separately compiled code (what if the called function is in a DLL that you don’t have source for?)

• Change the type of `indices` to `std::vector<std::shared_ptr<int>>`; i.e. use reference counting to keep the pointer alive. This has a runtime cost.

• Inline the body of `increment_counter`, so that the memory safety of `munge` is immediately clear.

• Make `increment_counter` a class method (or just a function) instead of an instance method, and have it take `counter` by reference. The idea here is to prevent the possibility that `increment_counter` could mess with `indices` in any way by shutting off its access to it.

What does this have to do with Rust? In fact, this error corresponds to a borrow check error that Brian Anderson hit when working on the scheduler. In Rust, the corresponding code looks something like this:

``````impl Foo {
fn get_first_index(&'a mut self) -> &'a mut int {
assert!(self.indices.len() > 0);
return &mut indices[0];
}

fn munge(&mut self) {
let first = self.get_first_index();
self.increment_counter(); // ERROR
println(first.to_str());
*first = 20;
}
}
``````

This causes a borrow check error because the `first` reference conflicts with the call to `increment_counter`. The reason the borrow check complains is that the borrow check only checks one function at a time, and it could tell (quite rightly!) that the call to `increment_counter` might be unsafe. The solution is to make `increment_counter` a static method that only has access to counter; i.e. to rewrite the `self.increment_counter()` line as follows:

``````Foo::increment_counter(&mut self.counter);
``````

Since the borrow check now sees that `increment_counter` couldn’t possibly destroy the `first` reference, it now accepts the code.

Fortunately, such borrow check errors are not as common anymore, with the new simpler borrow check rules. But it’s interesting to see that, when they do come up, they’re warning about real problems that affect any language with manual memory management. In the C++ code above, most programmers probably wouldn’t notice the fact that the memory safety of `munge` depends on the definition of `increment_counter`. The challenge in Rust, then, will be to make the error messages comprehensible enough to allow programmers to understand what the borrow checker is warning about and how to fix any problems that arise.

# An Overview of Memory Management in Rust

One of the key features of Rust that sets it apart from other new languages is that its memory management is manual—the programmer has explicit control over where and how memory is allocated and deallocated. In this regard, Rust is much more like C++ than like Java, Python, or Go, to name a few. This is an important design decision that makes Rust able to function in performance-critical domains that safe languages previously haven’t been able to—top-of-the line games and Web browsers, for example—but it adds a nontrivial learning curve to the language.

For programmers familiar with modern C++, this learning curve is much shallower, but for those who are used to other languages, Rust’s smart pointers can seem confusing and complex. In keeping with the systems-oriented nature of Rust, this post is designed to explain how Rust’s memory management works and how to effectively use it.

## Smart pointers

In many languages with manual memory management, like C, you directly allocate and free memory with calls to special functions. For example:

``````void f() {
int *x = malloc(sizeof(int));  /* allocates space for an int on the heap */
*x = 1024;                     /* initialize the value */
printf("%d\n", *x);            /* print it on the screen */
free(x);                       /* free the memory, returning it to the heap */
}
``````

C gives you a great deal of control over where memory is allocated and deallocated. Memory is allocated with a special function `malloc`, and it is freed with a special function `free`. After the call to `free`, it is an error to attempt to use `x`, as it is a dangling pointer. A dangling pointer points to invalid memory, but the C compiler makes no attempt to prevent you from using it; it’s your responsibility to avoid touching it after freeing the memory it points to.

Rust gives you the same level of control over memory, but it works somewhat differently. Let’s see how the same piece of code looks in Rust:

``````fn f() {
let x: ~int = ~1024;          // allocate space and initialize an int
// on the heap
println(fmt!("%d", *x));      // print it on the screen
} // <-- the memory that x pointed at is automatically freed here
``````

There are three main differences to notice here:

1. In C, you allocate memory first (with the call to `malloc`), and then you initialize it (in the example above, with the `*x = 1024` assignment). Rust fuses the two operations together into the `~` allocation operator, so that you don’t accidentally forget to initialize memory before you use it.

2. In C, the call to `malloc` returns a plain pointer, `int *`. In Rust, the `~` operator, which allocates memory, returns a special smart pointer to an int. Because this type of smart pointer is so common, its name is just a single character, `~`—thus the type of this smart pointer is written as `~int`.

3. You don’t call `free` manually in Rust. Rather, the compiler automatically frees the memory for you when a smart pointer goes out of scope.

As it turns out, points (2) and (3) are very intertwined, and together they form the cornerstone of Rust’s memory management system. Here’s the idea: Unlike C, allocation functions in Rust don’t return a raw pointer to the space they allocate. Instead, they return a smart pointer to the space. A smart pointer is a special kind of value that controls when the object is freed. Like a raw pointer in C, you can access the data that a smart pointer refers to with `*`. But unlike a raw pointer, when the smart pointer to an allocation goes out of scope, that allocation is automatically freed. In this way, smart pointers are “smart” because they not only track where an object is but also track how to clean it up.

Unlike C, in Rust you never call `free` directly. Instead, you rely on smart pointers to free all allocations. The most basic reason for this is that smart pointers make it harder to forget to free memory. In C, if you forget to call `free`, you have a memory leak, which means that the memory will not be cleaned up until the program exits. However, in Rust, the compiler will automatically insert the code necessary to free the memory for you when the smart pointer pointing to your data goes out of scope.

Rust has multiple types of smart pointers, corresponding to the different strategies that programs use to reclaim memory. Some smart pointers, namely `~` and `@` (which we will cover shortly), have special names known to the compiler, because they’re so common. (Having to type long names like `unique_ptr` all the time would be a burden.) Other smart pointers, such as `ARC` (which allows you to share read-only data between threads), are in the standard library and are not built into the compiler.

The pointer covered above is known as the unique smart pointer `~`. We call it “unique” because there is always only one smart pointer pointing to each allocation. The other type of smart pointer built into the language is the managed smart pointer, which allows multiple smart pointers to point to the same allocation and uses garbage collection to determine when to free it. Here’s an example of a managed smart pointer in use:

``````fn foo() {
let x: @int = @1024;     // allocate space and initialize an int
// on the heap
bar(x);                  // pass it to `bar`
println(fmt!("%d", *x)); // print it on the screen
} // <-- the memory can be freed here

fn bar(x: @int) {
let y: @int = x;         // make a new smart pointer to `x`
} // <-- despite `y` going out of scope, the memory is *not* freed here
``````

The key difference between `~` and `@` is that `@` allows multiple smart pointers to point to the same data, and the data is cleaned up only after the last such smart pointer disappears. Notice that, in this example, the memory pointed at by `y` (which is the same as the memory pointed at by `x`) is not freed at the end of the function `bar`, because `x` is still in use and also points to the same data. The fact that `@` allows multiple smart pointers to the same data, as well as the fact that the allocation is freed only when all of those pointers go out of scope, make managed smart pointers very useful. However, they can be less efficient than unique smart pointers, as they require garbage collection at runtime.

## References

Recall that a smart pointer is a pointer that automatically frees the memory that it points to when it goes out of scope. Perhaps surprisingly, it often turns out that it’s useful to have a kind of pointer that doesn’t free the memory that it points to. Consider this code:

``````struct Dog {
name: ~str    // a unique smart pointer to a string
}

fn dogshow() {
let dogs: [~Dog * 3] = [        // create an array of Dog objects
~Dog { name: ~"Spot"   },   // use unique smart pointers to
// allocate
~Dog { name: ~"Fido"   },
~Dog { name: ~"Snoopy" },
];
for dogs.each |dog| {
println(fmt!("Say hello to %s", dog.name));
}
} // <-- all dogs destroyed here
``````

Suppose that we wanted to single Fido out as the winner of the dog show. We might try this code:

``````fn dogshow() {
let dogs: [~Dog * 3] = [
~Dog { name: ~"Spot"   },
~Dog { name: ~"Fido"   },
~Dog { name: ~"Snoopy" },
];
let winner: ~Dog = dogs[1];
for dogs.each |dog| {
println(fmt!("Say hello to %s", dog.name));
}
println(fmt!("And the winner is: %s!", winner.name));
} // <-- all dogs, and `winner`, destroyed here
``````

But this code won’t compile. The reason is that, if it did, Fido would be destroyed twice. Remember that unique smart pointers free the allocations they point to when they go out of scope. The code attempts to make a second smart pointer to Fido at the time it executes the line `let winner: ~Dog = dogs[1];` If the compiler allowed this to proceed, then at the end of the block, the program would attempt to free Fido twice—once when it frees the original smart pointer embedded within the `dogs` array, and once when it frees `winner`.

What we really want is for `winner` to be a pointer that doesn’t free the allocation that it points to. In fact, what we want isn’t a smart pointer at all; we want a reference. Here’s the code rewritten to use one:

``````fn dogshow() {
let dogs: [~Dog * 3] = [
~Dog { name: ~"Spot"   },
~Dog { name: ~"Fido"   },
~Dog { name: ~"Snoopy" },
];
let winner: &Dog = dogs[1];  // note use of `&` to form a reference
for dogs.each |dog| {
println(fmt!("Say hello to %s", dog.name));
}
println(fmt!("And the winner is: %s!", winner.name));
} // <-- all dogs destroyed here
``````

This code will now compile. Here, we convert `winner` into a reference, notated in Rust with `&`. You can take a reference to any smart pointer type in Rust by simply assigning it to a value with a reference type, as the `let winner: &Dog = dogs[1]` line does.

References (also known as borrowed pointers) don’t cause the compiler to free the data they refer to. However, they don’t prevent the compiler from freeing anything either. They have no effect on what smart pointers will do; regardless of how many references you have, a unique smart pointer will always free the data that it points to when it goes out of scope, and a managed smart pointer will always free its data when all managed smart pointers to the same allocation go out of scope.

This is important to keep in mind. Code like this will not compile:

``````fn foo() {
let y: &int;
{
let x: ~int = ~2048;
y = x;
} // <-- x freed here
println(fmt!("Your lucky number is: %d", *y)); // ERROR: accesses freed data!
}
``````

In languages like C++, code like this could cause faults from attempting to access invalid memory. As it turns out, however, this piece of code won’t compile—the Rust compiler can and does prevent you from writing code like this at compile time. Essentially, the Rust compiler tracks where each reference came from and reports an error if a reference persists longer than the allocation it points into. This means that, generally speaking, you can use references all you like and have the confidence that they won’t result in hard-to-diagnose errors at runtime.

## Conclusion

These ideas—smart pointers and references—form the basis of memory management in Rust. If you’re a C++ programmer, most of this will (hopefully!) simply have been an exercise in learning different syntax. For other programmers, these concepts are likely more foreign. But using these tools, you can write code with fine-grained control over memory, with improved safety over languages like C.

# Which Pointer Should I Use?

Deciding whether to use a managed `@` pointer or an owned `~` pointer to allocate memory is one of the most frequent sources of confusion for newcomers to Rust. There are two main angles to consider when deciding whether to use an `@` pointer or a `~` pointer in Rust: memory management and concurrency. I’ll cover each in turn.

Note that this tutorial only presents the basic system. There are many extensions to the system—borrowing, library smart pointers, cells, and so on—that allow the various limitations described here to be overcome. But this is the core system that needs to be understood first.

# Memory management

One of the most important features of Rust from a systems programming perspective is that garbage collection is optional. What this means is that there are safe ways to allocate memory that do not require bookkeeping at runtime to determine when it is safe to free that memory.

What makes it possible for Rust programs to avoid runtime garbage collection is the notion of ownership of a particular allocation. Under this scheme, when the single owner of an allocation goes out of scope, the allocation is freed. Owned pointers in Rust are notated with `~`. Here’s an example of their use:

``````struct Point {
x: int,
y: int,
}

fn f() {
let x: ~Point = ~Point { x: 10, y: 20 };  // allocate a Point on the heap
}  // <-- x is freed here
``````

Here, `x` is the single owner of the `Point` on the heap. Because there is only a single owner, Rust can throw away the memory pointed to by `x` at the end of the function.

The compiler enforces that there is only a single owner. Assigning the pointer to a new location transfers ownership (known as a move for short). Consider this program:

``````fn g() {
let a: ~Point = ~Point { x: 10, y: 20 }; // allocate a Point on the heap
let b = a;                               // now b is the owner
println(b.x.to_str());                   // OK
println(a.x.to_str());                   // ERROR: use of moved value
} // <-- b is freed here
``````

When compiling this program, the compiler produces the error “use of moved value”. This is because assigning an owned pointer transfers ownership, making the old variable dead. Because the compiler knows precisely which variables are dead at all times, it can avoid having to determine at runtime whether to free the memory that a variable points to, and it can prevent you from accidentally accessing dead variables. However, this comes at a price: you are limited to using a single variable to refer to an `~` allocation.

By contrast, `@` pointers do not have this limitation. We think of memory that is allocated with `@` as owned by the garbage collector. You can make as many pointers to `@` memory as you would like. There is a cost in runtime performance, but this cost comes with a great deal of flexibility. For example, the code above will compile with an `@` pointer:

``````fn h() {
let a: @Point = @Point { x: 10, y: 20 }; // allocate a Point on the heap
let b = a;                               // a and b share a reference
println(b.x.to_str());                   // OK
println(a.x.to_str());                   // also OK
}
``````

So, in short: `@` pointers require garbage collection, but allow multiple pointers to the same location. `~` pointers avoid this GC overhead, but they don’t allow multiple pointers to the same location.

# Concurrency

Another equally important aspect to the distinction between `@` and `~` is that it ensures that concurrent Rust tasks don’t race on shared memory. To illustrate this, here’s an example of broken code that doesn’t compile:

``````struct Counter {
count: int
}

fn f() {
// Allocate a mutable counter.
let counter: @mut Counter = @mut Counter { count: 0 };
do spawn {               // spawn a new thread
// Increment the counter.
counter.count += 1;  // ERROR: attempt to capture an `@` value
}
println(counter.count.to_str()); // print the value
}
``````

This code contains a classic race—if this code compiled, then the value printed would be either 0 or 1, depending on whether the `counter.count += 1` line executed first or the `println` executed first. The key here is that two threads—the spawned thread and the main thread—are both simultaneously attempting to access the `counter` object. To prevent these errors, Rust prevents multiple threads from accessing the same memory at the same time.

Recall from the previous section that there can be any number of pointers to memory allocated with `@`. But there can be only one pointer to memory allocated with `~`. This suggests a way to forbid multiple threads from accessing the same data: restrict the types of pointers that can be sent between threads to `~` pointers. And this is exactly what Rust does.

For any piece of `~`-allocated memory, there is only one pointer to it, and that pointer is owned by exactly one thread. So there can be no races, since any other threads simply don’t have access to that memory. Let’s rewrite our example above using `~` to illustrate this:

``````fn g() {
// Allocate a mutable counter.
let mut counter: ~Counter = ~Counter { count: 0 };
do spawn {               // spawn a new thread
counter.count += 1;  // increment the counter
}
println(counter.count.to_str()); // ERROR: use of moved value
}
``````

What’s going on here is that, by referring to `counter` inside the `spawn` block, the new thread takes ownership of the `counter` variable, and the `counter` variable becomes dead everywhere outside that block. Essentially, the main thread loses access to `counter` by giving it away to the thread it spawns. So the attempt to print the value on the screen from the main thread will fail. By contrast, this code will work:

``````fn h() {
// Allocate a mutable counter.
let mut counter: ~Counter = ~Counter { count: 0 };
do spawn {               // spawn a new thread
counter.count += 1;  // increment the counter
println(counter.count.to_str()); // OK: `counter` is owned by this thread
}
}
``````

Notice that the data race is gone: this code always prints `1`, because the printing happens in the thread that owns the `Counter` object.

The resulting rule is pretty simple. In short: `@` pointers may not be sent from thread to thread. `~` pointers may be sent, and are owned by exactly one thread at a time. Therefore, if you need data to be sent, do not allocate it with `@`.

# Conclusion (TL;DR)

So the distinction between `@` and `~` is often confusing to newcomers, but it’s really quite simple. There are two main rules to remember:

1. `~` only supports one pointer to each allocation, so if you need multiple pointers to the same data, use `@`. But `@` requires garbage collection overhead, so if this is important to your application, use `~` wherever possible.

2. Don’t use `@` pointers if you need to send data between multiple threads. Use `~` instead.

Finally, I should note again that, if these rules are too restrictive for you (for example, if you need multiple pointers but can’t tolerate garbage collection pauses), there are more advanced solutions: borrowing, safe smart pointers, and unsafe code. But this simple system works well for many programs and forms the foundation of Rust’s approach to memory management.

# The New Borrow Check in a Nutshell

If you’ve used Rust for any period of time, you’ve probably been bitten by the mysterious borrow check—the compiler pass responsible for preventing iterator invalidation, as well as a few other dangling pointer scenarios. The current iteration of the borrow check enforces a fairly complex set of rules. Because the rules were hard to understand and ruled out too many valid programs, we were never really satisfied with the analysis; without a simple set of rules to follow, programmers will get frustrated and give up. To remedy this, Niko has proposed a revamp of the borrow checker known as “Imagine Never Hearing the Phrase ‘Aliasable, Mutable’ Again”. This has mostly been implemented in a pull request now, so I’d like to take the opportunity to explain the new rules. I’m particularly excited about this change because now the entire set of borrow check rules are simple enough to boil down to one principle.

Here’s the rule that the new borrow check is in charge of enforcing: Whenever you take a pointer to an object, you may not modify that object as long as that pointer exists, except through that pointer.

(Strictly speaking, this is not all the new borrow check enforces, but the other errors the pass can produce are generally straightforward and simple dangling pointer errors. Also, I’m omitting the rules related to `&const`, as this rarely-used type of pointer is likely to be removed.)

For unique pointers (`~`) and borrowed pointers (`&`), this rule is enforced at compile time, without any runtime overhead. Here’s an example:

``````let mut the_magic_word = Some(~"zap");
match the_magic_word {
None => {}
Some(ref word) {
the_magic_word = None; // ERROR
io::println(*word);
}
}
``````

Here, the line marked `ERROR` produces the error “assigning to mutable local variable prohibited due to outstanding loan”. This happens because we violated the rule above—the line `the_magic_word = None` mutates the value `the_magic_word` while there exists a pointer to it (`word`).

Another example:

``````struct Foo {
array: ~[int]
}

impl Foo {
fn bar(&mut self) {
for self.array.each |i| {
self.array = ~[];  // ERROR
io::println(i.to_str());
}
}
}
``````

Again, the error is “assigning to mutable field prohibited due to outstanding loan”. As before, it’s traceable to a violation of the mutation rule: the line `self.array = ~[]` mutates the `self.array` field while a pointer (`i`) into it exists.

This example is interesting for a couple of reasons. First of all, it illustrates the way the Rust compiler can catch iterator invalidation issues without runtime overhead in many cases: here the compiler is able to detect that the `i` iterator, which has type `&int`, was invalidated, and rejects the program instead of permitting undefined behavior at runtime. Second, this example illustrates something not possible under the current borrow check regime that the new borrow check allows: namely, taking an immutable pointer to a field accessible through a `&mut` pointer. (An immutable pointer is needed to call the `each` method to prevent iterator invalidation.) More than any other, this restriction probably led to the greatest number of borrow check errors in practice, since it prevented iterating over any collections reachable from `&mut` pointers.

Now all of this works fine for `&` and `~` pointers, but what about managed boxes (`@`)? It turns out that immutable `@` boxes are easy to deal with; since they can’t be mutated at all, the borrow checker doesn’t have to do anything to enforce the no-mutation rule. However, for `@mut` boxes, the situation is more complicated. For `@mut` boxes, the new borrow checker inserts runtime checks to enforce the pointer rules. Attempting to mutate an `@mut` box while a pointer to its contents exists results in task failure at runtime, unless the mutation is done through that pointer.

Interestingly, this is similar to the way various debug or safe STL implementations (for example, Microsoft’s) guard against iterator invalidation. The differences are: (1) in Rust, the checks are automatically inserted by the compiler instead of built into each collection by hand; and (2) the checks are only needed for garbage collected data, as the compiler can perform the checks at compile time for other types of data.

There is one gotcha here, however. As implemented, if any pointer exists to any part of an `@mut` box, then the entire box cannot be mutated while that pointer exists. This means that this example will fail:

``````struct Dungeon {
monsters: ~[Monster],
total_gold: int
}

impl Dungeon {
fn count_gold(@mut self) { // note `@mut self`, not `&mut self`
self.total_gold = 0;
for self.monsters.each |monster| { // pointer created here
self.total_gold += monster.gold;
}
}
}
``````

Note that the iterator variable `monster` has type `&Monster`. This is a pointer to the inside of `Dungeon`, so the assignment to `self.total_gold` violates the mutation rule. Unfortunately, the compiler does not currently catch this, so the program will fail at runtime.

There are a couple of workarounds. The simplest way is to change `@mut self` to `&mut self`. Since there is no need to give out the `@mut` pointer for this operation, this is safe. Roughly speaking, the compile-time checks operate on a per-field basis, while the runtime checks operate on a per-box basis. So this change makes the operation succeed. Another possibility is to make `total_gold` into a local variable and assign to the field after the `for` loop.

Despite the fact that this error is easy to fix, I’m concerned about the fact that the compiler won’t catch this kind of thing at compile time. So I think we should introduce a set of warnings that looks for common violations of this rule. It’s impossible to make the warnings catch all failures—that’s the reason why the check is done at runtime in the first place. (In general, trying to make the compiler reason about `@` boxes is hard, since the compiler has no idea how many references to them exist.) But I suspect that we could make the analysis good enough to catch the majority of these errors in practice.

In any case, the take-away from all of this is that the borrow checker should be much easier and more transparent with this change. There’s essentially just one straightforward rule to remember.