# 24 days of Rust - rayon

Let me start this article with a confession: *I suck at writing parallel code.*
I have only a vague idea about mutexes, locks and other low level primitives
for handling data between threads. And since I work mostly with Python,
I tend to use `multiprocessing`

(if I bother to parallelize at all) rather
than threads.

However, one of the selling points of Rust is its approach to concurrent/parallel programming. Mozilla recently released a video encouraging people to write parallel code that is safe and approachable. Then I heard about rayon - a library for data parallelism that is very approachable and lightweight.

If you have an iterator chain that computes something on every
element, you can stick some variation of `par_iter()`

in the chain and now
your code is`*`

parallel. So simple!

Niko Matsakis, the author of `rayon`

, has a
blog post
describing the motivation behind this crate and some implementation details.
Worth checking out if you're curious how `rayon`

works under the hood.

## How much is π?

Let's assume we don't know the value of π. How would we calculate it? One
approach is to measure the area of a unit circle (a circle with radius=1).
Since circle area is `πr<sup>2</sup>`

and r=1, the area of a unit circle
equals π.

If we plot the circle as a graph of a function, the area under the graph can be calculated as a definite integral. Let's represent our circle function in Rust:

fn semicircle(x: f64, r: f64) -> f64 { (r * r - x * x).sqrt() }

Note: this is actually a semicircle, because a circle doesn't represent a
function (it maps an `x`

value to two `y`

values).

There are various techniques for numerical integration. One of those is known as the trapezoidal rule. Here's a naive Rust implementation:

fn integrate<F>(f: F, points: usize) -> f64 where F: Fn(f64) -> f64 { let delta = 1f64 / points as f64; (0..points) .map(|i| { let a = i as f64 * delta; delta * (f(a) + f(a + delta)) / 2f64 }) .sum() }

We cut the area under function into narrow strips of width `delta`

and sum up
their areas. Don't be afraid if you don't follow all the math here. The
important thing is - every strip is independent of each other. This is an
embarassingly parallel
task! Let's see how `rayon`

can help us.

extern crate rayon; use rayon::prelude::*; fn parallel_integrate<F>(f: F, points: usize) -> f64 where F: Fn(f64) -> f64 + Sync { let delta = 1f64 / points as f64; (0..points) .into_par_iter() .map(|i| { let a = i as f64 * delta; delta * (f(a) + f(a + delta)) / 2f64 }) .sum() }

See any changes? Apart from the imports, the *only* things that changed are
trait bounds on the function (we need it to be `Sync`

) and the range is
converted into a parallel iterator. That's it!

println!("sequential: {}", 4f64 * integrate(semicircle, 10_000_000)); println!("parallel: {}", 4f64 * parallel_integrate(semicircle, 10_000_000));

sequential: 3.1415926535533574 parallel: 3.141592653552613

We're pretty close to the value of π!

## Monte Carlo method

Imagine you're throwing darts at a square board which has a circle drawn on it. If you're as poor at darts as I am and throw at random, some of your hits will be inside of the circle, some outside of it. Monte Carlo method for calculating π works just like that: pick a number of points at random and count these within the circle. The ratio of this count to total number of points approximates the ratio of the areas (which is π/4).

**Note**: this is a worse example of parallelization because `rand::random()`

uses a thread-local random generator that reseeds itself from OS-provided
entropy source.

extern crate rand; fn monte_carlo_pi(points: usize) -> f64 { let within_circle = (0..points) .filter_map(|_| { let x = rand::random::<f64>() * 2f64 - 1f64; let y = rand::random::<f64>() * 2f64 - 1f64; if x * x + y * y <= 1f64 { Some(1) } else { None } }) .count(); 4f64 * within_circle as f64 / points as f64 } fn parallel_monte_carlo_pi(points: usize) -> f64 { let within_circle = (0..points) .into_par_iter() .filter_map(|_| { let x = rand::random::<f64>() * 2f64 - 1f64; let y = rand::random::<f64>() * 2f64 - 1f64; if x * x + y * y <= 1f64 { Some(1) } else { None } }) .count(); 4f64 * within_circle as f64 / points as f64 } println!("sequential MC: {}", monte_carlo_pi(1_000_000)); println!("parallel MC: {}", parallel_monte_carlo_pi(1_000_000));

Again our parallel implementation differs only in the call to `into_par_iter()`

.

sequential MC: 3.138308 parallel MC: 3.142376

The results will be different in every run, but we can see it sort of looks like π.

## Benchmarks

*Twelve o'clock and all is well*, but how do we know if the parallel versions
really perform better?
Benchmarks!

The following code works only on nightly, due to `test`

feature gate.

#![feature(test)] extern crate test; #[cfg(test)] mod tests { use super::{semicircle, integrate, parallel_integrate, monte_carlo_pi, parallel_monte_carlo_pi}; use test::Bencher; #[bench] fn bench_sequential_integrate(b: &mut Bencher) { b.iter(|| integrate(semicircle, 100_000)) } #[bench] fn bench_parallel_integrate(b: &mut Bencher) { b.iter(|| parallel_integrate(semicircle, 100_000)) } #[bench] fn bench_sequential_monte_carlo(b: &mut Bencher) { b.iter(|| monte_carlo_pi(100_000)) } #[bench] fn bench_parallel_monte_carlo(b: &mut Bencher) { b.iter(|| parallel_monte_carlo_pi(100_000)) } }

$ cargo bench Finished release [optimized] target(s) in 0.0 secs Running target\release\rayon-bench-06c3c3f0de9581df.exe running 4 tests test tests::bench_parallel_integrate ... bench: 958,000 ns/iter (+/- 457,837) test tests::bench_parallel_monte_carlo ... bench: 4,604,488 ns/iter (+/- 1,072,339) test tests::bench_sequential_integrate ... bench: 1,532,289 ns/iter (+/- 111,210) test tests::bench_sequential_monte_carlo ... bench: 9,736,849 ns/iter (+/- 455,576) test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured

The results above are on my dual core Intel i5 laptop. As you can see, the
parallel versions run almost twice as fast, utilizing both CPU cores. However,
`rayon`

doesn't *guarantee* parallel execution. To quote Niko:

the decision of whether or not to use parallel threads is made dynamically, based on whether idle cores are available.

`*`

This approach is known as *potential parallelism*. Keep that in mind when using
`rayon`

and always measure performance.

## Further reading

Photo by Moyan Brenn and shared under the Creative Commons Attribution 2.0 Generic License. See https://www.flickr.com/photos/aigle_dore/5952225148/