# 24 days of Rust - rayon

Let me start this article with a confession: I suck at writing parallel code. I have only a vague idea about mutexes, locks and other low level primitives for handling data between threads. And since I work mostly with Python, I tend to use `multiprocessing` (if I bother to parallelize at all) rather than threads.

However, one of the selling points of Rust is its approach to concurrent/parallel programming. Mozilla recently released a video encouraging people to write parallel code that is safe and approachable. Then I heard about rayon - a library for data parallelism that is very approachable and lightweight.

If you have an iterator chain that computes something on every element, you can stick some variation of `par_iter()` in the chain and now your code is`*` parallel. So simple!

Niko Matsakis, the author of `rayon`, has a blog post describing the motivation behind this crate and some implementation details. Worth checking out if you're curious how `rayon` works under the hood.

## How much is π?

Let's assume we don't know the value of π. How would we calculate it? One approach is to measure the area of a unit circle (a circle with radius=1). Since circle area is `πr<sup>2</sup>` and r=1, the area of a unit circle equals π.

If we plot the circle as a graph of a function, the area under the graph can be calculated as a definite integral. Let's represent our circle function in Rust:

```fn semicircle(x: f64, r: f64) -> f64 {
(r * r - x * x).sqrt()
}
```

Note: this is actually a semicircle, because a circle doesn't represent a function (it maps an `x` value to two `y` values).

There are various techniques for numerical integration. One of those is known as the trapezoidal rule. Here's a naive Rust implementation:

```fn integrate<F>(f: F, points: usize) -> f64
where F: Fn(f64) -> f64
{
let delta = 1f64 / points as f64;
(0..points)
.map(|i| {
let a = i as f64 * delta;
delta * (f(a) + f(a + delta)) / 2f64
})
.sum()
}
```

We cut the area under function into narrow strips of width `delta` and sum up their areas. Don't be afraid if you don't follow all the math here. The important thing is - every strip is independent of each other. This is an embarassingly parallel task! Let's see how `rayon` can help us.

```extern crate rayon;

use rayon::prelude::*;

fn parallel_integrate<F>(f: F, points: usize) -> f64
where F: Fn(f64) -> f64 + Sync
{
let delta = 1f64 / points as f64;
(0..points)
.into_par_iter()
.map(|i| {
let a = i as f64 * delta;
delta * (f(a) + f(a + delta)) / 2f64
})
.sum()
}
```

See any changes? Apart from the imports, the only things that changed are trait bounds on the function (we need it to be `Sync`) and the range is converted into a parallel iterator. That's it!

```println!("sequential: {}", 4f64 * integrate(semicircle, 10_000_000));
println!("parallel: {}", 4f64 * parallel_integrate(semicircle, 10_000_000));
```
```sequential: 3.1415926535533574
parallel: 3.141592653552613
```

We're pretty close to the value of π!

## Monte Carlo method

Imagine you're throwing darts at a square board which has a circle drawn on it. If you're as poor at darts as I am and throw at random, some of your hits will be inside of the circle, some outside of it. Monte Carlo method for calculating π works just like that: pick a number of points at random and count these within the circle. The ratio of this count to total number of points approximates the ratio of the areas (which is π/4).

Note: this is a worse example of parallelization because `rand::random()` uses a thread-local random generator that reseeds itself from OS-provided entropy source.

```extern crate rand;

fn monte_carlo_pi(points: usize) -> f64 {
let within_circle = (0..points)
.filter_map(|_| {
let x = rand::random::<f64>() * 2f64 - 1f64;
let y = rand::random::<f64>() * 2f64 - 1f64;
if x * x + y * y <= 1f64 { Some(1) } else { None }
})
.count();
4f64 * within_circle as f64 / points as f64
}

fn parallel_monte_carlo_pi(points: usize) -> f64 {
let within_circle = (0..points)
.into_par_iter()
.filter_map(|_| {
let x = rand::random::<f64>() * 2f64 - 1f64;
let y = rand::random::<f64>() * 2f64 - 1f64;
if x * x + y * y <= 1f64 { Some(1) } else { None }
})
.count();
4f64 * within_circle as f64 / points as f64
}

println!("sequential MC: {}", monte_carlo_pi(1_000_000));
println!("parallel MC: {}", parallel_monte_carlo_pi(1_000_000));
```

Again our parallel implementation differs only in the call to `into_par_iter()`.

```sequential MC: 3.138308
parallel MC: 3.142376
```

The results will be different in every run, but we can see it sort of looks like π.

## Benchmarks

Twelve o'clock and all is well, but how do we know if the parallel versions really perform better? Benchmarks!

The following code works only on nightly, due to `test` feature gate.

```#![feature(test)]

extern crate test;

#[cfg(test)]
mod tests {
use super::{semicircle, integrate, parallel_integrate, monte_carlo_pi, parallel_monte_carlo_pi};
use test::Bencher;

#[bench]
fn bench_sequential_integrate(b: &mut Bencher) {
b.iter(|| integrate(semicircle, 100_000))
}

#[bench]
fn bench_parallel_integrate(b: &mut Bencher) {
b.iter(|| parallel_integrate(semicircle, 100_000))
}

#[bench]
fn bench_sequential_monte_carlo(b: &mut Bencher) {
b.iter(|| monte_carlo_pi(100_000))
}

#[bench]
fn bench_parallel_monte_carlo(b: &mut Bencher) {
b.iter(|| parallel_monte_carlo_pi(100_000))
}
}
```
```\$ cargo bench
Finished release [optimized] target(s) in 0.0 secs
Running target\release\rayon-bench-06c3c3f0de9581df.exe

running 4 tests
test tests::bench_parallel_integrate     ... bench:     958,000 ns/iter (+/- 457,837)
test tests::bench_parallel_monte_carlo   ... bench:   4,604,488 ns/iter (+/- 1,072,339)
test tests::bench_sequential_integrate   ... bench:   1,532,289 ns/iter (+/- 111,210)
test tests::bench_sequential_monte_carlo ... bench:   9,736,849 ns/iter (+/- 455,576)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured
```

The results above are on my dual core Intel i5 laptop. As you can see, the parallel versions run almost twice as fast, utilizing both CPU cores. However, `rayon` doesn't guarantee parallel execution. To quote Niko:

the decision of whether or not to use parallel threads is made dynamically, based on whether idle cores are available.

`*` This approach is known as potential parallelism. Keep that in mind when using `rayon` and always measure performance.