# Picking random element, with weights

Whenever I'm doing some testing I need sample data. Easiest way to do it is to generate data using some random/generate_series queries.

But what if I need specific frequencies?

For example, I need to generate 10,000,000 rows, where there will be 10% of ‘a', 20% of ‘b', and the rest will be split equally between ‘c' and ‘d'?

online user" on Discord had similar problem, so figured I'll write a function that does that.

Since we need to pass, to the function, any number of args, and weigths for them, the simplest thing I can pass it as are 2 arrays:

1. elements to randomly pick from
2. their weights

```CREATE FUNCTION weighted_random( IN p_choices ANYARRAY, IN p_weights float8[] ) RETURNS ANYELEMENT language plpgsql as \$\$
DECLARE
v_no_choices INT4 := array_length(p_choices, 1);
v_no_weights INT4 := array_length(p_weights, 1);
v_weight_sum FLOAT8;
v_random     FLOAT8;
v_i          INT4;
BEGIN
IF v_no_choices IS NULL OR v_no_weights IS NULL OR v_no_choices <> v_no_weights THEN
RETURN NULL;
END IF;

SELECT sum(x) INTO v_weight_sum FROM unnest(p_weights) as q(x);

v_random := random() * v_weight_sum;

for v_i in array_lower(p_weights, 1) .. array_upper(p_weights, 1) loop
IF v_random < p_weights[v_i] THEN
RETURN p_choices[v_i];
END IF;
v_random := v_random - p_weights[v_i];
END loop;
RETURN NULL;
END;
\$\$;```

It needs two arrays, they shouldn't be empty/null, and their number of elements have to be the same.

Example use:

```=\$ SELECT weighted_random( '{a,b,c,d}'::TEXT[], '{.5,.5,.1,.9}' );
weighted_random
─────────────────
d
(1 row)```

In this example, it picks from four values:

• a with weight 0.5
• b with weight 0.5
• c with weight 0.1
• d with weight 0.9

Since the weights are any float8, we can pick virtually any number to represent weight. For example we could:

`=\$ SELECT weighted_random( '{a,b,c,d}'::TEXT[], '{1,2,3.5,3.5}' );`

to generate one of the values with probabilities from the head of this post 🙂

We can easily check if the percentages are what we wanted by doing:

```=\$ WITH x AS (
SELECT
weighted_random( '{a,b,c,d}'::TEXT[], '{1,2,3.5,3.5}' ) as v
FROM
generate_series(1,10000) i
)
SELECT
v,
count(*),
count(*)::float8 * 100 / 10000 as count_pct
FROM
x
GROUP BY
v
ORDER BY
v;
v │ count │ count_pct
───┼───────┼───────────
a │   995 │      9.95
b │  2014 │     20.14
c │  3471 │     34.71
d │  3520 │      35.2
(4 rows)```

Of course, since these are randomly picked, the percentages fluctuate a bit, but should be close enough 🙂

Hope it could be useful for someone 🙂

This site uses Akismet to reduce spam. Learn how your comment data is processed.