Set operations in shell

I had this interesting case at work. We have imports of objects. Each object in import file has its “ID" (which can be any string). Same “ID" is in database.

So the idea is pretty simple – we can/should check how many of IDs from import were in database. Unfortunately – we'd rather not really do the comparison in DB, as it is pretty loaded.

With that in mind I generated to files:

$ psql -qAt -c "select id from objects" > ids.from.database
$ cat import | cut -f1 > ids.from.import

For simplicity sake, we can assume them to contain this:

=> cat ids.from.database
1
2
3
4
5
 
=> cat ids.from.import
3
4
5
6
7

(of course real files had much more data in them).

Very important assumption is that given ID is unique in its respective file. I.e. there will be no 2 lines with “3" in ids.from.database or ids.from.import.

Now. What I need to do can be summarized as SQLs:

SELECT id FROM import EXCEPT SELECT id FROM DATABASE;

To do it, I can use this relatively simple expression:

$ ( cat ids.from.database ids.from.import | sort | uniq -u ; cat ids.from.import ) | sort | uniq -d

Which, as expected, will print:

6
7

Now, what does it really do?

Let's break it into smaller parts:

cat ids.from.database ids.from.import

It simply concatenates the files into one data stream with 10 lines containing: 1, 2, 3, 4, 5, 3, 4, 5, 6, 7.

| sort | uniq -u

This lists elements that occur in given stream only once. In our case there will be 4 such ids: 1, 2, 6, 7. Where 1 and 2 are in database only, and 6 and 7 are in import only.

; cat ids.from.import

This will print (again) content of ids.from.import. We will need it further on, but we will need it concatenated with output of previous command (cat … | sort | uniq -u), so:

( ..... ; ..... )

This makes it possible to pipe output of many commands through filter, and
tread it as one input. More technically – it launches subshell, runs all the
commands inside parentheses in the subshell, and treats output of the subshell
as input for next “pipe".

In our case, the code: ( cat ids.from.database ids.from.import | sort | uniq -u ; cat ids.from.import ), will simply return:

1
2
6
7
3
4
5
6
7

Which is pretty obvious, and expected.

And now for the final part:

| sort | uniq -d

This will return (print) only the ids that appeared at list twice in input (i.e. have duplicates).

If you don't understand why there is sort in there, let me show you simple example. Let's test this data:

=> cat test
1
1
2
1
2
2
1

If I don't use sort, uniq works only if the repeated values are next to each other:

=> cat test  | uniq
1
2
1
2
1

But if I add sort, it will work on all values, because they are not ordered, so all lines with given id are next to each other:

=> cat test  | sort | uniq
1
2

uniq and sort are very powerful tools, and let you do other stuff as well:

  • UNION: cat input_a input_b | sort | uniq
  • INTERSECT: cat input_a input_b | sort | uniq -d
  • EXCEPT (a-b): (cat input_a input_b | sort | uniq -u ; cat input_a ) | sort | uniq -d
  • EXCEPT (b-a): (cat input_a input_b | sort | uniq -u ; cat input_b ) | sort | uniq -d

All of these of course doesn't mean that you should ditch your SQL in favor of shell. It just shows that there are always many ways to do the job, and you should know them, to be able to choose the one that's better for given set of input data and other criteria.

6 thoughts on “Set operations in shell”

  1. I’d rather use:
    EXCEPT (b-a): cat input_a input_a input_b | sort | uniq -u

  2. @mp:
    thanks – didn’t know that.

    Wow. And I thought my solution is smart. But I learned a lot thanks to your (both mp and sls) comments.

Comments are closed.