I was faced with interesting problem. Which schema, in my DB, uses the most disk space? Theoretically it's trivial, we have set of helpful functions:

  • pg_column_size
  • pg_database_size
  • pg_indexes_size
  • pg_relation_size
  • pg_table_size
  • pg_tablespace_size
  • pg_total_relation_size

But in some cases it becomes more of a problem. For example – when you have thousands of tables …

For my sample DB I picked a database with over a million objects in it:

$ select count(*) from pg_class;
  count
---------
 1087322
(1 row)

There are over 700 schemas, each of them contains tables.

Naive query would look like:

$ select n.nspname, sum(pg_total_relation_size(c.oid)) as total_size
from pg_class c join pg_namespace n on c.relnamespace = n.oid
where c.relkind = 'r'
group by n.nspname order by total_size desc limit 10;

The problem? Well, on my test DB, I let it ran for 3 minutes, and gave up.

It takes so long because I have so many objects. But is it possible to get this information faster? Yes. It's possible. It is not pretty, it it works.

To do it, we need to dig a bit deeper, and use file access functions.

First function I will need to use is “pg_ls_dir". It works like this:

$ select * from pg_ls_dir('.') limit 3;
  pg_ls_dir
--------------
 pg_xlog
 global
 pg_commit_ts
(3 rows)

Now, which dir to ls? Initial idea would be “base", but if you have many tablespaces, then you might miss some files.

So, we need to read two potential places: “./base" and “./pg_tblspc".

We can start with this query:

$ with  all_files as (
    SELECT 'base/' || l.filename as path, x.*
    FROM
        pg_ls_dir('base/') as l (filename),
        LATERAL pg_stat_file( 'base/' || l.filename) as x
    UNION ALL
    SELECT 'pg_tblspc/' || l.filename as path, x.*
    FROM
        pg_ls_dir('pg_tblspc/') as l (filename),
        LATERAL pg_stat_file( 'pg_tblspc/' || l.filename) as x
)
SELECT * FROM all_files;

This shows first level elements in base and pg_tblspc directories. Now, we just need to do recursive descent into all directories there are …

$ with recursive all_files as (
    SELECT 'base/' || l.filename as path, x.*
    FROM
        pg_ls_dir('base/') as l (filename),
        LATERAL pg_stat_file( 'base/' || l.filename) as x
    UNION ALL
    SELECT 'pg_tblspc/' || l.filename as path, x.*
    FROM
        pg_ls_dir('pg_tblspc/') as l (filename),
        LATERAL pg_stat_file( 'pg_tblspc/' || l.filename) as x
    UNION ALL
    SELECT
        u.path || '/' || l.filename, x.*
    FROM
        all_files u,
        lateral pg_ls_dir(u.path) as l(filename),
        lateral pg_stat_file( u.path || '/' || l.filename ) as x
    WHERE
        u.isdir
)
SELECT * FROM all_files;

This query, on the same server, returns ~ 1.1 million rows in ~11 seconds. Not bad. And how do the rows look like?

      path       |  size   |         access         |      modification      |         change         | creation | isdir
-----------------+---------+------------------------+------------------------+------------------------+----------+-------
 base/1          |    8192 | 2017-02-09 09:48:56+00 | 2017-02-09 09:50:03+00 | 2017-02-09 09:50:03+00 | [null]   | t
 base/12374      |    8192 | 2017-02-09 09:48:56+00 | 2017-02-09 09:48:56+00 | 2017-02-09 09:49:28+00 | [null]   | t
 base/12379      |    8192 | 2017-02-09 09:48:56+00 | 2018-02-15 18:16:44+00 | 2018-02-15 18:16:44+00 | [null]   | t
 base/16401      |    8192 | 2017-02-09 09:48:57+00 | 2017-02-09 09:50:03+00 | 2017-02-09 09:50:03+00 | [null]   | t
 base/16402      | 4485120 | 2017-02-09 09:48:59+00 | 2018-02-17 11:01:27+00 | 2018-02-17 11:01:27+00 | [null]   | t
 base/pgsql_tmp  |       6 | 2017-02-09 10:48:09+00 | 2018-02-17 12:29:24+00 | 2018-02-17 12:29:24+00 | [null]   | t
 pg_tblspc/16400 |      29 | 2015-09-14 14:52:59+00 | 2017-02-09 09:35:45+00 | 2018-02-16 15:07:52+00 | [null]   | t
 base/1/1255     |  581632 | 2017-02-09 09:48:56+00 | 2017-02-09 09:48:56+00 | 2017-02-09 09:49:28+00 | [null]   | f
 base/1/1255_fsm |   24576 | 2017-02-09 09:48:56+00 | 2017-02-09 09:48:56+00 | 2017-02-09 09:49:28+00 | [null]   | f
 base/1/1247     |   65536 | 2017-02-09 09:48:56+00 | 2017-02-09 09:48:56+00 | 2017-02-09 09:49:28+00 | [null]   | f
(10 rows)

This is not all that interesting, but let's filter it out, and extract what we really need.

First things first – we can only (sensibly) check files that belong to current database – otherwise we will not be able to map the file number (for example 1255) to table name. This is unfortunate, but (in my case) not a problem.

Second – we only need to care about data files – that is files which are named like “12314" or “1214.12". We don't care about _fsm or _vm files, because this are generally speaking small, and they are internal pg things.

So, let's limit what we have, and also – extract only file name from path:

$ with recursive all_elements as (
    SELECT 'base/' || l.filename as path, x.*
    FROM
        pg_ls_dir('base/') as l (filename),
        LATERAL pg_stat_file( 'base/' || l.filename) as x
    UNION ALL
    SELECT 'pg_tblspc/' || l.filename as path, x.*
    FROM
        pg_ls_dir('pg_tblspc/') as l (filename),
        LATERAL pg_stat_file( 'pg_tblspc/' || l.filename) as x
    UNION ALL
    SELECT
        u.path || '/' || l.filename, x.*
    FROM
        all_elements u,
        lateral pg_ls_dir(u.path) as l(filename),
        lateral pg_stat_file( u.path || '/' || l.filename ) as x
    WHERE
        u.isdir
), all_files as (
    SELECT path, size FROM all_elements WHERE NOT isdir
)
SELECT
    regexp_replace(
        regexp_replace(f.path, '.*/', ''),
        '\.[0-9]*$',
        ''
    ) as filename,
    sum( f.size )
FROM
    pg_database d,
    all_files f
WHERE
    d.datname = current_database() AND
    f.path ~ ( '/' || d.oid || E'/[0-9]+(\\.[0-9]+)?$' )
group BY filename;

This returns data in a bit nicer format:

 filename  |    sum
-----------+------------
 897150761 |       8192
 893855744 |          0
 830027226 |       8192
 846295375 |          0
 875288146 |      16384
 880671539 |       8192
 890834780 |       8192
 873076686 |       8192
 896836699 |      49152

These numbers refer to column relfilenode in pg_class. So I can join pg_class, pg_namespace, and see how it looks:

$ with recursive all_elements as (
    SELECT 'base/' || l.filename as path, x.*
    FROM
        pg_ls_dir('base/') as l (filename),
        LATERAL pg_stat_file( 'base/' || l.filename) as x
    UNION ALL
    SELECT 'pg_tblspc/' || l.filename as path, x.*
    FROM
        pg_ls_dir('pg_tblspc/') as l (filename),
        LATERAL pg_stat_file( 'pg_tblspc/' || l.filename) as x
    UNION ALL
    SELECT
        u.path || '/' || l.filename, x.*
    FROM
        all_elements u,
        lateral pg_ls_dir(u.path) as l(filename),
        lateral pg_stat_file( u.path || '/' || l.filename ) as x
    WHERE
        u.isdir
), all_files as (
    SELECT path, size FROM all_elements WHERE NOT isdir
), interesting_files as (
    SELECT
        regexp_replace(
            regexp_replace(f.path, '.*/', ''),
            '\.[0-9]*$',
            ''
        ) as filename,
        sum( f.size )
    FROM
        pg_database d,
        all_files f
    WHERE
        d.datname = current_database() AND
        f.path ~ ( '/' || d.oid || E'/[0-9]+(\\.[0-9]+)?$' )
    group BY filename
)
SELECT
    n.nspname,
    c.relname,
    c.relkind,
    f.sum as size
FROM
    interesting_files f
    join pg_class c on f.filename::oid = c.relfilenode
    join pg_namespace n on c.relnamespace = n.oid
ORDER BY
    size desc;
        nspname        |                             relname                             | relkind |    size
-----------------------+-----------------------------------------------------------------+---------+------------
 pg_toast              | pg_toast_805314153                                              | t       | 3984195584
 xxxxxxxxxxxxxxx_9053  | xxxxxxxx                                                        | r       | 3538305024
 xxxxxxxxx             | xxxxxxxxxxxxxxxxxxx                                             | r       | 3062521856
 xxxxxxxxxxxxxxx_11400 | xxxxxxxxxx                                                      | r       | 2555461632
 xxxxxxxxxxxxxxx_7860  | xxxxxxxxxxxxxxxxxxxx                                            | r       | 2443206656
 xxxxxxxxx             | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                        | i       | 2237513728
 xxxxxxxxx             | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx             | i       | 1667743744
 xxxxxxxxxxxxxxx_8371  | xxxxxxxxxxx                                                     | r       | 1553113088
 pg_toast              | pg_toast_806460704                                              | t       | 1454399488
 xxxxxxxxxxxxxxx_8371  | xxxxxxxxxxxxxxxxxxxx                                            | r       | 1329913856

sorry for censoring, but the table names might suggest things that are not relevant to this blogpost.

The thing is that while I got sizes of all tables (relkind = ‘r') and indexes (relkind = ‘i') – I also got, separately – sizes of toast tables (relkind = ‘t') – which are basically secondary storage for table data. And they are all in pg_toast schema, which doesn't suit me. I'd like to know original schema for each toast table, so I can sum it appropriately.

Luckily, this can be done with simple join. Finally, I get to this query:

$ with recursive all_elements as (
    SELECT 'base/' || l.filename as path, x.*
    FROM
        pg_ls_dir('base/') as l (filename),
        LATERAL pg_stat_file( 'base/' || l.filename) as x
    UNION ALL
    SELECT 'pg_tblspc/' || l.filename as path, x.*
    FROM
        pg_ls_dir('pg_tblspc/') as l (filename),
        LATERAL pg_stat_file( 'pg_tblspc/' || l.filename) as x
    UNION ALL
    SELECT
        u.path || '/' || l.filename, x.*
    FROM
        all_elements u,
        lateral pg_ls_dir(u.path) as l(filename),
        lateral pg_stat_file( u.path || '/' || l.filename ) as x
    WHERE
        u.isdir
), all_files as (
    SELECT path, size FROM all_elements WHERE NOT isdir
), interesting_files as (
    SELECT
        regexp_replace(
            regexp_replace(f.path, '.*/', ''),
            '\.[0-9]*$',
            ''
        ) as filename,
        sum( f.size )
    FROM
        pg_database d,
        all_files f
    WHERE
        d.datname = current_database() AND
        f.path ~ ( '/' || d.oid || E'/[0-9]+(\\.[0-9]+)?$' )
    group BY filename
)
SELECT
    n.nspname as schema_name,
    sum( f.sum ) as total_schema_size
FROM
    interesting_files f
    join pg_class c on f.filename::oid = c.relfilenode
    left outer join pg_class dtc on dtc.reltoastrelid = c.oid AND c.relkind = 't'
    join pg_namespace n on coalesce( dtc.relnamespace, c.relnamespace ) = n.oid
group BY
    n.nspname
ORDER BY
    total_schema_size desc
LIMIT 10

Which did return 10 most disk using schemas in less than 26 seconds.

Complicated. Not nice. Possibly still optimizable. Depending on some knowledge of filesystem layout. But works. And all done from plain SQL. I do love my PostgreSQL 🙂

  1. 5 comments

  2. # galaxy
    Feb 19, 2018

    I wonder why your naive approach query runs for so long. pg_total_relation_size internally does essentially the same thing – it stat()’s table files. Did you examine the plan for the query? Maybe it’s pg_class to pg_namespace join that is so slow?

    Also, I think your second approach is not very careful as it doesn’t seem to take into account indices, table segments (large tables are stored on disk in separate files: 12345, 12345.1, 12345.2 and so on), fsm and vm

  3. Feb 20, 2018

    @Galaxy:
    1. Sure my solution takes into account indexes and table parts.
    2. fsm ans vm forks are very small, so i purposedly ignored them, and I even wrote about it in paragraph starting with “Second – we only need to care about data files” – the same where I mentioned I am caring about datafiles with “.x” – segments of table.

    As for the “join being a suspect”, since my final query does the same join, i fail to see how it would be relevant in the first query but not last.

  4. # Galaxy
    Feb 20, 2018

    Sorry, I might have not read well. You are right about join, indices and I agree that vm/fsm is negligible.

    Still, I cannot understand how could your second query doing the same thing but in obviously more complex way outperform the first simple query.

    Was the plan more efficient in latter case? Could it be due to stat() call cache (that is you warmed the cache by your naive query and subsequent queries)?

    Can you maybe try running first and second queries on freshly started db with file i/o cache purged?

  5. Feb 20, 2018

    @Galaxy:

    I don’t have test db that I could use for this. I was running this query on basically a prod server (slave, but still prod) for a system we have 1 million tables for).

    Still – you can easily repeat the test on your own if you don’t believe me.

  6. # Avinash Patil
    Mar 21, 2018

    Great Article…really helpful … 🙂

Leave a comment