PostgreSQL 15: Statistics collector gone? What's new?

Anyone trying the next PostgreSQL 15 may notice that one of the background processes is missing.

if we compare this with PostgreSQL 14:

Yes, the "stats collector" is missing, and it's gone for good. One of the main bottlenecks and headaches is gone forever.
What does the statistics collector do?
First time users may wonder what it is and why it is needed for PG 14 and earlier. At least a few users are confused about table-level statistics collection (ANALYZE), which is used for query planning. But it's different. PostgreSQL tracks all activity of each process to have cumulative statistics like how many times a table or index is scanned, or when the last vacuum or autovacuum was run on the table, or how many times autovacuum has been executed on a table, etc. All the information collected by the statistics collector is available through different pg_stat_* views.
What was wrong?
Since each backend of a session is an individual process in PostgreSQL, collecting statistics and reporting them is not an easy task. Each backend sends information about the activity it has performed to a single "statistics collector" process. This communication was previously done via the UDP socket. There were a lot of problems with this approach, it's not a scalable model. Users often reported different types of issues, such as 1. outdated stats, 2. stats collector not working, 3. autovacuum not working/starting, etc.

Previously, it was very difficult to figure out what was wrong if a stat collector had a problem with a specific machine.

Another negative effect of the "stat collector" is the IO it causes. if you enable DEBUG level 2, you may see messages that keep coming in the PostgreSQL log like:

This can cause considerable I/O to the mount point where your data directory is located. This is the location pointed to by the value of the stats_temp_directory parameter. On many systems this will be the pg_stat_tmp in the data directory.

On Ubuntu/Debian this will be in /var/run/postgresql, for example:
What's new in PostgreSQL 15?
Instead of using files and filesystem, statistics now use dynamic shared memory.

You can refer to the commit here by Andres Freund for a summary:

Previously, the stats collector received stats updates via UDP and shared stats data by periodically writing it to temporary files. These files can reach tens of megabytes and are written up to twice per second. This repeatedly prevented us from adding additional useful stats.

Statistics are now stored in shared memory. Variable number object statistics are stored in a hash table dshash (backed by dynamic shared memory). Fixed number statistics are stored in ordinary shared memory.

The pgstat.c header contains an architectural overview.

The stats collector is no longer needed, remove it.

Using the transactional stats suppression framework introduced in a previous commit, stats entries can no longer "leak". Previously leaked statistics have been removed by pgstat_vacuum_stat(), called from [auto-]vacuum. On systems with many small relationships, pgstat_vacuum_stat() can be quite expensive.

Now that replicas remove statistics entries for deleted objects, it is no longer necessary to reset statistics when starting from a cleanly stopped replica.

Obviously the stats_temp_directory parameter is gone. So we don't need the pg_stat_tmp directory which is created in the data directory (or other location) where all the statistics files are generated and read. However, this directory is kept so as not to break many extensions like pg_stat_statements, which depend on the directory. The directory remains empty until the extension libraries are loaded. For example, if we load the pg_stat_statements library, a file appears in the directory.

Of course, extensions are not free. They bear their own costs.

In the new architecture, most statistics updates are first accumulated locally in each process as "pending" (each backend has a backend-local hash table). "Pending" in the sense that they are accumulated but not yet subject to the shared statistics system. This is then flushed to shared memory just after a commit or by timeout.

Since statistics are updated simultaneously while someone tries to read, read consistency comes into play. Thus PostgreSQL 15 introduces a new parameter: stats_fetch_consistency which can take three values none, cache or snapshot.

"none" is the most efficient...

Technology Aug 29, 2022 0 71 Add to Reading List

PostgreSQL 15: Statistics collector gone? What's new?

Anyone trying the next PostgreSQL 15 may notice that one of the background processes is missing.

if we compare this with PostgreSQL 14:

Yes, the "stats collector" is missing, and it's gone for good. One of the main bottlenecks and headaches is gone forever.

What does the statistics collector do?

First time users may wonder what it is and why it is needed for PG 14 and earlier. At least a few users are confused about table-level statistics collection (ANALYZE), which is used for query planning. But it's different. PostgreSQL tracks all activity of each process to have cumulative statistics like how many times a table or index is scanned, or when the last vacuum or autovacuum was run on the table, or how many times autovacuum has been executed on a table, etc. All the information collected by the statistics collector is available through different pg_stat_* views.

What was wrong?

Since each backend of a session is an individual process in PostgreSQL, collecting statistics and reporting them is not an easy task. Each backend sends information about the activity it has performed to a single "statistics collector" process. This communication was previously done via the UDP socket. There were a lot of problems with this approach, it's not a scalable model. Users often reported different types of issues, such as 1. outdated stats, 2. stats collector not working, 3. autovacuum not working/starting, etc.

Previously, it was very difficult to figure out what was wrong if a stat collector had a problem with a specific machine.

Another negative effect of the "stat collector" is the IO it causes. if you enable DEBUG level 2, you may see messages that keep coming in the PostgreSQL log like:

This can cause considerable I/O to the mount point where your data directory is located. This is the location pointed to by the value of the stats_temp_directory parameter. On many systems this will be the pg_stat_tmp in the data directory.

On Ubuntu/Debian this will be in /var/run/postgresql, for example:

What's new in PostgreSQL 15?

Instead of using files and filesystem, statistics now use dynamic shared memory.

You can refer to the commit here by Andres Freund for a summary:

Previously, the stats collector received stats updates via UDP and shared stats data by periodically writing it to temporary files. These files can reach tens of megabytes and are written up to twice per second. This repeatedly prevented us from adding additional useful stats.

Statistics are now stored in shared memory. Variable number object statistics are stored in a hash table dshash (backed by dynamic shared memory). Fixed number statistics are stored in ordinary shared memory.

The pgstat.c header contains an architectural overview.

The stats collector is no longer needed, remove it.

Using the transactional stats suppression framework introduced in a previous commit, stats entries can no longer "leak". Previously leaked statistics have been removed by pgstat_vacuum_stat(), called from [auto-]vacuum. On systems with many small relationships, pgstat_vacuum_stat() can be quite expensive.

Now that replicas remove statistics entries for deleted objects, it is no longer necessary to reset statistics when starting from a cleanly stopped replica.

Obviously the stats_temp_directory parameter is gone. So we don't need the pg_stat_tmp directory which is created in the data directory (or other location) where all the statistics files are generated and read. However, this directory is kept so as not to break many extensions like pg_stat_statements, which depend on the directory. The directory remains empty until the extension libraries are loaded. For example, if we load the pg_stat_statements library, a file appears in the directory.

Of course, extensions are not free. They bear their own costs.

In the new architecture, most statistics updates are first accumulated locally in each process as "pending" (each backend has a backend-local hash table). "Pending" in the sense that they are accumulated but not yet subject to the shared statistics system. This is then flushed to shared memory just after a commit or by timeout.

Since statistics are updated simultaneously while someone tries to read, read consistency comes into play. Thus PostgreSQL 15 introduces a new parameter: stats_fetch_consistency which can take three values none, cache or snapshot.

"none" is the most efficient...