If you ever encountered “idle in transaction" connections, you most likely hate them. I know, I personally hate them. They interfere with most of “cool toys" like replication, vacuum, DDL queries.
So, when I saw them on a database I was looking on, I decided to act.
Easier to say, difficult to do. How to fix the problem?
Of course, I could kill all postgres processes that are in “idle in transaction" state, but this is not safe, and definitely it's not optimal solution.
The best option would be to fix these parts of code that generate idle in transaction connections. But how?
First, let's ask a simple question: what exactly is “idle in transaction" connection?
It's pretty simple: it is postgres backend process, handling connection, which started transaction, perhaps did something within this transaction, but now it's not doing anything, and the transaction is still open, waiting for COMMIT or ROLLBACK.
Checking pg_stat_activity will not tell us anything – the connection is idling. Of course, we will know what host/post the connection is from, what database it uses, and what user is logged. But in most of the cases it's not enough. The idle in transaction connections come from application servers, and despite the fact that I know which application caused them, I don't know which part of the application did it.
Luckily, with a bit of Perl hacking, we can get some information.
First, let's setup PostgreSQL, so it will log all queries with exact timestamps and query duration. For example using these settings:
name | setting ----------------------------+----------------- log_line_prefix | %m %u@%d %p %r log_min_duration_statement | 0
Logs from this look like this:
2008-08-28 16:58:31.978 CEST some_user@some_database 26001 10.3.5.1(45094) LOG: duration: 1.387 ms statement: SELECT ...
Interesting parts are timestamp at the beginning of like, and query duration.
First thing – the timestamp is from the end of query execution. Since it is ‘2008-08-28 16:58:31.978', and query took a bit over 1ms, we can assume the query started at ‘2008-08-28 16:58:31.977'.
This gives us both start and end of query.
Doing analysis of whole log, we can find out what are the time differences between end of one query and start of another.
So, I wrote a program that analyzes logs, and prints interesting information.
This is how it works:
=> zcat postgresql-2008-08-28_165831.log.gz | ./find-iit-connections.pl 2>/dev/null 162ms @ 2008-08-28 16:58:55.795 CEST (pid: 8570), queries: - begin - some_query - other_query - another - the query *after* long idle in transaction
It searches for all idle-in-transactions periods that are longer than 100ms – but this can be easily modified in this line:
my $MINIMAL_IDLE_TO_REPORT = 100;
Output shows how long the idle in transaction was (162ms), at what time it ended, what is the pid of postgres backend which was in this state, and list of all queries (from beginning of transaction) that lead to this state.
Having this information finding responsible part of the code can be much easier.