postfix-users October 2010 archive
Main Archive Page > Month Archives  > postfix-users archives
postfix-users: intermittent hang on "postfix stop"; do

intermittent hang on "postfix stop"; doesn't return "terminating on signal"

From: Scott Brown <scottwb_98_at_nospam>
Date: Fri Oct 15 2010 - 13:43:56 GMT
To: postfix-users@postfix.org

Hello,

I'm stuck with a problem where postfix is hanging sometimes when issuing a
"postfix stop" command. In my configuration, I have two domains I'm relaying
mail for with postfix. The lists of email addresses are in these virtual files,
defined with this line in main.cf:
virtual_alias_maps =
hash:$config_directory/usermanaged/virtual.domain1.com,hash:$config_directory/usermanaged/virtual.domain2.com

I have a script called update-postfix.pl that runs every half an hour which
shuts down postfix, runs postmap on the virtual alias files, and then restarts
postfix. Most of the time, the script runs without any problems. Actually,
this configuration was running on a different server previously with no problems
at all. Because the old server was too slow, I set up a new postfix
installation on a new server, and it's under the new setup that I'm running into
this intermitent hanging problem.

Usually, when the update-postfix.pl script runs, it tells Postfix to shut down
and we get a logged message that says "postfix/postfix-script: stopping the
Postfix mail system". Right after that, postfix responds with something like
"postfix/master[11211]: terminating on signal 15"

However, sometimes (once every day or so), the script runs and we get the first
message "postfix/postfix-script: stopping the Postfix mail system", but then
postfix does not respond to it and keeps running for a while, until it sees the
virtual.domain2.com.db was updated, at which point it logs
"postfix/trivial-rewrite[16529]: table
hash:/etc/postfix/usermanaged/virtual.domain2.com(0,lock|fold_fix) has changed
-- restarting", and then after that, it appears to be hung.

My first thought was that maybe postfix didn't have enough of an opportunity to
shut itself down before the .db was updated by the postmap command. So I put in
a sleep 60 right after the postfix stop command. Even though when the stop
command works, we see the "terminating on signal" response almost instantly.
 Since the 60 seconds didn't work, I increased to 2 minutes, but that also
didn't help.

I found a forum post (http://www.howtoforge.com/forums/showthread.php?t=15898)
where someone had a somewhat similar problem. Someone suggested running
"newaliases". So I tried deleting both virtual.domain2.com.db and
virtual.domain1.com.db, then ran "newaliases", manually ran postmap for domain2
and domain1, and restarted postfix. But even after those steps, the problem
kept happening.

This most recent time it hung, I tried issuing another "service postfix stop",
as well as a plain "postfix stop", but neither of those caused postfix to
respond with "terminating on signal". I checked the ps aux process list, and
tried killing the postfix processes I saw. Then tried restarting, and got the
"already running" error. So I checked ps aux again and noticed there were a
bunch of processes being run by the postfix user that were tagged with
<defunct>. I tried killing those processes but couldn't kill them.

Does anyone have any ideas on what could be wrong?

Thanks very much in advance for any suggestions!

Scott

Below is a snippet from maillog showing what happens when it hangs. You can see
that at 23:30:02, the update-postfix.pl script kicks in and tries to stop
postfix. It doesn't succeed, and one minute later, Postfix sees the .db was
updated and tries to restart. Shortly after that, the update script tries to
restart postfix but fails because it's already running. Then postfix stays in a
hung state, not accepting any incoming connections. Half an hour later, the
cron job runs again but fails to do anything because postfix is hung.

Oct 9 23:30:02 myserver postfix/postfix-script: stopping the Postfix mail
system
Oct 9 23:31:04 myserver postfix/trivial-rewrite[16529]: table
hash:/etc/postfix/usermanaged/virtual.domain2.com(0,lock|fold_fix) has changed
-- restarting Oct 9 23:31:14 myserver postfix/postfix-script: fatal: the Postfix mail system is already running Oct 9 23:32:44 myserver postfix/anvil[16528]: statistics: max connection rate 1/60s for (smtp:110.36.0.252) at Oct 9 23:29:03 Oct 9 23:32:44 myserver postfix/anvil[16528]: statistics: max connection count 1 for (smtp:110.36.0.252) at Oct 9 23:29:03 Oct 9 23:32:44 myserver postfix/anvil[16528]: statistics: max cache size 2 at Oct 9 23:29:23 Oct 10 00:00:02 myserver postfix/postfix-script: stopping the Postfix mail system Oct 10 00:01:13 myserver postfix/postfix-script: fatal: the Postfix mail system is already running update-postfix.pl: ------------------------- #!/usr/bin/perl $|=1; my $dir= "/var"; my @fulldfinfo = `df $dir`; #/dev/da0s1f 33851580 28087462 3055992 90% /var my ($dfdevice,$total,$used,$avail,$pct,$dfmount)=split /[\b\t ]+/,$fulldfinfo[1]; print "space available=$avail\nspace used=$used"; if ($avail > 0) { open(SH, "|/bin/sh"); print SH <<"EOM"; umask 022 cd /etc/postfix /sbin/service postfix stop sleep 120 /usr/sbin/postmap -c /etc/postfix hash:/etc/postfix/access /usr/sbin/postmap -c /etc/postfix hash:/etc/postfix/transport /usr/sbin/postmap -c /etc/postfix hash:/etc/postfix/usermanaged/virtual.domain1.com /usr/sbin/postmap -c /etc/postfix hash:/etc/postfix/usermanaged/virtual.domain2.com /usr/sbin/postalias -c /etc/postfix hash:aliases sleep 10 /sbin/service postfix start #/usr/sbin/postfix -c /etc/postfix reload EOM #mv /etc/postfix/usermanaged/virtual.tmp.db /etc/postfix/usermanaged/virtual.db } else { print "Not enough space to generate new db files!"; }