postfix-users October 2010 archive
Main Archive Page > Month Archives  > postfix-users archives
postfix-users: Re: Postfix locking up, not accepting connections

Re: Postfix locking up, not accepting connections / smtp not sending emails out

From: Wietse Venema <wietse_at_nospam>
Date: Fri Oct 29 2010 - 17:04:40 GMT
To: Postfix users <postfix-users@postfix.org>

Christian Rohmann:
> > Does this server run in a virtual machine?
> Yeah, the Debian Lenny (amd64) runs on VMware ESX 4.1 hosts. The guests
> itself are Vmware HW revision 7.

VMware has an entire KB article on problems with delivering timer
interrupts to guest machines, and the hoops that they are jumping
through to avoid poor performance. See
http://tech.groups.yahoo.com/group/postfix-users/message/269786

> > What is the output from "grep fatal" on today's and yesterday's maillog file?
> None, not a single line.
>
> > What is the output from "grep watchdog" on all your maillog files?
> Same as above -> nothing.
>
> I guess that rules out this issue here?

No, it confirms my suspicion that either a) you run Postfix < 2.4
and do "postfix stop" or "reload" frequently, or b) your virtual
timers are broken, or c) you used "grep" on compressed files instead
of using "zgrep" or "bzgrep".

All Postfix daemons including the master have an alarm(3) timer
that aborts the process when it becomes stuck.

Normally all processes reset their alarm timer frequently; when
they become stuck, they stop resetting their alarm timer. When the
timer goes off, it logs a watchdog error and kills the process.

> On 10/29/2010 05:43 PM, lst_hoe02@kwsoft.de wrote:
> > Maybe another instance of this problem?
> > http://tech.groups.yahoo.com/group/postfix-users/message/269786
>
> Even though at some point postfix stopped at EPOLL_WAIT...

That does not look like the problem with "postfix stop" or "reload"
with Postfix < 2.4 which sometimes triggers a deadlock in syslog().

So we still have the possibility that your timer support is broken
such that even the per-process alarm timer is no longer working.

Postfix relies heavily on timer support to enforce sanity.

Specifically, Postfix relies on short-term timers (implemented with
poll and epoll on Linux) to enforce time limits on read/write
operations, and relies on long-term alarm timers to kill off a
process that hangs because some short-timer failed to go off.

If both layers of safety fail due to broken (virtual) timer support,
then it is not possible to run Postfix reliably.

        Wietse