postfix-users October 2010 archive
Main Archive Page > Month Archives  > postfix-users archives
postfix-users: Re: Postfix locking up, not accepting connections

Re: Postfix locking up, not accepting connections / smtp not sending emails out

From: Jeroen Geilman <jeroen_at_nospam>
Date: Fri Oct 29 2010 - 18:38:32 GMT
To: postfix-users@postfix.org

On 10/29/2010 05:35 PM, Christian Rohmann wrote:
> Hey postfix-users,
>
> we are currently analyzing very strange postfix behavior which I can
> only describe as lockup or freeze.
> Honestly we reached our abilities to gather more info and to find the
> root cause of this issue.
>
> You are my last hope obi wan ... eh Wietse ....
>
>
> --- Setup / Configuration ---
> We are running three VMware ESX 4.1 guests (multiple VMware hosts) with
> Debian Lenny amd64.
> Each machine has 8 virtual CPUs and 4GB of RAM.

Wow. Seriously ? For *email* ?

> We are using the Debian
> lenny postfix package (2.5.5) and now moved up to 2.7.1 which we
> backported (problems remained unchanged).
> Each instance of postfix recieves about 200-300 mails/minute.

That's... frankly, that's nothing.
We do that volume on a single-vcpu with 512MB.

At 5% CPU utilization.

Which makes me suspect you have other issues.
Is your networking sane/stable/provably working 100% ?
Which vmware NIC are you using ?
I have seen incompatibilities in Linux with anything but the standard
Intel E1000 emulator, so I suggest you use only that.

> The setup and configuration works like a charm for hours at a time and
> all of a sudden it stops working leading to two issues (not at the same
> time):
>
> 1) First issue was that suddenly smtp stopped delivering email to that
> mutli-A record. We noticed a few thousand emails in the active queue (I
> guess all emails where in the active queue by that time). We rule out
> problems with the destination servers since the remaining postfix
> instances still delivered mail during that time. Even the last
> submission of email from the now locked up postfix finished without
> issue. There are just no more tries to reach the destination. Postfix
> also stops to deliver to both destination IPs at the same time. There
> was no logging anymore, but from anvil giving us some statistics about
> connection rates.
>
> 2) The second issue, occurring more often is that smtpd stopped
> responding or doing anything acutally. Sometimes the number of processes
> went to max (500) at other occurences it just stayed at like 30-50 but
> symptoms where still the same.
>

Again looks like a networking issue to me - however, the issue may well
be *caused* by a timer malfunction.

Timing is rather important for dependable TCP, after all.

> Looking at netstat all tcp connections to the smtpd processes went away
> after some time.We believe that clients trying to deliver email to us
> did disconnect due to reaching their timeout.
>

That means the symptom definitely surfaces in master(8), as that accepts
all connections.

> Here is a sniplet of log from before and after the issue:
> --- connect before and after lockup ---
> Oct 29 12:57:35 mailserver postfix/smtpd[12457]: connect from
> newsletter.xxx.de[xxx.162.xxx.28]
> Oct 29 13:29:02 mailserver postfix/smtpd[12457]: 826C611FA8:
> client=newsletter.xxx.de[xxx.162.xxx.28]
> Oct 29 13:29:03 mailserver postfix/smtpd[12457]: lost connection after
> RCPT from newsletter.xxx.de[xxx.162.xxx.28]
> Oct 29 13:29:03 mailserver postfix/smtpd[12457]: disconnect from
> newsletter.xxx.de[xxx.162.xxx.28]
> --- / connect before and after lockup ---
> There is not a single line of log from smtpd processes while being in
> the locked up state. At 12:57:35 there is a last connect from a client,
> no more log happens until we run "postfix reload" (! a reload is
> sufficient) about half an hour later. Then the last few lines show that
> the smtpd process is "alive" again talking to the now already
> disconnected client.
>
> Also the master process is still there but not accepting anymore
> connections until we issue "postfix reload". Here is an strace of the
> master process:
> --- strace on master-pid while locked up ---
> 10648 alarm(333) = 307
> 10648 epoll_wait(13,
> --- / strace on master-pid while locked up ---
>
>
> Another strange thing is, that some of the last log entries contain
> "UNKNOWN" instead of the IP:
> --- smtpd connects towards UNKNOWN ---
> Oct 29 13:05:02 mailserver postfix/smtpd[14196]: connect from
> unknown[unknown]
> Oct 29 13:05:02 mailserver postfix/smtpd[14196]: NOQUEUE: reject:
> CONNECT from unknown[unknown]: 550 5.7.1 Client host rejected: cannot
> find your reverse hostname, [unknown]; proto=SMTP
> Oct 29 13:05:02 mailserver postfix/smtpd[14196]: too many errors after
> CONNECT from unknown[unknown]
> Oct 29 13:05:02 mailserver postfix/smtpd[14196]: disconnect from
> unknown[unknown]
> --- / smtpd connects towards UNKNOWN ---
>
> Anvil also gives statistics about this obscure client-IP "UNKNOWN":
> --- anvil statistics for UNKNOWN ---
> Oct 29 13:14:57 mailserver postfix/anvil[9312]: statistics: max
> connection rate 92/60s for (smtp:unknown) at Oct 29 13:05:02
> Oct 29 13:14:57 mailserver postfix/anvil[9312]: statistics: max
> connection count 1 for (smtp:unknown) at Oct 29 13:05:02
> Oct 29 13:14:57 mailserver postfix/anvil[9312]: statistics: max cache
> size 413 at Oct 29 13:05:02
> --- / anvil statistics for UNKNOWN ---
>
> During the lock up we still see the *:25 listening socket of the master
> process. When we tried to connect ourselves we just waited forever for a
> connection or even a SMTP prompt.
>
> New incoming connections are in state "SYN_RECV":
> --- netstat while locked up---
> Proto Recv-Q Send-Q Local Address Foreign Address State
> tcp 0 0 x.x.x.143:25 xx.0.xx.217:58556 SYN_RECV
> tcp 0 0 x.x.x.143:25 xx.115.xx.166:46355 SYN_RECV
> tcp 0 0 x.x.x.143:25 xx.63.xx.197:37216 SYN_RECV
> .
> .
> .
> ---/ netstat while locked up---
>
>

It's a tossup between networking and timing, or network-related timing,
for all I can see.

I have to tell you we run a dozen Linux boxes on ESX 4.1, and we have
zero problems on any of them.

We sync the vmware hosts from NTP, and sync the local guest time from
the vmware tools - or sometimes also from NTP.

Both work fine.

> I'm hoping for suggestions on what to look for or check,
>

Configure reliable network time with ntp.
Works for me.

-- J.