A Recent Experience
Recently during a POS (proof on site) exercise with a prospective customer, we had to perform a test in which an email client would send mail to a large number of recipients from our cloud email setup and capture performance test results. As a regular practice, we set up the SMTP controls on our server to allow this test, did a test from our environment, and then asked the client to do the test.
For the first time, we had enabled the SMTP scanning engines for their source IP to capture detailed information (which would slow down the mail flow naturally), and we found that the client could deliver a few mail, but would give up (simply show the progress bar, but not moving ahead). The server logs showed that there were no more connections from the client. As the first point of troubleshooting, we eliminated the scanning controls and simplified the SMTP rules to make no checks for their source IP address. We did another round of tests and had similar results but with a few more mail having gone through. During this phase, we couldn’t successfully send mail to all their recipients at all. After a few mail, the system would simply do nothing and the client would eventually time out.
On the face of it, all looked well in the client’s environment, since the users of the client were going about their business with no issues.
Without assuming anything, we performed the test from our office to eliminate any issues on the server-side. Once we did this successfully, we re-did the test from our environment, with the client’s data and that too went through successfully. All pointers were now to the client’s environment!
There obviously was some firewall policy, some proxy, or some other transparent firewall in the network which was disabling the test through the given Internet link. On our request, when the firewall policies were bypassed for connections to our servers, the test went through successfully.
Real-Life Situations of network problems
Several times, our help desk receives tickets for such “intangible” problems in the network which are difficult to troubleshoot since there is some element in the network, which is interfering with the normal flow. Clients find it difficult to accept these kinds of issues since on the face of it all seems to be well. Some real-life examples of such issues we face:
- At one of our customer sites, the address book on the clients suddenly stopped working. Clients connect to the Address book over the LDAP port 389. We found that while telnet to the LDAP port was working fine from a random set of clients, still the address book was not able to access the server over port 389. It turned out to be a transparent firewall that had a rate control.
- Several of our customers complain of duplicate mail. This typically happens when MS Outlook as a client sends a mail but retains the mail in the Outbox when it doesn’t receive a proper acknowledgment from the server. It then resends the mail and may do so repeatedly until it transaction completes successfully. On the face of it, it appears like a server issue, while actually, it's a network quality issue. Difficult to prove. I’ve personally spent hours on the phone trying to convince customers to clean up their networks. One of our customers, after a lot of convincing, did some hygiene work on their network and the problem “magically” vanished.
- One of our customers complained that their remote outgoing mail queue is rising rapidly. We found that the Internet link’s capacity to relay mail has suddenly dropped. So while mails were going, but very slowly and hence the queues were rising. Apparently no change and even as confirmed by the ISP. We were quite convinced that there was possibly an introduction of an SMTP proxy in the path which had some rate control or tar pit policies. To prove our hypothesis, we routed the mail from our hosted servers over a different port (not port 25). The mail flow became normal while sending through the same Internet link. As of the time of this writing, the ISP is still to acknowledge that there is an impediment in the path for port 25.
These and several more incidents show that problems in the network environment are challenging to troubleshoot and accept.
So, what next?
As a first step, the network must come into focus as an important part of the solution stack. More often it's a hidden element, with apparently no issues.
We believe that while changes to the network infrastructure need to be better managed since the impact is widespread, the need of the hour is to evolve simple troubleshooting steps to quickly zero in on the problem.