e1000_clean_tx_irq: Detected Tx Unit Hang

Since I do website hosting for customers, I have a coupld of HP ProLiant DL380 servers at one of my locations.

These servers have a PCI-Express dual-NIC Intel card in them.  To be more exact, the card is an Intel Corporation 82546EB Gigabit Ethernet Controller.

When the servers would receive a request to download a large file – roughly 1 megabyte or higher (such as a large picture file), the download progress would stop on the visitor’s end and time out.

After doing more research, I finally was able to see the error messages that were appearing in the server logs:

<time> <server> kernel: [1878984.981444] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
<time> <server> kernel: [1878984.981447]   Tx Queue             <0>
<time> <server> kernel: [1878984.981448]   TDH                  <b4>
<time> <server> kernel: [1878984.981449]   TDT                  <bf>
<time> <server> kernel: [1878984.981450]   next_to_use          <bf>
<time> <server> kernel: [1878984.981452]   next_to_clean        <ae>
<time> <server> kernel: [1878984.981453] buffer_info[next_to_clean]
<time> <server> kernel: [1878984.981454]   time_stamp           <10b32a4cb>
<time> <server> kernel: [1878984.981455]   next_to_watch        <b6>
<time> <server> kernel: [1878984.981456]   jiffies              <10b32a552>
<time> <server> kernel: [1878984.981457]   next_to_watch.status <0>
<time> <server> kernel: [1878986.981517] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
<time> <server> kernel: [1878986.981520]   Tx Queue             <0>
<time> <server> kernel: [1878986.981522]   TDH                  <b4>
<time> <server> kernel: [1878986.981523]   TDT                  <bf>
<time> <server> kernel: [1878986.981524]   next_to_use          <bf>
<time> <server> kernel: [1878986.981525]   next_to_clean        <ae>
<time> <server> kernel: [1878986.981527] buffer_info[next_to_clean]
<time> <server> kernel: [1878986.981528]   time_stamp           <10b32a4cb>
<time> <server> kernel: [1878986.981529]   next_to_watch        <b6>
<time> <server> kernel: [1878986.981531]   jiffies              <10b32a61a>
<time> <server> kernel: [1878986.981532]   next_to_watch.status <0>
<time> <server> kernel: [1878988.981591] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
<time> <server> kernel: [1878988.981594]   Tx Queue             <0>
<time> <server> kernel: [1878988.981595]   TDH                  <b4>
<time> <server> kernel: [1878988.981597]   TDT                  <bf>
<time> <server> kernel: [1878988.981598]   next_to_use          <bf>
<time> <server> kernel: [1878988.981599]   next_to_clean        <ae>
<time> <server> kernel: [1878988.981600] buffer_info[next_to_clean]
<time> <server> kernel: [1878988.981602]   time_stamp           <10b32a4cb>
<time> <server> kernel: [1878988.981603]   next_to_watch        <b6>
<time> <server> kernel: [1878988.981604]   jiffies              <10b32a6e2>
<time> <server> kernel: [1878988.981606]   next_to_watch.status <0>
<time> <server> kernel: [1878991.771623] e1000: eth2 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX

All of these errors occurred within a seven-second timeframe.  Sure, seven seconds doesn’t seem like a lot, but when these errors occurred, the servers looked like they were dead and wouldn’t respond to any queries across the Internet at all.

Both of the HP ProLiant DL380 servers had their on built-in BroadCom card, but I previously also had issues with these cards and the tg3 driver as well – so I opted to buy – what I thought was a very well-supported Intel card.  Ubuntu uses the e1000 driver (as noted above in the errors) for this card.

So far – after several attempts to re-download the same large pictures from the web servers after making the changes below, everything seems to be working well.  Here is how I did it.

First, I needed to add a line to the /etc/default/grub configuration line.  I added the “pcie_aspm=off” entry in the following line:

GRUB_CMDLINE_LINUX_DEFAULT=”quiet splash pcie_aspm=off”

That is supposed to disable the power management features of the PCI-Express components.

After performing a restart, I tested the download of the pictures again – and it hung.  So this didn’t fix the issue by itself.

The next step was to disable something called “TCP Segmentation Offload” – or TSO for short.  Many others indicated that by disabling this alone would fix the problem – but I started with disabling the power management features first.

In order to disable TSO for your NIC, do the following for EACH ethernet port (so if you have a dual-NIC card, you will have to run this command twice and change the ‘ethX’ number):

ethtool -K ethX tso off

Note to change the “X” in “ethX” to the number.  For example, mine are “eth2” and “eth3” for my dual-NIC Intel card.

Now, confirm that TSO is disabled by running this command for each NIC:

ethtool -k ethX

You may have to put “sudo” in front of each command if you are not running as the root user.  After running that command, you should see that TSO is off.

Now, how can you set it so that TSO is automatically disabled upon startup?  Make an init script!  if you use the command above, it is only effective until the next reboot.

So, go to the /etc/init.d directory and make a new file – mine is called “disable-tso”.  Create the file and add one line for each NIC you need to disable TSO on.  This is a copy of my script:

#!/bin/sh
ethtool -K eth2 tso off
ethtool -K eth3 tso off

Of course, ensure you update the “eth2” and “eth3” to your NIC card numbers

Now, you need to make the file executable – so I’d run this:

sudo chmod 755 disable-tso

And now, run this to update the startup process:

update-rc.d disable-tso defaults

And you are done!  Each time the server reboots, it will disable TSO upon startup.  Hope this is useful to others that have the Intel Corporation 82546EB Gigabit Ethernet Controller card in their servers using the e1000 driver in Ubuntu.