Forums

Resolved
0 votes
For the past few months our WAN link has been dropping out - it had worked without issue for ~2-3 years prior to that so I am wondering if a kernel update has included a flakey driver??? I've swapped the cable and that does not fix the issue. It sometimes fixes itself but usually I am too impatient and log on and do ifdown/ifup and restore it. Occasionally I can no longer log on via the LAN and have to do a cold boot. The issue seems to be worse in the afternoons so there may be a thermal driver as the office warms up????? We are still working and its not a sweat shop so by warm up I'm only talking a couple of degrees into the mid to high 20's outside of the box which is well ventilated and dust free. Perhaps that points to a card on its way out rather than a driver issue per se. I see that several years ago the R8168 driver needed patching but see no reports of people having issues with the R8169, other than those who also have an 8168.

System details are;
kernel
3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

lspci
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetLink BCM57780 Gigabit Ethernet PCIe (rev 01)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8169 PCI Gigabit Ethernet Controller (rev 10)

The Broadcom is on the MB and feeds the LAN
the Realtek is PCI and feeds the WAN. The MB is old and does not support PCIe.

ifconfig
enp3s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.9.253 netmask 255.255.255.0 broadcast 192.168.9.255
inet6 fe80::223:cdff:feb0:f7f2 prefixlen 64 scopeid 0x20<link>
ether 00:23:cd:b0:f7:f2 txqueuelen 1000 (Ethernet)
RX packets 12849331 bytes 13405715567 (12.4 GiB)
RX errors 14 dropped 4486 overruns 0 frame 54
TX packets 8281056 bytes 2171991964 (2.0 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

p128p1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.10.254 netmask 255.255.255.0 broadcast 192.168.10.255
inet6 fe80::a6ba:dbff:fefd:d23a prefixlen 64 scopeid 0x20<link>
ether a4:ba:db:fd:d2:3a txqueuelen 1000 (Ethernet)
RX packets 8784127 bytes 2287276371 (2.1 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 12852918 bytes 13445835171 (12.5 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 16

Any thoughts or others with similar issues?
Cheers
Kim
Thursday, October 07 2021, 02:45 AM
Share this post:
Responses (13)
  • Accepted Answer

    Thursday, October 14 2021, 06:25 AM - #Permalink
    Resolved
    0 votes
    As any FYI for any others with this issue, changing to the kmod driver did not fix the problem so I've reversed and gone back to the kernel driver.

    I have another NIC card so I'll try swapping that out and hope that by putting it in the same slot as the current one it will get recognised and end up with the same ID so I don't have to tweak the config too much.

    Cheers
    Kim
    The reply is currently minimized Show
  • Accepted Answer

    Saturday, October 09 2021, 10:22 AM - #Permalink
    Resolved
    0 votes
    Good work! Thanks Nick
    The reply is currently minimized Show
  • Accepted Answer

    Saturday, October 09 2021, 07:35 AM - #Permalink
    Resolved
    0 votes
    For the kmod-r8168 and kmod-r8169, I maintain them in ClearOS. For all the other kmod drivers, you have to install them from ElRepo. The reason I maintain the two is the ElRepo kmod-r8168 explicitly blacklists the r8169 drivers which seems pointless to me as their kmod-r8169 driver does not support the r8168 cards. If meant that, on my old server which had both cards, I was stuffed until I found a couple or workarounds. The easiest is to comment out the blacklisting in the r8168 driver and I now do this when I create the RPM. It does mean that, for safety, if you have an r8168 card you should install both drivers, in case you ever get an r8169 card. (And possibly the other way round but you rarely install the r8169 as, up to now, I have not heard of problems with the kernel version). The r8169 driver is unmodified.

    Really what I should do is make each driver require the other, but I've never got that far.

    And to revert, if you want to:
    yum remove kmod-r8169
    And reboot.
    The reply is currently minimized Show
  • Accepted Answer

    Friday, October 08 2021, 11:02 PM - #Permalink
    Resolved
    0 votes
    Great thanks Nick.

    Too easy! There goes my excuse for procrastinating.

    I presume the disable process is just a yum purge kmod-r8169 + reboot or what ever yum's remove and clean keyword is? I live in a Ubuntu world so more used to apt but do have one Centos machine which I haven't started in a while and which I've never had to uninstall any apps from.
    The reply is currently minimized Show
  • Accepted Answer

    Friday, October 08 2021, 12:35 PM - #Permalink
    Resolved
    0 votes
    yum install kmod-r8169
    and reboot.

    On an RTL8111/8168/8411 system:
    yum install kmod-r816*
    and reboot.
    The reply is currently minimized Show
  • Accepted Answer

    Friday, October 08 2021, 12:14 PM - #Permalink
    Resolved
    0 votes
    Thanks Nick

    I'll read up on installing the kmod driver and see if anything changes. It may be a week or so before I get a chance to sort it out and report back as I'm flat strap at the moment.

    Cheers
    Kim
    The reply is currently minimized Show
  • Accepted Answer

    Friday, October 08 2021, 11:03 AM - #Permalink
    Resolved
    0 votes
    The clearsyncd messages are probably a red herring. I have asked the devs before about thes messages. I get them like you, but at different intervals:
    [root@server ~]# grep 'System Events: Socket hang-up' /var/log/messages
    Oct 3 05:00:10 server clearsyncd[32135]: System Events: Socket hang-up: 36
    Oct 3 07:28:02 server clearsyncd[32135]: System Events: Socket hang-up: 36
    Oct 3 08:49:57 server clearsyncd[32135]: System Events: Socket hang-up: 36
    Oct 3 08:49:57 server clearsyncd[32135]: System Events: Socket hang-up: 36
    Oct 4 05:00:10 server clearsyncd[32135]: System Events: Socket hang-up: 36
    Oct 4 07:28:02 server clearsyncd[32135]: System Events: Socket hang-up: 36
    Oct 4 08:33:48 server clearsyncd[32135]: System Events: Socket hang-up: 36
    Oct 4 08:33:49 server clearsyncd[32135]: System Events: Socket hang-up: 36
    Oct 5 05:00:09 server clearsyncd[32135]: System Events: Socket hang-up: 36
    Oct 5 07:28:02 server clearsyncd[32135]: System Events: Socket hang-up: 36
    Oct 5 08:09:23 server clearsyncd[32135]: System Events: Socket hang-up: 36
    Oct 5 08:09:23 server clearsyncd[32135]: System Events: Socket hang-up: 36
    Oct 5 18:17:45 server clearsyncd[28543]: System Events: Socket hang-up: 36
    Oct 5 18:17:45 server clearsyncd[28543]: System Events: Socket hang-up: 36
    Oct 6 05:00:10 server clearsyncd[28543]: System Events: Socket hang-up: 36
    Oct 6 07:28:02 server clearsyncd[28543]: System Events: Socket hang-up: 36
    Oct 6 19:47:45 server clearsyncd[28543]: System Events: Socket hang-up: 36
    Oct 6 19:47:45 server clearsyncd[28543]: System Events: Socket hang-up: 36
    Oct 7 05:00:08 server clearsyncd[28543]: System Events: Socket hang-up: 36
    Oct 7 07:28:02 server clearsyncd[28543]: System Events: Socket hang-up: 36
    Oct 7 09:50:39 server clearsyncd[28543]: System Events: Socket hang-up: 36
    Oct 7 09:50:39 server clearsyncd[28543]: System Events: Socket hang-up: 36
    Oct 7 17:52:35 server clearsyncd[28543]: System Events: Socket hang-up: 36
    Oct 7 17:52:35 server clearsyncd[28543]: System Events: Socket hang-up: 36
    Oct 8 05:00:11 server clearsyncd[28543]: System Events: Socket hang-up: 36
    Oct 8 07:28:02 server clearsyncd[28543]: System Events: Socket hang-up: 36
    Oct 8 10:13:49 server clearsyncd[28543]: System Events: Socket hang-up: 36
    Oct 8 10:13:49 server clearsyncd[28543]: System Events: Socket hang-up: 36
    I tried checking back on the last few and they did not tie back to any other normally logged messages. I do get two regular ones at 05:00 and 07:28. The 07:28 seems to tie back to a cron job which does a "/usr/sbin/clearcenter-checkin". The ones at 05:00 are not so obvious as there are about 7 cron jobs which fire then.
    The reply is currently minimized Show
  • Accepted Answer

    Thursday, October 07 2021, 10:04 AM - #Permalink
    Resolved
    0 votes
    Nic

    Thanks.

    I'll see how I go.

    As another possible red herring. One of the other units in this complex uses the same ISP as me and are also reporting dropouts. However their dropouts are coming at the NBN box (interface between copper in the building to the optocoupler out in the street feeding the fibre) and are much more frequent than mine. Is it possible that the link to the outside is dropping, perhaps only momentarily, after it leaves our internal network and that ClearOS is not bringing the NIC card back up properly after getting a fail?? ifup enp3s0 does fix it.

    If so is there a switch to make it more demanding - i.e attempt a restart more often or for longer before giving up? I'd guess if it were that I'd see a string of fails in the logs and all I generally see is a single entry
    Oct 3 14:56:04 exgperfw01 clearsyncd[1257]: System Events: Socket hang-up: 31
    Oct 4 13:26:00 exgperfw01 clearsyncd[1152]: System Events: Socket hang-up: 31
    Oct 4 13:26:00 exgperfw01 clearsyncd[1152]: System Events: Socket hang-up: 31
    Oct 4 14:56:03 exgperfw01 clearsyncd[1152]: System Events: Socket hang-up: 31
    Oct 5 14:56:03 exgperfw01 clearsyncd[1152]: System Events: Socket hang-up: 31
    Oct 5 15:20:09 exgperfw01 clearsyncd[1152]: System Events: Socket hang-up: 32
    Oct 5 15:20:09 exgperfw01 clearsyncd[1152]: System Events: Socket hang-up: 32
    Oct 5 15:21:50 exgperfw01 clearsyncd[1152]: System Events: Socket hang-up: 31
    Oct 5 15:21:51 exgperfw01 clearsyncd[1152]: System Events: Socket hang-up: 31
    Oct 5 15:22:01 exgperfw01 clearsyncd[1152]: System Events: Socket hang-up: 31
    Oct 5 15:22:02 exgperfw01 clearsyncd[1152]: System Events: Socket hang-up: 31
    Oct 6 14:56:04 exgperfw01 clearsyncd[1152]: System Events: Socket hang-up: 31
    Oct 7 14:56:03 exgperfw01 clearsyncd[1152]: System Events: Socket hang-up: 31

    I didn't notice the drop out today so presumably it fixed itself. Most of the others needed a hand. just noticed that 14:56 seems to be a bad time each day.

    Cheers
    Kim
    The reply is currently minimized Show
  • Accepted Answer

    Thursday, October 07 2021, 09:45 AM - #Permalink
    Resolved
    0 votes
    Nic

    Sorry my printout was so close to yours that I didn't think to include it.

    filename: /lib/modules/3.10.0-1160.42.2.el7.x86_64/kernel/drivers/net/ethernet/realtek/r8169.ko.xz
    firmware: rtl_nic/rtl8107e-2.fw
    firmware: rtl_nic/rtl8107e-1.fw
    firmware: rtl_nic/rtl8168h-2.fw
    firmware: rtl_nic/rtl8168h-1.fw
    firmware: rtl_nic/rtl8168g-3.fw
    firmware: rtl_nic/rtl8168g-2.fw
    firmware: rtl_nic/rtl8106e-2.fw
    firmware: rtl_nic/rtl8106e-1.fw
    firmware: rtl_nic/rtl8411-2.fw
    firmware: rtl_nic/rtl8411-1.fw
    firmware: rtl_nic/rtl8402-1.fw
    firmware: rtl_nic/rtl8168f-2.fw
    firmware: rtl_nic/rtl8168f-1.fw
    firmware: rtl_nic/rtl8105e-1.fw
    firmware: rtl_nic/rtl8168e-3.fw
    firmware: rtl_nic/rtl8168e-2.fw
    firmware: rtl_nic/rtl8168e-1.fw
    firmware: rtl_nic/rtl8168d-2.fw
    firmware: rtl_nic/rtl8168d-1.fw
    license: GPL
    softdep: pre: realtek
    description: RealTek RTL-8169 Gigabit Ethernet driver
    author: Realtek and the Linux r8169 crew <netdev@vger.kernel.org>
    retpoline: Y
    rhelversion: 7.9
    srcversion: 886F7AAD6F5FCB3A32A400E
    alias: pci:v00000001d00008168sv*sd00002410bc*sc*i*
    alias: pci:v00001737d00001032sv*sd00000024bc*sc*i*
    alias: pci:v000016ECd00000116sv*sd*bc*sc*i*
    alias: pci:v00001259d0000C107sv*sd*bc*sc*i*
    alias: pci:v00001186d00004302sv*sd*bc*sc*i*
    alias: pci:v00001186d00004300sv*sd*bc*sc*i*
    alias: pci:v00001186d00004300sv00001186sd00004B10bc*sc*i*
    alias: pci:v000010ECd00008169sv*sd*bc*sc*i*
    alias: pci:v000010FFd00008168sv*sd*bc*sc*i*
    alias: pci:v000010ECd00008168sv*sd*bc*sc*i*
    alias: pci:v000010ECd00008167sv*sd*bc*sc*i*
    alias: pci:v000010ECd00008161sv*sd*bc*sc*i*
    alias: pci:v000010ECd00008136sv*sd*bc*sc*i*
    alias: pci:v000010ECd00008129sv*sd*bc*sc*i*
    alias: pci:v000010ECd00002600sv*sd*bc*sc*i*
    alias: pci:v000010ECd00002502sv*sd*bc*sc*i*
    depends:
    intree: Y
    vermagic: 3.10.0-1160.42.2.el7.x86_64 SMP mod_unload modversions
    signer: CentOS Linux kernel signing key
    sig_key: 28:FD:E6:60:84:9F:DF:48:DE:A9:1B:48:B8:0B:17:B5:6C:E1:51:98
    sig_hashalgo: sha256
    parm: debug:Debug verbosity level (0=none, ..., 16=all) (int)

    Cheers
    Kim
    The reply is currently minimized Show
  • Accepted Answer

    Thursday, October 07 2021, 09:44 AM - #Permalink
    Resolved
    0 votes
    Nic

    Sorry my printout was so close to yours that I didn't think to include it.

    Modinfo.txt hopefully attached

    Cheers
    Kim
    The reply is currently minimized Show
  • Accepted Answer

    Thursday, October 07 2021, 09:33 AM - #Permalink
    Resolved
    0 votes
    Try the kmod version.If you still have problems, I can request an update from ElRepo. Not much chance of Redhat doing anything unless an issue is raised with them. I also don't know how to tell if they have changed anything unless you look at their kernel sources. I've no idea where to even find those.

    [edit]
    And remember to reboot after adding the new driver.
    [/edit]
    The reply is currently minimized Show
  • Accepted Answer

    Thursday, October 07 2021, 09:21 AM - #Permalink
    Resolved
    0 votes
    Nick

    I'm using the kernel version.

    Should I try the kmod or wait until ElRepo have had a look at it?

    Of course after dropping out at least once every day this week, after posting this this morning and leaving an ssh session open, it has behaved properly. Maybe I should just lean an axe against the box to keep it on its toes:-0


    Cheers
    Kim
    The reply is currently minimized Show
  • Accepted Answer

    Thursday, October 07 2021, 09:08 AM - #Permalink
    Resolved
    0 votes
    If you have not installed kmod-r8169, then you are running the version supplied in the kernel and I have no idea if any updates have been done to that. If you use the kmod package, then it has not updated since 2018. What do you get from "modinfo r8169"? The kmod version gives:
    [root@server ~]# modinfo r8169
    filename: /lib/modules/3.10.0-1160.41.1.el7.x86_64/weak-updates/r8169/r8169.ko
    version: 6.020.00-NAPI
    license: GPL
    retpoline: Y
    rhelversion: 7.7
    srcversion: 1464E6100DEC614BF7DE95E
    alias: pci:v00001186d00004302sv*sd*bc*sc*i*
    alias: pci:v00001186d00004300sv*sd*bc*sc*i*
    alias: pci:v00001186d00004300sv00001186sd00004C00bc*sc*i*
    alias: pci:v00001186d00004302sv00001186sd00004302bc*sc*i*
    alias: pci:v00001186d00004300sv00001186sd00004300bc*sc*i*
    alias: pci:v000010ECd00008169sv*sd*bc*sc*i*
    alias: pci:v000010ECd00008167sv*sd*bc*sc*i*
    depends:
    vermagic: 3.10.0-1062.1.1.el7.x86_64 SMP mod_unload modversions
    parm: rx_copybreak:Copy breakpoint for copy-only-tiny-frames (int)
    parm: use_dac:Enable PCI DAC. Unsafe on 32 bit PCI slot. (int)
    parm: debug:Debug verbosity level (0=none, ..., 16=all) (int)
    and the kernel version gives:
    [root@microserver ~]# modinfo r8169
    filename: /lib/modules/3.10.0-1160.42.2.el7.x86_64/kernel/drivers/net/ethernet/realtek/r8169.ko.xz
    firmware: rtl_nic/rtl8107e-2.fw
    firmware: rtl_nic/rtl8107e-1.fw
    firmware: rtl_nic/rtl8168h-2.fw
    firmware: rtl_nic/rtl8168h-1.fw
    firmware: rtl_nic/rtl8168g-3.fw
    firmware: rtl_nic/rtl8168g-2.fw
    firmware: rtl_nic/rtl8106e-2.fw
    firmware: rtl_nic/rtl8106e-1.fw
    firmware: rtl_nic/rtl8411-2.fw
    firmware: rtl_nic/rtl8411-1.fw
    firmware: rtl_nic/rtl8402-1.fw
    firmware: rtl_nic/rtl8168f-2.fw
    firmware: rtl_nic/rtl8168f-1.fw
    firmware: rtl_nic/rtl8105e-1.fw
    firmware: rtl_nic/rtl8168e-3.fw
    firmware: rtl_nic/rtl8168e-2.fw
    firmware: rtl_nic/rtl8168e-1.fw
    firmware: rtl_nic/rtl8168d-2.fw
    firmware: rtl_nic/rtl8168d-1.fw
    license: GPL
    softdep: pre: realtek
    description: RealTek RTL-8169 Gigabit Ethernet driver
    author: Realtek and the Linux r8169 crew <netdev@vger.kernel.org>
    retpoline: Y
    rhelversion: 7.9
    srcversion: 886F7AAD6F5FCB3A32A400E
    alias: pci:v00000001d00008168sv*sd00002410bc*sc*i*
    alias: pci:v00001737d00001032sv*sd00000024bc*sc*i*
    alias: pci:v000016ECd00000116sv*sd*bc*sc*i*
    alias: pci:v00001259d0000C107sv*sd*bc*sc*i*
    alias: pci:v00001186d00004302sv*sd*bc*sc*i*
    alias: pci:v00001186d00004300sv*sd*bc*sc*i*
    alias: pci:v00001186d00004300sv00001186sd00004B10bc*sc*i*
    alias: pci:v000010ECd00008169sv*sd*bc*sc*i*
    alias: pci:v000010FFd00008168sv*sd*bc*sc*i*
    alias: pci:v000010ECd00008168sv*sd*bc*sc*i*
    alias: pci:v000010ECd00008167sv*sd*bc*sc*i*
    alias: pci:v000010ECd00008161sv*sd*bc*sc*i*
    alias: pci:v000010ECd00008136sv*sd*bc*sc*i*
    alias: pci:v000010ECd00008129sv*sd*bc*sc*i*
    alias: pci:v000010ECd00002600sv*sd*bc*sc*i*
    alias: pci:v000010ECd00002502sv*sd*bc*sc*i*
    depends:
    intree: Y
    vermagic: 3.10.0-1160.42.2.el7.x86_64 SMP mod_unload modversions
    signer: CentOS Linux kernel signing key
    sig_key: 28:FD:E6:60:84:9F:DF:48:DE:A9:1B:48:B8:0B:17:B5:6C:E1:51:98
    sig_hashalgo: sha256
    parm: debug:Debug verbosity level (0=none, ..., 16=all) (int)


    On systems with the RTL8111/8168/8411 NIC I always recommend installing the kmod-r8168 and kmod-r8169 drivers. You can try the kmod-r8169 driver on your system if you don't have it already.

    If you are using the kmod driver you can try removing it and you should revert to the kernel driver after a reboot.

    I also note that the kmod one is well out of date. If you are using it, please let me know and I'll ask the ElRepo people to update it.
    The reply is currently minimized Show
Your Reply