Forums

Resolved
0 votes
Hi all,

My clearos system hang sometimes. I can't get access via ssh and consol screen is blank.
I use to reboot it and forget about that.

Now it occurs more frequantly and today it's twice time in a day.
Please find /var/log/messages that I use to detect the time frame :
Mar 10 16:36:02 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:36:02 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:36:02 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:36:02 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:36:02 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:36:02 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:36:02 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:36:02 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:36:02 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:36:02 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:36:02 home rsyslogd: action 'action 1' suspended, next retry is Sun Mar 10 16:36:32 2019 [v8.24.0 try http://www.rsyslog.com/e/2007 ]
Mar 10 16:36:02 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 16:36:02 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 16:36:04 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 16:36:04 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 16:36:06 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 16:36:06 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 16:37:06 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 16:37:06 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 16:37:08 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 16:37:08 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 16:37:10 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 16:37:10 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 16:38:10 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 16:38:10 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 16:38:12 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 16:38:12 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 16:38:14 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 16:38:14 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 16:38:32 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:38:32 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:38:32 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:38:32 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:38:32 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:38:32 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:38:32 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:38:32 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:38:32 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:38:32 home rsyslogd: action 'action 1' resumed (module 'builtin:omfile') [v8.24.0 try http://www.rsyslog.com/e/2359 ]
Mar 10 16:38:32 home rsyslogd: action 'action 1' suspended, next retry is Sun Mar 10 16:39:02 2019 [v8.24.0 try http://www.rsyslog.com/e/2007 ]
Mar 10 18:12:27 home journal: Runtime journal is using 8.0M (max allowed 4.0G, trying to leave 4.0G free of 62.9G available → current limit 4.0G).
Mar 10 18:12:27 home kernel: Initializing cgroup subsys cpuset
Mar 10 18:12:27 home kernel: Initializing cgroup subsys cpu
Mar 10 18:12:27 home kernel: Initializing cgroup subsys cpuacct
Mar 10 18:12:27 home kernel: Linux version 3.10.0-862.11.6.v7.x86_64 (mockbuild@build64-1.clearsdn.local) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28$
Mar 10 18:12:27 home kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.0-862.11.6.v7.x86_64 root=/dev/mapper/clearos_home-root ro rd.lvm.lv=clearos_home/r$
Mar 10 18:12:27 home kernel: e820: BIOS-provided physical RAM map:
Mar 10 18:12:27 home kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009bfff] usable
Mar 10 18:12:27 home kernel: BIOS-e820: [mem 0x000000000009c000-0x000000000009ffff] reserved
Mar 10 18:12:27 home kernel: BIOS-e820: [mem 0x00000000000e6000-0x00000000000fffff] reserved
Mar 10 18:12:27 home kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000dfe8ffff] usable
Mar 10 18:12:27 home kernel: BIOS-e820: [mem 0x00000000dfe9e000-0x00000000dfe9ffff] type 9
Mar 10 18:12:27 home kernel: BIOS-e820: [mem 0x00000000dfea0000-0x00000000dfeb1fff] ACPI data
Mar 10 18:12:27 home kernel: BIOS-e820: [mem 0x00000000dfeb2000-0x00000000dfedffff] ACPI NVS


And

Mar 10 19:58:34 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 19:59:02 home systemd: Removed slice User Slice of cacti.
Mar 10 19:59:02 home systemd: Stopping User Slice of cacti.
Mar 10 19:59:06 home arpwatch: bogon 10.0.0.91 d8:cb:8a:cb:1a:6f
Mar 10 19:59:06 home arpwatch: bogon 10.0.0.91 d8:cb:8a:cb:1a:6f
Mar 10 19:59:07 home arpwatch: bogon 10.0.0.91 d8:cb:8a:cb:1a:6f
Mar 10 19:59:07 home arpwatch: bogon 10.0.0.91 d8:cb:8a:cb:1a:6f
Mar 10 19:59:08 home arpwatch: bogon 10.0.0.91 d8:cb:8a:cb:1a:6f
Mar 10 19:59:08 home arpwatch: bogon 10.0.0.91 d8:cb:8a:cb:1a:6f
Mar 10 19:59:17 home arpwatch: bogon 10.0.0.91 d8:cb:8a:cb:1a:6f
Mar 10 19:59:17 home arpwatch: bogon 10.0.0.91 d8:cb:8a:cb:1a:6f
Mar 10 19:59:18 home arpwatch: bogon 10.0.0.91 d8:cb:8a:cb:1a:6f
Mar 10 19:59:18 home arpwatch: bogon 10.0.0.91 d8:cb:8a:cb:1a:6f
Mar 10 19:59:19 home arpwatch: bogon 10.0.0.91 d8:cb:8a:cb:1a:6f
Mar 10 19:59:19 home arpwatch: bogon 10.0.0.91 d8:cb:8a:cb:1a:6f
Mar 10 19:59:27 home arpwatch: bogon 10.0.0.91 d8:cb:8a:cb:1a:6f
Mar 10 19:59:27 home arpwatch: bogon 10.0.0.91 d8:cb:8a:cb:1a:6f
Mar 10 19:59:28 home arpwatch: bogon 10.0.0.91 d8:cb:8a:cb:1a:6f
Mar 10 19:59:28 home arpwatch: bogon 10.0.0.91 d8:cb:8a:cb:1a:6f
Mar 10 19:59:29 home arpwatch: bogon 10.0.0.91 d8:cb:8a:cb:1a:6f
Mar 10 19:59:29 home arpwatch: bogon 10.0.0.91 d8:cb:8a:cb:1a:6f
Mar 10 19:59:34 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 19:59:34 home arpwatch: bogon 0.0.0.0 f0:82:61:86:25:5c
Mar 10 20:46:44 home journal: Runtime journal is using 8.0M (max allowed 4.0G, trying to leave 4.0G free of 62.9G available → current limit 4.0G).
Mar 10 20:46:44 home kernel: Initializing cgroup subsys cpuset
Mar 10 20:46:44 home kernel: Initializing cgroup subsys cpu
Mar 10 20:46:44 home kernel: Initializing cgroup subsys cpuacct
Mar 10 20:46:44 home kernel: Linux version 3.10.0-862.11.6.v7.x86_64 (mockbuild@build64-1.clearsdn.local) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28$
Mar 10 20:46:44 home kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.0-862.11.6.v7.x86_64 root=/dev/mapper/clearos_home-root ro rd.lvm.lv=clearos_home/r$
Mar 10 20:46:44 home kernel: e820: BIOS-provided physical RAM map:
Mar 10 20:46:44 home kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009bfff] usable
Mar 10 20:46:44 home kernel: BIOS-e820: [mem 0x000000000009c000-0x000000000009ffff] reserved
Mar 10 20:46:44 home kernel: BIOS-e820: [mem 0x00000000000e6000-0x00000000000fffff] reserved
Mar 10 20:46:44 home kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000dfe8ffff] usable
Mar 10 20:46:44 home kernel: BIOS-e820: [mem 0x00000000dfe9e000-0x00000000dfe9ffff] type 9
Mar 10 20:46:44 home kernel: BIOS-e820: [mem 0x00000000dfea0000-0x00000000dfeb1fff] ACPI data
Mar 10 20:46:44 home kernel: BIOS-e820: [mem 0x00000000dfeb2000-0x00000000dfedffff] ACPI NVS
Mar 10 20:46:44 home kernel: BIOS-e820: [mem 0x00000000dfee0000-0x00000000efffffff] reserved
Mar 10 20:46:44 home kernel: BIOS-e820: [mem 0x00000000ffe00000-0x00000000ffffffff] reserved
Mar 10 20:46:44 home kernel: BIOS-e820: [mem 0x0000000100000000-0x000000201effffff] usable
Mar 10 20:46:44 home kernel: NX (Execute Disable) protection: active


I've checked the RAM and the disk without any result

Any suggestion is welcomed...

Taryck.
Sunday, March 10 2019, 08:19 PM
Share this post:
Responses (17)
  • Accepted Answer

    Friday, March 15 2019, 07:59 PM - #Permalink
    Resolved
    0 votes
    Thanks for the tip. I"ll try it.
    The reply is currently minimized Show
  • Accepted Answer

    Thursday, March 14 2019, 08:15 AM - #Permalink
    Resolved
    0 votes
    I got fed up with the flip-flop messages and I junk them. I think they happen when a machine is powered on and gets its IP address by DHCP. Arpwatch initially sees a NIC with an address of 0.0.0.0 then its proper IP address. The proper address is the last seen address and arpwatch is reporting that it has flipped from one to the other. You may also get the message when you first power on as it flips from a proper address back to 0.0.0.0.

    From the arpwatch manual:
    flip flop
    The ethernet address has changed from the most recently seen address to the second most recently seen address. (If either the old or new ethernet address is a DECnet address and it is less than 24 hours, the email version of the report is suppressed.)


    I have a file I created, /etc/rsyslog.d/_messages-filter.conf (it can be called anything_you_like.conf) where I put my custom filters. I have one for arpwatch:
    # arpwatch flip flop messages
    if ($programname == 'arpwatch' and $msg contains '0.0.0.0') then stop
    Restart rsyslog and the messages are gone. ;)
    The reply is currently minimized Show
  • Accepted Answer

    Thursday, March 14 2019, 07:39 AM - #Permalink
    Resolved
    0 votes
    For the flip flop may be the error message on my ISP box is a part of the reason :
    Les périphériques 0c:c4:7a:33:07:8b et 0c:c4:7a:33:07:8a utilisent la même adresse IP : 10.0.0.142.
    NIC 0c:c4:7a:33:07:8b and 0c:c4:7a:33:07:8a use the same IP address 10.0.0.142.
    The reply is currently minimized Show
  • Accepted Answer

    Wednesday, March 13 2019, 08:03 PM - #Permalink
    Resolved
    0 votes
    Hi,

    "/scripts/watch-cpu-temp.sh" is a script I've made to watch the CPUs temp and shutdown system if it's higher than a specified value.

    For the flip flop I understand.

    I need the 2 NIC/IP as I've got service delivred only on local 10.0.0.142 and some service only on remote 10.0.0.142.
    So my internet access grant access to 10.0.0.137 but not to 10.0.0.142.

    Here is the actual configuration :
    10.0.0.137 255.255.255.252
    10.0.0.142 255.255.0.0
    So there is an over lap. That's may by I should remove...

    I've got in the past gateway error that why I need to define one NIC as External.

    But I understand that having distinct subnet could be :
    10.0.0.137 255.255.255.252
    10.0.1.142 255.255.255.0

    And DHCP : 10.0.1.0 => 10.0.1.100 mask 255.255.0.0

    For the water cooling too much additive will block the termal exchanger and too few lead to bacteria grow...
    I've give up. Thanks anyway.
    The reply is currently minimized Show
  • Accepted Answer

    Wednesday, March 13 2019, 08:58 AM - #Permalink
    Resolved
    0 votes
    As suspected flip-flop is a result of both interfaces connected to the same subnet - don't do it as Nick indicates... (bridge would be the exception).

    The bacteria growth in water cooling systems is a result of mismanagement. There are various anti-fungal additives available that prevent this, as well as corrosion inhibitors to reduce hardware damage. The world is full of water cooling systems from cooling towers to automobiles and the technology mature. A quick check on using Google would have shown what is required - here are two excellent references...
    Part I and Part II

    That power supply is a decent one... Re. IPMI memory error - can you please quote the error verbatim...
    The reply is currently minimized Show
  • Accepted Answer

    Wednesday, March 13 2019, 08:25 AM - #Permalink
    Resolved
    0 votes
    Even though it is not in use, please either delete the interface enp3s0 or change it to another subnet which is different from your WAN interface. If the subnets are the same you can end up with no default gateway and ClearOS will cease to communicate with the outside world.
    The reply is currently minimized Show
  • Accepted Answer

    Wednesday, March 13 2019, 08:22 AM - #Permalink
    Resolved
    0 votes
    Taryck BENSIALI wrote:
    However I've recieve mail I do not understand :
    flip flop (home.domain.xxx)
    Arpwatch <root@home.domain.xxx> Wed, Mar 13, 2019 at 12:07 AM
    To: root@home.domain.xxx
    hostname: home.domain.xxx
    ip address: 10.0.0.137
    ethernet address: 0c:c4:7a:33:07:8b
    ethernet vendor: <unknown>
    old ethernet address: 0c:c4:7a:33:07:8a
    old ethernet vendor: <unknown>
    timestamp: Wednesday, March 13, 2019 0:07:04 +0100
    previous timestamp: Tuesday, March 12, 2019 23:49:16 +0100
    delta: 17 minutes


    They are annoying. I have adjusted the line in /etc/sysconfig/arpwatch to:
    OPTIONS="-u arpwatch -e - -N"
    This stops the e-mail message.

    I have no idea how you've got a /scripts/watch-cpu-temp.sh. It must be something you've found and installed.

    If you don't want the full mail system configured, try aliasing arpwatch to root (or you) in /etc/aliases and follow the instructions in there. If you alias to root, root should then be aliased to you or another valid user. Similarly alias linux-alerts to something useful or change the watching script to e-mail to you.
    The reply is currently minimized Show
  • Accepted Answer

    Wednesday, March 13 2019, 08:07 AM - #Permalink
    Resolved
    0 votes
    Hi,

    Network config :

    Settings
    Network Mode
    Standalone - No Firewall
    Hostname
    home.domaine.xxx
    Internet Hostname
    home.domaine.xxx
    Default Domain
    home.domaine.xxx
    DNS
    DNS Server #1
    8.8.8.8
    DNS Server #2
    8.8.4.4
    Network Interfaces
    Interface Role Type IP Address Action
    enp3s0 LAN Static 10.0.0.142
    enp4s0 External Static 10.0.0.137

    Power Supply : https://www.evga.com/products/product.aspx?pn=120-g2-1300-xr

    Water cooling has been removed for the test (as it was non operational due to bacterie growing)
    Now is a classical heatsink with fan.
    The bridge do not have any fan. See https://c1.neweggimages.com/NeweggImage/ProductImage/13-182-761-01.jpg
    The greay heatsink below pci

    IPMI raised 512 errors on memory. I ask to supermicro to get guidance, to be sure this could stop the system....
    The reply is currently minimized Show
  • Accepted Answer

    Wednesday, March 13 2019, 12:44 AM - #Permalink
    Resolved
    0 votes
    One further thought... on re-reading your appends came across the reference to water cooling. One disadvantage to water cooling is a hotter running mother board. Reason? There is no CPU fan moving the air near the mother-board - especially with a down-facing CPU fan pushing air through the heat-sing to the CPU. The power control circuitry on the mother board for the CPU and memory generates significant heat which quickly builds up if there is static stale air in that area. Some water cooling builders actually install am internal fan blowing on the mother-board to combat this problem. This also helps the memory to run cooler...
    The reply is currently minimized Show
  • Accepted Answer

    Wednesday, March 13 2019, 12:23 AM - #Permalink
    Resolved
    0 votes

    've detect a quite high system temp in my IPMI monitoring I guess related to North & South Bridge. at idle 61°C whit heavy load in raise up to 105°C but system is still responding...

    Seems too high to me - you would not want even the CPUs to run that hot... Is the case cooling working OK? Fans spinning, any filters clean, grills clean. Is there a dust build up on the fan blades reducing their efficiency... poor case location - needs to be resolved. Fan orientation such that air is pulled in from one side and expelled the other. What case, how many fans and where?

    You indicated two NICs. What is the network mode configuration? Gateway? standalone etc... connected to what.. Network addresses... static, dhcp? Details please...

    Mac addresses 0c:c4:7a:33:07:8a and 0c:c4:7a:33:07:8b are the two NICs on the ClearOS server? If so, makes me wonder if they are on the same sub-net getting dhcp addresses to flip-flop like that, but without any information on your network can on;y guess...

    Make and model of the new power supply?
    The reply is currently minimized Show
  • Accepted Answer

    Tuesday, March 12 2019, 11:28 PM - #Permalink
    Resolved
    0 votes
    I've unlocked I guess the mail by changing the relay in mail server.
    I've removed 1736 mails.

    However I've recieve mail I do not understand :
    flip flop (home.domain.xxx)
    Arpwatch <root@home.domain.xxx> Wed, Mar 13, 2019 at 12:07 AM
    To: root@home.domain.xxx
    hostname: home.domain.xxx
    ip address: 10.0.0.137
    ethernet address: 0c:c4:7a:33:07:8b
    ethernet vendor: <unknown>
    old ethernet address: 0c:c4:7a:33:07:8a
    old ethernet vendor: <unknown>
    timestamp: Wednesday, March 13, 2019 0:07:04 +0100
    previous timestamp: Tuesday, March 12, 2019 23:49:16 +0100
    delta: 17 minutes


    I also recieve mail for every cron action like watching for CPU temp :
    Cron <root@home> /scripts/watch-cpu-temp.sh
    (Cron Daemon) <root@home.domain.xxx> Wed, Mar 13, 2019 at 12:20 AM
    To: root@home.domain.xxx
    CPU1 = 24 °C
    CPU2 = 24 °C

    I've disabled email for this script by adding >/dev/null to the end off line in /etc/crontab
    The reply is currently minimized Show
  • Accepted Answer

    Tuesday, March 12 2019, 08:04 PM - #Permalink
    Resolved
    0 votes
    Hi,

    Brought 3 years ago
    Supermicro H8SGL
    AMD G34 16 core
    DDR3 ECC 128Gb
    2 x IBM m1015 converted to LSI MegaRAID SAS 9240-8i
    12 HDD
    2 NIC

    No hardware raid
    Only LVM

    No UPS used

    I also suspect power supply as monitorin indicate for 12v only 11,86 and drop to 11,76V
    I've stressed the system few minutes and no crash. I use stress like this -c 8 -i 8 -m 2 -d 1

    I've brought a new power supply (with great hopes on that) with 12,254 V for 12v
    I restart yesterday at 20:30 and this morning at 10:43 system crash again...
    No heartbeat in /var/log/syswatch :
    Tue Mar 12 10:27:08 2019  info:  system - heartbeat...
    Tue Mar 12 10:37:12 2019 info: system - heartbeat...
    Tue Mar 12 20:04:33 2019 info: system - syswatch started
    Tue Mar 12 20:04:33 2019 info: config - IP referrer tool is installed
    Tue Mar 12 20:04:33 2019 info: config - debug level - 0


    I did realy try the Ctrl+Alt+Del, but I always use reset button I guess because system is not reponding.
    Light inside the LSI cards are still bliking....

    I've detect a quite high system temp in my IPMI monitoring I guess related to North & South Bridge. at idle 61°C whit heavy load in raise up to 105°C but system is still responding...

    I'll ask supermicro some advise.

    I've quickly read : https://www.sraellis.tk/master.php?topic=crash
    However at this stage I can't say that I've looked to all possible log as I do not know where to look at.
    with /var/log/messages I only get a clue of time of crash however it change all the time...

    This only think I've added is a seagate 12Tb hard disk that was not working on clearos at first try, but works fine on my PC. I've removed a lot of disk in order to lower down the load on power supply.
    Now I can't remove it as I do not have enought disk to move data from this disk.
    I can't return it as it works on my PC.

    I've got a big maillog file :
    Mar 10 18:12:52 home postfix/qmgr[2587]: F20F4180FCF4: from=<root@home.domain.xxx>, size=950, nrcpt=1 (queue active)
    Mar 10 18:12:52 home postfix/qmgr[2587]: F4134180CB30: from=<root@home.domain.xxx>, size=950, nrcpt=1 (queue active)
    Mar 10 18:12:52 home postfix/qmgr[2587]: F2B90181A0BA: from=<root@home.domain.xxx>, size=950, nrcpt=1 (queue active)
    Mar 10 18:12:52 home postfix/qmgr[2587]: F36011810C24: from=<root@home.domain.xxx>, size=950, nrcpt=1 (queue active)
    Mar 10 18:12:52 home postfix/qmgr[2587]: F349F1827198: from=<root@home.domain.xxx>, size=950, nrcpt=1 (queue active)
    Mar 10 18:12:52 home postfix/qmgr[2587]: F1BF418243B2: from=<root@home.domain.xxx>, size=950, nrcpt=1 (queue active)
    Mar 10 18:12:52 home postfix/qmgr[2587]: F24D318208B2: from=<root@home.domain.xxx>, size=950, nrcpt=1 (queue active)
    Mar 10 18:12:56 home postfix/pickup[2586]: A34191801FE4: uid=77 from=<arpwatch>
    Mar 10 18:12:56 home postfix/cleanup[2592]: A34191801FE4: message-id=<20190310171256.A34191801FE4@home.domain.xxx>
    Mar 10 18:12:56 home postfix/qmgr[2587]: A34191801FE4: from=<arpwatch@home.domain.xxx>, size=710, nrcpt=1 (queue active)
    Mar 10 18:12:56 home postfix/cleanup[2592]: A683D1800EE1: message-id=<20190310171256.A34191801FE4@home.domain.xxx>
    Mar 10 18:12:56 home postfix/local[2611]: A34191801FE4: to=<root@home.domain.xxx>, orig_to=<root>, relay=local, delay=0.05, delays=0.04/0/0/0.02, dsn=2.0.0, status=sent (fo$
    Mar 10 18:12:56 home postfix/qmgr[2587]: A683D1800EE1: from=<arpwatch@home.domain.xxx>, size=847, nrcpt=1 (queue active)
    Mar 10 18:12:56 home postfix/qmgr[2587]: A34191801FE4: removed
    Mar 10 18:13:02 home postfix/smtp[2612]: 3882F18208A9: to=<linux-alerts@domain.xxx>, orig_to=<root>, relay=smtp.gmail.com[64.233.166.108]:465, delay=33181, delays=33171/0.0$
    Mar 10 18:13:02 home postfix/smtp[2603]: 3B7791824CF3: to=<linux-alerts@domain.xxx>, orig_to=<root>, relay=smtp.gmail.com[64.233.166.108]:465, delay=13981, delays=13971/0.0$
    Mar 10 18:13:02 home postfix/smtp[2595]: 3C6471820C6D: to=<linux-alerts@domain.xxx>, orig_to=<root>, relay=smtp.gmail.com[64.233.166.108]:465, delay=26881, delays=26871/0.0$
    Mar 10 18:13:02 home postfix/smtp[2617]: 32826182345C: to=<linux-alerts@domain.xxx>, orig_to=<root>, relay=smtp.gmail.com[64.233.166.108]:465, delay=21781, delays=21771/0.0$
    Mar 10 18:13:02 home postfix/smtp[2622]: 31F091800F38: to=<linux-alerts@domain.xxx>, orig_to=<root>, relay=smtp.gmail.com[64.233.166.108]:465, delay=51480, delays=51470/0.1$
    Mar 10 18:13:02 home postfix/error[3429]: 39B501820C74: to=<linux-alerts@domain.xxx>, orig_to=<root>, relay=none, delay=25681, delays=25671/10/0/0.01, dsn=4.4.2, status=def$
    Mar 10 18:13:02 home postfix/error[3429]: 38977180D07F: to=<linux-alerts@domain.xxx>, orig_to=<root>, relay=none, delay=103381, delays=103371/10/0/0.01, dsn=4.4.2, status=d$
    Mar 10 18:13:02 home postfix/error[3430]: 396F91810F9A: to=<linux-alerts@domain.xxx>, orig_to=<root>, relay=none, delay=39481, delays=39471/10/0/0.02, dsn=4.4.2, status=def$
    Mar 10 18:13:02 home postfix/error[3429]: 36E76180D7D8: to=<linux-alerts@domain.xxx>, orig_to=<root>, relay=none, delay=96181, delays=96171/10/0/0.02, dsn=4.4.2, status=def$
    Mar 10 18:13:02 home postfix/error[3431]: 3D324180DF01: to=<linux-alerts@domain.xxx>, orig_to=<root>, relay=none, delay=86281, delays=86271/10/0/0.02, dsn=4.4.2, status=def$

    I don't know why I've got so many mail to send and so many error, as test is OK when I test mail setting on webconfig.

    I've got a big arpwatch file in /var/spool/mail/
    From MAILER-DAEMON  Sat Jun 16 22:55:37 2018
    Return-Path: <>
    X-Original-To: arpwatch@home.domain.xxx
    Delivered-To: arpwatch@home.domain.xxx
    Received: by home.domain.xxx (Postfix)
    id 2DEFA1800E38; Sat, 16 Jun 2018 22:55:37 +0200 (CEST)
    Date: Sat, 16 Jun 2018 22:55:37 +0200 (CEST)
    From: MAILER-DAEMON@home.domain.xxx (Mail Delivery System)
    Subject: Undelivered Mail Returned to Sender
    To: arpwatch@home.domain.xxx
    Auto-Submitted: auto-replied
    MIME-Version: 1.0
    Content-Type: multipart/report; report-type=delivery-status;
    boundary="BE4871800E39.1529182537/home.domain.xxx"
    Message-Id: <20180616205537.2DEFA1800E38@home.domain.xxx>

    This is a MIME-encapsulated message.

    --BE4871800E39.1529182537/home.domain.xxx
    Content-Description: Notification
    Content-Type: text/plain; charset=us-ascii

    This is the mail system at host home.domain.xxx.

    I'm sorry to have to inform you that your message could not
    be delivered to one or more recipients. It's attached below.

    For further assistance, please send mail to postmaster.

    If you do so, please include this problem report. You can
    delete your own text from the attached returned message.

    The mail system

    <linux-alerts@domain.xxx> (expanded from <root>;): host
    aspmx.l.google.com[74.125.133.26] said: 550-5.7.1 [5.51.5.30] The IP you're
    using to send mail is not authorized to send 550-5.7.1 email directly to
    our servers. Please use the SMTP relay at your 550-5.7.1 service provider
    instead. Learn more at 550 5.7.1
    https://support.google.com/mail/?p=NotAuthorizedError 198-v6si581168wml.12
    - gsmtp (in reply to end of DATA command)

    --BE4871800E39.1529182537/home.domain.xxx
    Content-Description: Delivery report
    Content-Type: message/delivery-status

    Reporting-MTA: dns; home.domain.xxx
    X-Postfix-Queue-ID: BE4871800E39
    X-Postfix-Sender: rfc822; arpwatch@home.domain.xxx
    Arrival-Date: Sat, 16 Jun 2018 22:55:36 +0200 (CEST)

    Final-Recipient: rfc822; linux-alerts@domain.xxx
    Original-Recipient: rfc822; root
    Action: failed
    Status: 5.7.1
    Remote-MTA: dns; aspmx.l.google.com
    Diagnostic-Code: smtp; 550-5.7.1 [5.51.5.30] The IP you're using to send mail
    is not authorized to send 550-5.7.1 email directly to our servers. Please
    use the SMTP relay at your 550-5.7.1 service provider instead. Learn more
    at 550 5.7.1 https://support.google.com/mail/?p=NotAuthorizedError
    198-v6si581168wml.12 - gsmtp

    --BE4871800E39.1529182537/home.domain.xxx
    Content-Description: Undelivered Message
    Content-Type: message/rfc822

    Return-Path: <arpwatch@home.domain.xxx>
    Received: by home.domain.xxx (Postfix)
    id BE4871800E39; Sat, 16 Jun 2018 22:55:36 +0200 (CEST)
    Delivered-To: root@home.domain.xxx
    Received: by home.domain.xxx (Postfix, from userid 77)
    id B15701800E38; Sat, 16 Jun 2018 22:55:36 +0200 (CEST)
    From: root@home.domain.xxx (Arpwatch)
    To: root@home.domain.xxx
    Subject: new station (local.home.domain.xxx)
    Message-Id: <20180616205536.B15701800E38@home.domain.xxx>
    Date: Sat, 16 Jun 2018 22:55:36 +0200 (CEST)

    hostname: local.home.domain.xxx
    ip address: 10.0.0.142
    ethernet address: 0c:c4:7a:33:07:8a
    ethernet vendor: <unknown>
    timestamp: Saturday, June 16, 2018 22:55:36 +0200

    --BE4871800E39.1529182537/home.domain.xxx--

    I've got 1733 mails blocked...
    The reply is currently minimized Show
  • Accepted Answer

    Monday, March 11 2019, 10:44 AM - #Permalink
    Resolved
    0 votes
    I didn't say you had a full file system or disk - just suggesting reducing the excessive logging and consequent disk activity. As for disks, would recommend running the long smart tests as well. See nothing obviously wrong with the rsyslog.conf file.

    You might get some ideas regarding the hang here Could you please provide a short summary of the system components and their age so we know what we are dealing with.. Using raid? One prime suspect, if hardware, would be the power supply. External sources could include interference, dips in the mains power supply or mains wiring fault. Do you use a UPS? Can you reboot after a hang using Ctl-Alt-Del - or is it necessary to press the Reset or Power Off Switch?
    The reply is currently minimized Show
  • Accepted Answer

    Monday, March 11 2019, 08:52 AM - #Permalink
    Resolved
    0 votes
    Hi Tony,

    My issue is that the system hang, not for full filesystem.
    I need guidance to troubleshoot the reason.
    I suspect an hardware issue but which one....

    /etc/rsyslog.conf
    # rsyslog configuration file

    # For more information see /usr/share/doc/rsyslog-*/rsyslog_conf.html
    # If you experience problems, see http://www.rsyslog.com/doc/troubleshoot.html

    #### MODULES ####

    # The imjournal module bellow is now used as a message source instead of imuxsock.
    $ModLoad imuxsock # provides support for local system logging (e.g. via logger command)
    $ModLoad imjournal # provides access to the systemd journal
    #$ModLoad imklog # reads kernel messages (the same are read from journald)
    #$ModLoad immark # provides --MARK-- message capability

    # Provides UDP syslog reception
    #$ModLoad imudp
    #$UDPServerRun 514

    # Provides TCP syslog reception
    #$ModLoad imtcp
    #$InputTCPServerRun 514


    #### GLOBAL DIRECTIVES ####

    # Where to place auxiliary files
    $WorkDirectory /var/lib/rsyslog

    # Use default timestamp format
    $ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat

    # File syncing capability is disabled by default. This feature is usually not required,
    # not useful and an extreme performance hit
    #$ActionFileEnableSync on

    # Include all config files in /etc/rsyslog.d/
    $IncludeConfig /etc/rsyslog.d/*.conf

    # Turn off message reception via local log socket;
    # local messages are retrieved through imjournal now.
    $OmitLocalLogging on

    # File to store the position in the journal
    $IMJournalStateFile imjournal.state


    #### RULES ####

    # Log all kernel messages to the console.
    # Logging much else clutters up the screen.
    #kern.* /dev/console

    # Log anything (except mail) of level info or higher.
    # Don't log private authentication messages!
    *.info;mail.none;authpriv.none;cron.none;local6.none;local5.none /var/log/messages

    # The authpriv file has restricted access.
    authpriv.* /var/log/secure

    # Log all the mail messages in one place.
    mail.* -/var/log/maillog


    # Log cron stuff
    cron.* /var/log/cron

    # Everybody gets emergency messages
    *.emerg :omusrmsg:*

    # Save news errors of level crit and higher in a special file.
    uucp,news.crit /var/log/spooler

    # Save boot messages also to boot.log
    local7.* /var/log/boot.log


    # ### begin forwarding rule ###
    # The statement between the begin ... end define a SINGLE forwarding
    # rule. They belong together, do NOT split them. If you create multiple
    # forwarding rules, duplicate the whole block!
    # Remote Logging (we use TCP for reliable delivery)
    #
    # An on-disk queue is created for this action. If the remote host is
    # down, messages are spooled to disk and sent when it is up again.
    #$ActionQueueFileName fwdRule1 # unique name prefix for spool files
    #$ActionQueueMaxDiskSpace 1g # 1gb space limit (use as much as possible)
    #$ActionQueueSaveOnShutdown on # save messages to disk on shutdown
    #$ActionQueueType LinkedList # run asynchronously
    #$ActionResumeRetryCount -1 # infinite retries if host is down
    # remote host is: name/ip:port, e.g. 192.168.0.1:514, port optional
    #*.* @@remote-host:514
    # ### end of the forwarding rule ###
    local6.* /var/log/system
    local5.* /var/log/compliance

    #TBE (fix)
    # LDAP - Log Openldap
    local4.* /var/log/openldap/slapd.log
    The reply is currently minimized Show
  • Accepted Answer

    Monday, March 11 2019, 01:59 AM - #Permalink
    Resolved
    0 votes
    Two suggestions :-
    1) arpwatch
    Suggest at least temporarily stop logging arpwatch messages by adding a file to /etc/rsyslog.d (see below $$$ ). This will stop the clutter in /var/log/messages and and prevent the excessive disk activity as well as saving disk space in /var/log directory (Nowhere near full I hope)...
    2) rsyslog
    As I understand it the "actions" are numbered sequentially from the start of the "#### RULES ####" section in /etc/syslog.conf (if this is incorrect please indicate how they are numbered). Can you please show us your /etc/rsyslog.conf from "#### RULES ####" to the end (in code tags please).

    $$$ Create a file named, for example, /etc/rsyslog.d/arpwatch.conf with the following contents similar to the following

    # Reference = http://wiki.rsyslog.com/index.php/Filtering_by_program_name
    if $programname == 'arpwatch' then stop
    See previous discussion in these forums regarding these files...
    The reply is currently minimized Show
  • Accepted Answer

    Sunday, March 10 2019, 10:40 PM - #Permalink
    Resolved
    0 votes
    Hi Nick,

    I've tested the ECC RAM with Ultimate boot cd MemTest86+
    I've tested the disk with smartmontools launching a short test on each of them.

    For the logs, I don't know where to look. I used messages to figure out the average timeframe. But it change all the times.

    CPU temperature is under monitoring as I've got a water cooling and the additive in the fluid sometimes block the fuild in the CPU heater. So now I monitor it with cacti in order to detect when I need to clean the heater.
    Higher CPU temp in 29°C, lower 19°C average 21°C
    The graph with the 2 stops that occurs today
    The reply is currently minimized Show
  • Accepted Answer

    Sunday, March 10 2019, 09:54 PM - #Permalink
    Resolved
    0 votes
    I don't like the rsyslog message but I can't find any particularly useful info on it.
    How did you test your disk and memory?
    Do any of the other logs have anything relevant?

    When you reboot, can you drop into the BIOS and check your system temperature?
    The reply is currently minimized Show
Your Reply