Forums

Gavin Meek
Gavin Meek
Offline
Resolved
0 votes
Hi All,

I need some help/advice. I'm running ClearOS Community edition primarily as a content filter and DHCP server. I'm running it in 'invisible' mode. The machine has 2 ethernet interfaces, 1 being used as a WAN interface using PPPoE via an external fibre modem and the other as my network connection to a switch and hence the rest of the network. Most users are wifi users connecting via an access point in bridge mode (ie the access point is doing no NAT and the ip addresses for for wifi users are provided by the DHCP server on the ClearOS Server. I think I have a maximum of approx 20 users connecting at any one time but it may be more than this it is hard to quantify at present.

I'm having issues with Kernal Panics (2 flashing lights on keyboard). This seems to happen somewhat randomly but I have determined/carried out the following:
The server remains stable when the WAN is disconnected,
I have done a clean install of the latest ClearOS version the system remained stable for approx a week and a half but has now crashed again,
I thought the issue may have been down to DHCP leases as originally they were on a 12hour lease and I had available 50 addresses in the address pool, I thought perhaps I was running out of addresses so I reduced the lease time to 4 hours and increased the pool to 150 addresses. I thought this had solved the problem as the system remained stable for a week and a half after this change but after a period of heavier use has crashed again today.
I don't believe I have a hardware issue but I am a novice with linux if anyone can explain how I interrogate the log files to determine the cause of the crash and definitely rule out a hardware issue that would help narrow things down.
Generally when the server crashes no users can access the internet, I cannot ping the internal or external ip addresses and I cannot access the webconfig page but sometimes a user can continue to access the internet but I still cannot access the webconfig page instead I get a warning about an invalid certificate when I try to access it.

I've run out of ideas to resolve this issue and any pointers or guidance that could help would be much appreciated.

Thank
Thursday, March 07 2019, 08:01 PM
Share this post:
Responses (12)
  • Accepted Answer

    Wednesday, March 13 2019, 06:02 AM - #Permalink
    Resolved
    0 votes
    The Vendor id for the NICs suggest this is a Dell PowerEdge 860 Server of about 2006 vintage. Is this correct? If so, then hardware that old could be developing all sorts of problems. Electrolytic Capacitors drying out, memory issues, power supply starting to fail, CPU heat transfer paste between CPU and heat-sink going bad, fans running too slow as bearings worn, case full of dust, CPU head-sink clogged, hard disk errors etc...

    Completely agree with Nick. If you want a response the you must make it easy to get the information. I read the first screen with the NIC information, sorry not going to struggle any further If you must use a camera/phone then at least hold it in a position where the lens points straight at and is level with the centre of the screen and you get a square screen snapshot with all of it mostly in focus, not one where the lines go out of focus and diminish in length as they become further away from the lens. A photo is necessary to capture the panic output - this could help us. The rest should be a copy and paste as Nick detailed - between code tags.

    Have you performed some of the simple checks such as running memory tests for a few hours, or running the S.M.A.R.T disk monitoring tool long test (install the smartmontools package if not already installed) to verify the disk drive(s) are not having problems? Inspected the interior of the server?
    The reply is currently minimized Show
  • Accepted Answer

    Tuesday, March 12 2019, 10:49 AM - #Permalink
    Resolved
    0 votes
    I've tidied up your post. Rather than take photo's, can I suggest, if you use Windows on your LAN, that you get hold of the programs PuTTy and WinSCP. PuTTy will provide you with a remote console. You can copy text from it just by selecting it and paste into it by right-clicking. WinSCP will provide you with with a graphical file manager and text editor.

    Your NIC drivers and disk space are OK.
    The dmesg output needs the timestamp on the left to show if that was during boot up or later. I don't like the errors I see, but if it is during boot up, I suspect the kernel is recovering OK.
    I can't read the last part of the last picture (the messages log?). I think you are just rebooting then but I'm not sure.

    What is the LAN subnet of your PPPoE/DSL Router/modem e.g if you delete the PPPoE interface and switch the NIC to DHCP, does it get an IP, and if so, on which subnet?
    The reply is currently minimized Show
  • Accepted Answer

    Gavin Meek
    Gavin Meek
    Offline
    Tuesday, March 12 2019, 12:09 AM - #Permalink
    Resolved
    0 votes
    Log 3
    Attachments:
    The reply is currently minimized Show
  • Accepted Answer

    Gavin Meek
    Gavin Meek
    Offline
    Tuesday, March 12 2019, 12:08 AM - #Permalink
    Resolved
    0 votes
    Log 2
    Attachments:
    The reply is currently minimized Show
  • Accepted Answer

    Gavin Meek
    Gavin Meek
    Offline
    Tuesday, March 12 2019, 12:07 AM - #Permalink
    Resolved
    0 votes
    Log 1
    Attachments:
    The reply is currently minimized Show
  • Accepted Answer

    Gavin Meek
    Gavin Meek
    Offline
    Tuesday, March 12 2019, 12:04 AM - #Permalink
    Resolved
    0 votes
    Output of lspci -knn | grep Eth -A 3
    The reply is currently minimized Show
  • Accepted Answer

    Gavin Meek
    Gavin Meek
    Offline
    Monday, March 11 2019, 11:56 PM - #Permalink
    Resolved
    0 votes
    Hi Nick,

    I have attached the ouptur of lspci -knn | grep Eth -A 3 and the message, system and dmesg logs. I will follow up with syswatch log as you suggested in last post. See if you think there is anything that is relevant.

    Thanks
    The reply is currently minimized Show
  • Accepted Answer

    Gavin Meek
    Gavin Meek
    Offline
    Monday, March 11 2019, 11:33 PM - #Permalink
    Resolved
    0 votes
    Hi Nick,

    I have attached some pics including output of lspci -knn grep Eth -A 3 and the messages, system and dmesg logs. Let me know if there is anything you think is relevant. I will try and run the memtest and disk checks along with the output of the syswatch log as you suggested in the next day or two.

    Thanks
    Gavin
    The reply is currently minimized Show
  • Accepted Answer

    Friday, March 08 2019, 10:16 PM - #Permalink
    Resolved
    0 votes
    To get BIOS temps you need to reboot to get into it.

    It sounds like you are losing networking. In the local console, can you use the root user to log onto the console? Also, if you can get get to the red box, you can do an Alt+f2 to get to a normal terminal. From there you can logon and do all you can usually do from a console. Do you see anything at the end of "dmesg"

    "ifconfig" should show you your NIC's and their configuration. Have they lost their IP's? Even after rebooting, the syswatch log may show some activity if you are losing your IP's. If you have lost your IP's, does a "service network restart" get them back. On the other hand, if you are losing your NIC's or PCI bus then you are dead anyway.
    The reply is currently minimized Show
  • Accepted Answer

    Gavin Meek
    Gavin Meek
    Offline
    Friday, March 08 2019, 10:00 PM - #Permalink
    Resolved
    0 votes
    Hi Again Nick,

    See my previous reply above I think I may have replied in the wrong place still getting used to the forum.

    In further addition to that reply I have more info upon going to login to the local console it wouldn't accept my username and password. Then I noticed below the red box the configuration URL had changed to https://127.0.0.1:81/ when it is usually 192.168.0.1:81. Upon another reboot it returns to the 192.168.0.1:81 as expected and I can once again access the local console and the web config.

    Thanks
    The reply is currently minimized Show
  • Accepted Answer

    Gavin Meek
    Gavin Meek
    Offline
    Friday, March 08 2019, 08:48 PM - #Permalink
    Resolved
    0 votes
    Hi Nick,

    Thanks for your reply and apologies for the duplicate post I had missed the message saying it had to be approved before it was posted.

    I will get you the results of the things you have suggested.

    In the meantime I have tonight rebooted the server after another crash.
    With regard to the question about the bios temps when a crash occurs I cannot do anything no response from keyboard or mouse input and a blank screen.
    When rebooted the server has been operating fine with 4 devices connected and able to access the internet with no problems and I could connect to the web config page with no issues.
    However after approx 45mins I have noticed the following all devices can still access the internet but I can no longer access the web config page and when I do try to access it I get an error saying Forbidden You do not have permission to access / on this server.
    I think I have encountered this error before the server has crashed completely.

    Thanks
    The reply is currently minimized Show
  • Accepted Answer

    Friday, March 08 2019, 08:45 AM - #Permalink
    Resolved
    0 votes
    I've deleted your duplicate post. New posters get their firs couple of posts moderated so they don't appear immediately.

    I was not aware until you posted that flashing keyboard lights meant a kernel panic! It is very unlikely to be the sort of software issue you've been looking at. 50 addresses for 20 people is enough. Worst case scenario is that new users would not get a lease rather than DCHP (dnsmasq) would crash. To me the most likely cause is a hardware issue. Are you able to capture the panic output from the console?

    How are you for disk space?
    df -h


    When you crash are you able to get to your system temperature through your BIOS?

    Can you take the server offline and run a memory test such as memtest86?

    In terms of logs there may not be much as the crash may stop log writing. You could have a look at messages, system and dmesg (all in /var/log)

    Can you install smartmontools and run some disk checks?

    Out of interest, what is the output of:
    lspci -knn | grep Eth -A 3
    The reply is currently minimized Show
Your Reply