Discussion:
[Sisuite-devel] systemimager.ci.uchicago.edu down
Andrea Righi
2009-01-22 08:54:13 UTC
Permalink
Dear ***@uchicago,

the host systemimager.ci.uchicago.edu seems down (not ping-able nor
telnet-able).

In these past days we've had a lot of out-of-memory problems. Now
we've configured the server to prevent OOM conditions (using a script
that restarts apache when the memory is getting low) and in case the
OOM can't be prevented the kernel automatically reboots after a OOM
trace. Unfortunately this doesn't seem enough...

Please, could you check if the server is down due to another reason
(not OOM, I mean, if there's a console which is the message on the
screen?) and try to manually reboot it?

Many thanks,
-Andrea
--
Andrea Righi,
PhD student
Department of Information Engineering
Universita' degli Studi di Siena
Via Roma, 56 - 53100 Siena (Italy)
Brian Elliott Finley
2009-01-22 13:44:12 UTC
Permalink
Andrea,

Do you believe that we're simply running out of memory? How much memory do you think we need?

Also, if I could replace the box, would it be OK if I went with a different architecture? Just considering the options...

-Brian


------Original Message------
From: Andrea Righi
To: ***@ci.uchicago.edu
Cc: Sisuite-devel
Cc: Brian Finley
Subject: [Sisuite-devel] systemimager.ci.uchicago.edu down
Sent: Jan 22, 2009 2:54 AM

Dear ***@uchicago,

the host systemimager.ci.uchicago.edu seems down (not ping-able nor
telnet-able).

In these past days we've had a lot of out-of-memory problems. Now
we've configured the server to prevent OOM conditions (using a script
that restarts apache when the memory is getting low) and in case the
OOM can't be prevented the kernel automatically reboots after a OOM
trace. Unfortunately this doesn't seem enough...

Please, could you check if the server is down due to another reason
(not OOM, I mean, if there's a console which is the message on the
screen?) and try to manually reboot it?

Many thanks,
-Andrea

--
Andrea Righi,
PhD student
Department of Information Engineering
Universita' degli Studi di Siena
Via Roma, 56 - 53100 Siena (Italy)

------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
sisuite-devel mailing list
sisuite-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sisuite-devel
--
Brian Elliot
Andrea Righi
2009-01-25 22:02:05 UTC
Permalink
Post by Brian Elliott Finley
Andrea,
Hi Brian,
Post by Brian Elliott Finley
Do you believe that we're simply running out of memory? How much memory do you think we need?
Also, if I could replace the box, would it be OK if I went with a different architecture? Just considering the options...
mmmh.. surely the reason of the last 3-4 crashes was always an out of
memory (looking in the log). I don't know the details of this last down,
but it doesn't seem the same OOM problem. In general, in case of OOM,
the server is still ping-able and now it isn't. Probably something wrong
happened during the auto-reboot due to the panic_on_oops setting.

In any case, using a server with more memory probably would simply delay
the problem, but it wouldn't resolve it definitely. To fix this kind of
issues we must identify all the possible reasons and try to prevent
them... the check_oom.pl script reduced the crashes *a lot* (well...
except in the last days...), now the auto-reboot (panic_on_oops) seemed
like a good idea, but only if the reboot is a reliable operation.

The error message in the console would really help to understand what
happened.

BTW, a box with IPMI capability to remotely reboot it or even look at
the console (always remotely) would be *great*. I've no idea if it's
possible to have something like that or maybe if someone wants to donate
it. I can ask in CINECA if they've some spare boxes with this
capability.

-Andrea
Post by Brian Elliott Finley
-Brian
------Original Message------
From: Andrea Righi
Cc: Sisuite-devel
Cc: Brian Finley
Subject: [Sisuite-devel] systemimager.ci.uchicago.edu down
Sent: Jan 22, 2009 2:54 AM
the host systemimager.ci.uchicago.edu seems down (not ping-able nor
telnet-able).
In these past days we've had a lot of out-of-memory problems. Now
we've configured the server to prevent OOM conditions (using a script
that restarts apache when the memory is getting low) and in case the
OOM can't be prevented the kernel automatically reboots after a OOM
trace. Unfortunately this doesn't seem enough...
Please, could you check if the server is down due to another reason
(not OOM, I mean, if there's a console which is the message on the
screen?) and try to manually reboot it?
Many thanks,
-Andrea
Brian Elliott Finley
2009-01-26 04:18:41 UTC
Permalink
Sure.

I wouldn't mind at all if it were hosted there. If not, I do have an ia64 server system we could use. However, it would still be hosted at the same facility with the same response time if we ever need someone to physically kick the box.

Cheers,

-Brian
--
Brian Elliott Finley
Mobile: 630.631.6621


-----Original Message-----
From: Andrea Righi <***@gmail.com>

Date: Sun, 25 Jan 2009 23:02:05
To: <***@thefinleys.com>
Cc: Sisuite-devel<sisuite-***@lists.sourceforge.net>; Brian Finley<***@anl.gov>
Subject: Re: [Sisuite-devel] systemimager.ci.uchicago.edu down
Post by Brian Elliott Finley
Andrea,
Hi Brian,
Post by Brian Elliott Finley
Do you believe that we're simply running out of memory? How much memory do you think we need?
Also, if I could replace the box, would it be OK if I went with a different architecture? Just considering the options...
mmmh.. surely the reason of the last 3-4 crashes was always an out of
memory (looking in the log). I don't know the details of this last down,
but it doesn't seem the same OOM problem. In general, in case of OOM,
the server is still ping-able and now it isn't. Probably something wrong
happened during the auto-reboot due to the panic_on_oops setting.

In any case, using a server with more memory probably would simply delay
the problem, but it wouldn't resolve it definitely. To fix this kind of
issues we must identify all the possible reasons and try to prevent
them... the check_oom.pl script reduced the crashes *a lot* (well...
except in the last days...), now the auto-reboot (panic_on_oops) seemed
like a good idea, but only if the reboot is a reliable operation.

The error message in the console would really help to understand what
happened.

BTW, a box with IPMI capability to remotely reboot it or even look at
the console (always remotely) would be *great*. I've no idea if it's
possible to have something like that or maybe if someone wants to donate
it. I can ask in CINECA if they've some spare boxes with this
capability.

-Andrea
Post by Brian Elliott Finley
-Brian
------Original Message------
From: Andrea Righi
Cc: Sisuite-devel
Cc: Brian Finley
Subject: [Sisuite-devel] systemimager.ci.uchicago.edu down
Sent: Jan 22, 2009 2:54 AM
the host systemimager.ci.uchicago.edu seems down (not ping-able nor
telnet-able).
In these past days we've had a lot of out-of-memory problems. Now
we've configured the server to prevent OOM conditions (using a script
that restarts apache when the memory is getting low) and in case the
OOM can't be prevented the kernel automatically reboots after a OOM
trace. Unfortunately this doesn't seem enough...
Please, could you check if the server is down due to another reason
(not OOM, I mean, if there's a console which is the message on the
screen?) and try to manually reboot it?
Many thanks,
-
Loading...