From bill at cse.ucdavis.edu Thu Oct 1 13:08:23 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] Nvidia FERMI/gt300 GPU Message-ID: <4AC50C37.9070301@cse.ucdavis.edu> Impressive: * IEEE floating point, doubles 1/2 as fast as single precision (6 times or so faster than the gt200). * ECC * 512 cores (gt200 has 240) * 384 bit bus gddr5 (twice as fast per pin, gt200 has 512 bits) * 3 billion transistors * 64KB of L1 cache per SM, 768KB L2, cache coherent across the chip. No time line or price, but it sounds like a rather interesting GPU for CUDA and OpenCL. More details: http://www.realworldtech.com/page.cfm?ArticleID=RWT093009110932 From Craig.Tierney at noaa.gov Thu Oct 1 14:18:40 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] Nvidia FERMI/gt300 GPU In-Reply-To: <4AC50C37.9070301@cse.ucdavis.edu> References: <4AC50C37.9070301@cse.ucdavis.edu> Message-ID: <4AC51CB0.7070205@noaa.gov> Bill Broadley wrote: > Impressive: > * IEEE floating point, doubles 1/2 as fast as single precision (6 times or > so faster than the gt200). > * ECC The GDDR5 says it supports ECC, but what is the card going to do? Is it ECC just from the memory controller, or is it ECC all the way through the chip? Is it 1-bit correct, 2-bit error message? Anyone? Craig > * 512 cores (gt200 has 240) > * 384 bit bus gddr5 (twice as fast per pin, gt200 has 512 bits) > * 3 billion transistors > * 64KB of L1 cache per SM, 768KB L2, cache coherent across the chip. > > No time line or price, but it sounds like a rather interesting GPU for CUDA > and OpenCL. > > More details: > http://www.realworldtech.com/page.cfm?ArticleID=RWT093009110932 > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Craig Tierney (craig.tierney@noaa.gov) From bill at cse.ucdavis.edu Thu Oct 1 14:42:32 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] Nvidia FERMI/gt300 GPU In-Reply-To: <4AC51CB0.7070205@noaa.gov> References: <4AC50C37.9070301@cse.ucdavis.edu> <4AC51CB0.7070205@noaa.gov> Message-ID: <4AC52248.4030402@cse.ucdavis.edu> Craig Tierney wrote: > Bill Broadley wrote: >> Impressive: >> * IEEE floating point, doubles 1/2 as fast as single precision (6 times or >> so faster than the gt200). >> * ECC > > The GDDR5 says it supports ECC, but what is the card going to do? > Is it ECC just from the memory controller, or is it ECC all the way > through the chip? Is it 1-bit correct, 2-bit error message? Nvidia is pleasingly specific in their white paper: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIAFermiComputeArchitectureWhitepaper.pdf Specifically: Fermi supports Single-Error Correct Double-Error Detect (SECDED) ECC codes that correct any single bit error in hardware as the data is accessed. ... Fermi?s register files, shared memories, L1 caches, L2 cache, and DRAM memory are ECC protected ... All NVIDIA GPUs include support for the PCI Express standard for CRC check with retry at the data link layer. Fermi also supports the similar GDDR5 standard for CRC check with retry (aka ?EDC?) during transmission of data across the memory bus. Kudos to Nvidia to being very clear. From kilian.cavalotti.work at gmail.com Fri Oct 2 00:19:14 2009 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] Nvidia FERMI/gt300 GPU In-Reply-To: <4AC50C37.9070301@cse.ucdavis.edu> References: <4AC50C37.9070301@cse.ucdavis.edu> Message-ID: On Thu, Oct 1, 2009 at 10:08 PM, Bill Broadley wrote: > No time line or price, but it sounds like a rather interesting GPU for CUDA > and OpenCL. Not only CUDA and OpenCL, but also DirectX, DirectCompute, C++, and Fortran. From a programmer's point of view, it could be a major improvement, and the only thing which still kept people from using GPUs to run their code. On a side note, it's funny to notice that this is probably the first GPU in history to be introduced by its manufacturer as a "supercomputer on a chip", rather than a graphics engine which will allow gamers to play their favorite RPS at never-reached resolutions and framerates. Reading some reviews, it seemed that the traditional audience of such events (gamers) were quite disappointed by not really seeing what the announcements could mean to them. After all, HPC is still a niche compared to the worldwide video games market, and it's impressive that NVIDIA decided to focus on this tiny fraction of its prospective buyers, rather than go for the usual my-vertex-pipeline-is-longer-than-yours. :) Cheers, -- Kilian From i.n.kozin at googlemail.com Fri Oct 2 02:01:08 2009 From: i.n.kozin at googlemail.com (Igor Kozin) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] Nvidia FERMI/gt300 GPU In-Reply-To: References: <4AC50C37.9070301@cse.ucdavis.edu> Message-ID: > Not only CUDA and OpenCL, but also DirectX, DirectCompute, C++, and > Fortran. From a programmer's point of view, it could be a major > improvement, and the only thing which still kept people from using > GPUs to run their code. all of the above but C++ can be used on the current hardware. CUDA Fortran is already available in PGI 9.0-4 (public beta). > On a side note, it's funny to notice that this is probably the first > GPU in history to be introduced by its manufacturer as a > "supercomputer on a chip", rather than a graphics engine which will > allow gamers to play their favorite RPS at never-reached resolutions > and framerates. Reading some reviews, it seemed that the traditional > audience of such events (gamers) were quite disappointed by not really > seeing what the announcements could mean to them. > After all, HPC is still a niche compared to the worldwide video games > market, and it's impressive that NVIDIA decided to focus on this tiny > fraction of its prospective buyers, rather than go for the usual > my-vertex-pipeline-is-longer-than-yours. :) yes, indeed and such a strong skew towards HPC worries me. there is almost no word how gamers are going to benefit from Fermi apart from C++ support. in the mean time RV770 offers 2.7 TFlops SP and 1/4 of that in DP which positions it pretty close to Fermi in that respect. in RV770, quick integer operations are in 24-bit and i wonder if the same still holds true for Fermi. From rpnabar at gmail.com Fri Oct 2 12:46:52 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: Message-ID: On Thu, Oct 1, 2009 at 1:32 AM, Beat Rubischon wrote: > > As long as a single person or a motivated team handles a datacenter, > everything looks good. But when spare time operators or even worse several > groups are using a spare room in the basement, bad things are common. +1 coming from a university setting. I think this is very typical amongst my peers. Not "good" , or "desirable"; but just a fact. I also suspect this is a "size of the cluster" issue. Many of the HPC-clusters I see are "small". 30-50 servers clustered together. I think these tend to be a lot more messy and hacked-together than the nice, clean, efficient setups many of you on this list might have. Mark probably deals with "huge" systems. Just a thought. -- Rahul From rpnabar at gmail.com Fri Oct 2 12:51:16 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: <4AC36693.3040801@scalableinformatics.com> References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> Message-ID: On Wed, Sep 30, 2009 at 9:09 AM, Joe Landman wrote: > Of course I could also talk about the SOL (serial over lan) which didn't THanks Joe! I've been reading about IPMI and also talking to my vendor about it. Sure, our machines also have IPMI support! Question: What's the difference between SOL and IPMI. Is one a subset of the other? Frankly, based on my past experience the only part of IPMI that I foresee being most useful to us is the console (keyboard + monitor) redirection. Right from the BIOS stage of the bootup cycle. All the other IPMI capabilities are of course cool and nice but that is icing on the cake for me. -- Rahul From skylar at cs.earlham.edu Fri Oct 2 20:13:49 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> Message-ID: <4AC6C16D.9050708@cs.earlham.edu> Rahul Nabar wrote: > On Wed, Sep 30, 2009 at 9:09 AM, Joe Landman > wrote: > >> Of course I could also talk about the SOL (serial over lan) which didn't >> > > THanks Joe! I've been reading about IPMI and also talking to my vendor > about it. Sure, our machines also have IPMI support! > > Question: What's the difference between SOL and IPMI. Is one a subset > of the other? > SOL is provided by IPMI v1.5+, so it's a part of IPMI itself. > Frankly, based on my past experience the only part of IPMI that I > foresee being most useful to us is the console (keyboard + monitor) > redirection. Right from the BIOS stage of the bootup cycle. All the > other IPMI capabilities are of course cool and nice but that is icing > on the cake for me. > In addition to the console, the other really useful feature of IPMI is remote power cycling. That's useful when the console itself is totally wedged. -- -- Skylar Thompson (skylar@cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20091002/29f860d9/signature.bin From rpnabar at gmail.com Sat Oct 3 09:33:48 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: <4AC6C16D.9050708@cs.earlham.edu> References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> <4AC6C16D.9050708@cs.earlham.edu> Message-ID: On Fri, Oct 2, 2009 at 10:13 PM, Skylar Thompson wrote: > Rahul Nabar wrote: >> On Wed, Sep 30, 2009 at 9:09 AM, Joe Landman >> wrote: > > In addition to the console, the other really useful feature of IPMI is > remote power cycling. That's useful when the console itself is totally > wedged. > True. That's a useful feature. But that "could" be done by sending "magic packets" to a eth card as well, right? I say "can" because I don't have that running on all my servers but had toyed with that on some. I guess, just many ways of doing the same thing. -- Rahul From skylar at cs.earlham.edu Sat Oct 3 09:53:19 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> <4AC6C16D.9050708@cs.earlham.edu> Message-ID: <4AC7817F.9090702@cs.earlham.edu> Rahul Nabar wrote: > True. That's a useful feature. But that "could" be done by sending > "magic packets" to a eth card as well, right? I say "can" because I > don't have that running on all my servers but had toyed with that on > some. I guess, just many ways of doing the same thing. > You could use Wake-on-LAN to turn a system on, but I don't think you can reset or power off the system. What ever you use should give you some authentication/authorization and hopefully encryption so that you don't have just anyone rebooting systems. IPMI 1.5+ will do this, but Wake-on-LAN does not. -- -- Skylar Thompson (skylar@cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20091003/b9f6f718/signature.bin From rpnabar at gmail.com Sat Oct 3 09:54:27 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: <4AC6C16D.9050708@cs.earlham.edu> References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> <4AC6C16D.9050708@cs.earlham.edu> Message-ID: On Fri, Oct 2, 2009 at 10:13 PM, Skylar Thompson wrote: >> >> THanks Joe! I've been reading about IPMI and also talking to my vendor >> about it. Sure, our machines also have IPMI support! >> >> Question: What's the difference between SOL and IPMI. Is one a subset >> of the other? >> > > SOL is provided by IPMI v1.5+, so it's a part of IPMI itself. > Last two days I was playing with ipmitool to connect to the machines. Is this the typical tool or do people have any other open source sugesstions to use. ipmitool seems to have a rich set of features for querying the BMC's etc. but I didn't see many SOL sections. Is ipmitool what people use to watch a redirected console as well or anything else? I couldn't find any good SOL howtos or tutorials on google other than the vendor-specific ones. ANy pointers? I did find this one from a fellow Beowulfer but this seems quite dated now..... http://buttersideup.com/docs/howto/IPMI_on_Debian.html -- Rahul From landman at scalableinformatics.com Sat Oct 3 09:54:57 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> <4AC6C16D.9050708@cs.earlham.edu> Message-ID: <4AC781E1.7050705@scalableinformatics.com> Rahul Nabar wrote: > On Fri, Oct 2, 2009 at 10:13 PM, Skylar Thompson wrote: >> Rahul Nabar wrote: >>> On Wed, Sep 30, 2009 at 9:09 AM, Joe Landman >>> wrote: > > >> In addition to the console, the other really useful feature of IPMI is >> remote power cycling. That's useful when the console itself is totally >> wedged. >> > > True. That's a useful feature. But that "could" be done by sending > "magic packets" to a eth card as well, right? I say "can" because I > don't have that running on all my servers but had toyed with that on > some. I guess, just many ways of doing the same thing. Hmmm... If I were building a cluster of anything more than 4 machines (not racks, machines), I would be insisting upon IPMI 2.0 with a working SOL and kvm over IP capability built in. For the 250-300 machine system you are looking at, you *want* IPMI 2.0 with KVM over IP. You *want* switched remotely accessible PDUs, for those times when IPMI itself gets wedged (rarer these days, but it does still happen). IMO you *want* this IPMI on a separate network. You *want* a serial concentrator type system to provide a redundant path in the event of an IPMI failure. Problems don't go away just because IPMI stopped working. You *need* an inexpensive crash cart that just works, and plugs into your PDUs. Understand that administration time could scale linearly with the number of nodes if you are not careful, so you want to (carefully) use tools which significantly help reduce administrative load. IPMI 2.0 is one such tool. Sending "magic" bytes to an eth won't work if the OS/machine is wedged. You are (likely) thinking of power-on when traffic shows up on LAN. This is a very different beast. If you could simply toggle power state of a server by sending "magic bytes to the eth port, lots of people would be very unhappy from the never ending denial of service attack this opens up. Take it as a given that you want functional IPMI 2.0 with operational SOL, and you really do want remote kvm over IP built in. The latter is my opinion, but it is again based on experience over the last decade+ in building/supporting these things. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From rpnabar at gmail.com Sat Oct 3 09:55:21 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: <4AC7817F.9090702@cs.earlham.edu> References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> <4AC6C16D.9050708@cs.earlham.edu> <4AC7817F.9090702@cs.earlham.edu> Message-ID: On Sat, Oct 3, 2009 at 11:53 AM, Skylar Thompson wrote: > Rahul Nabar wrote: > > You could use Wake-on-LAN to turn a system on, but I don't think you can > reset or power off the system. What ever you use should give you some > authentication/authorization and hopefully encryption so that you don't > have just anyone rebooting systems. IPMI 1.5+ will do this, but > Wake-on-LAN does not. Ah! I see, I didn't realize that difference. Thanks, Skylar. -- Rahul From landman at scalableinformatics.com Sat Oct 3 09:55:37 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> <4AC6C16D.9050708@cs.earlham.edu> Message-ID: <4AC78209.10006@scalableinformatics.com> Rahul Nabar wrote: > ipmitool seems to have a rich set of features for querying the BMC's > etc. but I didn't see many SOL sections. Is ipmitool what people use > to watch a redirected console as well or anything else? I couldn't > find any good SOL howtos or tutorials on google other than the > vendor-specific ones. ANy pointers? ipmitool .... shell ipmitool> sol activate -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From skylar at cs.earlham.edu Sat Oct 3 10:00:03 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> <4AC6C16D.9050708@cs.earlham.edu> Message-ID: <4AC78313.8060904@cs.earlham.edu> Rahul Nabar wrote: > On Fri, Oct 2, 2009 at 10:13 PM, Skylar Thompson wrote: > > >>> THanks Joe! I've been reading about IPMI and also talking to my vendor >>> about it. Sure, our machines also have IPMI support! >>> >>> Question: What's the difference between SOL and IPMI. Is one a subset >>> of the other? >>> >>> >> SOL is provided by IPMI v1.5+, so it's a part of IPMI itself. >> >> > > Last two days I was playing with ipmitool to connect to the machines. > Is this the typical tool or do people have any other open source > sugesstions to use. > > ipmitool seems to have a rich set of features for querying the BMC's > etc. but I didn't see many SOL sections. Is ipmitool what people use > to watch a redirected console as well or anything else? I couldn't > find any good SOL howtos or tutorials on google other than the > vendor-specific ones. ANy pointers? > > I did find this one from a fellow Beowulfer but this seems quite dated now..... > > http://buttersideup.com/docs/howto/IPMI_on_Debian.html > > That's what we use. You'd do something like "ipmitool -a -H hostname -U username -I lanplus sol activate" will do the trick. -- -- Skylar Thompson (skylar@cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20091003/490c20c2/signature.bin From rpnabar at gmail.com Sat Oct 3 10:04:41 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: <4AC781E1.7050705@scalableinformatics.com> References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> <4AC6C16D.9050708@cs.earlham.edu> <4AC781E1.7050705@scalableinformatics.com> Message-ID: On Sat, Oct 3, 2009 at 11:54 AM, Joe Landman wrote: > If I were building a cluster of anything more than 4 machines (not racks, > machines), I would be insisting upon IPMI 2.0 with a working SOL and kvm > over IP capability built in. Thanks for those tips Joe. I am already convinced by all the posts on the list that IPMI is a must. No other way. All you guys seem pretty unanimous about that much! > For the 250-300 machine system you are looking at, you *want* IPMI 2.0 with > KVM over IP. ?You *want* switched remotely accessible PDUs, for those times > when IPMI itself gets wedged (rarer these days, but it does still happen). > ?IMO you *want* this IPMI on a separate network. You *want* a serial > concentrator type system to provide a redundant path in the event of an IPMI > failure. ?Problems don't go away just because IPMI stopped working. ?You > *need* an inexpensive crash cart that just works, and plugs into your PDUs. I see, thanks for disabusing me of my notion of "ipmi" as one monolithic all-or-none creature. From what you write (and my online reading) it seems there are several discrete parts: IMPI 2.0 switched remotely accessible PDUs "serial concentrator type system " Correct me if I am wrong but these are all "options" and varying vendors and implementations will offer parts or all or none of these? Or is it that when one says "IPMI 2" it includes all these features. I did read online but these implementation seem vendor specific so its hard to translate jargon across vendors. e.g. for Dell they are called DRAC's etc. Finally, what's a"serial concentrator"? Isn't that the same as the SOL that Skylar was explaining to me? Or is that something different too? -- Rahul From landman at scalableinformatics.com Sat Oct 3 10:13:58 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> <4AC6C16D.9050708@cs.earlham.edu> <4AC781E1.7050705@scalableinformatics.com> Message-ID: <4AC78656.90301@scalableinformatics.com> Rahul Nabar wrote: > On Sat, Oct 3, 2009 at 11:54 AM, Joe Landman > wrote: > >> If I were building a cluster of anything more than 4 machines (not racks, >> machines), I would be insisting upon IPMI 2.0 with a working SOL and kvm >> over IP capability built in. > > Thanks for those tips Joe. I am already convinced by all the posts on > the list that IPMI is a must. No other way. All you guys seem pretty > unanimous about that much! > >> For the 250-300 machine system you are looking at, you *want* IPMI 2.0 with >> KVM over IP. You *want* switched remotely accessible PDUs, for those times >> when IPMI itself gets wedged (rarer these days, but it does still happen). >> IMO you *want* this IPMI on a separate network. You *want* a serial >> concentrator type system to provide a redundant path in the event of an IPMI >> failure. Problems don't go away just because IPMI stopped working. You >> *need* an inexpensive crash cart that just works, and plugs into your PDUs. > > I see, thanks for disabusing me of my notion of "ipmi" as one > monolithic all-or-none creature. From what you write (and my online > reading) it seems there are several discrete parts: > > IMPI 2.0 > switched remotely accessible PDUs > "serial concentrator type system " > > Correct me if I am wrong but these are all "options" and varying > vendors and implementations will offer parts or all or none of these? Yes. > Or is it that when one says "IPMI 2" it includes all these features. I IPMI 2.0 includes * local power control (on-off switch in software) * Serial-over-lan * system sensor inspection It *may* contain kvm over IP (the clusters we build do). > did read online but these implementation seem vendor specific so its > hard to translate jargon across vendors. e.g. for Dell they are called > DRAC's etc. IPMI 2.0 at minimum is a must. DRAC has levels which also provide kvm over IP, though at additional cost. > > Finally, what's a"serial concentrator"? Isn't that the same as the > SOL that Skylar was explaining to me? Or is that something different > too? Something different. A serial concentrator is a machine you can ssh into providing N serial ports. It is different than the IPMI SOL capability. It is a second non-IPMI management channel. For large systems, I'd recommend multiple administrative paths ... > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From rpnabar at gmail.com Sat Oct 3 10:18:02 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: <4AC78313.8060904@cs.earlham.edu> References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> <4AC6C16D.9050708@cs.earlham.edu> <4AC78313.8060904@cs.earlham.edu> Message-ID: On Sat, Oct 3, 2009 at 12:00 PM, Skylar Thompson wrote: > Rahul Nabar wrote: >> On Fri, Oct 2, 2009 at 10:13 PM, Skylar Thompson wrote: >> >> >> > That's what we use. You'd do something like "ipmitool -a -H hostname -U > username -I lanplus sol activate" will do the trick. > > Thanks Skylar. I just found I have bigger problems. I thought I was done since ipmitool did a happy make; make install. But nope: ./src/ipmitool -I open chassis status Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory Error sending Chassis Status command I don't think I have the impi devices visible. From googling this seems a bigger project needing insertion of some kernel modules. There goes my weekend! :) -- Rahul From skylar at cs.earlham.edu Sat Oct 3 10:18:06 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> <4AC6C16D.9050708@cs.earlham.edu> <4AC781E1.7050705@scalableinformatics.com> Message-ID: <4AC7874E.5030606@cs.earlham.edu> Rahul Nabar wrote: > I see, thanks for disabusing me of my notion of "ipmi" as one > monolithic all-or-none creature. From what you write (and my online > reading) it seems there are several discrete parts: > > IMPI 2.0 > switched remotely accessible PDUs > "serial concentrator type system " > These actually are different beasts. IPMI you'll find on a host motherboard. Switched PDUs tend to provide a telnet/ssh/web interface, but you should also make sure you can switch outlets using SNMP to make scripting easier. The serial concentrator is an appliance that has a bunch of serial ports that you can connect to serial ports on your systems. You'll ssh into the concentrator and be able to select a port to connect to. These are really nice for switches, so you can make disruptive changes without worrying about the network change cutting off your telnet or ssh session to the switch. > Correct me if I am wrong but these are all "options" and varying > vendors and implementations will offer parts or all or none of these? > Or is it that when one says "IPMI 2" it includes all these features. I > did read online but these implementation seem vendor specific so its > hard to translate jargon across vendors. e.g. for Dell they are called > DRAC's etc. > I think IPMI defines the way different components talk to each other, but it doesn't mandate that a given implementation use all the components in the specification. There's just mandates for how it'll authenticate, talk to sensors, connect to the serial port, etc if it chooses to provide those features. -- -- Skylar Thompson (skylar@cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20091003/ec6d2213/signature.bin From skylar at cs.earlham.edu Sat Oct 3 10:26:13 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> <4AC6C16D.9050708@cs.earlham.edu> <4AC78313.8060904@cs.earlham.edu> Message-ID: <4AC78935.2010906@cs.earlham.edu> Rahul Nabar wrote: > Thanks Skylar. I just found I have bigger problems. I thought I was > done since ipmitool did a happy make; make install. > > But nope: > > ./src/ipmitool -I open chassis status > Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: > No such file or directory > Error sending Chassis Status command > > > I don't think I have the impi devices visible. From googling this > seems a bigger project needing insertion of some kernel modules. There > goes my weekend! :) > Yeah. I've run into that problem too. You do need IPMI modules loaded if you're connecting locally over the IPMI bus. Here's the modules I see loaded on one of my RHEL5 Dell systems: ipmi_devintf 44753 0 ipmi_si 77453 0 ipmi_msghandler 72985 2 ipmi_devintf,ipmi_si If you can't get the IPMI devices working even after loading those modules, you might try looking at configuring your system's IPMI network interface manually. You should be able to do this during the boot process on any system (look for a device called "Service Processor" or "Baseboard Management Controller" after POST and before the OS boots). Some systems also have their own non-IPMI ways of configuring IPMI. If you're on Dell you can use OpenManage's omconfig command-line tool. Older x86 Sun systems like the v40z and v20z would let you key in the network information from the front panel, while newer Sun systems let you connect over a serial port to configure it. -- -- Skylar Thompson (skylar@cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20091003/48d4f3fd/signature.bin From tomislav.maric at gmx.com Sat Oct 3 10:41:28 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] RAID for home beowulf Message-ID: <4AC78CC8.5060500@gmx.com> Hi everyone, I've finally gathered all the hardware I need for my home beowulf. I'm thinking of setting up RAID 5 for the /home partition (that's where my simulation data will be and RAID 1 for the system / partitions without the /boot. 1) Does this sound reasonable? 2) I want to put the /home at the beginning of the disks go get faster write/seek speeds, if the partitions are the same, software RAID doesn't care where they are? 3) I'll leave the /boot partition on one of the 3 disks and it will NOT be included in the RAID array, is this ok? 4) I've read about setting up parallel swaping via priority given to swap partitions in fstab, but also how it would be ok to create RAID 1 array of swap partitions for the HA of the cluster. What should I choose? I've gone through all the software raid how-tos, FAQs and similar, but they are not quite new (date at least 3 years) and there's no reference to clusters. Any pointers regarding this? Thank you in advance, Tomislav I'm starting with this: 3 x Asus P5Q-VM motherboards 2 x Intel Quad Core Q8200 2.33GHz (2 nodes with 4 cores) 1 x Intel Core 2 Duo E6300 2.6GHz (master node) 3 x Seagate Barracuda 320GB SATA 2 HDDs gigabyte Eth switch with 8 ports .... etc ... Best regards, Tomislav From skylar at cs.earlham.edu Sat Oct 3 10:51:57 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC78CC8.5060500@gmx.com> References: <4AC78CC8.5060500@gmx.com> Message-ID: <4AC78F3D.3030707@cs.earlham.edu> Tomislav Maric wrote: > Hi everyone, > > I've finally gathered all the hardware I need for my home beowulf. I'm > thinking of setting up RAID 5 for the /home partition (that's where my > simulation data will be and RAID 1 for the system / partitions without > the /boot. > > 1) Does this sound reasonable? > It depends on your workload. RAID5 is good for large sequential writes, but sucks at small sequential writes because for every write it has to do a read to compare parity. > 2) I want to put the /home at the beginning of the disks go get faster > write/seek speeds, if the partitions are the same, software RAID doesn't > care where they are? > I don't think this will buy you much performance. There probably is a measurable difference, but I don't think it's enough to worry about. > 3) I'll leave the /boot partition on one of the 3 disks and it will NOT > be included in the RAID array, is this ok? > Sure, but /boot is actually trivial to mirror. Just make sure your boot loader is on each disk in the mirror and that each disk is partitioned identically, and all you have to do if a drive dies is change the device you boot off of if a drive dies. > 4) I've read about setting up parallel swaping via priority given to > swap partitions in fstab, but also how it would be ok to create RAID 1 > array of swap partitions for the HA of the cluster. What should I choose? > Any swapping at all will kill performance. I would get enough RAM to make sure you don't swap. > I've gone through all the software raid how-tos, FAQs and similar, but > they are not quite new (date at least 3 years) and there's no reference > to clusters. Any pointers regarding this? If you're using a Red Hat-based distro, kickstart can handle software RAID. I don't know about other distros though. -- -- Skylar Thompson (skylar@cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20091003/53d2a3b3/signature.bin From hearnsj at googlemail.com Sat Oct 3 11:36:44 2009 From: hearnsj at googlemail.com (John Hearns) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> <4AC6C16D.9050708@cs.earlham.edu> Message-ID: <9f8092cc0910031136t6587e79br6a08af7d001b3c85@mail.gmail.com> 2009/10/3 Rahul Nabar : > > > ipmitool seems to have a rich set of features for querying the BMC's > etc. but I didn't see many SOL sections. Is ipmitool what people use > to watch a redirected console as well or anything else? I couldn't > find any good SOL howtos or tutorials on google other than the > vendor-specific ones. ANy pointers? Remember that a) the system must be running a serial console, ie, in inittab there is a getty associated with the serial port b) the BIOS is set to redirect serial port to IPMI From Greg at keller.net Sat Oct 3 11:37:08 2009 From: Greg at keller.net (Greg Keller) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster In-Reply-To: <200910031718.n93HIlWD006256@bluewest.scyld.com> References: <200910031718.n93HIlWD006256@bluewest.scyld.com> Message-ID: <4E1364FA-3AC0-4135-BEC0-033E8505B278@Keller.net> > > Message: 6 > Date: Sat, 3 Oct 2009 12:18:02 -0500 > From: Rahul Nabar > Subject: Re: [Beowulf] recommendation on crash cart for a cluster > room: full cluster KVM is not an option I suppose? > To: Skylar Thompson > Cc: Beowulf Mailing List > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > On Sat, Oct 3, 2009 at 12:00 PM, Skylar Thompson > wrote: >> Rahul Nabar wrote: >>> On Fri, Oct 2, 2009 at 10:13 PM, Skylar Thompson >> > wrote: >>> >>> >>> >> That's what we use. You'd do something like "ipmitool -a -H >> hostname -U >> username -I lanplus sol activate" will do the trick. >> >> > > Thanks Skylar. I just found I have bigger problems. I thought I was > done since ipmitool did a happy make; make install. > > But nope: > > ./src/ipmitool -I open chassis status > Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: > No such file or directory > Error sending Chassis Status command > > > I don't think I have the impi devices visible. From googling this > seems a bigger project needing insertion of some kernel modules. There > goes my weekend! :) > > -- > Rahul > Rahul, Loading the modules is usually as easy as: /etc/init.d/ipmi start HTH, Greg -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091003/1eca7112/attachment.html From a28427 at ua.pt Sat Oct 3 11:44:36 2009 From: a28427 at ua.pt (Tiago Marques) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] Nvidia FERMI/gt300 GPU In-Reply-To: References: <4AC50C37.9070301@cse.ucdavis.edu> Message-ID: On Fri, Oct 2, 2009 at 10:01 AM, Igor Kozin wrote: >> Not only CUDA and OpenCL, but also DirectX, DirectCompute, C++, and >> Fortran. From a programmer's point of view, it could be a major >> improvement, and the only thing which still kept people from using >> GPUs to run their code. > > all of the above but C++ can be used on the current hardware. > CUDA Fortran is already available in PGI 9.0-4 (public beta). > >> On a side note, it's funny to notice that this is probably the first >> GPU in history to be introduced by its manufacturer as a >> "supercomputer on a chip", rather than a graphics engine which will >> allow gamers to play their favorite RPS at never-reached resolutions >> and framerates. Reading some reviews, it seemed that the traditional >> audience of such events (gamers) were quite disappointed by not really >> seeing what the announcements could mean to them. >> After all, HPC is still a niche compared to the worldwide video games >> market, and it's impressive that NVIDIA decided to focus on this tiny >> fraction of its prospective buyers, rather than go for the usual >> my-vertex-pipeline-is-longer-than-yours. :) > > yes, indeed and such a strong skew towards HPC worries me. there is > almost no word how gamers are going to benefit from Fermi apart from > C++ support. in the mean time RV770 offers 2.7 TFlops SP and 1/4 of > that in DP which positions it pretty close to Fermi in that respect. > in RV770, quick integer operations are in 24-bit and i wonder if the > same still holds true for Fermi. Integers are 32 in 64 bit in Fermi. AMD just wants to trick you, running programs that try to reach those 2.7TFLOPs(like Furmark, which AMD calls power viruses) results in the card throttling itself down to perserve a healthy function of the VRM. This has been also the case with the RV770 but in the RV870 it's done on hardware. When you reached top processing speed on the RV770 the cards would frequently shutdown. It remains to be seen what they will do with Firestream cards, as for that I known not if it happens. It will if they quote similar TDPs. Best regards, Tiago Marques > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From a28427 at ua.pt Sat Oct 3 12:10:26 2009 From: a28427 at ua.pt (Tiago Marques) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] Re: how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: <200909091900.n89J07U6031683@bluewest.scyld.com> <4AA926B8.6060702@scalableinformatics.com> Message-ID: Hi Rahul, I implemented a custom NFS solution based on Gentoo Linux for a cluster some time ago, which has been going fine until now but it is a very small cluster. It's an 8 node machine which will be upgraded to 16 this year. Still, some codes used there write a lot to disk so the NFS link could be easily saturated without much effort. There are no funds for 10GbE, FC or Infiniband, so I decided to do NFS + local disk for all compute nodes. This was done mainly to have a way to keep all compute nodes updated with little effort and not to increase I/O performance. It's also easier to maintain than Lustre, so I went for it. It goes something like this: - NFS server is the entry node and has RAID 1. It stores the base install which is not bootable and a small copy of installation files that must be writable(/etc and /var) for each node, with the rest being bind mounted. I export those directories to the nodes. - Nodes boot a kernel image by PXE and mount exported filesystem as / and then write some files(not much data) to /etc and /var at boot, the rest is read only with the exception of /tmp and /home (also some swap for safety reasons) which are running on a single SAS disk on the node. Typically scratch files run either on /home and /tmp is there to keep the pressure of the single link to the NFS server. I have dedicated a single GbE port on each blade to serve/access the NFS shares, leaving the other one for MPI, which we aren't using either way because it's too slow for the codes run there. - All configurations and user management are done in the base install which are then rsync'd to all other installations /etc and /var, which is a fast procedure by now and that can run on-the-fly without problems for the compute nodes. Backup is also easy, it's just a backup of the base install which is always in an "unbootable" state, with no redundant files. So far it has been working great and scaling nodes is very easy. I would say something like this is feasible for 300 nodes due to the lack of pressure put on the network. They only basically go load the executable file and shared libraries at the start of a job and that's it. I can provide the scrips I have set up to do this if you want to take a look at them. Best regards, Tiago Marques On Thu, Sep 24, 2009 at 10:54 PM, Rahul Nabar wrote: > On Thu, Sep 10, 2009 at 11:18 AM, Joe Landman > wrote: > >> >> root@dv4:~# mpirun -np 4 ./io-bm.exe -n 32 -f /data2/test/file -r -d ?-v > > In order to see how good (or bad) my current operating point is I was > trying to replicate your test. But what is "io-bm.exe"? Is that some > proprietary code or could I have it to run a similar test? > > -- > Rahul > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From skylar at cs.earlham.edu Sat Oct 3 12:52:52 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: <9f8092cc0910031136t6587e79br6a08af7d001b3c85@mail.gmail.com> References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> <4AC6C16D.9050708@cs.earlham.edu> <9f8092cc0910031136t6587e79br6a08af7d001b3c85@mail.gmail.com> Message-ID: <4AC7AB94.7000909@cs.earlham.edu> John Hearns wrote: > 2009/10/3 Rahul Nabar : > >> ipmitool seems to have a rich set of features for querying the BMC's >> etc. but I didn't see many SOL sections. Is ipmitool what people use >> to watch a redirected console as well or anything else? I couldn't >> find any good SOL howtos or tutorials on google other than the >> vendor-specific ones. ANy pointers? >> > > Remember that > a) the system must be running a serial console, ie, in inittab there > is a getty associated with the serial port > b) the BIOS is set to redirect serial port to IPMI > Also, c) Your boot loader (e.g. grub) must be pointed at the serial console. d) Make sure the bit rates for a)-c) are set the same, and that the bit rates actually work. For instance, Dell BMCs claim to do up through 115kbps but ipmitool gives garbage at anything above 56.6kbps. -- -- Skylar Thompson (skylar@cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20091003/1bfef306/signature.bin From tomislav.maric at gmx.com Sat Oct 3 13:48:58 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC78F3D.3030707@cs.earlham.edu> References: <4AC78CC8.5060500@gmx.com> <4AC78F3D.3030707@cs.earlham.edu> Message-ID: <4AC7B8BA.4040507@gmx.com> First of all: thank you very much for the advice, Skylar. :) So, all I need to do is to create the same partitions on three disks and set up a RAID 5 on /home since I'll be doing CFD simulations (long sequential writes) and use RAID 1 for other (system) partitions, to account for recovery of the system in case of disk failure because log writes are sequential and small in volume. I was reading about RAID 0, but I'm not sure how safe is to use it for storing computed data and how much speed would I get compared to RAID 5. Sorry for the totally newbish questions. I'm using Ubuntu, and after I install it, I'll try to configure the RAID manually. How do I make sure that the boot loader is on all disks? I mean, isn't RAID going to make the OS look at the /boot partition that's spread over 3 HDDs as a single mount point? Best regards, Tomislav Skylar Thompson wrote: > Tomislav Maric wrote: >> Hi everyone, >> >> I've finally gathered all the hardware I need for my home beowulf. I'm >> thinking of setting up RAID 5 for the /home partition (that's where my >> simulation data will be and RAID 1 for the system / partitions without >> the /boot. >> >> 1) Does this sound reasonable? >> > It depends on your workload. RAID5 is good for large sequential writes, > but sucks at small sequential writes because for every write it has to > do a read to compare parity. > >> 2) I want to put the /home at the beginning of the disks go get faster >> write/seek speeds, if the partitions are the same, software RAID doesn't >> care where they are? >> > > I don't think this will buy you much performance. There probably is a > measurable difference, but I don't think it's enough to worry about. > >> 3) I'll leave the /boot partition on one of the 3 disks and it will NOT >> be included in the RAID array, is this ok? >> > > Sure, but /boot is actually trivial to mirror. Just make sure your boot > loader is on each disk in the mirror and that each disk is partitioned > identically, and all you have to do if a drive dies is change the device > you boot off of if a drive dies. > >> 4) I've read about setting up parallel swaping via priority given to >> swap partitions in fstab, but also how it would be ok to create RAID 1 >> array of swap partitions for the HA of the cluster. What should I choose? >> > > Any swapping at all will kill performance. I would get enough RAM to > make sure you don't swap. > >> I've gone through all the software raid how-tos, FAQs and similar, but >> they are not quite new (date at least 3 years) and there's no reference >> to clusters. Any pointers regarding this? > > If you're using a Red Hat-based distro, kickstart can handle software > RAID. I don't know about other distros though. > From hahn at mcmaster.ca Sat Oct 3 14:01:42 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> <4AC6C16D.9050708@cs.earlham.edu> <4AC781E1.7050705@scalableinformatics.com> Message-ID: > monolithic all-or-none creature. From what you write (and my online > reading) it seems there are several discrete parts: > > IMPI 2.0 > switched remotely accessible PDUs > "serial concentrator type system " I think Joe was going a bit belt-and-suspenders-and-suspenders here. ipmi normally provides out-of-band access to the system's I2C bus (which lets one power on/off, reset, and read the sensors.) it also normally provides some form of console access: usually this is by serial redirection (serial output can be redirected through the BMC and onto the net). independent of this (but usually also provided) is a bios feature which scrapes the video character array onto serial, thus giving access to bios output (and also technically independent but also provided is lan->bmc->serial->bios "keyboard" input.) some people also configure systems with network-aware PDUs (power bars): APC is a common provider of these, and they provide a backup if IPMI doesn't work for some reason (network problems, hung BMC, etc). I do not personally think they are worthwhile because I rarely see IPMI problems - admittedly perhaps due to the fairly narrow range of parts my organization has. smart PDUs sometimes also provide power montoring, which might be useful, though I would actually prefer to see IPMI merely provide current sensors via I2C (in addition to volts). (having both socket power and motherboard power might be amusing, though, since you could calculate your PSU's efficiency - potentially even its load-efficiency curve. most vendors now quote 92-93% efficiency, but it's unclear what load range that's for...) finally, I think Joe is advocating another layer of backups - serial concentrators that would connect to the console serial port on each node to collect output if IPMI SOL isn't working. this is perhaps a matter of taste, but I don't find this terribly useful. I thought it would be for my first cluster, but never actually set it up. but again, that's because IPMI works well in my experience. I think Joe's right in the sense that you _don't_ want a cluster without working power control, and working post/console redirection is pretty valuable as well. both become more critical with larger cluster sizes, mainly because the chances grow of hitting a problem where you need power/reset/console control. whether you need backup systems past IPMI is unclear - depends on whether your IPMI works well. > Correct me if I am wrong but these are all "options" and varying > vendors and implementations will offer parts or all or none of these? > Or is it that when one says "IPMI 2" it includes all these features. I I interpreted Joe as saying that you need IPMI2 (remote power/reset/console) as well as backup mechanisms for IPMI failures. > hard to translate jargon across vendors. e.g. for Dell they are called > DRAC's etc. vendors provide IPMI features, usually with added proprietary nonsense. sometimes they sacrifice parts of IPMI in favor of the proprietary crap... > Finally, what's a"serial concentrator"? Isn't that the same as the > SOL that Skylar was explaining to me? Or is that something different > too? a network-accessible box into which many serial ports plug. some let you transform a serial port into a syslog stream, for instance. From skylar at cs.earlham.edu Sat Oct 3 14:06:03 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC7B8BA.4040507@gmx.com> References: <4AC78CC8.5060500@gmx.com> <4AC78F3D.3030707@cs.earlham.edu> <4AC7B8BA.4040507@gmx.com> Message-ID: <4AC7BCBB.9050401@cs.earlham.edu> Tomislav Maric wrote: > First of all: thank you very much for the advice, Skylar. :) > > So, all I need to do is to create the same partitions on three disks and > set up a RAID 5 on /home since I'll be doing CFD simulations (long > sequential writes) and use RAID 1 for other (system) partitions, to > account for recovery of the system in case of disk failure because log > writes are sequential and small in volume. I was reading about RAID 0, > but I'm not sure how safe is to use it for storing computed data and how > much speed would I get compared to RAID 5. > Cool. If you have zippy processors the overhead of calculating parity probably isn't going to be too high, so RAID 5 and RAID 0 will be comparable. Sequential reads and writes are ideal for RAID 5. > Sorry for the totally newbish questions. > > I'm using Ubuntu, and after I install it, I'll try to configure the RAID > manually. How do I make sure that the boot loader is on all disks? I > mean, isn't RAID going to make the OS look at the /boot partition that's > spread over 3 HDDs as a single mount point? > > You'd mount /boot using the /dev/md? device, but point your boot loader at one of the underlying /dev/sd? or /dev/hd? devices. This means updates get mirrored, but the boot loader itself only looks at one of the mirrors. -- -- Skylar Thompson (skylar@cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20091003/0b015133/signature.bin From hahn at mcmaster.ca Sat Oct 3 14:11:03 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: <4AC78935.2010906@cs.earlham.edu> References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> <4AC6C16D.9050708@cs.earlham.edu> <4AC78313.8060904@cs.earlham.edu> <4AC78935.2010906@cs.earlham.edu> Message-ID: >> ./src/ipmitool -I open chassis status >> Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: that's local ipmi, which to me is quite beside the point. ipmi is valuable primarily for its out-of-band-ness - that is, you can get to it when the host is off or wedged. > you're connecting locally over the IPMI bus. Here's the modules I see > loaded on one of my RHEL5 Dell systems: > > ipmi_devintf 44753 0 > ipmi_si 77453 0 > ipmi_msghandler 72985 2 ipmi_devintf,ipmi_si > > If you can't get the IPMI devices working even after loading those > modules, you might try looking at configuring your system's IPMI network > interface manually. You should be able to do this during the boot > process on any system (look for a device called "Service Processor" or > "Baseboard Management Controller" after POST and before the OS boots). > Some systems also have their own non-IPMI ways of configuring IPMI. If on our dl145's, we don't normally have local ipmi enabled at all, since it's inferior to remote. but modprobe ipmi_devintf;modprobe ipmi_si loads it, which can be useful for something like ipmitool user set password 3 foobar or ipmitool mc reset > you're on Dell you can use OpenManage's omconfig command-line tool. IMO, proprietary tools are evil. using them encourages vendors to diverge from open standards and hurts everyone, and in the long-term. demand standards and just say "no" to non-standards, especially when venors claim that they're supra-standard features. if we as computer people have learned anything at all from our own history, it is that open standards drive everything in the end. From tomislav.maric at gmx.com Sat Oct 3 14:13:13 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC7BCBB.9050401@cs.earlham.edu> References: <4AC78CC8.5060500@gmx.com> <4AC78F3D.3030707@cs.earlham.edu> <4AC7B8BA.4040507@gmx.com> <4AC7BCBB.9050401@cs.earlham.edu> Message-ID: <4AC7BE69.4020708@gmx.com> Skylar Thompson wrote: > Tomislav Maric wrote: >> First of all: thank you very much for the advice, Skylar. :) >> >> So, all I need to do is to create the same partitions on three disks and >> set up a RAID 5 on /home since I'll be doing CFD simulations (long >> sequential writes) and use RAID 1 for other (system) partitions, to >> account for recovery of the system in case of disk failure because log >> writes are sequential and small in volume. I was reading about RAID 0, >> but I'm not sure how safe is to use it for storing computed data and how >> much speed would I get compared to RAID 5. >> > > Cool. If you have zippy processors the overhead of calculating parity > probably isn't going to be too high, so RAID 5 and RAID 0 will be > comparable. Sequential reads and writes are ideal for RAID 5. > >> Sorry for the totally newbish questions. >> >> I'm using Ubuntu, and after I install it, I'll try to configure the RAID >> manually. How do I make sure that the boot loader is on all disks? I >> mean, isn't RAID going to make the OS look at the /boot partition that's >> spread over 3 HDDs as a single mount point? >> >> > > You'd mount /boot using the /dev/md? device, but point your boot loader > at one of the underlying /dev/sd? or /dev/hd? devices. This means > updates get mirrored, but the boot loader itself only looks at one of > the mirrors. > Thanks a lot! There's just one thing left: to make it work in real life.... :)) Best regards, Tomislav From a.travis at abdn.ac.uk Sat Oct 3 14:16:41 2009 From: a.travis at abdn.ac.uk (Tony Travis) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC78CC8.5060500@gmx.com> References: <4AC78CC8.5060500@gmx.com> Message-ID: <4AC7BF39.8030504@abdn.ac.uk> Tomislav Maric wrote: > Hi everyone, > > I've finally gathered all the hardware I need for my home beowulf. I'm > thinking of setting up RAID 5 for the /home partition (that's where my > simulation data will be and RAID 1 for the system / partitions without > the /boot. > > 1) Does this sound reasonable? Hello, Tomislav. So far so good, but RAID5 trades capacity for perfomance... > 2) I want to put the /home at the beginning of the disks go get faster > write/seek speeds, if the partitions are the same, software RAID doesn't > care where they are? Actually, it does - Read about the 'stride' of an ext3 filesystem: http://wiki.centos.org/HowTos/Disk_Optimization You also need to be aware that RAID5 is not so good when writing to the disk, because parity has to be calculated and written to the disk. In fact this performance penalty has lead to a campaign against RAID5: http://www.baarf.com/ > 3) I'll leave the /boot partition on one of the 3 disks and it will NOT > be included in the RAID array, is this ok? I think you'd be better off putting your system one one of your three disks, and making a RAID1 for /home from the other two. This will give you a perfomance gain because RAID1 writes do not involve generating parity, and you will decouple disk access between 'system' and /home. You can backup your system disk to the RAID1, or reinstall if it fails. > 4) I've read about setting up parallel swaping via priority given to > swap partitions in fstab, but also how it would be ok to create RAID 1 > array of swap partitions for the HA of the cluster. What should I choose? You're going to have to decide between perfomance and HA: Sorry, you can't have both with only three disks. I've built systems with four SATA disks each with five partitions: /boot on ext2, swap on RAID1, / and /home on RAID5, /backup on RAID5 for much the same reasons you are considering doing it too. It works, and you do get HA, but performance is not good. I'm now simplifying and upgrading the systems by fitting a 3ware 8006-2 hardware RAID1 controller and two extra disks for /, swap and /backups. I'm using the original four disks with a single partition now as a software RAID5 for an ext3 /home filesystem using appropriate 'stride' and directory indexing. > I've gone through all the software raid how-tos, FAQs and similar, but > they are not quite new (date at least 3 years) and there's no reference > to clusters. Any pointers regarding this? One thing you need to bear in mind in relation to HA is that software RAID does not support hot-swap - That's why I chose the 3ware 8006-2, which is not very expensive. It doesn't automatically detect new disks and rebuild the RAID, but it does support hot-swap and has a web GUI: http://www.3ware.com/products/serial_ata8000.asp Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis@abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From hahn at mcmaster.ca Sat Oct 3 14:19:30 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC78F3D.3030707@cs.earlham.edu> References: <4AC78CC8.5060500@gmx.com> <4AC78F3D.3030707@cs.earlham.edu> Message-ID: > It depends on your workload. RAID5 is good for large sequential writes, > but sucks at small sequential writes because for every write it has to > do a read to compare parity. well, it's bad at small random writes. small _sequential_ writes would be able to avoid reads for all but the first transaction. IMO, raid5 is often unappealing because raid10 avoids the write penalty, and raid6 is a lot more survivable. ultimately it depends on your taste in trading off performance, space efficiency, risk. >> 2) I want to put the /home at the beginning of the disks go get faster >> write/seek speeds, if the partitions are the same, software RAID doesn't >> care where they are? > > I don't think this will buy you much performance. There probably is a > measurable difference, but I don't think it's enough to worry about. inner tracks are normally about 60% of the speed of outer tracks - that's for a normal density-optimized disk, not a latency-optimized (and therefore inherently small) "enterprise" disk. >> 3) I'll leave the /boot partition on one of the 3 disks and it will NOT >> be included in the RAID array, is this ok? > > Sure, but /boot is actually trivial to mirror. Just make sure your boot > loader is on each disk in the mirror and that each disk is partitioned > identically, and all you have to do if a drive dies is change the device > you boot off of if a drive dies. or better yet, don't bother booting of the local disk. simply make your head/admin/master server reliable and net-boot. it's likley that nodes won't be functional without the master server anyway, and net-booting doesn't mean you can't use the local disk for swap/scratch/... >> 4) I've read about setting up parallel swaping via priority given to >> swap partitions in fstab, but also how it would be ok to create RAID 1 >> array of swap partitions for the HA of the cluster. What should I choose? > > Any swapping at all will kill performance. I would get enough RAM to > make sure you don't swap. well, using swap space is harmless as long as you're not actually swapping _in_ any nontrivial amount. unless you have some very extreme parameters (uncheckpointable long jobs, flakey hardware or power, banking-level reliability expectations), I wouldn't bother raiding swap. From tomislav.maric at gmx.com Sat Oct 3 14:51:36 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC7BF39.8030504@abdn.ac.uk> References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> Message-ID: <4AC7C768.5070705@gmx.com> Tony Travis wrote: > Tomislav Maric wrote: >> Hi everyone, >> >> I've finally gathered all the hardware I need for my home beowulf. I'm >> thinking of setting up RAID 5 for the /home partition (that's where my >> simulation data will be and RAID 1 for the system / partitions without >> the /boot. >> >> 1) Does this sound reasonable? > > Hello, Tomislav. > > So far so good, but RAID5 trades capacity for perfomance... > >> 2) I want to put the /home at the beginning of the disks go get faster >> write/seek speeds, if the partitions are the same, software RAID doesn't >> care where they are? > > Actually, it does - Read about the 'stride' of an ext3 filesystem: > > http://wiki.centos.org/HowTos/Disk_Optimization OK, thanks, nice to know, I've been reading about it in Tannenbaum's book (this home cluster stuff really made me widen my horizon :) ). I've seen Centos mentioned a lot in connection to HPC, am I making a mistake with Ubuntu?? > > You also need to be aware that RAID5 is not so good when writing to the > disk, because parity has to be calculated and written to the disk. In > fact this performance penalty has lead to a campaign against RAID5: > > http://www.baarf.com/ Okaay. :) There's war going on against it. > >> 3) I'll leave the /boot partition on one of the 3 disks and it will NOT >> be included in the RAID array, is this ok? > > I think you'd be better off putting your system one one of your three > disks, and making a RAID1 for /home from the other two. This will give > you a perfomance gain because RAID1 writes do not involve generating > parity, and you will decouple disk access between 'system' and /home. > You can backup your system disk to the RAID1, or reinstall if it fails. Yeah, but isn't RAID1 used for disk mirroring? How then would I get any speedup? From what I've read so far, data stripping is where I get the performance boost when using RAID: there's no real parallel writing/seeking applied to single data stream in RAID1... >> 4) I've read about setting up parallel swaping via priority given to >> swap partitions in fstab, but also how it would be ok to create RAID 1 >> array of swap partitions for the HA of the cluster. What should I choose? > > You're going to have to decide between perfomance and HA: Sorry, you > can't have both with only three disks. I've built systems with four SATA > disks each with five partitions: /boot on ext2, swap on RAID1, / and > /home on RAID5, /backup on RAID5 for much the same reasons you are > considering doing it too. It works, and you do get HA, but performance > is not good. I'm now simplifying and upgrading the systems by fitting a > 3ware 8006-2 hardware RAID1 controller and two extra disks for /, swap > and /backups. I'm using the original four disks with a single partition > now as a software RAID5 for an ext3 /home filesystem using appropriate > 'stride' and directory indexing. > >> I've gone through all the software raid how-tos, FAQs and similar, but >> they are not quite new (date at least 3 years) and there's no reference >> to clusters. Any pointers regarding this? > > One thing you need to bear in mind in relation to HA is that software > RAID does not support hot-swap - That's why I chose the 3ware 8006-2, > which is not very expensive. It doesn't automatically detect new disks > and rebuild the RAID, but it does support hot-swap and has a web GUI: > > http://www.3ware.com/products/serial_ata8000.asp > Thanks, my only problem is that I've reached my financial limits for my home project so I have to work with what I have. :) I'll definitely save this e-mail in my "importants" folder. Best, Tomislav > Bye, > > Tony. From tomislav.maric at gmx.com Sat Oct 3 15:02:27 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: References: <4AC78CC8.5060500@gmx.com> <4AC78F3D.3030707@cs.earlham.edu> Message-ID: <4AC7C9F3.4050301@gmx.com> Mark Hahn wrote: >> It depends on your workload. RAID5 is good for large sequential writes, >> but sucks at small sequential writes because for every write it has to >> do a read to compare parity. > > well, it's bad at small random writes. small _sequential_ writes would > be able to avoid reads for all but the first transaction. > > IMO, raid5 is often unappealing because raid10 avoids the write penalty, > and raid6 is a lot more survivable. ultimately it depends on your taste > in trading off performance, space efficiency, risk. I can't wrap my mind around the RAID config, because I'm using software RAID: it supports linear mode, and RAID level 0,1,4 or 5. Since it acts as if a partition is a device, this gives me way too much freedom (more to think about). :)) So, maybe the bold question to ask would be: what would be the best RAID config for 3 HDDS and a max 6 node HPC cluster? Should I just use RAID 1 for the system partitions on one disk, and RAID 0 for the simulation data placed on the same partitions on other two disks: after post-processing, the data is gone anyway... and with a good backup strategy, I don't have to worry about RAID0 not recovering from a disk fail... > >>> 2) I want to put the /home at the beginning of the disks go get faster >>> write/seek speeds, if the partitions are the same, software RAID doesn't >>> care where they are? >> I don't think this will buy you much performance. There probably is a >> measurable difference, but I don't think it's enough to worry about. > > inner tracks are normally about 60% of the speed of outer tracks - > that's for a normal density-optimized disk, not a latency-optimized > (and therefore inherently small) "enterprise" disk. > >>> 3) I'll leave the /boot partition on one of the 3 disks and it will NOT >>> be included in the RAID array, is this ok? >> Sure, but /boot is actually trivial to mirror. Just make sure your boot >> loader is on each disk in the mirror and that each disk is partitioned >> identically, and all you have to do if a drive dies is change the device >> you boot off of if a drive dies. > > or better yet, don't bother booting of the local disk. simply make your > head/admin/master server reliable and net-boot. it's likley that nodes > won't be functional without the master server anyway, and net-booting > doesn't mean you can't use the local disk for swap/scratch/... > Well, I want to configure the net boot for all diskless nodes and use the master node and it's RAID for a performance gains with writing CFD simulation data against network communication and to be able to scale more easily. >>> 4) I've read about setting up parallel swaping via priority given to >>> swap partitions in fstab, but also how it would be ok to create RAID 1 >>> array of swap partitions for the HA of the cluster. What should I choose? >> Any swapping at all will kill performance. I would get enough RAM to >> make sure you don't swap. > > well, using swap space is harmless as long as you're not actually swapping > _in_ any nontrivial amount. > > unless you have some very extreme parameters (uncheckpointable long jobs, > flakey hardware or power, banking-level reliability expectations), > I wouldn't bother raiding swap. Excellent, thank you very much! Best regards, Tomislav > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From james.p.lux at jpl.nasa.gov Sat Oct 3 15:19:40 2009 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC7B8BA.4040507@gmx.com> Message-ID: On 10/3/09 1:48 PM, "Tomislav Maric" wrote: > First of all: thank you very much for the advice, Skylar. :) > > So, all I need to do is to create the same partitions on three disks and > set up a RAID 5 on /home since I'll be doing CFD simulations (long > sequential writes) and use RAID 1 for other (system) partitions, to > account for recovery of the system in case of disk failure because log > writes are sequential and small in volume. I was reading about RAID 0, > but I'm not sure how safe is to use it for storing computed data and how > much speed would I get compared to RAID 5. > > Sorry for the totally newbish questions. > So why are you using RAID at all on a home cluster, other than to gain experience with using/configuring/managing RAID? I would think that a decent batch backup scheme might actually be more cost effective. That is, if you stop your job every hour and mirror it, at most you lose an hour of computing time, which is presumably cheap compared to extra disk drives. For that matter, if you're doing software raid, the reduced performance from doing the raid might be even a bigger effect than the potential loss of an hour or two. From niftyompi at niftyegg.com Sat Oct 3 15:20:59 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <4AC2EEC6.4000902@scalableinformatics.com> Message-ID: <20091003222059.GB12320@compegg> On Wed, Sep 30, 2009 at 07:48:28AM -0500, Rahul Nabar wrote: > > > Thanks! This is exactly what I will shop for then. I used the term > "crash cart" in a more generic sense. I've seen crash-carts parked in > cluster rooms before and they do look unwieldy. What you suggest > seems a better option. It does help to have a single "crash cart"; i.e. Display, Keyboard, mouse on wheels to debug systems that do not respond as expeceted to console servers, KVM etc... Toss an old desktop comuter on the cart and you can also connect to the networked KVM/terminal server and have a finger on the reset button or power cable at the same time. You can also have a small old style unswitched network hub and snoop on the net that a problem system is camped on to debug MAC address collisons, DHCP, kickstart and other odds and ends. When things are 'normal' you can stay at your desk 99% of the time but when that does not work you may need a real keyboard, mouse and display. Some errors never make it to the terminal server or log system but the video card may still shows the last message as long as it is still powered on. YMMV. -- T o m M i t c h e l l Found me a new hat, now what? From a.travis at abdn.ac.uk Sat Oct 3 15:30:39 2009 From: a.travis at abdn.ac.uk (Tony Travis) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC7C768.5070705@gmx.com> References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> Message-ID: <4AC7D08F.9040806@abdn.ac.uk> Tomislav Maric wrote: > [...] > I've seen Centos mentioned a lot in connection to HPC, am I making a > mistake with Ubuntu?? Hello, Tomislav. [Just let me put my flame-proof trousers on...] I know a lot of HPC people on this list use RH-based distros, but I use Ubuntu for HPC and I think it's very good. In fact I started a thread on the Ubuntu forums about EasyUbuntuClustering: http://ubuntuforums.org/showthread.php?t=1030849 I used RH6-9, and Fedora core2, but I switched to Debian and now Ubuntu. >> You also need to be aware that RAID5 is not so good when writing to the >> disk, because parity has to be calculated and written to the disk. In >> fact this performance penalty has lead to a campaign against RAID5: >> >> http://www.baarf.com/ > > Okaay. :) There's war going on against it. This campaign really made me think twice about what I was doing using RAID5. I lied to you (a bit) because I've bought more 3ware 8006-2's to put /home on RAID10 for our Beowulf servers. I must admit that hot-swap is one of the main reasons, but BAARF did come into it as well. >[...] > Yeah, but isn't RAID1 used for disk mirroring? How then would I get any > speedup? From what I've read so far, data stripping is where I get the > performance boost when using RAID: there's no real parallel > writing/seeking applied to single data stream in RAID1... You don't get a speedup when writing, but you avoid the performance penalty of writing to RAID5. Writing to a RAID1 is essentially the same speed as writing to a single disk. However, you do get a performance benefit when reading from RAID1, and you decouple disk access between the 'system' disk and /home on the RAID1 if you follow my suggestion. On COTS motherboards the main bottleneck is the PCI bus anyway, not the SATA disks. Have you benchmarked the disk i/o performance that your hardware is capable of? > [...] > Thanks, my only problem is that I've reached my financial limits for my > home project so I have to work with what I have. :) I'll definitely save > this e-mail in my "importants" folder. I set out with similar ideas to yours, but in the end you get what you pay for. My four-disk software RAID systems work fine and they survive single disk failures without crashing or losing any data. However, we've had a couple of near double disk failures so I decided to put the system and /backups on hardware RAID1 instead. I'm still using software RAID5 for /home, and I think this is a reasonable compromise between cost, HA and performance. Good luck! Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis@abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From tomislav.maric at gmx.com Sat Oct 3 15:40:15 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: References: Message-ID: <4AC7D2CF.3010404@gmx.com> Lux, Jim (337C) wrote: > > > On 10/3/09 1:48 PM, "Tomislav Maric" wrote: > >> First of all: thank you very much for the advice, Skylar. :) >> >> So, all I need to do is to create the same partitions on three disks and >> set up a RAID 5 on /home since I'll be doing CFD simulations (long >> sequential writes) and use RAID 1 for other (system) partitions, to >> account for recovery of the system in case of disk failure because log >> writes are sequential and small in volume. I was reading about RAID 0, >> but I'm not sure how safe is to use it for storing computed data and how >> much speed would I get compared to RAID 5. >> >> Sorry for the totally newbish questions. >> > > So why are you using RAID at all on a home cluster, other than to gain > experience with using/configuring/managing RAID? I would think that a > decent batch backup scheme might actually be more cost effective. That is, > if you stop your job every hour and mirror it, at most you lose an hour of > computing time, which is presumably cheap compared to extra disk drives. > For that matter, if you're doing software raid, the reduced performance from > doing the raid might be even a bigger effect than the potential loss of an > hour or two. > I'm using RAID because I want to learn it and I'm learning it because this small cluster has a chance of scaling if I do the job right and the machine manages to serve it's purpose. :) Besides that, I'm running CFD simulations, and I would like to run decent ones. Every motherboard can hold 16GBs of RAM so I've figured that using RAID for parallelized writing of the data would enhance the speed of the max sized runs on the machine. Thanks, Tomislav From tomislav.maric at gmx.com Sat Oct 3 15:48:06 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:57 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC7D08F.9040806@abdn.ac.uk> References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> <4AC7D08F.9040806@abdn.ac.uk> Message-ID: <4AC7D4A6.6060404@gmx.com> Tony Travis wrote: > Tomislav Maric wrote: >> [...] >> I've seen Centos mentioned a lot in connection to HPC, am I making a >> mistake with Ubuntu?? > > Hello, Tomislav. > > [Just let me put my flame-proof trousers on...] > > I know a lot of HPC people on this list use RH-based distros, but I use > Ubuntu for HPC and I think it's very good. In fact I started a thread on > the Ubuntu forums about EasyUbuntuClustering: > > http://ubuntuforums.org/showthread.php?t=1030849 > > I used RH6-9, and Fedora core2, but I switched to Debian and now Ubuntu. > So it can be done. :) Great, I love Ubuntu. :) >>> You also need to be aware that RAID5 is not so good when writing to the >>> disk, because parity has to be calculated and written to the disk. In >>> fact this performance penalty has lead to a campaign against RAID5: >>> >>> http://www.baarf.com/ >> Okaay. :) There's war going on against it. > > This campaign really made me think twice about what I was doing using > RAID5. I lied to you (a bit) because I've bought more 3ware 8006-2's to > put /home on RAID10 for our Beowulf servers. I must admit that hot-swap > is one of the main reasons, but BAARF did come into it as well. > >> [...] >> Yeah, but isn't RAID1 used for disk mirroring? How then would I get any >> speedup? From what I've read so far, data stripping is where I get the >> performance boost when using RAID: there's no real parallel >> writing/seeking applied to single data stream in RAID1... > > You don't get a speedup when writing, but you avoid the performance > penalty of writing to RAID5. Writing to a RAID1 is essentially the same > speed as writing to a single disk. However, you do get a performance > benefit when reading from RAID1, and you decouple disk access between > the 'system' disk and /home on the RAID1 if you follow my suggestion. > OK, thanks, I think I'm getting the hang of this... I guess I'll have to balance some goals and play around with the configurations. > On COTS motherboards the main bottleneck is the PCI bus anyway, not the > SATA disks. Have you benchmarked the disk i/o performance that your > hardware is capable of? > I'm assembling and configuring something like this for the first time ever. So the answer is: not yet, haven't thought of that, thank you very much for the advice. :) >> [...] >> Thanks, my only problem is that I've reached my financial limits for my >> home project so I have to work with what I have. :) I'll definitely save >> this e-mail in my "importants" folder. > > I set out with similar ideas to yours, but in the end you get what you > pay for. My four-disk software RAID systems work fine and they survive > single disk failures without crashing or losing any data. However, we've > had a couple of near double disk failures so I decided to put the system > and /backups on hardware RAID1 instead. I'm still using software RAID5 > for /home, and I think this is a reasonable compromise between cost, HA > and performance. > I figured it's something like that... hardware RAID will have to wait for a while, definitely... Thanks, Tomislav > Good luck! > > Tony. From hahn at mcmaster.ca Sat Oct 3 15:49:20 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC7BF39.8030504@abdn.ac.uk> References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> Message-ID: > because parity has to be calculated and written to the disk. In fact this > performance penalty has lead to a campaign against RAID5: > > http://www.baarf.com/ bah. redundancy costs - your only choice is what kind of redundancy you want. raid5 has its issues, but really what's needed is redundancy managed by the filesystem. doing it block-level is what introduces the pain. >> 3) I'll leave the /boot partition on one of the 3 disks and it will NOT >> be included in the RAID array, is this ok? > > I think you'd be better off putting your system one one of your three disks, > and making a RAID1 for /home from the other two. This will give you a I disagree. for one, I strongly advice against partition-o-philia: the somewhat traditional practice of putting lots of separate partitions on a system. but for 3 disks, linux MD has a nice mode where each block is stored twice. so you get 2/3 space efficiency and potentially better performance. > It works, and you do get HA, but performance is not good. I'm now simplifying > and upgrading the systems by fitting a 3ware 8006-2 hardware RAID1 controller my experience with old 3ware boards was quite poor performance - much slower than MD. >= 95xx are respectable, though. > One thing you need to bear in mind in relation to HA is that software RAID > does not support hot-swap - That's why I chose the 3ware 8006-2, which is not > very expensive. It doesn't automatically detect new disks and rebuild the > RAID, but it does support hot-swap and has a web GUI: > > http://www.3ware.com/products/serial_ata8000.asp MD software raid certainly can support HS - it's mainly a feature of the controller. (this is not to say that most controllers do HS well.) From hahn at mcmaster.ca Sat Oct 3 15:55:34 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC7C768.5070705@gmx.com> References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> Message-ID: > I've seen Centos mentioned a lot in connection to HPC, am I making a > mistake with Ubuntu?? distros differ mainly in their desktop decoration. for actually getting cluster-type work done, the distro is as close to irrelevant as imaginable. a matter of taste, really. it's not as if the distros provide the critical components - they merely repackage the kernel, libraries, middleware, utilities. wiring yourself to a distro does affect when you can or have to upgrade your system, though. consider, for instance, that there's no reason for a compute node to run whatever distro you choose for your login node. yes, you'd like to keep some synchronization in libc and middleware libraries. but you could configure a compute node with only the basics: kernel, shell, minimal /sbin utilities, single rc script, sshd (in addition to the probably few libraries needed by jobs - compiler runtimes, probably MPI, probably acml/mkl) From hahn at mcmaster.ca Sat Oct 3 16:01:27 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC7C9F3.4050301@gmx.com> References: <4AC78CC8.5060500@gmx.com> <4AC78F3D.3030707@cs.earlham.edu> <4AC7C9F3.4050301@gmx.com> Message-ID: > So, maybe the bold question to ask would be: what would be the best RAID > config for 3 HDDS and a max 6 node HPC cluster? Should I just use RAID 1 do you mean for each node? > for the system partitions on one disk, and RAID 0 for the simulation > data placed on the same partitions on other two disks: after > post-processing, the data is gone anyway... and with a good backup > strategy, I don't have to worry about RAID0 not recovering from a disk > fail... you're going to back up a raid0? in any case, I think you should consider net-booting and using the node disks as a 3x raid0. if the local files are really transient, then your startup script can just reinitialize the local disks every boot. (which would leave you with a working node even after a disk failure or two!) that's assuming you need or can benefit from the capacity or bandwidth. >> or better yet, don't bother booting of the local disk. simply make your >> head/admin/master server reliable and net-boot. it's likley that nodes >> won't be functional without the master server anyway, and net-booting >> doesn't mean you can't use the local disk for swap/scratch/... > > Well, I want to configure the net boot for all diskless nodes and use > the master node and it's RAID for a performance gains with writing CFD > simulation data against network communication and to be able to scale > more easily. I'm not sure I parse that. net booting is orthogonal to whether or not you store data locally or over the net. but yes, gigabit is somewhat slower than a single modern disk, so local IO will win. From hahn at mcmaster.ca Sat Oct 3 16:06:26 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC7D08F.9040806@abdn.ac.uk> References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> <4AC7D08F.9040806@abdn.ac.uk> Message-ID: > http://ubuntuforums.org/showthread.php?t=1030849 > > I used RH6-9, and Fedora core2, but I switched to Debian and now Ubuntu. sorry, I read that thread and https://wiki.ubuntu.com/EasyUbuntuClustering but nothing really jumped out at me besides a little mac-envy. could you describe what concrete issues you think makes Ubuntu more appealing for clustering than other distros? From a.travis at abdn.ac.uk Sat Oct 3 16:29:38 2009 From: a.travis at abdn.ac.uk (Tony Travis) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> <4AC7D08F.9040806@abdn.ac.uk> Message-ID: <4AC7DE62.20600@abdn.ac.uk> Mark Hahn wrote: >> http://ubuntuforums.org/showthread.php?t=1030849 >> >> I used RH6-9, and Fedora core2, but I switched to Debian and now Ubuntu. > > sorry, I read that thread and https://wiki.ubuntu.com/EasyUbuntuClustering > but nothing really jumped out at me besides a little mac-envy. could you > describe what concrete issues you think makes Ubuntu more appealing for > clustering than other distros? Hello, Mark I think Debian is well established as a reliable server operating system, and Ubuntu *is* Debian. However, I'm not asserting that Debian or Ubuntu are better for clustering than other distro's. The thread I started is about using Kerrighed under Ubuntu. Most of the Kerrighed development has been done under Mandriva, but I want to use Ubuntu. What I am saying is that Ubuntu users are interested in clustering. I use Ubuntu on our Beowulf for several reasons but, in particular, we use Bio-Linux: http://nebc.nox.ac.uk/nebc/tools/bio-linux I ported openMosix to Ubuntu 6.06, but I'm now planning to replace it with Kerrighed. Much of the bioinformatics and modelling work we do is embarrasingly parallel and SSI is appropriate for that. We access the system via NX desktops, and Ubuntu presents a much more polished user interface than Debian, which is why I switched from Debian to Ubuntu. There have been discussions here before about why the APT system is poplular. As a matter of fact, I used APT for RPM when I used RH-based systems. Although I am, clearly, advocating Ubuntu for HPC it is more from the perspective of advocating HPC for Ubuntu. I understand that people using HPC in other areas would equally want to use e.g. CentOS for similar reasons that I want to use Ubuntu. What we all want to do is use HPC in a familiar environment that supports our work. I don't think *any* distro is 'best' for HPC, but many groups want to use a particular distro because their favourite software is available. Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis@abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From a.travis at abdn.ac.uk Sat Oct 3 16:46:41 2009 From: a.travis at abdn.ac.uk (Tony Travis) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> Message-ID: <4AC7E261.6040201@abdn.ac.uk> Mark Hahn wrote: >[...] > bah. > redundancy costs - your only choice is what kind of redundancy you want. > raid5 has its issues, but really what's needed is redundancy managed > by the filesystem. doing it block-level is what introduces the pain. Hello, Mark. You're right ;-) I've just been reading about the Hadoop file system. That recommends not using RAID at all, and just relying on configurable levels of redundancy in the filesystem. However, I've got a complete mental block about using the Hadoop filesystem 'shell'. I'll persevere, though because it does look very interesting. I've been trying to learn how to use it for ages! >>> 3) I'll leave the /boot partition on one of the 3 disks and it will NOT >>> be included in the RAID array, is this ok? >> I think you'd be better off putting your system one one of your three disks, >> and making a RAID1 for /home from the other two. This will give you a > > I disagree. for one, I strongly advice against partition-o-philia: > the somewhat traditional practice of putting lots of separate partitions > on a system. but for 3 disks, linux MD has a nice mode where each block > is stored twice. so you get 2/3 space efficiency and potentially better > performance. I'm not quite sure I understand what you're saying here: I have my system organised in such a way that my 3ware RAID's provide a bare-metal recovery backup using just three partitions: sda1 / 40GB sda2 swap 8GB sda3 /backup rest of disk The 'md' RAID5 is four disks with a single partition: sd[abcd]1 RAID autodetect How is this "partition-o-philia"? >[...] > MD software raid certainly can support HS - it's mainly a feature of the > controller. (this is not to say that most controllers do HS well.) I'm interested in that - It was only alpha release when I looked, and not recommended for production use. I did some experiments, and settled on hot-swap using the 3ware 8006-2's, with 'md' software RAID10. . Bye, Tony -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis@abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From a.travis at abdn.ac.uk Sat Oct 3 16:51:51 2009 From: a.travis at abdn.ac.uk (Tony Travis) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC7E261.6040201@abdn.ac.uk> References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7E261.6040201@abdn.ac.uk> Message-ID: <4AC7E397.3090005@abdn.ac.uk> Tony Travis wrote: > [...] > The 'md' RAID5 is four disks with a single partition: > > sd[abcd]1 RAID autodetect Sorry, I mean: sd[bcde]1 RAID autodetect Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis@abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From a.travis at abdn.ac.uk Sat Oct 3 17:05:10 2009 From: a.travis at abdn.ac.uk (Tony Travis) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> <4AC7D08F.9040806@abdn.ac.uk> Message-ID: <4AC7E6B6.10906@abdn.ac.uk> Mark Hahn wrote: >> http://ubuntuforums.org/showthread.php?t=1030849 >> >> I used RH6-9, and Fedora core2, but I switched to Debian and now Ubuntu. > > sorry, I read that thread and https://wiki.ubuntu.com/EasyUbuntuClustering > but nothing really jumped out at me besides a little mac-envy. could you > describe what concrete issues you think makes Ubuntu more appealing for > clustering than other distros? Hello, Mark. Read your comment again: I've already got an iMac, and it's not my wiki! The parts we've contributed to are: http://wiki.ubuntu.com/EasyUbuntuClustering?action=AttachFile&do=view&target=build.sh http://wiki.ubuntu.com/EasyUbuntuClustering/UbuntuKerrighedClusterGuide Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis@abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From tomislav.maric at gmx.com Sun Oct 4 04:08:14 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: References: <4AC78CC8.5060500@gmx.com> <4AC78F3D.3030707@cs.earlham.edu> <4AC7C9F3.4050301@gmx.com> Message-ID: <4AC8821E.60605@gmx.com> Mark Hahn wrote: >> So, maybe the bold question to ask would be: what would be the best RAID >> config for 3 HDDS and a max 6 node HPC cluster? Should I just use RAID 1 > > do you mean for each node? No, the nodes are diskless. I plan to scale the cluster and 1TB of storage is quite enough, even if I use 6 nodes, or 2x6 nodes. That's actually what I know from my small experience in running CFD codes on 96 cores cluster. That's the reason for thinking about RAID in the first place: create stable and good performing centralized storage for the future number of the nodes (i.e. 12 nodes with 4 cores and 16 GB of RAM each). > >> for the system partitions on one disk, and RAID 0 for the simulation >> data placed on the same partitions on other two disks: after >> post-processing, the data is gone anyway... and with a good backup >> strategy, I don't have to worry about RAID0 not recovering from a disk >> fail... > > you're going to back up a raid0? >From your question, I sense it's a bad idea... :) I have no clue, this is the first time I'm doing this. > in any case, I think you should consider net-booting and using the node > disks as a 3x raid0. if the local files are really transient, then > your startup script can just reinitialize the local disks every boot. > (which would leave you with a working node even after a disk failure or two!) > that's assuming you need or can benefit from the capacity or bandwidth. > OK, I want a net boot because the nodes are diskless, the remaining question is how to use 3 HDDs with RAID to get a performance boost where I need it (like the /home where the data is written) and HA for the / dir., in case of disk fail. Is this the right way of thinking? >>> or better yet, don't bother booting of the local disk. simply make your >>> head/admin/master server reliable and net-boot. it's likley that nodes >>> won't be functional without the master server anyway, and net-booting >>> doesn't mean you can't use the local disk for swap/scratch/... >> Well, I want to configure the net boot for all diskless nodes and use >> the master node and it's RAID for a performance gains with writing CFD >> simulation data against network communication and to be able to scale >> more easily. > > I'm not sure I parse that. net booting is orthogonal to whether or not > you store data locally or over the net. but yes, gigabit is somewhat > slower than a single modern disk, so local IO will win. > Thanks. From tomislav.maric at gmx.com Sun Oct 4 04:08:27 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> Message-ID: <4AC8822B.6000603@gmx.com> Mark Hahn wrote: >> I've seen Centos mentioned a lot in connection to HPC, am I making a >> mistake with Ubuntu?? > > distros differ mainly in their desktop decoration. for actually > getting cluster-type work done, the distro is as close to irrelevant > as imaginable. a matter of taste, really. it's not as if the distros > provide the critical components - they merely repackage the kernel, > libraries, middleware, utilities. wiring yourself to a distro does > affect when you can or have to upgrade your system, though. > > consider, for instance, that there's no reason for a compute node to > run whatever distro you choose for your login node. yes, you'd like > to keep some synchronization in libc and middleware libraries. but > you could configure a compute node with only the basics: kernel, shell, > minimal /sbin utilities, single rc script, sshd (in addition to the > probably few libraries needed by jobs - compiler runtimes, probably MPI, > probably acml/mkl) > Thanks, that's exactly what I thought: the software components of a beowulf mentioned in rgb's book and on the net, are simply utilities that are used upon any linux distribution. From tomislav.maric at gmx.com Sun Oct 4 05:56:33 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] ATX on switch Message-ID: <4AC89B81.3040608@gmx.com> Hi again, where can I get an "on/off" and "reset" switch for ATX motherboard without buying and ripping apart a case? Should I make one? I'm planning on having up to 12 mobos: should I use software for powering them off and reseting them (i.e. over LAN), or make a bunch of switches and place them in a case? Any suggestions? I'm afraid to use a screwdriver and short circuit the pins. :) Thanks, Tomislav From a.travis at abdn.ac.uk Sun Oct 4 06:08:23 2009 From: a.travis at abdn.ac.uk (Tony Travis) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] ATX on switch In-Reply-To: <4AC89B81.3040608@gmx.com> References: <4AC89B81.3040608@gmx.com> Message-ID: <4AC89E47.6080407@abdn.ac.uk> Tomislav Maric wrote: > Hi again, > > where can I get an "on/off" and "reset" switch for ATX motherboard > without buying and ripping apart a case? > > Should I make one? I'm planning on having up to 12 mobos: should I use > software for powering them off and reseting them (i.e. over LAN), or > make a bunch of switches and place them in a case? Any suggestions? > > I'm afraid to use a screwdriver and short circuit the pins. :) Hello, Tomislav. You could set the BIOS to power state to "ON after AC loss". I do this because I have lots of COTS PC's in tower cases on industrial shelving in our computer room and it's awkward to go round pressing all the on/off switches ;-) Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis@abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From tomislav.maric at gmx.com Sun Oct 4 06:37:02 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] ATX on switch In-Reply-To: References: <4AC89B81.3040608@gmx.com> <4AC89E47.6080407@abdn.ac.uk> Message-ID: <4AC8A4FE.2060000@gmx.com> Hi Dmitri and Tony, thank you both very much for your answers. I'm on my way to rip out a switch from an old computer case so I can start the master node for the first time (hopefully without calling the firemen and an ambulance :) ). I'll setup the BIOS as you've told me: "ON after AC loss". Does that mean that next time it will be turned on(off) when the PS is turned on(off)? I know I only have 2 nodes and I can use the switches, but if this works, I'll be expanding it, so I want this problem covered (12 on/off switches... I'm too lazy for that). :) If I do this, I can power the master with PS on switch, and use the switch I have for adding another node with the same setup: set it up to run diskless, set it's BIOS the same way, turn it off, disconnect the switch and use it for adding another node. Is this OK? Best regards, Tomislav From a.travis at abdn.ac.uk Sun Oct 4 07:09:19 2009 From: a.travis at abdn.ac.uk (Tony Travis) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] ATX on switch In-Reply-To: <4AC8A4FE.2060000@gmx.com> References: <4AC89B81.3040608@gmx.com> <4AC89E47.6080407@abdn.ac.uk> <4AC8A4FE.2060000@gmx.com> Message-ID: <4AC8AC8F.3070904@abdn.ac.uk> Tomislav Maric wrote: > Hi Dmitri and Tony, > > thank you both very much for your answers. I'm on my way to rip out a > switch from an old computer case so I can start the master node for the > first time (hopefully without calling the firemen and an ambulance :) ). > > I'll setup the BIOS as you've told me: "ON after AC loss". Does that > mean that next time it will be turned on(off) when the PS is turned on(off)? Hello, Tomislav. Many PC BIOS's default to PSU in the "previous state" when AC power is restored, but you can force the state to "on" if you want, which means that when the AC power is restored, the BIOS will switch the PSU back on automatically. This is not desirable for a server, but is fine for a compute node - It's safer to leave a server off after AC power loss. > If I do this, I can power the master with PS on switch, and use the > switch I have for adding another node with the same setup: set it up to > run diskless, set it's BIOS the same way, turn it off, disconnect the > switch and use it for adding another node. Is this OK? Or, learn to short the pins with a screwdriver like the rest of us ;-) Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis@abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From tomislav.maric at gmx.com Sun Oct 4 08:08:08 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] ATX on switch In-Reply-To: <9f8092cc0910040710r233d9cd2nfc5d73dc1a55004e@mail.gmail.com> References: <4AC89B81.3040608@gmx.com> <4AC89E47.6080407@abdn.ac.uk> <4AC8A4FE.2060000@gmx.com> <9f8092cc0910040710r233d9cd2nfc5d73dc1a55004e@mail.gmail.com> Message-ID: <4AC8BA58.8030305@gmx.com> @John Hearns Thank you! I've been looking around and I new there must be some kind of power supply for multiple motherboards. That's exactly what I'll need when the time comes for scaling. @Tony Travis Thanks, I've sawed off a switch from an old box. :) It's doing the job so far. There were no flames, smoke or electricity bursts to the eyes. :) I have another problem with the "chasis intrusion" detection. The jumper is in default place on the mobo, but I still get the error and my system gets halted. I've been going through the post on ASUS forum pages, and I don't like the answers. Has anyone here had any experience with this? Best regards, Tomislav From tomislav.maric at gmx.com Sun Oct 4 08:10:36 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <9f8092cc0910040715r6745bc21wdcbc0c9055ce8e8e@mail.gmail.com> References: <4AC78CC8.5060500@gmx.com> <4AC78F3D.3030707@cs.earlham.edu> <4AC7C9F3.4050301@gmx.com> <4AC8821E.60605@gmx.com> <9f8092cc0910040715r6745bc21wdcbc0c9055ce8e8e@mail.gmail.com> Message-ID: <4AC8BAEC.20305@gmx.com> John Hearns wrote: > 2009/10/4 Tomislav Maric : >> >> No, the nodes are diskless. I plan to scale the cluster and 1TB of >> storage is quite enough, even if I use 6 nodes, or 2x6 nodes. That's >> actually what I know from my small experience in running CFD codes on 96 >> cores cluster. > 1 Tbyte? Are you sure... depends on your workload of course, but I > would plan for a bit more! Yes, definitely, I'm removing the results after postprocessing, and I'm the only user. :) For three nodes, it will be more than enough, but after scaling and with adding more users, I'll definitely need more. > > That's the reason for thinking about RAID in the first >> place: create stable and good performing centralized storage for the >> future number of the nodes (i.e. 12 nodes with 4 cores and 16 GB of RAM >> each). > That's a good policy. > You should be looking at running one benchmark case, and getting as > much information from it as possible - ie. how often it writes a > solution file, how big that file is, and how long it takes. That's the plan, I just need to get it to work first. ;) > > It might definitely be worth looking at a parallel filesystem also - > ie. keep your main storage on a 'conventional' RAID and have a few > Tbytes of scratch (temporary) storage on the faster filesystem for the > solution files. OK, thank you very much, I'll keep that in mind: I'll leave some space on the disks for that. Best regards, Tomislav From jorg.sassmannshausen at strath.ac.uk Sat Oct 3 12:18:43 2009 From: jorg.sassmannshausen at strath.ac.uk (=?iso-8859-1?q?J=F6rg_Sa=DFmannshausen?=) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] Re: RAID for home beowulf In-Reply-To: <200910031845.n93IjXuO008654@bluewest.scyld.com> References: <200910031845.n93IjXuO008654@bluewest.scyld.com> Message-ID: <200910032018.43712.jorg.sassmannshausen@strath.ac.uk> Hi Tomislav I agree with what Skylar wrote. However, ask yourself what are you going to do with the cluster? For example, I am doing quite a lot of molecular modelling, which requires plenty of RAM and also scratch space. So for the machines at the old University, I set up /boot and / as RAID1. Why? Failover is the answere. In case one disc dies, I have degraded, but working machine and the next possible time I can sort out the broken hdd. Actually, I had to do that a few times as somehow the IDE hdd seem to be a bit dogdy and broke quickly. Fortunately, within one day everything was back in working order (I swapped both discs in that occassion, so I had to mirror twice). There is a HowTo setup a failover boot as well. ( http://www200.pair.com/mecham/raid/raid1.html ) Also, if what you are doing needs plenty of scratch space, I would recommend a RAID0 and xfs. Best to do that as a hardware raid as it is faster than a software raid. I hope that helps a bit. All the best J?rg Am Samstag 03 Oktober 2009 schrieb beowulf-request@beowulf.org: > Message: 2 > Date: Sat, 03 Oct 2009 19:41:28 +0200 > From: Tomislav Maric > Subject: [Beowulf] RAID for home beowulf > To: beowulf@beowulf.org > Message-ID: <4AC78CC8.5060500@gmx.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Hi everyone, > > I've finally gathered all the hardware I need for my home beowulf. I'm > thinking of setting up RAID 5 for the /home partition (that's where my > simulation data will be and RAID 1 for the system / partitions without > the /boot. > > 1) Does this sound reasonable? > 2) I want to put the /home at the beginning of the disks go get faster > write/seek speeds, if the partitions are the same, software RAID doesn't > care where they are? > 3) I'll leave the /boot partition on one of the 3 disks and it will NOT > be included in the RAID array, is this ok? > 4) I've read about setting up parallel swaping via priority given to > swap partitions in fstab, but also how it would be ok to create RAID 1 > array of swap partitions for the HA of the cluster. What should I choose? > > I've gone through all the software raid how-tos, FAQs and similar, but > they are not quite new (date at least 3 years) and there's no reference > to clusters. Any pointers regarding this? > > Thank you in advance, > Tomislav > > I'm starting with this: > > 3 x Asus P5Q-VM motherboards > 2 x Intel Quad Core Q8200 2.33GHz (2 nodes with 4 cores) > 1 x Intel Core 2 Duo E6300 2.6GHz (master node) > 3 x Seagate Barracuda 320GB SATA 2 HDDs > > gigabyte Eth switch with 8 ports .... etc ... > > Best regards, > Tomislav -- ************************************************************* J?rg Sa?mannshausen Research Fellow University of Strathclyde Department of Pure and Applied Chemistry 295 Cathedral St. Glasgow G1 1XL email: jorg.sassmannshausen@strath.ac.uk web: http://sassy.formativ.net Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html From dmitri.chubarov at gmail.com Sun Oct 4 06:23:08 2009 From: dmitri.chubarov at gmail.com (Dmitri Chubarov) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] ATX on switch In-Reply-To: <4AC89E47.6080407@abdn.ac.uk> References: <4AC89B81.3040608@gmx.com> <4AC89E47.6080407@abdn.ac.uk> Message-ID: Hello, Tomislav, Anthony is right pointing that the "On after AC loss" or a similar feature would help on multiple occasions of sudden loss of power, however a switch would probably be handy in your setup. Basically any microswitch from an electronics shop would go. It might be more difficult to attach the wires to the pinout on the motherboards. Also note that in most ATX cases the wires are run through a ferrite bead to reduce high frequency interference. Best regards, Dima On Sun, Oct 4, 2009 at 8:08 PM, Tony Travis wrote: > Tomislav Maric wrote: > >> Hi again, >> >> where can I get an "on/off" and "reset" switch for ATX motherboard >> without buying and ripping apart a case? >> >> Should I make one? I'm planning on having up to 12 mobos: should I use >> software for powering them off and reseting them (i.e. over LAN), or >> make a bunch of switches and place them in a case? Any suggestions? >> >> I'm afraid to use a screwdriver and short circuit the pins. :) >> > > Hello, Tomislav. > > You could set the BIOS to power state to "ON after AC loss". > > I do this because I have lots of COTS PC's in tower cases on industrial > shelving in our computer room and it's awkward to go round pressing all the > on/off switches ;-) > > Bye, > > Tony. > -- > Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition > and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK > tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk > mailto:a.travis@abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091004/275761a6/attachment.html From tomislav.maric at gmx.com Sun Oct 4 09:42:26 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <9f8092cc0910040836sb2db253y4b7d60f909b7b6c3@mail.gmail.com> References: <4AC78CC8.5060500@gmx.com> <4AC78F3D.3030707@cs.earlham.edu> <4AC7C9F3.4050301@gmx.com> <4AC8821E.60605@gmx.com> <9f8092cc0910040715r6745bc21wdcbc0c9055ce8e8e@mail.gmail.com> <4AC8BAEC.20305@gmx.com> <9f8092cc0910040836sb2db253y4b7d60f909b7b6c3@mail.gmail.com> Message-ID: <4AC8D072.30005@gmx.com> John Hearns wrote: > 2009/10/4 Tomislav Maric : >> J >> Yes, definitely, I'm removing the results after postprocessing, and I'm >> the only user. :) > > Aha. I see.... Just wait for those other users to come along. will > they remove the files immediately after postprocessing? Hmmm??? My > advice - buy a cattle prod. > > Seriously though, I think you have the deletion of files in your > scripts. Just make sure everyone else uses the scripts. > How about though if you want to re-run a case? > I've told you in another response: I'll definitely expand the storage, but right now I'm just concerned at making the machine work properly. :) If there will ever be more users, then I'll have to expand the storage. In the mean time I have the time for exploring parallel file systems and similar stuff to speed up the IO. :) For my own use: I have pyFoam scripts that clear the unnecessary result data while keeping the input for the simulations intact. :) Anyway, I'll worry about this one when I come to it. Thank you very much for the advice, I know that the storage operation is going to change the way the machine will work depending on the way it's used. From tomislav.maric at gmx.com Sun Oct 4 09:44:40 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] ATX on switch In-Reply-To: <9f8092cc0910040834u7e8e01bdnaec4e53f11d33ad6@mail.gmail.com> References: <4AC89B81.3040608@gmx.com> <4AC89E47.6080407@abdn.ac.uk> <4AC8A4FE.2060000@gmx.com> <9f8092cc0910040710r233d9cd2nfc5d73dc1a55004e@mail.gmail.com> <4AC8BA58.8030305@gmx.com> <9f8092cc0910040834u7e8e01bdnaec4e53f11d33ad6@mail.gmail.com> Message-ID: <4AC8D0F8.6070508@gmx.com> John Hearns wrote: > 2009/10/4 Tomislav Maric : >> @John Hearns >> Thank you! I've been looking around and I new there must be some kind of >> power supply for multiple motherboards. That's exactly what I'll need >> when the time comes for scaling. > > Tomislav, this is not a true power supply for multiple motherboards. > There was a company in the UK, Workstations UK, which used to build > clusters from multipl emotherboards in an enclosure, which had a shard > power supply. > These days people buy blade enclosures! > > However, do please use a mains PDU with these IEC plugs for any installation. > Plugging systems into many, many wall outlets is ugly. > > > Re. the chassis intrusion, this simply MUST be an option in your BIOS. > My advice, and this goes for anyone, take a monitor and spend a half > an hour going through all BIOS options. > I just spent half an hour in BIOS. :) Still, I couldn't get to it before I've reset the RTC, for some strange reason I got the chassis error. Now it's solved. Thanks. I'll use the PDU as soon as I gather up some money. ;) Still a student. :) From tomislav.maric at gmx.com Sun Oct 4 09:50:12 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] Re: RAID for home beowulf In-Reply-To: <200910032018.43712.jorg.sassmannshausen@strath.ac.uk> References: <200910031845.n93IjXuO008654@bluewest.scyld.com> <200910032018.43712.jorg.sassmannshausen@strath.ac.uk> Message-ID: <4AC8D244.8020302@gmx.com> Hi J?rg, thanks for the info. I'm converging to the solution regarding RAID now. Thank you a lot for the link I'll be needing it. :) Well, I don't need too much scratch space, the important part of the disk is the /home with the results. What file system should I use for it, ext3? Best regards, Tomislav J?rg Sa?mannshausen wrote: > Hi Tomislav > > I agree with what Skylar wrote. However, ask yourself what are you going to do > with the cluster? > For example, I am doing quite a lot of molecular modelling, which requires > plenty of RAM and also scratch space. > So for the machines at the old University, I set up /boot and / as RAID1. Why? > Failover is the answere. In case one disc dies, I have degraded, but working > machine and the next possible time I can sort out the broken hdd. Actually, I > had to do that a few times as somehow the IDE hdd seem to be a bit dogdy and > broke quickly. Fortunately, within one day everything was back in working > order (I swapped both discs in that occassion, so I had to mirror twice). > There is a HowTo setup a failover boot as well. > ( http://www200.pair.com/mecham/raid/raid1.html ) > > Also, if what you are doing needs plenty of scratch space, I would recommend a > RAID0 and xfs. Best to do that as a hardware raid as it is faster than a > software raid. > > I hope that helps a bit. > > All the best > > J?rg > > > Am Samstag 03 Oktober 2009 schrieb beowulf-request@beowulf.org: >> Message: 2 >> Date: Sat, 03 Oct 2009 19:41:28 +0200 >> From: Tomislav Maric >> Subject: [Beowulf] RAID for home beowulf >> To: beowulf@beowulf.org >> Message-ID: <4AC78CC8.5060500@gmx.com> >> Content-Type: text/plain; charset=ISO-8859-1 >> >> Hi everyone, >> >> I've finally gathered all the hardware I need for my home beowulf. I'm >> thinking of setting up RAID 5 for the /home partition (that's where my >> simulation data will be and RAID 1 for the system / partitions without >> the /boot. >> >> 1) Does this sound reasonable? >> 2) I want to put the /home at the beginning of the disks go get faster >> write/seek speeds, if the partitions are the same, software RAID doesn't >> care where they are? >> 3) I'll leave the /boot partition on one of the 3 disks and it will NOT >> be included in the RAID array, is this ok? >> 4) I've read about setting up parallel swaping via priority given to >> swap partitions in fstab, but also how it would be ok to create RAID 1 >> array of swap partitions for the HA of the cluster. What should I choose? >> >> I've gone through all the software raid how-tos, FAQs and similar, but >> they are not quite new (date at least 3 years) and there's no reference >> to clusters. Any pointers regarding this? >> >> Thank you in advance, >> Tomislav >> >> I'm starting with this: >> >> 3 x Asus P5Q-VM motherboards >> 2 x Intel Quad Core Q8200 2.33GHz (2 nodes with 4 cores) >> 1 x Intel Core 2 Duo E6300 2.6GHz (master node) >> 3 x Seagate Barracuda 320GB SATA 2 HDDs >> >> gigabyte Eth switch with 8 ports .... etc ... >> >> Best regards, >> Tomislav From tomislav.maric at gmx.com Sun Oct 4 16:04:26 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: References: <4AC78CC8.5060500@gmx.com> <4AC78F3D.3030707@cs.earlham.edu> <4AC7C9F3.4050301@gmx.com> <4AC8821E.60605@gmx.com> Message-ID: <4AC929FA.6060201@gmx.com> Mark Hahn wrote: >>>> So, maybe the bold question to ask would be: what would be the best RAID >>>> config for 3 HDDS and a max 6 node HPC cluster? Should I just use RAID 1 >>> do you mean for each node? >> No, the nodes are diskless. I plan to scale the cluster and 1TB of >> storage is quite enough, even if I use 6 nodes, or 2x6 nodes. That's >> actually what I know from my small experience in running CFD codes on 96 >> cores cluster. That's the reason for thinking about RAID in the first >> place: create stable and good performing centralized storage for the >> future number of the nodes (i.e. 12 nodes with 4 cores and 16 GB of RAM >> each). > > perhaps I'm confusing multiple threads, but didn't you say earlier > that your workload was disk-intensive? from that I'd assume you'd > want the speedup you'd get from having each node do IO to local disks. > No, I've said that I have centralized storage and I want to make it fast and avaliable. :) And I've said that this is the first time I'm doing this and that didn't benchmark it yet, it's a two node machnine, that will be scaled after benchmarking if it shows promise. :) >>>> for the system partitions on one disk, and RAID 0 for the simulation >>>> data placed on the same partitions on other two disks: after >>>> post-processing, the data is gone anyway... and with a good backup >>>> strategy, I don't have to worry about RAID0 not recovering from a disk >>>> fail... >>> you're going to back up a raid0? >>> From your question, I sense it's a bad idea... :) I have no clue, this >> is the first time I'm doing this. > > well, using raid0 is a declaration that the files are purely transient > (have no meaning or value when the job or job-step is over.) so backing > them up is somewhat strange. or from the other direction: backing up > means that the files are valuable, and you don't want to use raid0 for > that. (unless, I suppose, the files are read-only...) > Thank you for the info, I only know about RAID what I was able to read from the short and out of date how-tos. I'm only thinking about it through the information I've got so far: RAID0 for speed, RAID1 for mirroring, RAID5 a hybrid that causes much discussion. That's it, so far. What does it mean that they don't have any value? They are at the disk and I'm the sole user for now, and after I've done with them, I can dispose of them and leave the processed simulation data on a safe place. I admit I'm getting lost here. >>> in any case, I think you should consider net-booting and using the node >>> disks as a 3x raid0. if the local files are really transient, then >>> your startup script can just reinitialize the local disks every boot. >>> (which would leave you with a working node even after a disk failure or two!) >>> that's assuming you need or can benefit from the capacity or bandwidth. >>> >> OK, I want a net boot because the nodes are diskless, the remaining >> question is how to use 3 HDDs with RAID to get a performance boost where > > since you're talking about IO over the net, which is gigabit, right, > it hardly matters. gigabit is 100 MB/s at best, and that's less than > one disk's worth of throughput. 3-disk raid10 will give you 50% space > efficiency but fast writes; raid5 will give you 66% space but relatively > slow writes. OK, that's correct. I guess I'll have to see how the writing is going to take place when a bunch of nodes (maybe 6 of them) send a lot of data (even over GigEth) to the master node. I was thinking ahead, with the RAID or whatever other option for expanding the storage, making it fast (parallel file system?) and reliable. I guess now that you've explained it, I'll have no problems with a few nodes, since the eth is the bottleneck, as it will probably always be. I just wanted to make sure that the nodes don't wait for the master to write down the data and that my OS and other installation is safe from disk failure. That's the strategy, but obviously my knowledge can't follow it at all. :) > >> I need it (like the /home where the data is written) and HA for the / >> dir., in case of disk fail. Is this the right way of thinking? > > you should always strive for simpler systems. don't introduce differences > in config unless they're clearly necessary. don't partition disks unless > necessary. > > How do you mean differences in config? I'm configuring the master, and the other nodes are to be diskless.. I have separated these partitions: /swap /boot / /var and /home. Is this ok? Thank you very much for the advice. Best regards, Tomislav >>>>> or better yet, don't bother booting of the local disk. simply make your >>>>> head/admin/master server reliable and net-boot. it's likley that nodes >>>>> won't be functional without the master server anyway, and net-booting >>>>> doesn't mean you can't use the local disk for swap/scratch/... >>>> Well, I want to configure the net boot for all diskless nodes and use >>>> the master node and it's RAID for a performance gains with writing CFD >>>> simulation data against network communication and to be able to scale >>>> more easily. >>> I'm not sure I parse that. net booting is orthogonal to whether or not >>> you store data locally or over the net. but yes, gigabit is somewhat >>> slower than a single modern disk, so local IO will win. >>> >> Thanks. >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> > From tomislav.maric at gmx.com Sun Oct 4 16:07:16 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> <4AC8822B.6000603@gmx.com> Message-ID: <4AC92AA4.4080008@gmx.com> Mark Hahn wrote: >>>> I've seen Centos mentioned a lot in connection to HPC, am I making a >>>> mistake with Ubuntu?? >>> distros differ mainly in their desktop decoration. for actually >>> getting cluster-type work done, the distro is as close to irrelevant >>> as imaginable. a matter of taste, really. it's not as if the distros >>> provide the critical components - they merely repackage the kernel, >>> libraries, middleware, utilities. wiring yourself to a distro does >>> affect when you can or have to upgrade your system, though. >>> >>> consider, for instance, that there's no reason for a compute node to >>> run whatever distro you choose for your login node. yes, you'd like >>> to keep some synchronization in libc and middleware libraries. but >>> you could configure a compute node with only the basics: kernel, shell, >>> minimal /sbin utilities, single rc script, sshd (in addition to the >>> probably few libraries needed by jobs - compiler runtimes, probably MPI, >>> probably acml/mkl) >>> >> Thanks, that's exactly what I thought: the software components of a >> beowulf mentioned in rgb's book and on the net, are simply utilities >> that are used upon any linux distribution. > > that's not actually what I meant. yes, of course, a beowulf can be > based on any distro you like. but my point was actually that there's > no reason to have any distro on a compute node. what needs to be on > a compute node is quite minimal - so much so that it could be managed > outside of any distro. it's not as if the distros do any hard work - > they just recompile code managed by other people. and you can get that > code yourself. a compute node really only needs a dozen or so packages > on it to support running your job. > OK thanks, I understand. Where can I learn to do that? There are many diskless nodes how-tos and, as in RAID they seem to be a bit out of date, and confusing for someone with limited experience, like me. I would be really grateful for a point in the right direction. Best regards, Tomislav From tomislav.maric at gmx.com Sun Oct 4 16:08:15 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] Re: RAID for home beowulf In-Reply-To: References: <200910031845.n93IjXuO008654@bluewest.scyld.com> <200910032018.43712.jorg.sassmannshausen@strath.ac.uk> <4AC8D244.8020302@gmx.com> Message-ID: <4AC92ADF.5060609@gmx.com> Mark Hahn wrote: >> disk is the /home with the results. What file system should I use for >> it, ext3? > > it doesn't matter much. ext3 is a reasonable, conservative choice; > ext4 is the modern upgrade, though considered too-new-to-be-stable by some. > xfs is prefered as a matter of taste by others; > Thanks, I've used ext3. :) Guess I'm conservative then. :) From niftyompi at niftyegg.com Sun Oct 4 17:21:16 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC8822B.6000603@gmx.com> References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> <4AC8822B.6000603@gmx.com> Message-ID: <20091005002116.GA3130@tosh2egg.ca.sanfran.comcast.net> On Sun, Oct 04, 2009 at 01:08:27PM +0200, Tomislav Maric wrote: > Mark Hahn wrote: > >> I've seen Centos mentioned a lot in connection to HPC, am I making a > >> mistake with Ubuntu?? > > > > distros differ mainly in their desktop decoration. for actually > > getting cluster-type work done, the distro is as close to irrelevant > > as imaginable. a matter of taste, really. it's not as if the distros > > provide the critical components - they merely repackage the kernel, > > libraries, middleware, utilities. wiring yourself to a distro does > > affect when you can or have to upgrade your system, though. > > An interesting perspective is to have a cluster built on the same distro as your desktop because all the work (compile/edit-->a.out) can just run on your cluster. If you have a personal bias to having Ubuntu under your fingers and mouse then a cluster based on the same version of Ubuntu can make sense. This can apply to the 32/64 bit choice too. For this reason alone it can make a lot of sense for a small personal cluster to match the desktop. Other distros like RHEL/CentOS/ScientificLinux do have some advantages in things that matter to big groups like kickstart support package management and server specific tools and support for the latest higher end hardware. For large clusters the personal station biases are all over the map for the user community so the compass points to server distros. Another perspective is compiler support. If you have a superior compiler for your applications then the cluster distro should be "compatible" with the compiler. Some compilers can improve run time by 40% and quickly tip the balance. Your benchmarks will tell.... I like Ubuntu because it facilitates my WiFi and graphics support better than some others. It is also quite current in image tools so it is the distro I connect my camera to more often than my other systems. I can also ssh into my CentOS boxen to other types of work that do not depend on eye candy. -- T o m M i t c h e l l Found me a new hat, now what? From mdidomenico4 at gmail.com Sun Oct 4 17:56:17 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] ATX on switch In-Reply-To: <4AC8D0F8.6070508@gmx.com> References: <4AC89B81.3040608@gmx.com> <4AC89E47.6080407@abdn.ac.uk> <4AC8A4FE.2060000@gmx.com> <9f8092cc0910040710r233d9cd2nfc5d73dc1a55004e@mail.gmail.com> <4AC8BA58.8030305@gmx.com> <9f8092cc0910040834u7e8e01bdnaec4e53f11d33ad6@mail.gmail.com> <4AC8D0F8.6070508@gmx.com> Message-ID: You can get a full complement of switches and leds for under 10 bucks. Just search froogle for "atx power switch". I've purchased the StarTech kits in the past. On Sun, Oct 4, 2009 at 12:44 PM, Tomislav Maric wrote: > John Hearns wrote: >> 2009/10/4 Tomislav Maric : >>> @John Hearns >>> Thank you! I've been looking around and I new there must be some kind of >>> power supply for multiple motherboards. That's exactly what I'll need >>> when the time comes for scaling. >> >> Tomislav, this is not a true power supply for multiple motherboards. >> There was a company in the UK, Workstations UK, which used to build >> clusters from multipl emotherboards in an enclosure, which had a shard >> power supply. >> These days people buy blade enclosures! >> >> However, do please use a mains PDU with these IEC plugs for any installation. >> Plugging systems into many, many wall outlets is ugly. >> >> >> Re. the chassis intrusion, this simply MUST be an option in your BIOS. >> My advice, and this goes for anyone, take a monitor and spend a half >> an hour going through all BIOS options. >> > > I just spent half an hour in BIOS. :) Still, I couldn't get to it before > ?I've reset the RTC, for some strange reason I got the chassis error. > Now it's solved. Thanks. > > I'll use the PDU as soon as I gather up some money. ;) Still a student. :) > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From a28427 at ua.pt Sun Oct 4 19:07:52 2009 From: a28427 at ua.pt (Tiago Marques) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] XEON power variations In-Reply-To: <4AAFEED9.6040302@pa.msu.edu> References: <4AAFEED9.6040302@pa.msu.edu> Message-ID: Hi Tom, On Tue, Sep 15, 2009 at 8:45 PM, Tom Rockwell wrote: > Hi, > > Intel assigns the same power consumption to different clockspeeds of L, E, X > series XEON. ?All L series have the same rating, all E series etc. ?So, > taking their numbers, the fastest of each type will always have the best > performance per watt. ?And there is no power consumption penalty for buying > the fastest clockspeed of each type. ?Vendor's power calculators reflect > this (dell.com/calc for example). ?To me this seems like marketing > simplification... ?Anybody know any different, e.g. have you seen other > numbers from vendors or tested systems yourself? > > Is the power consumption of a system with an E5502 CPU really the same as > one with an E5540? No but the consumption and the E5520 may be higher than the E5540, although I wouldn't expect variations of more than 20% on a 80W CPU. Don't take anything as granted. When most processors had a fixed voltage input per generation, you almost certainly could expect them to have higher power consumption based on the clockspeed since voltage was the same for all models and it only varied slightly due to current leakage differences between batches of CPUs. With current process shrinks, current leakage became a bigger problem then back then and it now varies much more depending on CPU batches. After the power wall became a problem, apparently manufacturers started binning processors by leakage which affects TDP. Intel currently employs a voltage range(for the higher p-state voltage, not just between p-states) for processor lines, which they use to adjust the processors TDP depending on requirements and not just the clockspeed target. They can lower voltage other than the higher default value in the range to help make up for leakage but it needs extra testing to ensure stability then, testing that takes more time and money. On the consumer market it's common to find CPUs that are capable of 3GHz but are sold at 2.5GHz and have sometimes higher power consumption than a 3GHz part if you overclock it just by clock multiplier means, effectively making it an equal sample by all measures. The part was obviously capable of a bigger clock but had it's voltage and clock reduced to fit the TDP, often even barely. This results in strange situations where the higher clocked, sometimes even higher voltage part, consumes less power than a lower clocked one with the same power grade. Intel's priority is to make money, so they don't care how much power the processor needs as long as it fits the TDP goal. I expect the situation in servers is the same, although with a (slightly) higher degree of thoroughness, hence the 60-80-95-130 grades and not just 65-95-130 grades as in desktops. Best regards, Tiago Marques > > Thanks, > Tom Rockwell > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From landman at scalableinformatics.com Sun Oct 4 21:31:29 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] XEON power variations In-Reply-To: <4AAFEED9.6040302@pa.msu.edu> References: <4AAFEED9.6040302@pa.msu.edu> Message-ID: <4AC976A1.7080104@scalableinformatics.com> Tom Rockwell wrote: > Hi, > > Intel assigns the same power consumption to different clockspeeds of L, > E, X series XEON. All L series have the same rating, all E series etc. Not quite. They provide the maximum power consumption/dissipation, and quite possibly bin these numbers over a range of parts. > So, taking their numbers, the fastest of each type will always have the > best performance per watt. And there is no power consumption penalty Well ... against the TDP bin, not necessarily in actuality. > for buying the fastest clockspeed of each type. Vendor's power > calculators reflect this (dell.com/calc for example). To me this seems These numbers come from Intel in most cases. > like marketing simplification... Anybody know any different, e.g. have binning. > you seen other numbers from vendors or tested systems yourself? We have a 2.93GHz Nehalem and a 3.2 GHz Nehalem that are almost identical. We could put a power meter on them if you'd like (I bought one a while ago for this purpose). I've found from a design perspective, that its always a better idea not to design for actual power, but to design for maximim power consumed, and then add some margin. Its far better to overestimate your power and cooling needs than to underestimate it. > Is the power consumption of a system with an E5502 CPU really the same > as one with an E5540? It's worth noting that the E5502 CPU is a 1.86GHz part, while E5540 is a 2.53GHz part. The power consumed in a CPU is a function of frequency. A linear dependence upon frequency would have the E5540 being 36% more power than the E5502. However, the 5502 is a dual core part, and the 5540 is a quad core part, so you get 2x the number of FLOPs out of the 5540. This would bias the performance per watt calculation if it weren't factored in correctly. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From hahn at mcmaster.ca Sun Oct 4 23:28:52 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC929FA.6060201@gmx.com> References: <4AC78CC8.5060500@gmx.com> <4AC78F3D.3030707@cs.earlham.edu> <4AC7C9F3.4050301@gmx.com> <4AC8821E.60605@gmx.com> <4AC929FA.6060201@gmx.com> Message-ID: > the other nodes are to be diskless.. I have separated these partitions: > /swap /boot / /var and /home. Is this ok? I don't believe there is much value in separating partitions like this. for instance, a swap partition has no advantage over a swap file, and the latter is generally more convenient. separating /boot is largely a vestige of quite old bioses which could not do 48b LBA addressing (ie, deal with big disks.) separating /, /var and /home is largely a matter of taste: - separate /home means that you can blast / and /var when you upgrade or change distros. - separating / and /var means that something like syslog or mail can never fill up the root partition. which is generally not a serious issue anyway: if anything, this argues for putting /tmp in a separate partition, but leaving /var inside /... - more partitions (filesystems) means more time spent waiting for disk heads to seek. remember that write activity on any fs (modern, journaling) requires some synchronous writes, and therefore can't be lazy-ified or elevator-scheduled when you have multiple partitions. I think it's always a good idea to minimize partitions, though we can probably argue about how many is best. I suppose one factor in favor of separate, purpose-specific filesystems is that they might be split into different raid levels. for instance, /tmp might be raid0, perhaps /var a >2 disk raid10 for write speed, and /home left on /, both under raid6. but given that you don't want multiple filesystems competing for access to the same disk(s), I'm skeptical of this approach... in short, it's perfectly OK to use one big filesystem. or you can just use the distro's default partitioning scheme. in the end it's not going to make much difference. regards, mark hahn. From hearnsj at googlemail.com Sun Oct 4 23:35:12 2009 From: hearnsj at googlemail.com (John Hearns) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <20091005002116.GA3130@tosh2egg.ca.sanfran.comcast.net> References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> <4AC8822B.6000603@gmx.com> <20091005002116.GA3130@tosh2egg.ca.sanfran.comcast.net> Message-ID: <9f8092cc0910042335j1d3a9e7cr75bc8e135ff9b379@mail.gmail.com> 2009/10/5 Nifty Tom Mitchell : > I like Ubuntu because it facilitates my WiFi and graphics support better than > some others. ?It is also quite current in image tools so it is the distro > I connect my camera to more often than my other systems. ?I can also I gotta put in a reply here for SuSE. The SuSE desktop (either openSuSE or SLED) has plenty of eye candy, and works perfectly as my day-to-day desktop - plug in cameras, USB disks etc. and they're recognised, and wifi just works too. Accelerated 3D graphics support for AMD and Nvidia is fine - indeed there is a 'one click install' for the proprietary drivers. When you go onto your Beouwulf with SuSE you get compatibility with all compilers and all the ISV type codes I've ever seen. Besides that. SuSE were the first with an amd64 distribution (OK that's well in the past). However you'd be hard pushed to get anything other than SuSE running on a big Altix. (Actually, I did Google for this and people have got Debian running - as to why you would put in the time and effort to do that I don't know). From hearnsj at googlemail.com Sun Oct 4 23:44:28 2009 From: hearnsj at googlemail.com (John Hearns) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: References: <4AC78CC8.5060500@gmx.com> <4AC78F3D.3030707@cs.earlham.edu> <4AC7C9F3.4050301@gmx.com> <4AC8821E.60605@gmx.com> <4AC929FA.6060201@gmx.com> Message-ID: <9f8092cc0910042344t169dff9dh6bc24900522a4f14@mail.gmail.com> 2009/10/5 Mark Hahn : >> the other nodes are to be diskless.. I have separated these partitions: >> ?/swap /boot / ?/var and /home. Is this ok? > > I don't believe there is much value in separating partitions like this. > for instance, a swap partition has no advantage over a swap file, > and the latter is generally more convenient. ?separating /boot is largely a > vestige of quite old bioses which could not do 48b LBA addressing (ie, deal > with big disks.) ?separating /, /var and /home > is largely a matter of taste: I agree with what Mark says. Many separate partitions were necessary in the 'old days' when disks were small - so the partitioning reflects in fact using separate disks on a SCSI chain. I still vote for a separate /boot partition though - much more in my comfort zone. I was going to say that having a single /boot means that you can have a system with several distros or releases and boot into whichever root filesystem you choose. But I'm talking nonsense - I had an openSUSE dekstop with half this disk empty the other week. I put SLED 10 on there as well for tests - the installer recognised first time that there was an existing install and put its parameters in the grub menu! A common layout is to have a separate /home. Then some smart-assed user filling his/her home filesystem can't drag the rest of the system down. For CFD applications you definitely need some larger area of disk which is fast storage. I would definitely NOT put this on the system disk, no matter how big it is, for reasons that Mark says. You either want NFS on a pretty well performing server, with a fat network pipe and striped disks, or a parallel filesystem. From hahn at mcmaster.ca Mon Oct 5 00:14:42 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] re: power prediction and planning In-Reply-To: <4AC976A1.7080104@scalableinformatics.com> References: <4AAFEED9.6040302@pa.msu.edu> <4AC976A1.7080104@scalableinformatics.com> Message-ID: in reference to recent mention of TDP and power planning, what do you think the trend in the next few years will be? 5 years ago, people were talking about exponential increases in power and power density - afaikt with straight faces. that clearly didn't happen. in fact, dissipation has clearly gone down (on a per-node basis - power per flop has gone through the floor!). I'm looking at current nodes and seeing substantial benefits to improvements in cpu power, lower voltage/power ram, more sensible chipsets, more efficient power supplies. we have a lot of nodes bought ~4 years ago - dual-socket HP DL145G2's. cpu power was around 100W/socket, and I suspect PSUs were probably ~80% efficient. now it would be hard to miss 92-93% efficient PSUs and CPUs in the 60-80W range. I'd be surprised if support chips haven't improved as well. I know even disks are dissipating a lot less, even though they never were a large component. I've even seen some vendors brag about how their fans are more efficient (which resonates with me because when a DL145G2 gets warm, the ten (10) fans ramp up dissipate noticably more power, which doesn't help the room temperature at all...) I'm also not really talking about heat density - I generally assume that we'll get an average of 2 nodes per 1U - ~80/rack. Intel seems to be agressively pushing fab tech as well - .32 nm is due this year, and they claim significant speed/power advances. SO: do you expect pretty much constant per-node dissipation? if higher, why? if lower, savings due to what? we've planned with 300W/node for a long time, but current nodes are noticably cooler than that. I imagine next-gen nodes will be lower power, not higher, unless something else changes (4 socket becomes common, or every node needs a Fermi or Larrabee card...) also, does anyone have thoughts about machineroom features that would support multiple generations, preferably without a lot of reno? for instance, we have mostly L6-30 power, which seems reasonably safe for commodity systems. but how about rack-back water cooling systems? what are the chances of getting multiple generations out of that (at least the heat-rejection part, if not the pipes or server-side heat-exchangers.) is anyone working on commodity-ish systems with water/coolant going to each node? (that is, skip or minimize the use of air) thanks, mark hahn. From hearnsj at googlemail.com Mon Oct 5 00:45:25 2009 From: hearnsj at googlemail.com (John Hearns) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] re: power prediction and planning In-Reply-To: References: <4AAFEED9.6040302@pa.msu.edu> <4AC976A1.7080104@scalableinformatics.com> Message-ID: <9f8092cc0910050045p72a66bedhbc7488309738081d@mail.gmail.com> 2009/10/5 Mark Hahn : > also, does anyone have thoughts about machineroom features that would > support multiple generations, preferably without a lot of reno? > for instance, we have mostly L6-30 power, which seems reasonably safe for > commodity systems. ?but how about rack-back water cooling systems? what are > the chances of getting multiple generations out of that (at least the > heat-rejection part, if not the pipes or server-side heat-exchangers.) Oooh - good question. The SGI rack-back water cooled system is excellent, however the racks are SGI-size. You would be able to get multiple generations of SGI kit in the same racks - their IRU units are the same size, so you could go IA64/ICE/Ultraviolet. But that's not really multi-vendor equipment! On the cooling side, I heard that Imperial College went for a CO2 based cooling system - similar rack-back cooling radiators, but you pump CO2 through them. From tomislav.maric at gmx.com Mon Oct 5 03:02:11 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <20091005002116.GA3130@tosh2egg.ca.sanfran.comcast.net> References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> <4AC8822B.6000603@gmx.com> <20091005002116.GA3130@tosh2egg.ca.sanfran.comcast.net> Message-ID: <4AC9C423.10003@gmx.com> Nifty Tom Mitchell wrote: > On Sun, Oct 04, 2009 at 01:08:27PM +0200, Tomislav Maric wrote: >> Mark Hahn wrote: >>>> I've seen Centos mentioned a lot in connection to HPC, am I making a >>>> mistake with Ubuntu?? >>> distros differ mainly in their desktop decoration. for actually >>> getting cluster-type work done, the distro is as close to irrelevant >>> as imaginable. a matter of taste, really. it's not as if the distros >>> provide the critical components - they merely repackage the kernel, >>> libraries, middleware, utilities. wiring yourself to a distro does >>> affect when you can or have to upgrade your system, though. >>> > > An interesting perspective is to have a cluster built > on the same distro as your desktop because all the > work (compile/edit-->a.out) can just run on your cluster. > If you have a personal bias to having Ubuntu under your > fingers and mouse then a cluster based on the same version > of Ubuntu can make sense. This can apply to the 32/64 bit > choice too. > > For this reason alone it can make a lot of sense for a > small personal cluster to match the desktop. > > Other distros like RHEL/CentOS/ScientificLinux do have > some advantages in things that matter to big groups > like kickstart support package management and server specific > tools and support for the latest higher end hardware. > For large clusters the personal station biases are all > over the map for the user community so the compass points > to server distros. > > Another perspective is compiler support. If you have a superior > compiler for your applications then the cluster distro should be > "compatible" with the compiler. Some compilers can improve > run time by 40% and quickly tip the balance. Your benchmarks will tell.... > > I like Ubuntu because it facilitates my WiFi and graphics support better than > some others. It is also quite current in image tools so it is the distro > I connect my camera to more often than my other systems. I can also > ssh into my CentOS boxen to other types of work that do not depend on eye candy. > > > OK, I guess then Ubuntu will suffice for a 12 node Cluster. :) Anyway, I'll try it and see. Thanks! Best regards, Tomislav From tomislav.maric at gmx.com Mon Oct 5 03:13:24 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] ATX on switch In-Reply-To: References: <4AC89B81.3040608@gmx.com> <4AC89E47.6080407@abdn.ac.uk> <4AC8A4FE.2060000@gmx.com> <9f8092cc0910040710r233d9cd2nfc5d73dc1a55004e@mail.gmail.com> <4AC8BA58.8030305@gmx.com> <9f8092cc0910040834u7e8e01bdnaec4e53f11d33ad6@mail.gmail.com> <4AC8D0F8.6070508@gmx.com> Message-ID: <4AC9C6C4.7050003@gmx.com> Thanks Michael! If I'm going to use switches, I'll definitely get this kit. :) Best regards, Tomislav Michael Di Domenico wrote: > You can get a full complement of switches and leds for under 10 bucks. > Just search froogle for "atx power switch". I've purchased the > StarTech kits in the past. > > On Sun, Oct 4, 2009 at 12:44 PM, Tomislav Maric wrote: >> John Hearns wrote: >>> 2009/10/4 Tomislav Maric : >>>> @John Hearns >>>> Thank you! I've been looking around and I new there must be some kind of >>>> power supply for multiple motherboards. That's exactly what I'll need >>>> when the time comes for scaling. >>> Tomislav, this is not a true power supply for multiple motherboards. >>> There was a company in the UK, Workstations UK, which used to build >>> clusters from multipl emotherboards in an enclosure, which had a shard >>> power supply. >>> These days people buy blade enclosures! >>> >>> However, do please use a mains PDU with these IEC plugs for any installation. >>> Plugging systems into many, many wall outlets is ugly. >>> >>> >>> Re. the chassis intrusion, this simply MUST be an option in your BIOS. >>> My advice, and this goes for anyone, take a monitor and spend a half >>> an hour going through all BIOS options. >>> >> I just spent half an hour in BIOS. :) Still, I couldn't get to it before >> I've reset the RTC, for some strange reason I got the chassis error. >> Now it's solved. Thanks. >> >> I'll use the PDU as soon as I gather up some money. ;) Still a student. :) >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> > From tomislav.maric at gmx.com Mon Oct 5 03:19:01 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: References: <4AC78CC8.5060500@gmx.com> <4AC78F3D.3030707@cs.earlham.edu> <4AC7C9F3.4050301@gmx.com> <4AC8821E.60605@gmx.com> <4AC929FA.6060201@gmx.com> Message-ID: <4AC9C815.7030105@gmx.com> Mark Hahn wrote: >> the other nodes are to be diskless.. I have separated these partitions: >> /swap /boot / /var and /home. Is this ok? > > I don't believe there is much value in separating partitions like this. > for instance, a swap partition has no advantage over a swap file, > and the latter is generally more convenient. separating /boot is > largely a vestige of quite old bioses which could not do 48b LBA > addressing (ie, deal with big disks.) separating /, /var and /home > is largely a matter of taste: > > - separate /home means that you can blast / and /var when > you upgrade or change distros. > > - separating / and /var means that something like syslog or > mail can never fill up the root partition. which is generally > not a serious issue anyway: if anything, this argues for putting > /tmp in a separate partition, but leaving /var inside /... > > - more partitions (filesystems) means more time spent waiting > for disk heads to seek. remember that write activity on any fs > (modern, journaling) requires some synchronous writes, and > therefore can't be lazy-ified or elevator-scheduled when you have > multiple partitions. > > I think it's always a good idea to minimize partitions, though we can > probably argue about how many is best. I suppose one factor in favor > of separate, purpose-specific filesystems is that they might be split > into different raid levels. for instance, /tmp might be raid0, perhaps > /var a >2 disk raid10 for write speed, and /home left on /, both under > raid6. but given that you don't want multiple filesystems competing > for access to the same disk(s), I'm skeptical of this approach... > > in short, it's perfectly OK to use one big filesystem. or you can just > use the distro's default partitioning scheme. in the end it's not going > to make much difference. > > regards, mark hahn. > Thank you very much for the info. There so much of this stuff that's totally new to me. I think I'm going to feel very lucky if I ever get this thing to work properly. Best regards, Tomislav From atp at piskorski.com Mon Oct 5 03:52:24 2009 From: atp at piskorski.com (Andrew Piskorski) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] ATX on switch In-Reply-To: <4AC89B81.3040608@gmx.com> References: <4AC89B81.3040608@gmx.com> Message-ID: <20091005105224.GA68320@piskorski.com> On Sun, Oct 04, 2009 at 02:56:33PM +0200, Tomislav Maric wrote: > where can I get an "on/off" and "reset" switch for ATX motherboard > without buying and ripping apart a case? I no longer remember where I bought mine, but googling for "ATX power switch" or the like will find you lots of options, e.g.: http://www.google.com/products/catalog?q=ATX+power+switch&hl=en&cid=2592395134572032536&sa=title#p http://www.pccables.com/cgi-bin/orders6.cgi?action=Showitem&partno=03201&rsite=f.03201 http://www.directron.com/atxswitch.html?gsear=1 http://www.axiontech.com/prdt.php?item=78201 -- Andrew Piskorski http://www.piskorski.com/ From tjrc at sanger.ac.uk Mon Oct 5 05:45:23 2009 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC9C423.10003@gmx.com> References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> <4AC8822B.6000603@gmx.com> <20091005002116.GA3130@tosh2egg.ca.sanfran.comcast.net> <4AC9C423.10003@gmx.com> Message-ID: <1384B3AB-A2D8-4A38-AA75-C4458CF4E1C1@sanger.ac.uk> On 5 Oct 2009, at 11:02 am, Tomislav Maric wrote: > OK, I guess then Ubuntu will suffice for a 12 node Cluster. :) Anyway, > I'll try it and see. Thanks! We run Debian on our clusters, so you're definitely not the only person using a Debian-based distro for your cluster. Debian does have a kickstart-like deployment mechanism, debconf preseeding, but it's fiendish to debug. I use FAI instead (Fully Automated Install) but that's broken with recent Ubuntu releases, unfortunately. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From rpnabar at gmail.com Mon Oct 5 12:08:28 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <4AC36693.3040801@scalableinformatics.com> <4AC6C16D.9050708@cs.earlham.edu> <4AC78313.8060904@cs.earlham.edu> <4AC78935.2010906@cs.earlham.edu> Message-ID: On Sat, Oct 3, 2009 at 4:11 PM, Mark Hahn wrote: > that's local ipmi, which to me is quite beside the point. ?ipmi is valuable > primarily for its out-of-band-ness - that is, you can get to it when the > host is off or wedged. True. I was just testing it locally before I tested remote. >> you're on Dell you can use OpenManage's omconfig command-line tool. I am on Dell. R410 and SC1435's > demand standards and just say "no" to non-standards, especially when venors > claim that they're supra-standard features. ?if we as computer > people have learned anything at all from our own history, it is that open > standards drive everything in the end. I agree. No arguments with that. Open tends to be cheaper too, nothing sucks so much as being tied to a particular vendors software or implementation. -- Rahul From dzaletnev at yandex.ru Mon Oct 5 04:19:37 2009 From: dzaletnev at yandex.ru (Dmitry Zaletnev) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] Home Beowulf Message-ID: <58241254741577@webmail83.yandex.ru> Hello, list, I just thinking about a 2-nodes Beowulf for a CFD application that reads/writes small amounts of data in random areas of RAM (non-cached translation). What would be the difference in overall performance (I suppose, due to the memory access time) between Q9650/ 8 GB DDR2-800 and i7 920/ 6 GB DDR3-1866. By the way, are local variables of methods/functions stored in L2 cache? The code will use all the system memory for data storage, excluding used by OS, no swapping supposed. I'm going to buy motherboards next month. Thank you in advance for any advice, -- Dmitry Zaletnev From rpnabar at gmail.com Tue Oct 6 09:22:40 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <4AC36693.3040801@scalableinformatics.com> <4AC6C16D.9050708@cs.earlham.edu> <4AC781E1.7050705@scalableinformatics.com> Message-ID: On Sat, Oct 3, 2009 at 4:01 PM, Mark Hahn wrote: >> monolithic all-or-none creature. From what you write (and my online >> reading) it seems there are several discrete parts: >> >> IMPI 2.0 >> switched remotely accessible PDUs >> "serial concentrator type system " > > I think Joe was going a bit belt-and-suspenders-and-suspenders here. > > finally, I think Joe is advocating another layer of backups - serial > concentrators that would connect to the console serial port on each node > to collect output if IPMI SOL isn't working. ?this is perhaps a matter of > taste, but I don't find this terribly useful. ?I thought it would be for my > first cluster, but never actually set it up. ?but again, that's > because IPMI works well in my experience. Thanks Mark and Joe.Mark, I think I agree with your assessment of multiple layers being an overkill. It's up to individual judgment but for our needs I think IPMI will suffice. Besides the servers I'm tending to buy do come with IPMI support so that's lucky for. The chore is just getting it all configured correctly now! Thanks for all the leads guys! -- Rahul From dnlombar at ichips.intel.com Tue Oct 6 09:42:48 2009 From: dnlombar at ichips.intel.com (David N. Lombard) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] ATX on switch In-Reply-To: <4AC89B81.3040608@gmx.com> References: <4AC89B81.3040608@gmx.com> Message-ID: <20091006164248.GB5702@nlxdcldnl2.cl.intel.com> On Sun, Oct 04, 2009 at 05:56:33AM -0700, Tomislav Maric wrote: > Hi again, > > where can I get an "on/off" and "reset" switch for ATX motherboard > without buying and ripping apart a case? Some stores sell a small package of switches and LEDs with the proper connectors. I've used those for playing about with boards outside a case. An electronics store may have the connectors and switches you want. Finally, you could rip apart an old case for the wire/connectors and replace the switches with standard momentary push buttons. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From tom.elken at qlogic.com Tue Oct 6 11:14:51 2009 From: tom.elken at qlogic.com (Tom Elken) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] Home Beowulf In-Reply-To: <73111254851774@webmail20.yandex.ru> References: <58241254741577@webmail83.yandex.ru> <35AAF1E4A771E142979F27B51793A4888702AE6C7D@AVEXMB1.qlogic.org> <73111254851774@webmail20.yandex.ru> Message-ID: <35AAF1E4A771E142979F27B51793A4888702AE6CC1@AVEXMB1.qlogic.org> > > > On Behalf Of Dmitry Zaletnev >>> What would be the difference in overall performance > (I suppose, due to the memory access time) between Q9650/ 8 GB DDR2- > 800 > > > and i7 920/ 6 GB DDR3-1866. > > you probably mean "DDR3-1066". > Tom, thank you for your answer, here I mean 1866 MHz, OCZ3P1866C7LV6GK > kits. I expect their availability must grow up till the end of the > winter when I'm going to finalize the construction of my home Beowulf. OK. I didn't know that desktop memory was so much faster than the top-end server memory of DDR3-1333 we have in our cluster. BTW, we've had a lot of dimm and/or dimm slot issues at this speed, so I would hesitate using the "bleeding edge" memory speed if reliability is important to you. For the rest of the Beowulf list, I had meant to CC you on my response to Dmitry. Here it is the rest of my response: " Concerning overall CFD performance, I would go for the i7 920. For CFD performance data, one good resource are the FLUENT benchmark pages: http://www.fluent.com/software/fluent/fl6bench/fl6bench_6.4.x/index.htm click on one of the 6 test datasets there and get tables of benchmark performance. For example the Aricraft_2m model (with at: http://www.fluent.com/software/fluent/fl6bench/fl6bench_6.4.x/problems/aircraft_2m.htm The Nehalem (same architecture as i7 920) generation: INTEL WHITEBOX (INTEL_X5570_NHM4,2930,RHEL5)* FLUENT rating at 8 cores: = 784.9 The Harpertown generation (same architecture as Q9650, I think) INTEL WHITEBOX (INTEL_X5482_HTN4,3200,RHEL5) FLUENT rating at 8 cores: = 307.6 * To help decode these brief descriptions of processors, systems & interconnects, you can often find more details at this page: http://www.fluent.com/software/fluent/fl6bench/new.htm THe FLUENT rating is a "bigger is better" metric relating to the number of jobs you can run in 24 hours, IIRC. So with the newer generation, on these FLUENT benchmarks anyway, you get approximately 2x the performance. > By the way, are local variables of > methods/functions stored in L2 cache? There are 3 levels of cache in the i7, 2 levels in the Q9650. All your problems are likely to use all levels of cache. You are right that data you tend to reuse frequently will tend to live in the lower cache levels longer, but there are no hard and fast rules. Best of luck, Tom " > I'm slightly confused: the test results from the Corsair site tell > about significant difference in performance between i7 with 1333 MHz > memory and 1866 MHz memory, from other side Tom's hardware forum tells > that the only parameter that matters is the latency and there's no > difference between a low-latency 1333 MHz memory and 1600 MHz memory. > But I'm afraid that chosing 1333 MHz memory now will eliminate the > sense of future upgrade to Xeon (motherboard supports it) or Fermi-in- > SLI. > > Best regards, > Dmitry From rpnabar at gmail.com Tue Oct 6 12:22:14 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] scheduler recommendations for a HPC cluster Message-ID: Any strong / weak recommendations for / against schedulers? For a long time we have worked happily with a Torque + Maui system. It isn't perfect but works (and is free!). But rarely does a chance present itself to go for something "newer and better" on a in-production system since people hate changes and outages. This time as we shop for a new cluster it presents me the opportunity to change if something better exists. Any comments? What are other users using out there? Any horror stories? Or any super good finds? I shy against LSF etc since those cost a lot of money. Especially as they, and similar systems are mostly licensed per server per year so the costs do add up. I have been a user on a LSF systems for a long time and I think it is an awesome scheduler but have never been at the admin end of LSF. One thing that the Torque+Maui option is not the best is that it is not monolithic. Oftentimes it is hard to know which component to blame for a problem or more relevant which config file to use to fix a problem. Torque or Maui. On the other hand , can't get rid of Maui since Fairshare policies etc. are important to us and those seem to be in the Maui domain. (all our jobs are MPI jobs in case that is relevant. We haven't been doing checkpointing yet) Of course, there is MOAB these days, but I am not sure if that is worth the money since I have not used it. I appreciate any comments or words of wisdom you guys might have! -- Rahul From bill at cse.ucdavis.edu Tue Oct 6 12:51:23 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] DRAM error rates: Nightmare on DIMM street Message-ID: <4ACB9FBB.7060509@cse.ucdavis.edu> Somewhat of a follow up of the rather large study of disk drive reliability that google published awhile back: http://blogs.zdnet.com/storage/?p=638 PDF on which the article is based on: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf From tomislav.maric at gmx.com Tue Oct 6 12:59:14 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] diskless how to Message-ID: <4ACBA192.2060108@gmx.com> Hi, I'm browsing through the web and there's multiple options for me to choose between prepared programs that set up diskless nodes. Still for the first two of them, I would like to learn how it's done. Can someone tell me what I need to do or point me to a "manual" tutorial for the diskless nodes? I just need good pointers so that I can avoid dead ends on my way to home beowulf. Thanks, Tomislav From dzaletnev at yandex.ru Tue Oct 6 10:56:14 2009 From: dzaletnev at yandex.ru (Dmitry Zaletnev) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] Home Beowulf In-Reply-To: <35AAF1E4A771E142979F27B51793A4888702AE6C7D@AVEXMB1.qlogic.org> References: <58241254741577@webmail83.yandex.ru> <35AAF1E4A771E142979F27B51793A4888702AE6C7D@AVEXMB1.qlogic.org> Message-ID: <73111254851774@webmail20.yandex.ru> > > On Behalf Of Dmitry Zaletnev > > > > I just thinking about a 2-nodes Beowulf for a CFD application that > > reads/writes small amounts of data in random areas of RAM (non-cached > > translation). What would be the difference in overall performance (I > > suppose, due to the memory access time) between Q9650/ 8 GB DDR2-800 > > and i7 920/ 6 GB DDR3-1866. > you probably mean "DDR3-1066". Tom, thank you for your answer, here I mean 1866 MHz, OCZ3P1866C7LV6GK kits. I expect their availability must grow up till the end of the winter when I'm going to finalize the construction of my home Beowulf. I'm slightly confused: the test results from the Corsair site tell about significant difference in performance between i7 with 1333 MHz memory and 1866 MHz memory, from other side Tom's hardware forum tells that the only parameter that matters is the latency and there's no difference between a low-latency 1333 MHz memory and 1600 MHz memory. But I'm afraid that chosing 1333 MHz memory now will eliminate the sense of future upgrade to Xeon (motherboard supports it) or Fermi-in-SLI. Best regards, Dmitry From wrankin at duke.edu Tue Oct 6 15:39:14 2009 From: wrankin at duke.edu (Bill Rankin) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] scheduler recommendations for a HPC cluster In-Reply-To: References: Message-ID: <4ACBC712.4050507@duke.edu> Hi Rahul, If you are looking at the commercial offerings, you may want to also consider PBS Professional from Altair Engineering. http://www.pbspro.com/ In the name of full disclosure, I work for the PBS division of Altair. Thanks, -bill On 10/06/2009 03:22 PM, Rahul Nabar wrote: > Any strong / weak recommendations for / against schedulers? From dag at sonsorol.org Tue Oct 6 16:04:58 2009 From: dag at sonsorol.org (Chris Dagdigian) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] scheduler recommendations for a HPC cluster In-Reply-To: References: Message-ID: Platform LSF one of the best available offerings if you consider the overall administrative burden, APIs, quality of documentation and quality of support. Basically for some cases the cost of the commercial software license more than pays for itself in having a product that is stable, well documented and has a very low admin/ operational burden. Trying to save money on an open source product and then needing to hire additional people to keep it from falling over is a mistake I've seen at more than one site. That said, I'm personally a Grid Engine zealot these days and use/ deploy it often. Open source and commercial flavors, amazing support community and a high level of product acceptance & market share in the life sciences which is where I do most of my work. In the interest of full disclosure I do Grid Engine consulting & training so I'm not totally unbiased here. When it comes to PBS variants I'd avoid the pure open source versions of PBS/Torque - I don't think I've ever been in an openPBS or Torque shop that has not altered the source code or otherwise dug deeply into the product. For people considering the PBS route I always recommend checking in with the pbspro people first. Just my $.02 -Chris On Oct 6, 2009, at 9:22 PM, Rahul Nabar wrote: > Any strong / weak recommendations for / against schedulers? For a long > time we have worked happily with a Torque + Maui system. It isn't > perfect but works (and is free!). But rarely does a chance present > itself to go for something "newer and better" on a in-production > system since people hate changes and outages. This time as we shop for > a new cluster it presents me the opportunity to change if something > better exists. > > Any comments? What are other users using out there? Any horror > stories? Or any super good finds? > > I shy against LSF etc since those cost a lot of money. Especially as > they, and similar systems are mostly licensed per server per year so > the costs do add up. I have been a user on a LSF systems for a long > time and I think it is an awesome scheduler but have never been at the > admin end of LSF. > > One thing that the Torque+Maui option is not the best is that it is > not monolithic. Oftentimes it is hard to know which component to blame > for a problem or more relevant which config file to use to fix a > problem. Torque or Maui. On the other hand , can't get rid of Maui > since Fairshare policies etc. are important to us and those seem to be > in the Maui domain. (all our jobs are MPI jobs in case that is > relevant. We haven't been doing checkpointing yet) > > Of course, there is MOAB these days, but I am not sure if that is > worth the money since I have not used it. > > I appreciate any comments or words of wisdom you guys might have! > > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From rpnabar at gmail.com Wed Oct 7 10:27:51 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] scheduler recommendations for a HPC cluster In-Reply-To: <4ACBC712.4050507@duke.edu> References: <4ACBC712.4050507@duke.edu> Message-ID: On Tue, Oct 6, 2009 at 5:39 PM, Bill Rankin wrote: > Hi Rahul, > > If you are looking at the commercial offerings, you may want to also > consider PBS Professional from Altair Engineering. > > http://www.pbspro.com/ > > In the name of full disclosure, I work for the PBS division of Altair. Thanks Bill! Maybe I will try out PBSPro. Commercial is OK so long as the cost is justifiable. The names are confusing : I thought PBS, Torque and PBSPro were the same pedigree and developer. I'll get into touch with you off list. -- Rahul From rpnabar at gmail.com Wed Oct 7 13:22:09 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] scheduler recommendations for a HPC cluster In-Reply-To: <4ACBC712.4050507@duke.edu> References: <4ACBC712.4050507@duke.edu> Message-ID: On Tue, Oct 6, 2009 at 5:39 PM, Bill Rankin wrote: > Hi Rahul, > > If you are looking at the commercial offerings, you may want to also > consider PBS Professional from Altair Engineering. > > http://www.pbspro.com/ > > In the name of full disclosure, I work for the PBS division of Altair. > How does one compare different schedulers, anyways? Is it mostly "word of mouth" and reputation. Feature sets are good to look at but it's not really a quantitative metric. Are there any third party comparisons of various schedulers? Do they have a niche that one scheduler outperforms another? Perhaps my quest for a quantitative metric is stupid. Maybe this is one of the many areas of technology where things are more qualitative than quantitative anyways. Price/ performance is always hard to define but for schedulers this seems impossible. The other issue seems to be per core licensing. To me it seems as an admin the amount of time and effort one puts in configuring a scheduler for a 50 core system and a 2000 core system is not grossly too different (maybe I am wrong?). The license cost on the other hand scales with cores. That makes the justification even harder. -- Rahul From hahn at mcmaster.ca Wed Oct 7 16:52:27 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] scheduler recommendations for a HPC cluster In-Reply-To: References: <4ACBC712.4050507@duke.edu> Message-ID: > How does one compare different schedulers, anyways? Is it mostly "word > of mouth" and reputation. Feature sets are good to look at but it's > not really a quantitative metric. Are there any third party > comparisons of various schedulers? Do they have a niche that one > scheduler outperforms another? I've never seen a useful comparison or even discussion of schedulers. as far as I can tell, part of the problem is that the conceptual domain is not developed enough to permit real, general tool-ness. I don't mean that schedulers aren't useful. it's dead simple to throw together a package that lets users submit/control jobs, arbitrate a queue, matches jobs to resources, fire them up, etc. they all do it, and you can write such a system from scratch in literally a few programmer-days if you know what you're doing. it's the details that matter, and that's where existing schedulers, even though they are functional, are not good tools. to me, a good tool has a certain patina of conceptual integrity about it. for instance, what it takes to make a good compiler is well known: we are all familiar with the two interfaces: source language and machine code. there are differences in quality, but for the most part, compilers all work alike. we all are at least somewhat aware of the long history of compilers, littered with the kind of mistakes that bring learning. you know what to expect from a screwdriver or drillpress, though they may differ in size or power. your programming language may be more torx than phillips, or you may prefer a multibit screwdriver. but we all know it needs a comfortable handle, a certain size and rigidity, blade for fitting the screw, etc. schedulers are more like an insanely ramified swiss army knife: feature complete is sometimes detrimental, and extreme featurefulness sometimes means it's guaranteed to not do what you want, only something vaguely in that direction. I think there's a physics-of-software principle here, that features always lead to less flexibility. (that doesn't deny that techniques like refactoring help, but they _only_ introduce a discontinuity between regions of featuritis. if nothing else, the dragging weight of back-compatibility is piecewise-montonic...) > Perhaps my quest for a quantitative metric is stupid. Maybe this is > one of the many areas of technology where things are more qualitative > than quantitative anyways. Price/ performance is always hard to define > but for schedulers this seems impossible. nah. it's a domain problem: what a scheduler should do and how is simply not well-defined, so scheduler companies just go for quantity to win you over. > The other issue seems to be per core licensing. To me it seems as an well, or licensing at all. it would be different if you were paying SchedulerCo to "make everything work the way I want", AND they could actually do it. instead, you pay for the thing they want to sell, and then spend huge amounts of your time fighting it and ultimately erecting shims and scaffolds around it to get it closer to "right". > admin the amount of time and effort one puts in configuring a > scheduler for a 50 core system and a 2000 core system is not grossly > too different (maybe I am wrong?). depends on what you want. if you're undemanding and merely want to hit the "users can submit jobs which eventually run somehow" milestone, then there's certainly no reason to pay for anything, and can expect it to take a competent sysadmin a few hours to set up. ie, the goal is "keep the machineroom warm". maybe my organization is uniquely picky, but I would say that after 5 years elapsed, and probably > 2 staff-years manhandling it, our (commercial) scheduler does maybe 65% of what it should. we're in the process of switching to another (mostly commercial) scheduler, and I expect that with 1-2 staff-years of investment, we'll have it close to 65 or 70%. > The license cost on the other hand > scales with cores. That makes the justification even harder. per-core licensing is just asinine: vendors seem to think that since the government taxes everyone, that's a fine revenue model for them too. they don't consider it from the other direction: that the amount of development and support effort is pretty close to constant per installation (ie, specifically _not_ a function of core count). we can argue about tax/politics/society over beer sometime, but "soak the rich" is not optimal in the marketplace. From sabujp at gmail.com Tue Oct 6 15:52:13 2009 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] scheduler recommendations for a HPC cluster In-Reply-To: <4ACBC712.4050507@duke.edu> References: <4ACBC712.4050507@duke.edu> Message-ID: Torque for a simple scheduler. Maui if you want something more complex but something which can handle resources more effectively, especially if you're going to have multiple groups accessing the cluster. >> Any strong / weak recommendations for / against schedulers? From rich at nd.edu Wed Oct 7 07:52:02 2009 From: rich at nd.edu (Rich Sudlow) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] Best Practices SOL vs Cyclades ACS Message-ID: <4ACCAB12.8030104@nd.edu> In the past we've used cyclades console servers for serial interfaces into our cluster nodes. We're replacing 360 nodes which couldn't do SOL with 360 which could. Now that we can do SOL is that a better to use that instead of the Cyclades? Thoughts? Rich -- Rich Sudlow University of Notre Dame Center for Research Computing 128 Information Technology Center PO Box 539 Notre Dame, IN 46556-0539 (574) 631-7258 office phone (574) 631-9283 office fax From landman at scalableinformatics.com Wed Oct 7 18:36:04 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] Best Practices SOL vs Cyclades ACS In-Reply-To: <4ACCAB12.8030104@nd.edu> References: <4ACCAB12.8030104@nd.edu> Message-ID: <4ACD4204.4080603@scalableinformatics.com> Rich Sudlow wrote: > In the past we've used cyclades console servers for serial > interfaces into our cluster nodes. > > We're replacing 360 nodes which couldn't do SOL with 360 > which could. > > Now that we can do SOL is that a better to use that instead of the > Cyclades? > > Thoughts? Every now and then IPMI gets wedged. We have seen it on all IPMI stacks. When IPMI gets wedged, SOL stops working. I recommend redundant administrative pathways ... make sure you can get to and control the machine in the event of a problem. Some pathways may not be as cost effective at scale than others. > > Rich > > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From beckerjes at mail.nih.gov Wed Oct 7 19:24:48 2009 From: beckerjes at mail.nih.gov (Jesse Becker) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] scheduler recommendations for a HPC cluster In-Reply-To: References: Message-ID: <20091008022448.GD15181@mail.nih.gov> On Tue, Oct 06, 2009 at 03:22:14PM -0400, Rahul Nabar wrote: >Any strong / weak recommendations for / against schedulers? For a long I'm a happy SGE user, and have been for 7+ years. A basic install does simple FIFO queuing (just like Torque, from what I've heard). It is fairly easy to add various "fairness" mechanisms to make sure that a single user doesn't take over the cluster, as well as define what you thing "fair" means. SGE can handle both interactive and non-interactive jobs, manage basic job dependencies, and has a more advanced interface through the DRMAA API. SGE also provides resource management and load balancing (e.g. handling software licenses and making sure compute nodes aren't oversubscribed). There's a wealth of documentation, a very helpful mailing list, active development, and responsive developers. Oh, and it's free (as in beer and speech), although I think you can throw money at Sun for support if you want. -- Jesse Becker NHGRI Linux support (Digicon Contractor) From bcostescu at gmail.com Thu Oct 8 03:08:04 2009 From: bcostescu at gmail.com (Bogdan Costescu) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] scheduler recommendations for a HPC cluster In-Reply-To: References: <4ACBC712.4050507@duke.edu> Message-ID: On Wed, Oct 7, 2009 at 10:22 PM, Rahul Nabar wrote: > How does one compare different schedulers, anyways? Is it mostly "word > of mouth" and reputation. Feature sets are good to look at but it's > not really a quantitative metric. I would say that, in the spirit of other benchmarks and comparisons discussed on this list, the best way is to try as many different ones as possible and make your own decision based on objective (how much it fits the needs) and subjective (how much you like its interfaces, how comfortable you feel maintaining it) criteria. But of course this is just wishful thinking... most resource managers and schedulers are way too complicated for a simple "let's just install it and run it" scenario, so it takes an excessive amount of time to set them up and to read documentation, pull hair over documented but not working features, contact support or write to mailing lists for help, getting something working only to find out that the users or, even worse, some administrative powers want changes and... start the cycle all over again. One can argue that taking care of a resource manager and scheduler can be a full-time job on a medium/large cluster, somehow similar to a DBA. IMHO, features sets are not a good way of comparing schedulers - much more important is how they map to what you want to achieve and how easy is to interact with them. I can take a lot of time to configure and test (very important part if the admin actually cares about the users ;-)) a medium to complex setup (f.e. several queues, serving several types of nodes with different limits of using the hardware or what users are allowed to do) and often there are several ways of achieving the same result which means that it's quite easy for a beginner to shoot him/her-self in the foot by trying to set too many things at once. The interaction with the scheduler is crucial because usually some results have to be extracted from it (efficiency of usage of the hardware, what ratio of time each user/group has consumed, how well fairshare works, etc.) and sometimes new settings have to be put into effect to change those results; let alone adding new nodes or making changes to accommodate a particular user or software... Another issue is that, although many features are advertised, not all work as you think. At configuration stage and especially while testing (or latest in production...), you can find about all kinds of limitations or implementation details which raise barriers between the resource manager or scheduler and your goals - especially interactions with various other components like MPI libraries (which need to know how to start a process on a remote node, which processors or cores to use, etc.), ISV-provided start-up scripts, existing infrastructure (Kerberos support, username length, number of secondary groups a user can belong to, etc.), but sometimes also inside the resource manager or scheduler itself (how good is it a cleaning up after a failed job, what's the maximum number of queues, etc.). I agree with Mark Hahn about selling a solution that solves a particular site's resource management and scheduling problem and not a generic solution that has to be customized by people who might have a hard time understanding the concepts and maybe limitations of the offered solution. Of course, some academic sites might not want this... to reduce costs, to not depend on a particular vendor, to allow for new ideas to be implemented, while many companies which require resource management and scheduling probably have their staff trained for this particular task. But I see a real need in small academic groups or small companies which don't have enough manpower to dedicate to this... Sorry for the rather negative message, but one would expect that after more than 20 years such an important piece of middleware would reach maturity and be easy to deploy and configure. Cheers, Bogdan From tomislav.maric at gmx.com Thu Oct 8 04:21:17 2009 From: tomislav.maric at gmx.com (Tomislav Maric) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] diskless how to In-Reply-To: <571f1a060910072052x4862d93bhdf7193d73b367036@mail.gmail.com> References: <4ACBA192.2060108@gmx.com> <571f1a060910072052x4862d93bhdf7193d73b367036@mail.gmail.com> Message-ID: <4ACDCB2D.2030402@gmx.com> @Greg Kurtzer & Jess Cannata Thank you both very much for the information and advice! I'm digging through it. :) Best regards, Tomislav From prentice at ias.edu Thu Oct 8 10:18:45 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Wed Nov 25 01:08:58 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: Message-ID: <4ACE1EF5.1020304@ias.edu> Rahul Nabar wrote: > Any good recommendation on a crash cart for a cluster room? My last > cluster was small and we had the luxury of having a KVM + SIP > connecting to each compute node. > I use an Anthro adjustable-height Zido (25" wide) with extension tube, flat panel monitor mount, and metal bin for storage. It works very well, and doesn't take up much space in the server room http://www.anthro.com/ppage.aspx?pmid=66 -- Prentice From prentice at ias.edu Thu Oct 8 10:24:39 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: fullcluster KVM is not an option I suppose? In-Reply-To: References: <4AC2EEC6.4000902@scalableinformatics.com> <68A57CCFD4005646957BD2D18E60667B0D5251A1@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4ACE2057.70108@ias.edu> Rahul Nabar wrote: > On Wed, Sep 30, 2009 at 3:27 AM, Hearns, John wrote: > >> And Rahul, are you not talking to vendors who are telling you about >> their remote management and >> node imaging capabilities? By vendors, I do not mean your local Tier 1 >> salesman, who sells servers to normal businesses >> and corporations. > > > Thanks for all the suggestions guys! I am aware of IPMI and my > hardware does support it (I think). Its just that I've never had much > use for it all my past clusters being very small. > Even with IPMI, you still need a crash cart of some type to initially set up IPMI in the system's BIOS. At the minimum, you need to set the IP address that the IMPI interface will listen on (if it's a shared NIC port), and the password. You definitely don't want to leave the default password. -- Prentice From mathog at caltech.edu Thu Oct 8 12:40:23 2009 From: mathog at caltech.edu (David Mathog) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? Message-ID: Prentice Bisbal wrote: > Even with IPMI, you still need a crash cart of some type to initially > set up IPMI in the system's BIOS. At the minimum, you need to set the IP > address that the IMPI interface will listen on (if it's a shared NIC > port), and the password. You definitely don't want to leave the default > password. That could more easily be done on a bench before racking the system. I have a really, really long PS/2 KVM cable that runs from a monitor/keyboard/mouse on the one table in the machine room to whichever machine needs work. I don't recall its length, 30 feet maybe, but these sorts of cables work up to 100 ft or so. The single long cable works here because the machine room is small. In the two steps forward, one step back category, many machines now have no PS/2 connectors, and the one place where USB has a problem is with long cables, which max out at 5 meters. This can be increased with more cables and hubs, but it stops being simple at 5 meters. If you just want to plug in a VGA monitor into such a system then a 100 ft single cable is still an option. Anyway, for me it is a lot easier to snake the one long cable across the floor than it would be to roll a cart around. YMMV. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rpnabar at gmail.com Thu Oct 8 14:01:29 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: References: Message-ID: On Thu, Oct 8, 2009 at 2:40 PM, David Mathog wrote: > Prentice Bisbal wrote: > > Anyway, for me it is a lot easier to snake the one long cable across the > floor than it would be to roll a cart around. YMMV. Yup; true. I totally agree. -- Rahul From buccaneer at rocketmail.com Thu Oct 8 14:23:31 2009 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: Message-ID: <294779.96781.qm@web30601.mail.mud.yahoo.com> - > > Anyway, for me it is a lot easier to snake the one > long cable across the > > floor than it would be to roll a cart around. YMMV. > > Yup; true. I totally agree. We use serial port concentrators which we use to handle most of our problems. If I have to get up close and personal, we have a bunch of carts we just roll over and plug in. From hahn at mcmaster.ca Thu Oct 8 15:01:37 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: fullcluster KVM is not an option I suppose? In-Reply-To: <4ACE2057.70108@ias.edu> References: <4AC2EEC6.4000902@scalableinformatics.com> <68A57CCFD4005646957BD2D18E60667B0D5251A1@milexchmb1.mil.tagmclarengroup.com> <4ACE2057.70108@ias.edu> Message-ID: > Even with IPMI, you still need a crash cart of some type to initially > set up IPMI in the system's BIOS. At the minimum, you need to set the IP > address that the IMPI interface will listen on (if it's a shared NIC afaik, not really. here's what I prefer: cluster nodes normally come out of the box with BIOS configured to try booting over the net before local HD. sometimes this is conditional on the local HD having no active partition. great: so they boot from a special PXE image I set up as a catchall. (dhcpd lets you define a catchall for any not nodes which lack a their own MAC-specific stanza.) when nodes are in that state, I like to auto-configure the cluster's knowlege of them: collect MAC, add to dhcpd.conf, etc. at this stage, you can also use local (open) ipmi on the node itself to configure the IPMI LAN interface: ipmitool lan 2 set password pa55word ipmitool lan 2 set defgw ipaddr 10.10.10.254 ipmitool lan 2 set ipsrc dhcp none of this precludes tricks like frobing the switch to find the port-MAC mappings of course - the point is simply that if you let unconfigured nodes autoboot into a useful image, that image can help you automate more of the config. for a while I had a sort of 'borg' cluster that would autoconfigure anything that PXEd on its LAN. (well, assuming it would boot at least an ia32 image - that image would add it to the cluster and arrange for it to reboot into a more specific (eg x86_64) image.) I never even bothered to mess with the BIOS boot-order on those nodes - they would have tried to boot from the local disk before PXE, except that I left it unpartitioned. local filesystem on /def/hda (ie, not hda1). swapfile. From rpnabar at gmail.com Thu Oct 8 15:08:54 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: <294779.96781.qm@web30601.mail.mud.yahoo.com> References: <294779.96781.qm@web30601.mail.mud.yahoo.com> Message-ID: On Thu, Oct 8, 2009 at 4:23 PM, Buccaneer for Hire. wrote: > > - >> > Anyway, for me it is a lot easier to snake the one >> long cable across the >> > floor than it would be to roll a cart around. YMMV. >> >> Yup; true. I totally agree. > > We use serial port concentrators which we use to handle most of our problems. If I have to get up close and personal, we have a bunch of carts we just roll over and plug in. In past experience the only time I need non-ssh access is for initial BIOS setup and MAC address extraction. Nowadays, the vendors will ship with the BIOS settings I ask for and also provide the MAC address externally (or I can autoextract from the initial PXE discovery packets). After that its PXE etc. and once I have a shell window everything is OK. In the rare cases that I can't come to the point of a shell even terminal access (via IMPI etc.) rarely helps since at that point a motherboard swap or a dead HDD etc. is involved. So, hopefully I will have very little need for IPMI (remote power cycling is great though; those network aware power distribution stics seem great). -- Rahul From rpnabar at gmail.com Thu Oct 8 15:10:58 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: fullcluster KVM is not an option I suppose? In-Reply-To: References: <4AC2EEC6.4000902@scalableinformatics.com> <68A57CCFD4005646957BD2D18E60667B0D5251A1@milexchmb1.mil.tagmclarengroup.com> <4ACE2057.70108@ias.edu> Message-ID: On Thu, Oct 8, 2009 at 5:01 PM, Mark Hahn wrote: >> Even with IPMI, you still need a crash cart of some type to initially >> set up IPMI in the system's BIOS. At the minimum, you need to set the IP >> address that the IMPI interface will listen on (if it's a shared NIC > > afaik, not really. ?here's what I prefer: cluster nodes normally come out of > the box with BIOS configured to try booting over the net before local HD. > sometimes this is conditional on the local HD having no active partition. Thanks Mark. I am getting more and more of the feeling that if I cannot make a node behave and present a usable command line via PXE it is likely so far gone that a trip to the server room is inevitible and IPMI is unlikely to save the day. But that's just my workflow. -- Rahul From hahn at MCMASTER.CA Thu Oct 8 15:34:50 2009 From: hahn at MCMASTER.CA (Mark Hahn) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: References: <294779.96781.qm@web30601.mail.mud.yahoo.com> Message-ID: > So, hopefully I will have very little need for IPMI (remote power > cycling is great though; those network aware power distribution stics > seem great). power-cycling like this a machine is clearly harder on the PSU than reset or soft-off/on. I think the only justification for net/smart PDUs is for power monitoring or as a last-chance backup for proper (IPMI) power control. I would certainly chose IPMI before net-PDU if I could only have one (and indeed my organization sticks to IPMI.) I have only rarely experienced problems with "stuck" IPMI, and some fraction of those can be solved by sshing to the machine and using "ipmitool mc reset cold" locally. From lindahl at pbm.com Thu Oct 8 15:55:03 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: References: <294779.96781.qm@web30601.mail.mud.yahoo.com> Message-ID: <20091008225503.GD27035@bx9.net> On Thu, Oct 08, 2009 at 05:08:54PM -0500, Rahul Nabar wrote: > So, hopefully I will have very little need for IPMI (remote power > cycling is great though; those network aware power distribution stics > seem great). You haven't mentioned the other things you can use IPMI for. 1) Console logging. Your machine just crashed. No clue in /var/log/messages. "I wonder if it printed something on the console?" Answer: ipmi and conman (available in an rpm in Red Hat distros). 2) Monitoring. Temp, fan speeds, power supply state, events. Answers the "why is the little red light on the front of the case lit?" question. You can get some of this via other software (lm_sensors), but I find ipmitool to suck less, and ipmitool accurately answers the red light question -- lm_sensors can only guess. -- g From buccaneer at rocketmail.com Thu Oct 8 15:57:53 2009 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: Message-ID: <107529.66590.qm@web30603.mail.mud.yahoo.com> > In past experience the only time I need non-ssh access is > for initial BIOS setup and MAC address extraction. Nowadays, > the vendors will ship with the BIOS settings I ask for and > also provide the MAC address externally (or I can autoextract > from the initial PXE discovery packets). > We have found that vendors are very helpful there. Setting the node to PXE is a big help, specially for new types of nodes. We require the latest firmware on the box (so I don't have to spend countless hours upgrading firmware), and a spreadsheet with Mac address/Asset information. We use a script to upload the info the DB. Connect power, gigE,10gE,serial. Then on box and it starts installing... So if I can't ssh or rsh to the box. I try the serial console. If that does not work I look at the history of what's been going on (serial buffer). If I have to reboot at this point I can try the ILO/LOM/ROAMER/DRAC, etc. If that does not work I can remote cut the power. If that does not work, THEN I get off my all-too-comfy chair and head to the cluster room. From csamuel at vpac.org Thu Oct 8 17:33:03 2009 From: csamuel at vpac.org (Chris Samuel) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] scheduler recommendations for a HPC cluster In-Reply-To: Message-ID: <1898708397.4238301255048383349.JavaMail.root@mail.vpac.org> ----- "Rahul Nabar" wrote: > One thing that the Torque+Maui option is not the best is > that it is not monolithic. Actually from our point of view the really good part of Torque is that the scheduler is pluggable and you can have the very simple pbs_sched, Maui or Moab or even write your own if you want using the examples! > Oftentimes it is hard to know which component to blame > for a problem or more relevant which config file to use > to fix a problem. Torque or Maui. We try and keep Torque *really* simple (just some queues to let a couple of applications select a walltime) and do all the smarts in Maui/Moab. For what we do we have to use Moab, Maui didn't have some of the capabilities we needed. One thing we *really* like is the fact that Torque's pbs_mom can run a health check script and then if that reports an error (say "ERROR /tmp full") then it gets passed back to the pbs_server and Moab will mark the node as down until that error clears. This keeps a node with problems from taking jobs meaning you can get to work on it sooner. Ours checks everything from SMART errors, MCEs, disk space through to if the node needs rebooting for a kernel upgrade. If you're not using Moab then you can instead simply get your health check script to run pbsnodes to mark the node offline (remembering to use the -N message set an appropriate message). cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From hahn at mcmaster.ca Thu Oct 8 22:44:58 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: <107529.66590.qm@web30603.mail.mud.yahoo.com> References: <107529.66590.qm@web30603.mail.mud.yahoo.com> Message-ID: > We have found that vendors are very helpful there. Setting the node to PXE >is a big help, specially for new types of nodes. We require the latest >firmware on the box (so I don't have to spend countless hours upgrading >firmware), and a spreadsheet with Mac address/Asset information. We use a >script to upload the info the DB. Connect power, gigE,10gE,serial. Then on >box and it starts installing... donno. I can see that it would be easier to gain vendor cooperation at the beginning, before they have the cash, or soon enough after acceptance that they can still remember it ;) OTOH, I think we all need _ongoing_ mechanisms to handle these issues, since they WILL crop up again during the cluster's lifespan. I'd probably not burn my "vendor capital" on this stuff. then again, maybe it's use it or lose it... my first wish would be for some way to automate BIOS settings. node properties like MAC and SNs are easy enough to gather yourself. flashing BIOS versions is easy enough too via pxe - but again, only as long as the flash doesn't fubar your settings... From hearnsj at googlemail.com Thu Oct 8 22:54:44 2009 From: hearnsj at googlemail.com (John Hearns) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: <20091008225503.GD27035@bx9.net> References: <294779.96781.qm@web30601.mail.mud.yahoo.com> <20091008225503.GD27035@bx9.net> Message-ID: <9f8092cc0910082254p19829468s9e5c0db6873d4109@mail.gmail.com> 2009/10/8 Greg Lindahl : > You haven't mentioned the other things you can use IPMI for. > > 1) Console logging. Your machine just crashed. No clue in > /var/log/messages. "I wonder if it printed something on the console?" > Answer: ipmi and conman (available in an rpm in Red Hat distros). > > 2) Monitoring. Temp, fan speeds, power supply state, events. Answers > the "why is the little red light on the front of the case lit?" > question. At the risk of getting a reputation in these parts, both come as standard on the SGI ICE cluster. Console logging via IPMI/conman on the rack leader for all nodes, which is then mounted across to the admin node. Temp, fan speed etc. logged on the rack leaders and reported via ESP monitoring. Ganglia implemented too. From lynesh at Cardiff.ac.uk Fri Oct 9 01:23:06 2009 From: lynesh at Cardiff.ac.uk (Huw Lynes) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: <20091008225503.GD27035@bx9.net> References: <294779.96781.qm@web30601.mail.mud.yahoo.com> <20091008225503.GD27035@bx9.net> Message-ID: <1255076586.2299.5.camel@w609.insrv.cf.ac.uk> On Thu, 2009-10-08 at 15:55 -0700, Greg Lindahl wrote: > On Thu, Oct 08, 2009 at 05:08:54PM -0500, Rahul Nabar wrote: > > > So, hopefully I will have very little need for IPMI (remote power > > cycling is great though; those network aware power distribution stics > > seem great). > > You haven't mentioned the other things you can use IPMI for. > > 1) Console logging. Your machine just crashed. No clue in > /var/log/messages. "I wonder if it printed something on the console?" > Answer: ipmi and conman (available in an rpm in Red Hat distros). > IPMI is also handy for initiating crash dumps via NMI. Useful for finding out why that machine went zombie when there is nothing in the SOL console logging. -- Huw Lynes | Advanced Research Computing HEC Sysadmin | Cardiff University | Redwood Building, Tel: +44 (0) 29208 70626 | King Edward VII Avenue, CF10 3NB From buccaneer at rocketmail.com Fri Oct 9 05:07:15 2009 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: Message-ID: <247171.48388.qm@web30607.mail.mud.yahoo.com> --- On Thu, 10/8/09, Mark Hahn wrote: > From: Mark Hahn > Subject: Re: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? > To: "Beowulf Mailing List" > Date: Thursday, October 8, 2009, 10:44 PM > > We have found that vendors are > very helpful there. Setting the node to PXE > >is a big help, specially for new types of nodes. We > require the latest > >firmware on the box (so I don't have to spend countless > hours upgrading > >firmware), and a spreadsheet with Mac address/Asset > information. We use a > >script to upload the info the DB.? Connect power, > gigE,10gE,serial.? Then on > >box and it starts installing... > > donno.? I can see that it would be easier to gain vendor cooperation > at the beginning, before they have the cash, or soon enough after > acceptance that they can still remember it ;) > > > OTOH, I think we all need _ongoing_ mechanisms to handle these > issues, since they WILL crop up again during the cluster's lifespan. > I'd probably not burn my "vendor capital" on this stuff.? then > again, maybe it's use it or lose it... > > > my first wish would be for some way to automate BIOS settings, > node properties like MAC and SNs are easy enough to gather yourself. > flashing BIOS versions is easy enough too via pxe - but again, > only as long as the flash doesn't fubar your settings... While agree sometimes we have to get on the same page, I see a vendor as a partner in my success. If the vendor has a different idea, we don't have a problem seeking new vendors. If that vendor can not help you reach your goals, why squander the company's money? It is also best to discuss this during the sales cycle. If we buy X number of nodes, I don't think I need to be spending 2,3 or 4 hours on each of them to upgrade all the firmware-before I can put them to work. I consider that absurd. We have also never had an issue with a vendor getting this information to us. They have to capture it anyway. In fact, we send then a spreadsheet with hostname and they fill in the rest of the information. We get the spreadsheet, I run a script for the that vendor and the DB is populated-a second. After that, as the nodes are placed in the rack and connected, when power is applied it starts to install!!! On paper. :) But is works way more than it doesn't. From rpnabar at gmail.com Fri Oct 9 09:47:23 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: <247171.48388.qm@web30607.mail.mud.yahoo.com> References: <247171.48388.qm@web30607.mail.mud.yahoo.com> Message-ID: On Fri, Oct 9, 2009 at 7:07 AM, Buccaneer for Hire. wrote: > We have also never had an issue with a vendor getting this information to us. ?They have to capture it anyway. In fact, we send then a spreadsheet with hostname and they fill in the rest of the information. We get the spreadsheet, I run a script for the that vendor and the DB is populated-a second. ?After that, as the nodes are placed in the rack and connected, when power is applied it starts to install!!! ?On paper. :) > That's a simple but efficient idea! Why did I not think of sending my vendor a spreadsheet before! :) Maybe I ought to automate service-tag collection as well! -- Rahul From john.hearns at mclaren.com Fri Oct 9 10:08:22 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: References: <247171.48388.qm@web30607.mail.mud.yahoo.com> Message-ID: <68A57CCFD4005646957BD2D18E60667B0D76868C@milexchmb1.mil.tagmclarengroup.com> Maybe I ought to automate service-tag collection as well! That is easily done. Streamline do it for Dell or HP kit - they just taken a handheld barcode reader, and read the barcodes from all the servers in a rack. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From rpnabar at gmail.com Fri Oct 9 10:17:59 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: <20091008225503.GD27035@bx9.net> References: <294779.96781.qm@web30601.mail.mud.yahoo.com> <20091008225503.GD27035@bx9.net> Message-ID: On Thu, Oct 8, 2009 at 5:55 PM, Greg Lindahl wrote: > > > 1) Console logging. Your machine just crashed. No clue in > /var/log/messages. "I wonder if it printed something on the console?" > Answer: ipmi and conman (available in an rpm in Red Hat distros). I was "planning" on using kdump and a crash-kernel for that. Note the emphasis on "planning". I never got that working correctly. I got started on kdump+kexec when exactly the same "node crashes for unkown reasons and I have no output" problem. Maybe IPMI gives you the same functionality. Interesting point for me though: What's the pros and cons of IPMI-console-logging versus kdump in such crash scenarios. Are they competitors? Is one better / easier than the other? > 2) Monitoring. Temp, fan speeds, power supply state, events. Answers > the "why is the little red light on the front of the case lit?" > question. You can get some of this via other software (lm_sensors), > but I find ipmitool to suck less, and ipmitool accurately answers the > red light question -- lm_sensors can only guess. I see. Yes, you read me correctly: I was putting full faith in lm_sensors to do this. Currently I have lm_sensors feedign Temperatures to my nagios monitoring setup and has been working fine. But I didn't grasp a practical point about lm_sensors sucking more than IPMI. THat's interesting again: Aren't they taking data from the same bus or counters? Or is this because the sensor details tend to be proprietary so lm_sensors lags behind the Vendor implementations of IPMI? Because if open-source IPMI is also trying to log sensor stats its in competition with open source lm_sensors (not to say this is bad or un heard of for multiple open source projects getting the same thing done!) -- Rahul From buccaneer at rocketmail.com Fri Oct 9 10:50:46 2009 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: Message-ID: <809204.38947.qm@web30607.mail.mud.yahoo.com> The first thing they ask for, when you call the vendor, is for an asset tag or serial number. We keep it in the DB so I can always verify it. It is even better when the vendor makes is easier to get it through dmidecode. --- On Fri, 10/9/09, Rahul Nabar wrote: > From: Rahul Nabar > Subject: Re: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? > To: "Buccaneer for Hire." > Cc: "Beowulf Mailing List" , "Mark Hahn" > Date: Friday, October 9, 2009, 9:47 AM > On Fri, Oct 9, 2009 at 7:07 AM, > Buccaneer for Hire. > > wrote: > > We have also never had an issue with a vendor getting > this information to us. ?They have to capture it anyway. In > fact, we send then a spreadsheet with hostname and they fill > in the rest of the information. We get the spreadsheet, I > run a script for the that vendor and the DB is populated-a > second. ?After that, as the nodes are placed in the rack > and connected, when power is applied it starts to install!!! > ?On paper. :) > > > > That's a simple but efficient idea! Why did I not think of > sending my > vendor a spreadsheet before! :) > > Maybe I ought to automate service-tag collection as well! > > -- > Rahul > From rpnabar at gmail.com Fri Oct 9 10:46:12 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: References: <107529.66590.qm@web30603.mail.mud.yahoo.com> Message-ID: On Fri, Oct 9, 2009 at 12:44 AM, Mark Hahn wrote: > my first wish would be for some way to automate BIOS settings. node > properties like MAC and SNs are easy enough to gather yourself. flashing > BIOS versions is easy enough too via pxe - but again, > only as long as the flash doesn't fubar your settings... Well, it's embarrassing to admit but I have never flashed a BIOS firmware since I am mortally scared of messing things and ending up with a $3000 expensive paperweight. :) Is BIOS firmware-flashing routine for you guys? Is it easy or error-prone. Are there any such "paperweight servers" lurking in your racks and server-rooms? Or is this a bugaboo of the ancient dark-ages? -- Rahul From rpnabar at gmail.com Fri Oct 9 10:57:48 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: <809204.38947.qm@web30607.mail.mud.yahoo.com> References: <809204.38947.qm@web30607.mail.mud.yahoo.com> Message-ID: On Fri, Oct 9, 2009 at 12:50 PM, Buccaneer for Hire. wrote: > The first thing they ask for, when you call the vendor, is for an asset tag or serial number. We keep it in the DB so I can always verify it. It is even better when the vendor makes is easier to get it through dmidecode. Yup. That is what they always ask for. The helpdesk engineers won't even talk to me (justified!) unless I assure them I have a valid-service tag and my BIOS firmware is up-to-date (more flexible!) (it raises an interesting thought I had: I wish the sales guys sometimes gave me a "dummy" service tag while I am evaluating hardware options to buy a new cluster. Often times one doesn't really want to try-and-buy each option (then I would get a service tag!) but yet one has questions that are not answered in the spec sheets or the manuals are hazy. The support-engineers are often the most knowledgeable about these nuts-and-bolts aspects I've found. But in the absence of a service tag it's hard to get access to them. ) Keeping the BIOS firmware up-to-date is a chore though. When we are choosing Linux distros in the interest of stability a frequently updated firmware gets irritating. I guess it's a tradeoff to enjoy bug-fixes feature-additions so maybe I shouldn't be cribbing...... -- Rahul From buccaneer at rocketmail.com Fri Oct 9 11:00:58 2009 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: Message-ID: <395922.43089.qm@web30607.mail.mud.yahoo.com> --- On Fri, 10/9/09, Rahul Nabar wrote: > That's a simple but efficient idea! Why did I not think of > sending my vendor a spreadsheet before! :) > > Maybe I ought to automate service-tag collection as well! Our relationship with our vendors should not be adversarial. A vendor should be part of your support system. IF your vendor refuses to give you the support you need, then you need a new vendor. Unless you like pain. I am a stickler for keeping track of things. And as I get older I don't want it to be my brain. We sometimes have mini-clusters that are not in the cluster room, and I build an id tag for these nodes so we know who is responsible for them. Someone is always holding the bag for these puppies. From buccaneer at rocketmail.com Fri Oct 9 11:11:43 2009 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: Message-ID: <255432.27220.qm@web30604.mail.mud.yahoo.com> --- On Fri, 10/9/09, Rahul Nabar wrote: > (it raises an interesting thought I had: I wish the sales > guys sometimes gave me a "dummy" service tag while I am > evaluating hardware options to buy a new cluster. Often > times one doesn't really want to try-and-buy each option > (then I would get a service tag!) but yet one > has questions that are not answered in the spec sheets or > the manuals are hazy. The support-engineers are often the most > knowledgeable about these nuts-and-bolts aspects I've found. > But in the absence of a service tag it's hard to get access > to them. ) > > Keeping the BIOS firmware up-to-date is a chore though. When > we are choosing Linux distros in the interest of stability a > frequently updated firmware gets irritating. I guess it's a > tradeoff to enjoy bug-fixes feature-additions so maybe I shouldn't > be cribbing...... IF you go through a vetting process like we do, it takes a while to vett changes. We can't just jump to new firmware just so they will do the support we paid them for. When they come to give you the dog-and-pony show, that's when I jump in and ask for things. They are here to solve your problems-that's the first thing they will tell you. If they help you succeed, you will stay in bed with them. That's a win-win situation. A node on a pallet waiting for you do upgrade firmware is absolutely worthless. 1000 of them even more so. From dnlombar at ichips.intel.com Fri Oct 9 11:39:47 2009 From: dnlombar at ichips.intel.com (David N. Lombard) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: <809204.38947.qm@web30607.mail.mud.yahoo.com> References: <809204.38947.qm@web30607.mail.mud.yahoo.com> Message-ID: <20091009183947.GF5702@nlxdcldnl2.cl.intel.com> On Fri, Oct 09, 2009 at 10:50:46AM -0700, Buccaneer for Hire. wrote: > The first thing they ask for, when you call the vendor, is for an asset > tag or serial number. We keep it in the DB so I can always > verify it. It is even better when the vendor makes is easier to get > it through dmidecode. Just be aware that dmidecode output varies considerably in quality and quantity. You may found the info you need in fixed locations, but it may also vary over time and vendor. Use it when it works, but don't be surprised if it doesn't work as well as you would hope. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From lindahl at pbm.com Fri Oct 9 11:44:06 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: References: <294779.96781.qm@web30601.mail.mud.yahoo.com> <20091008225503.GD27035@bx9.net> Message-ID: <20091009184406.GB9695@bx9.net> > > 1) Console logging. Your machine just crashed. No clue in > > /var/log/messages. "I wonder if it printed something on the console?" > > Answer: ipmi and conman (available in an rpm in Red Hat distros). > > I was "planning" on using kdump and a crash-kernel for that. Which is complicated enough to set up that I've never tried. IPMI doesn't get you the same functionality as kdump: you can't do further debugging without the dump. But you do get the oops with ipmi/conman, which is about the same as getting the stacktrace when a program segfaults. Personally I'm not really going to debug in the kernel more than staring hard at the oops, and the oops is the preferred way of filing a bug against the kernel. > I see. Yes, you read me correctly: I was putting full faith in > lm_sensors to do this. lm_sensors isn't going to tell you about something that happened between scans. ipmi gives you access to the event log, which will show you all transient events. The two do look at the same bus and counters. lm_sensors works in systems which are missing ipmi. -- greg From lindahl at pbm.com Fri Oct 9 11:53:14 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: <9f8092cc0910082254p19829468s9e5c0db6873d4109@mail.gmail.com> Message-ID: <20091009185314.GD9695@bx9.net> On Fri, Oct 09, 2009 at 06:54:44AM +0100, John Hearns wrote: > At the risk of getting a reputation in these parts, You've already gotten one. You know, when you chime in and say, "Hey, SGI supports that feature", I think every time it's been the sort of feature that all full-service cluster vendors have. So what is your point? -- greg From hahn at mcmaster.ca Fri Oct 9 13:36:54 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: References: <107529.66590.qm@web30603.mail.mud.yahoo.com> Message-ID: > Is BIOS firmware-flashing routine for you guys? Is it easy or > error-prone. Are there any such "paperweight servers" lurking in your > racks and server-rooms? Or is this a bugaboo of the ancient dark-ages? I don't believe I've ever bricked a server, though I don't flash any more than I have too. my experience is that even interrupted bios flashes seem to be re-flashable. I'm guessing that they isolate the main bios from the code that performs the flash (which perhaps they never update). my organization has > 2500 nodes and most have been flashed a time or two; I think we've had a couple failed flashes that worked on second try. we might have actually bricked a machine or two over 4 years: call it 2 bricks in 5k flashes... (all this flashing is of non-UPS nodes, done with PXE-booted floppy images.) From hahn at mcmaster.ca Fri Oct 9 14:02:35 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: <255432.27220.qm@web30604.mail.mud.yahoo.com> References: <255432.27220.qm@web30604.mail.mud.yahoo.com> Message-ID: > A node on a pallet waiting for you do upgrade firmware is absolutely worthless. 1000 of them even more so. I'm not sure why you keep going on about bios updates. they're easy and automatable, so not a real issue for clusters. no offense! From hahn at mcmaster.ca Fri Oct 9 14:09:46 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: <20091009183947.GF5702@nlxdcldnl2.cl.intel.com> References: <809204.38947.qm@web30607.mail.mud.yahoo.com> <20091009183947.GF5702@nlxdcldnl2.cl.intel.com> Message-ID: > Just be aware that dmidecode output varies considerably in quality and > quantity. You may found the info you need in fixed locations, but it > may also vary over time and vendor. Use it when it works, but don't be > surprised if it doesn't work as well as you would hope. sure. dmidecode only shows you what's been programmed into the bios. HP nodes (perhaps just replacement motherboards) seem to come with nulled serial numbers in bios. I had a good head-scratch the other day with a replaced MB that didn't have the default IPMI account/passwords. happily, I could use the trick of booting it and running local ipmitool. come to think of it, before I knew about dmidecode, I used to run strings on /dev/mem ;) but one could probably glean other data by collecting POST messages on SOL. From rpnabar at gmail.com Fri Oct 9 14:14:21 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: References: <255432.27220.qm@web30604.mail.mud.yahoo.com> Message-ID: On Fri, Oct 9, 2009 at 4:02 PM, Mark Hahn wrote: >> A node on a pallet waiting for you do upgrade firmware is absolutely >> worthless. ?1000 of them even more so. > > I'm not sure why you keep going on about bios updates. ?they're easy and > automatable, so not a real issue for clusters. ?no offense! Sorry, I just went off on a tangent when I was thinking about "what the service engineers ask" comment from "Buccaneer for Hire". I agree they aren't that important of a topic. Not even sure they are very necessary. On the other hand, thanks for your comments, I did not realize they were automable and relatively risk free. -- Rahul From lindahl at pbm.com Fri Oct 9 14:27:45 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: References: <255432.27220.qm@web30604.mail.mud.yahoo.com> Message-ID: <20091009212745.GH4382@bx9.net> On Fri, Oct 09, 2009 at 05:02:35PM -0400, Mark Hahn wrote: > I'm not sure why you keep going on about bios updates. they're easy and > automatable, so not a real issue for clusters. no offense! Mark, I don't think you've had a look at that many vendors, right? Mostly a single vendor? I recently updated hundreds of SuperMicro nodes, because the existing BIOS had a bug where a node with 64 gbytes of ram wouldn't boot. We were upgrading memory to 64g. Our reseller has some magic tools that let them generate a bios-flashing boot disk with a non-standard default setting. But this tool does not allow AHCI to be turned on. I had to manually manipulate every single node after flashing. My crash cart came in quite handy. On another note, dmidecode running against SuperMicro's Phoenix BIOS correctly indicates whether ECC is turned on in the BIOS -- I found 2 nodes which were incorrectly configured this way. But a newer node with Nehalem has an AMI Bios, and dmidecode always reports ECC off. Oh, well. Neither one captures serial numbers, but, I use the mac addr as a mobo serial number. -- greg From hahn at mcmaster.ca Fri Oct 9 14:27:09 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: References: <294779.96781.qm@web30601.mail.mud.yahoo.com> <20091008225503.GD27035@bx9.net> Message-ID: > in such crash scenarios. Are they competitors? Is one better / easier > than the other? they're different. and I've never actually seen kdump in use. but logging (remote syslog, syslog-ng, netconsole, ipmi-sol, etc) is something everyone does to varying degrees and more is better. >> 2) Monitoring. Temp, fan speeds, power supply state, events. Answers >> the "why is the little red light on the front of the case lit?" >> question. You can get some of this via other software (lm_sensors), >> but I find ipmitool to suck less, and ipmitool accurately answers the >> red light question -- lm_sensors can only guess. > > I see. Yes, you read me correctly: I was putting full faith in > lm_sensors to do this. Currently I have lm_sensors feedign > Temperatures to my nagios monitoring setup and has been working fine. lm_sensors is in-band, in that it consumes cycles on your node, and doesn't help you if your node isn't working right. IPMI is OOB, has no performance effect and works regardless of power state, panic, etc. > But I didn't grasp a practical point about lm_sensors sucking more > than IPMI. THat's interesting again: Aren't they taking data from the > same bus or counters? Or is this because the sensor details tend to be > proprietary so lm_sensors lags behind the Vendor implementations of > IPMI? lm_sensors is _more_ flexible because it can, for instance, probe components that the BMC doesn't know about. I'm thinking of SPD on dimms, which is gettable over the I2C bus, but I've never seen an IPMI mess with it. there are other I2C devices too (some video cards?). the more your monitoring can be OOB, the better. not just from a perturbation standpoint, but also fragility. From gmkurtzer at gmail.com Wed Oct 7 20:52:17 2009 From: gmkurtzer at gmail.com (Greg Kurtzer) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] diskless how to In-Reply-To: <4ACBA192.2060108@gmx.com> References: <4ACBA192.2060108@gmx.com> Message-ID: <571f1a060910072052x4862d93bhdf7193d73b367036@mail.gmail.com> Have you looked at Perceus? It does all of the node management (stateless/diskless) plus some other very helpful features. http://www.perceus.org/ Good luck! Greg On Tue, Oct 6, 2009 at 12:59 PM, Tomislav Maric wrote: > Hi, > > I'm browsing through the web and there's multiple options for me to > choose between prepared programs that set up diskless nodes. Still for > the first two of them, I would like to learn how it's done. Can someone > tell me what I need to do or point me to a "manual" tutorial for the > diskless nodes? I just need good pointers so that I can avoid dead ends > on my way to home beowulf. > > Thanks, > Tomislav > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com From Daniel.Kidger at bull.co.uk Fri Oct 9 04:08:19 2009 From: Daniel.Kidger at bull.co.uk (Daniel.Kidger@bull.co.uk) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Best Practices SOL vs Cyclades ACS Message-ID: >Rich Sudlow wrote: >> In the past we've used cyclades console servers for serial >> interfaces into our cluster nodes. >> >> We're replacing 360 nodes which couldn't do SOL with 360 >> which could. >> >> Now that we can do SOL is that a better to use that instead of the >> Cyclades? >> >> Thoughts? > >Every now and then IPMI gets wedged. We have seen it on all IPMI >stacks. When IPMI gets wedged, SOL stops working. > >I recommend redundant administrative pathways ... make sure you can get >to and control the machine in the event of a problem. Some pathways may >not be as cost effective at scale than others. I suggest that if you are already comfortable with Cyclades Terminal servers and already have them configured plus all the cables are already there, then why not continue to use them. I guess you already use the feature where they can write the console logs to a NFS mounted filesystem? Redundant pathways are always a bonus. However here you might have problems in having effectively 2 simulatenous serial consoles. Daniel -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091009/3befec50/attachment.html From buccaneer at rocketmail.com Fri Oct 9 15:14:51 2009 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: Message-ID: <873606.1358.qm@web30606.mail.mud.yahoo.com> --- On Fri, 10/9/09, Mark Hahn wrote: > From: Mark Hahn > Subject: Re: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? > To: "Buccaneer for Hire." > Cc: "Beowulf Mailing List" > Date: Friday, October 9, 2009, 2:02 PM > > A node on a pallet waiting for > you do upgrade firmware is absolutely worthless.? 1000 > of them even more so. > > I'm not sure why you keep going on about bios > updates.? they're easy and > automatable, so not a real issue for clusters.? no > offense! > No offense taken taken... Because the node should be ready for prime time when we get it. But it is not the BIOS where the problems are, but all the little sophisticated pieces of the node with its own firmware where you have to plug in and hit keys, and wait, and hit keys. This may be OK if you have a small cluster, not OK when you have tons of nodes. From mm at yuhu.biz Fri Oct 9 16:54:48 2009 From: mm at yuhu.biz (Marian Marinov) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Best Practices SOL vs Cyclades ACS In-Reply-To: References: Message-ID: <200910100254.49024.mm@yuhu.biz> On Friday 09 October 2009 14:08:19 Daniel.Kidger@bull.co.uk wrote: > >Rich Sudlow wrote: > >> In the past we've used cyclades console servers for serial > >> interfaces into our cluster nodes. > >> > >> We're replacing 360 nodes which couldn't do SOL with 360 > >> which could. > >> > >> Now that we can do SOL is that a better to use that instead of the > >> Cyclades? > >> > >> Thoughts? > > > >Every now and then IPMI gets wedged. We have seen it on all IPMI > >stacks. When IPMI gets wedged, SOL stops working. > > > >I recommend redundant administrative pathways ... make sure you can get > >to and control the machine in the event of a problem. Some pathways may > >not be as cost effective at scale than others. > > I suggest that if you are already comfortable with Cyclades Terminal > servers and already have them configured plus all the cables are already > there, then why not continue to use them. > I guess you already use the feature where they can write the console logs > to a NFS mounted filesystem? > > Redundant pathways are always a bonus. However here you might have > problems in having effectively 2 simulatenous serial consoles. > > Daniel We have more then 400 machines. Every month there is one machine that we can not reboot using IPMI or the SOL is not working. I also confirm what Daniel and Joe said, if you already have the infrastructure and you are used to it, it is best to keep it. We shifted to IPMI for easier management since our admins didn't liked the Cyclades but unfortunately we lost the total power control that we had with them :( -- Best regards, Marian Marinov From hahn at mcmaster.ca Fri Oct 9 21:57:34 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Re: recommendation on crash cart for a cluster room:fullcluster KVM is not an option I suppose? In-Reply-To: <20091009212745.GH4382@bx9.net> References: <255432.27220.qm@web30604.mail.mud.yahoo.com> <20091009212745.GH4382@bx9.net> Message-ID: >> I'm not sure why you keep going on about bios updates. they're easy and >> automatable, so not a real issue for clusters. no offense! > > Mark, > > I don't think you've had a look at that many vendors, right? Mostly a > single vendor? completely true - besides our quite a few dl145 and dl145g2 nodes, I've only done floppy-pxe bios updates on some oem tyan systems. > Our reseller has some magic tools that let them generate a > bios-flashing boot disk with a non-standard default setting. But this well, I didn't say that the vendor provided these pxe images - we normally have to fiddle with dosemu and with the dos boot files to make them work and non-interactively. > tool does not allow AHCI to be turned on. I had to manually manipulate > every single node after flashing. My crash cart came in quite handy. ugh. do you have SOL or similar, that might be scripted with expect to do the AHCI enabling? also, this is AHCI for sata, right? I guess I had the impression that AHCI could be selected by the driver without bios help... From hahn at mcmaster.ca Fri Oct 9 22:09:45 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Best Practices SOL vs Cyclades ACS In-Reply-To: <200910100254.49024.mm@yuhu.biz> References: <200910100254.49024.mm@yuhu.biz> Message-ID: > We have more then 400 machines. Every month there is one machine that we can > not reboot using IPMI or the SOL is not working. we have something like 2500 nodes, mostly HP dl145g2's, and have a BMC-wedge probably 6-12 times/year. can I ask what brand/model has such flakey IPMI? if you run "ipmi mc reset" on the node, does it resolve the problem? I wonder whether flakiness might also correspond to some config or usage pattern. (ours dhcp from a local server - actually all the traffic is local.) From mm at yuhu.biz Fri Oct 9 22:33:23 2009 From: mm at yuhu.biz (Marian Marinov) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Best Practices SOL vs Cyclades ACS In-Reply-To: References: <200910100254.49024.mm@yuhu.biz> Message-ID: <200910100833.24322.mm@yuhu.biz> On Saturday 10 October 2009 08:09:45 Mark Hahn wrote: > > We have more then 400 machines. Every month there is one machine that we > > can not reboot using IPMI or the SOL is not working. > > we have something like 2500 nodes, mostly HP dl145g2's, and have a > BMC-wedge probably 6-12 times/year. can I ask what brand/model has such > flakey IPMI? if you run "ipmi mc reset" on the node, does it resolve the > problem? I wonder whether flakiness might also correspond to some config or > usage pattern. (ours dhcp from a local server - actually all the traffic > is local.) These are only Dell machines used for shared hosting. Usually these problem appear when there is DoS/DDoS or very high system resource usage(for example load over 100 on machine with 4 cores). Our problem is that in such situations IPMI sometimes is unreliable as you can not connect on serial nor reboot the machine. -- Best regards, Marian Marinov From tiagomnm at gmail.com Sat Oct 10 04:36:24 2009 From: tiagomnm at gmail.com (Tiago Marques) Date: Wed Nov 25 01:08:59 2009 Subject: [Beowulf] Home Beowulf In-Reply-To: <73111254851774@webmail20.yandex.ru> References: <58241254741577@webmail83.yandex.ru> <35AAF1E4A771E142979F27B51793A4888702AE6C7D@AVEXMB1.qlogic.org> <73111254851774@webmail20.yandex.ru> Message-ID: Hi, I've run some tests with an i7 920 and OCZ DDR-1600 CAS7 and the results were very good for bandwidth friendly benchmarks and I definitely recommend you to stay away from the usual stock memory of DDR3-1066@CAS9. About performance with CFD I have no idea. VASP liked particularly of the extra bandwidth and yielded an extra 15% performance for RAM that costed about the same. Other benchmarks may perform betters and others won't care. I'm somewhat new to this world but haven't found codes to which latency matters. Games, as you might find on those websites, are the main programs that benefit from reduced latency and there's only one exception to the rule, which are UT3 based games. They heavily benefit from bandwidth, which tells me they did some serious optimization to benefit from it, which is the trend industry is taking as of late: more bandwidth no matter the latency. (at least when it comes to memory technology only). Best regards On 10/6/09, Dmitry Zaletnev wrote: > > >> > On Behalf Of Dmitry Zaletnev >> > >> > I just thinking about a 2-nodes Beowulf for a CFD application that >> > reads/writes small amounts of data in random areas of RAM (non-cached >> > translation). What would be the difference in overall performance (I >> > suppose, due to the memory access time) between Q9650/ 8 GB DDR2-800 >> > and i7 920/ 6 GB DDR3-1866. >> you probably mean "DDR3-1066". > Tom, thank you for your answer, here I mean 1866 MHz, OCZ3P1866C7LV6GK kits. > I expect their availability must grow up till the end of the winter when I'm > going to finalize the construction of my home Beowulf. I'm slightly > confused: the test results from the Corsair site tell about significant > difference in performance between i7 with 1333 MHz memory and 1866 MHz > memory, from other side Tom's hardware forum tells that the only parameter > that matters is the latency and there's no difference between a low-latency > 1333 MHz memory and 1600 MHz memory. But I'm afraid that chosing 1333 MHz > memory now will eliminate the sense of future upgrade to Xeon (motherboard > supports it) or Fermi-in-SLI. > > Best regards, > Dmitry > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From niftyompi at niftyegg.com Sun Oct 11 13:29:29 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC9C423.10003@gmx.com> References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> <4AC8822B.6000603@gmx.com> <20091005002116.GA3130@tosh2egg.ca.sanfran.comcast.net> <4AC9C423.10003@gmx.com> Message-ID: <20091011202929.GA3032@tosh2egg.ca.sanfran.comcast.net> On Mon, Oct 05, 2009 at 12:02:11PM +0200, Tomislav Maric wrote: > Nifty Tom Mitchell wrote: > > On Sun, Oct 04, 2009 at 01:08:27PM +0200, Tomislav Maric wrote: > >> Mark Hahn wrote: > >>>> I've seen Centos mentioned a lot in connection to HPC, am I making a > >>>> mistake with Ubuntu?? > >>> distros differ mainly in their desktop decoration. for actually > >>> getting cluster-type work done, the distro is as close to irrelevant > >>> as imaginable. a matter of taste, really. it's not as if the distros > >>> provide the critical components - they merely repackage the kernel, > >>> libraries, middleware, utilities. wiring yourself to a distro does > >>> affect when you can or have to upgrade your system, though. > >>> > > > > An interesting perspective is to have a cluster built > > on the same distro as your desktop because.... ..... > > OK, I guess then Ubuntu will suffice for a 12 node Cluster. :) Anyway, > I'll try it and see. Thanks! I think it will work just fine for you. As you structure thiings make it clear in your mind where important data lives. Also, that compute nodes commonly do not contain data long term and that in most cluster admin environments they get re-imaged at the drop of a hat and upgraded as a group so all the nodes in a cluster always "look the same" to your programs. N.B. This is not strictly true when using a cluster based file system. The common practice of reimaging nodes all the same except the IP address will have to also address the distributed FS components. And, Perhaps overkill for a personal cluster. While youd are tinkering and researching look at the ROCKS solution to clustering. ROCKS is best described as a management package that lives on top of RHEL or CentOS. It includes all the necessary components for a larger (midsized) central managed multi user MPI compute cluster, user management, batch system, kickstart, management database etc. If you download it and CentOS then deploy it on your compute nodes for a test drive you may learn a bit about all the moving parts needed to manage a midsized cluster. Their solution has a very specific set of goals and if they match your needs it is way cool. If not it is not flexible and might be painful. Do download the documentation to scan. Also bookmark "http://www.clustermonkey.net/"! -- T o m M i t c h e l l Found me a new hat, now what? From a.travis at abdn.ac.uk Mon Oct 12 04:43:50 2009 From: a.travis at abdn.ac.uk (Tony Travis) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC9C423.10003@gmx.com> References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> <4AC8822B.6000603@gmx.com> <20091005002116.GA3130@tosh2egg.ca.sanfran.comcast.net> <4AC9C423.10003@gmx.com> Message-ID: <4AD31676.2060303@abdn.ac.uk> Tomislav Maric wrote: > [...] > OK, I guess then Ubuntu will suffice for a 12 node Cluster. :) Anyway, > I'll try it and see. Thanks! Hello, Tomislav. Ubuntu is based on Debian, which is a well established server OS, and I think you should consider Ubuntu LTS as a 'serious' option. It does more than just 'suffice' for our 90-node Beowulf: It meets our requirements. Bye, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:a.travis@abdn.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From rpnabar at gmail.com Mon Oct 12 10:05:02 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] One time passwords and two factor authentication for a HPC setup (might be offtopic? ) Message-ID: In all the tiny clusters I've managed so far I've had primitive (I think) access control by strong [sic] passwords. How practical is it for a small HPC setup to think about rolling out a two-factor, one-time-password system? [I apologize if this might be somewhat offtopic for HPC;it could be termed a generic Linux logon problem but I couldn't find many leads in my typical linux.misc group.] I've used RSA type cards in the past for accessing larger supercomputing environments and they seem fairly secure but I suspect that kind of setup is too large (expensive, proprietary, complicated) for us. Are there any good open source alternatives? The actual time-seeded random-number generation key fobs seem pretty cheap (less than $20 a piece e.g. http://www.yubico.com/products/yubikey/ ). So the hardware is OK but I still need the backend software to tie it in to /etc/passwd or PAM or some such mechanism. The software I found was either Win-based or catered to apache or email etc. I did find VASCO and CryptoCard but am not sure they are the right fit. I looked around at open source but couldn't find much. Are other sys-admins using some form of OTP. What options do I have? Of course, I know that OTP and two-factor is not some magic bullet that makes my security watertight; but I still think its more secure than static user passwords. -- Rahul From john.hearns at mclaren.com Mon Oct 12 10:08:08 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Ahoy shipmates Message-ID: <68A57CCFD4005646957BD2D18E60667B0D768E45@milexchmb1.mil.tagmclarengroup.com> Flicking through New Scientist dated 3rd October, an article on green technologies suggests sending data centres out to sea. The idea is to use raw seawater for cooling, and you could use wave power for generating electricity. Seemingly this guy's idea: http://cseweb.ucsd.edu/~vahdat/ So to be a Beowulf guru of the future you'll have to have your sea legs as well as your IPMI fingers. Just as long as there is a rum ration I'll take the King's shilling! The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From john.hearns at mclaren.com Mon Oct 12 10:27:24 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] One time passwords and two factor authentication for aHPC setup (might be offtopic? ) In-Reply-To: References: Message-ID: <68A57CCFD4005646957BD2D18E60667B0D768E57@milexchmb1.mil.tagmclarengroup.com> for us. Are there any good open source alternatives? The actual time-seeded random-number generation key fobs seem pretty cheap (less than $20 a piece e.g. http://www.yubico.com/products/yubikey/ ). the back-end for cards like these is a Radius server (if I'm not wrong - last time I looked at stuff like this was a frighteningly long time ago). Look for Openradius. I looked around at open source but couldn't find much. Are other sys-admins using some form of OTP. What options do I have? I opie any use to you? http://wiki.linuxquestions.org/wiki/Opie The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From james.p.lux at jpl.nasa.gov Mon Oct 12 11:00:39 2009 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] One time passwords and two factor authentication for a HPC setup (might be offtopic? ) In-Reply-To: Message-ID: On 10/12/09 10:05 AM, "Rahul Nabar" wrote: > In all the tiny clusters I've managed so far I've had primitive (I > think) access control by strong [sic] passwords. How practical is it > for a small HPC setup to think about rolling out a two-factor, > one-time-password system? > > [I apologize if this might be somewhat offtopic for HPC;it could be > termed a generic Linux logon problem but I couldn't find many leads in > my typical linux.misc group.] > > I've used RSA type cards in the past for accessing larger > supercomputing environments and they seem fairly secure but I suspect > that kind of setup is too large (expensive, proprietary, complicated) > for us. Probably cheaper than you think. They're certainly used a lot. I checked about a year ago, and I seem to recall it's about $50/user (for the token) plus some annual fee of comparable magnitude for the server side. Google "RSA SecurID Cost" Are there any good open source alternatives? The actual > time-seeded random-number generation key fobs seem pretty cheap (less > than $20 a piece e.g. http://www.yubico.com/products/yubikey/ ). So > the hardware is OK but I still need the backend software to tie it in > to /etc/passwd or PAM or some such mechanism. The software I found was > either Win-based or catered to apache or email etc. I did find VASCO > and CryptoCard but am not sure they are the right fit. This is what going with RSA buys you.. They have basically turnkey solutions for every operating system known. Yes, you're beholden to a proprietary solution, but think of it like being beholden to Intel or AMD. Any "authentication" mechanism should use standard interfaces, so if you decide to go to some other authentication scheme, it's transparent. I use a SecureID token every day at work.. It's not a big pain for me, BUT, there are really lame implementations of the basic concept that I've heard about, particularly ones that require physical connection to the key (usually using a Chip&PIN style access card and a reader) I've also not ever lost my key.. Lose that key and work comes to a grinding halt until you get a new one, so make sure you have the needed support infrastructure in place to accommodate your particular availability needs. Also, one big advantage is that you can, if you make it part of an overall security infrastructure, do away with the need to remember eighty bazillion different passwords, each with different expiration cycles and different rules for including "strong" characters (the latter which I believe is aimed at a threat that doesn't really dominate the risk spectrum any more..brute force attacks on passwd files or something). These days it's compromise by social engineering, or sniffing the wire, shoulder surfing kinds of things. Modern schemes for simple passwords (try 3 times and get locked out for some long time, etc.) pretty much defeat the brute force at the user interface approach.. You can't get enough tries in a reasonable time to crack the password, especially if you've got something that watches the logs and says "hey, someone has tried 100 different times to get into account XYZ in the last 24 hours!" It's the "I've got a copy of passwd and I'm taking it to my lair under the volcano to process it on my 2048 node beowulf cluster of FPGAs programmed to crack 3DES" thing that you have to guard against (by other means). From rpnabar at gmail.com Mon Oct 12 11:43:58 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] One time passwords and two factor authentication for a HPC setup (might be offtopic? ) In-Reply-To: References: Message-ID: Thanks for the comments Jim and John! On Mon, Oct 12, 2009 at 1:00 PM, Lux, Jim (337C) wrote: > > This is what going with RSA buys you.. They have basically turnkey solutions > for every operating system known. >? ?Yes, you're beholden to a proprietary > solution, but think of it like being beholden to Intel or AMD. ?Any > "authentication" mechanism should use standard interfaces, so if you decide > to go to some other authentication scheme, it's transparent. True. But I get the feeling that so far as my needs are "linux logins" *only*, is the whole project so complicated to merit a "turnkey" solution? All one needs is a random number generator synced to a very accurate clock isn't it? I can understand using RSA type solutions for a full security initiative; but here is just one standalone server needing OTPs. Or am I oversimplyfying the situation. Maybe there are hidden things to be taken care of that aren't obvious to me. I'm not saying "I can hack this myself" but am curious how come an open source project hasn't come along.....OTOH maybe it is such a specialized need still that not enough developers feel the need to put their time into it. > > I use a SecureID token every day at work.. It's not a big pain for me, BUT, > there are really lame implementations of the basic concept that I've heard > about, particularly ones that require physical connection to the key > (usually using a Chip&PIN style access card and a reader) Yess! For sure. I used to work in a corporate stup where this was a nightmare. Some "smart" sys admin had decided to use the same magnetic ID for door-access, photoID, and computer OTP authentication. You needed the ID-card inserted into a reader connected via USB to the computer. The moment you pulled the card out the computer logged off. Result was that each time you went for a coffee or a restroom break you had a logged out machine. Now there might be some who may say that's the point but I just think it was a pain. > I've also not ever lost my key.. Lose that key and work comes to a grinding > halt until you get a new one, so make sure you have the needed support > infrastructure in place to accommodate your particular availability needs. Aren't there overrides? Let's say PersonA loses his card. Can't I set the system to accept just a plain-old-PW for the 4 days till he gets a new hardware key? > Also, one big advantage is that you can, if you make it part of an overall > security infrastructure, do away with the need to remember eighty bazillion > different passwords, I just have a standalone HPC server I admin. > It's the "I've got a copy of passwd and I'm taking it to my lair under the > volcano to process it on my 2048 node beowulf cluster of FPGAs programmed to > crack 3DES" thing that you have to guard against (by other means). Now that, I have no idea how to prevent? "other means"? Short of, I/P based blocking I have no clue how I could outwit a foe of such resources and talent. -- Rahul From nixon at nsc.liu.se Mon Oct 12 12:30:53 2009 From: nixon at nsc.liu.se (Leif Nixon) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] One time passwords and two factor authentication for a HPC setup (might be offtopic? ) In-Reply-To: (Rahul Nabar's message of "Mon\, 12 Oct 2009 12\:05\:02 -0500") References: Message-ID: Rahul Nabar writes: > Are there any good open source alternatives? The actual > time-seeded random-number generation key fobs seem pretty cheap (less > than $20 a piece e.g. http://www.yubico.com/products/yubikey/ ). So > the hardware is OK but I still need the backend software to tie it in > to /etc/passwd or PAM or some such mechanism. The software I found was > either Win-based or catered to apache or email etc. I did find VASCO > and CryptoCard but am not sure they are the right fit. Uhm, you did find the open-source PAM libs, validation servers, etc, for the yubikey? -- / Swedish National Infrastructure for Computing Leif Nixon - Security officer < National Supercomputer Centre \ Nordic Data Grid Facility From nixon at nsc.liu.se Mon Oct 12 12:35:03 2009 From: nixon at nsc.liu.se (Leif Nixon) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] One time passwords and two factor authentication for a HPC setup (might be offtopic? ) In-Reply-To: (Rahul Nabar's message of "Mon\, 12 Oct 2009 13\:43\:58 -0500") References: Message-ID: Rahul Nabar writes: > Yess! For sure. I used to work in a corporate stup where this was a > nightmare. Some "smart" sys admin had decided to use the same magnetic > ID for door-access, photoID, and computer OTP authentication. You > needed the ID-card inserted into a reader connected via USB to the > computer. The moment you pulled the card out the computer logged off. > Result was that each time you went for a coffee or a restroom break > you had a logged out machine. That's the point. > Now there might be some who may say that's the point but I just think > it was a pain. Sure it's a pain. Hopefully there had been a proper risk analysis carried out that indicated that it still was the right thing to do. -- / Swedish National Infrastructure for Computing Leif Nixon - Security officer < National Supercomputer Centre \ Nordic Data Grid Facility From mm at yuhu.biz Mon Oct 12 13:33:08 2009 From: mm at yuhu.biz (Marian Marinov) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Ahoy shipmates In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0D768E45@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B0D768E45@milexchmb1.mil.tagmclarengroup.com> Message-ID: <200910122333.09534.mm@yuhu.biz> On Monday 12 October 2009 20:08:08 Hearns, John wrote: > Flicking through New Scientist dated 3rd October, an article on green > technologies suggests > sending data centres out to sea. The idea is to use raw seawater for > cooling, and you could use wave power for > generating electricity. Seemingly this guy's idea: > http://cseweb.ucsd.edu/~vahdat/ > > > So to be a Beowulf guru of the future you'll have to have your sea legs > as well as your IPMI fingers. > Just as long as there is a rum ration I'll take the King's shilling! > > The contents of this email are confidential and for the exclusive use of > the intended recipient. If you receive this email in error you should not > copy it, retransmit it, use it or disclose its contents but should return > it to the sender immediately and delete your copy. > > I don't know about the wave power but the cooling power of the ocean or sea water is pretty good idea to look at. And add to that the Container based DC designs, you can have a very efficient Underwater DC. The only problem will be physically accessing the DC. Or you can simply build your DC near the shore and have pipes directly to the ocean/sea(however this approach will not be as effective as an underwater DC. Maybe one thing that we are missing here is how this will effect the environment of the ocean/sea shores. We will effectively add a few hundred water heaters. -- Best regards, Marian Marinov From poral at fct.unl.pt Sun Oct 11 09:10:50 2009 From: poral at fct.unl.pt (Paulo Afonso Lopes) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Best Practices SOL vs Cyclades ACS In-Reply-To: References: <200910100254.49024.mm@yuhu.biz> Message-ID: <32182.89.180.64.142.1255277450.squirrel@webmail.fct.unl.pt> >> We have more then 400 machines. Every month there is one machine that we >> can >> not reboot using IPMI or the SOL is not working. > > we have something like 2500 nodes, mostly HP dl145g2's, and have a > BMC-wedge > probably 6-12 times/year. can I ask what brand/model has such flakey > IPMI? > if you run "ipmi mc reset" on the node, does it resolve the problem? > I wonder whether flakiness might also correspond to some config or usage > pattern. (ours dhcp from a local server - actually all the traffic is > local.) Mark, Do you have SOL on the HP DL145-G2 ? I also have these nodes, and although I can use most ipmi functions (including remote access power up/cycle), I can not get SOL to work. Also, i have noticed that the kipmi0 daemon does consume a little bit, e.g., 45 minutes for 9 days uptime (with the top default refresh, it shows up every 4 screens or so). (CentOS 5.3) Regards, paulo -- Paulo Afonso Lopes | Tel: +351- 21 294 8536 Departamento de Inform?tica | 294 8300 ext.10702 Faculdade de Ci?ncias e Tecnologia | Fax: +351- 21 294 8541 Universidade Nova de Lisboa | e-mail: poral@fct.unl.pt 2829-516 Caparica, PORTUGAL From poral at fct.unl.pt Sun Oct 11 10:23:07 2009 From: poral at fct.unl.pt (Paulo Afonso Lopes) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Swaping dual by quad-core Opterons in 4-year old mobo Message-ID: <32423.89.180.64.142.1255281787.squirrel@webmail.fct.unl.pt> I have been in a HP seminar where the AMD invited presenter told us that quad-core Opterons are socket-compatible with older dual cores. He further said that there will be a small performance drop in motherboards that do not have the split-power capabality. Has anybody successfully conducted such a swap? In HP DL145-G2 ? Regards, paulo PS: I'm still awaiting HP reply (> 2 weeks) -- Paulo Afonso Lopes | Tel: +351- 21 294 8536 Departamento de Inform?tica | 294 8300 ext.10702 Faculdade de Ci?ncias e Tecnologia | Fax: +351- 21 294 8541 Universidade Nova de Lisboa | e-mail: poral@fct.unl.pt 2829-516 Caparica, PORTUGAL From gmkurtzer at gmail.com Mon Oct 12 11:50:10 2009 From: gmkurtzer at gmail.com (Greg Kurtzer) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <20091011202929.GA3032@tosh2egg.ca.sanfran.comcast.net> References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> <4AC8822B.6000603@gmx.com> <20091005002116.GA3130@tosh2egg.ca.sanfran.comcast.net> <4AC9C423.10003@gmx.com> <20091011202929.GA3032@tosh2egg.ca.sanfran.comcast.net> Message-ID: <571f1a060910121150q47382afbwa2be2194404b7544@mail.gmail.com> On Sun, Oct 11, 2009 at 1:29 PM, Nifty Tom Mitchell wrote: > On Mon, Oct 05, 2009 at 12:02:11PM +0200, Tomislav Maric wrote: >> Nifty Tom Mitchell wrote: >> > On Sun, Oct 04, 2009 at 01:08:27PM +0200, Tomislav Maric wrote: >> >> Mark Hahn wrote: >> >>>> I've seen Centos mentioned a lot in connection to HPC, am I making a >> >>>> mistake with Ubuntu?? >> >>> distros differ mainly in their desktop decoration. ?for actually >> >>> getting cluster-type work done, the distro is as close to irrelevant >> >>> as imaginable. ?a matter of taste, really. ?it's not as if the distros >> >>> provide the critical components - they merely repackage the kernel, >> >>> libraries, middleware, utilities. ?wiring yourself to a distro does >> >>> affect when you can or have to upgrade your system, though. >> >>> I seem to not have the original sources to this thread, but this is something that I thought I should chime in on. The underlying components that make up a distribution are in-fact an important component to an HPC system in its entirety. There are many reasons for this, but I will focus on just a few that I hope don't strike too much of a religious chord with people while at the same time letting me rant a bit. ;-) 1) HPC people are quite familiar with building their scientific apps with optimized compilers and libraries. If an application is linking against any OS libraries (yes, including the C library) it would probably make sense to make sure those have been compiled with an optimal build environment. Most distributions do not do this, as for a single standalone system the results may or not even be noticeable. I have been part of large benchmark projects to evaluate the differences of the distributions. In a nutshell, differences become more obvious at scale. 2) Distributions focused on non-HPC targets may not include tools, libraries or even functions that would be beneficial for HPC. And in addition to that, they may not be included in a way that makes it very usable. For example, Just because a distribution contains a package does not mean that is what people should use. For example, using a distribution supplied version of Open MPI would be an injustice to the majority of cluster users, but many distributions consider themselves HPC ready because they have some HPC capable libs. It is more important to have a solution for creating a suitable HPC environment. For example, in Caos NSA the core OS is RPM based but we also utilize a source based "ports-like" tree for building scientific packages, and then we integrate with Environment Modules to make them available for the users. So one can do: # cd /usr/src/cports/packages/openmpi/1.3.3 # make install COMPILERS=intel # make clean # make install COMPILERS=gcc # su - user $ module load openmpi/1.3.3-intel $ mpicc -show icc ........ $ module unload openmpi $ module load openmpi/1.3.3-gcc $ mpicc -show gcc ........ 3) HPC distributions should be focused on being lightweight and efficient. Bloat free stateless environments are important for keeping node operating systems quiet and supportive of HPC code. Lightweight and bloat free does not mean an ancient and featureless core environment either. 4) Even the kernel for most distributions is tuned for desktop use which tries to give the fairest share of CPU time to all processes (obviously not HPC supportive). 5) Clusters are not worth BETA quality code. Unstable environments with no long term plans for upstream support makes it a ridiculous solution for anybody trying to build a production environment. The number of unsuitable solutions that we had to "rescue" because they were running Fedora and totally unmaintainable by the people that integrated them is just silly. It really is a shame that religion plays such a big part of what OS someone would use because each OS (even non-Linux) are good for certain things. I can understand wanting to leverage an economy of scale with a homogenous environment, but there is a particular point where the economy of scale is no longer justified when shoe-horning a non-suitable solution onto a cluster. Where that line is really depends on the admins, users, the size of the system, and what they baseline their benchmark for success at. Just my $0.02 for what it's worth. Greg -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com From dimitrios.v.gerasimatos at jpl.nasa.gov Mon Oct 12 13:08:47 2009 From: dimitrios.v.gerasimatos at jpl.nasa.gov (Gerasimatos, Dimitrios V (343K)) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Disappointing floating point performance for X5560 versus SPECmarks? Message-ID: <6F127CF61C0FE143B5E46BF06036094F952A29B82F@ALTPHYEMBEVSP30.RES.AD.JPL> According to SPECfp2006, the X5560 should blow the doors off of the E5430. The X5560 scores 36 while the E5430 scores about 18. However, our own benchmarking using nbench, unixbench, and a home-grown utility (twobod) all show that any differences are attributed to clock speed. X5560@2.80GHz E5430@2.66GHz X5560/E5430 X5560/E5430@2.80 (twobod) time: 27.6 s 27.8 s 1.01 0.96 (nbench) FP INDEX: 34.294 30.767 1.11 1.06 INT INDEX: 14.974 14.835 1.01 0.96 (unixbench) SCORE: 1325.5 868.3 1.53 1.45 reg. Dhry: 997.6 844.6 1.18 1.12 DP Whet: 393.1 359.8 1.09 1.04 Has anyone else seen this? Any ideas to explain why this is? I am thinking of going with the 3.2 Ghz version of the Harpertown chip clocked over the Nehalem- based 5500 series. Dimitri -- Dimitrios Gerasimatos dimitrios.gerasimatos@jpl.nasa.gov Section 343 Jet Propulsion Laboratory 4800 Oak Grove Dr. Mail Stop 264-820 Pasadena, CA 91109 Voice: 818.354.4910 FAX: 818.393.7413 Cell: 818.726.8617 From rpnabar at gmail.com Mon Oct 12 14:23:53 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] One time passwords and two factor authentication for a HPC setup (might be offtopic? ) In-Reply-To: References: Message-ID: On Mon, Oct 12, 2009 at 2:35 PM, Leif Nixon wrote: > Rahul Nabar writes: > > That's the point. > >> Now there might be some who may say that's the point but I just think >> it was a pain. > > Sure it's a pain. Hopefully there had been a proper risk analysis > carried out that indicated that it still was the right thing to do. > Fair enough! Thanks Leif, for the view from the "other" side! -- Rahul From landman at scalableinformatics.com Mon Oct 12 14:26:08 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Disappointing floating point performance for X5560 versus SPECmarks? In-Reply-To: <6F127CF61C0FE143B5E46BF06036094F952A29B82F@ALTPHYEMBEVSP30.RES.AD.JPL> References: <6F127CF61C0FE143B5E46BF06036094F952A29B82F@ALTPHYEMBEVSP30.RES.AD.JPL> Message-ID: <4AD39EF0.2040105@scalableinformatics.com> Gerasimatos, Dimitrios V (343K) wrote: > According to SPECfp2006, the X5560 should blow the doors off of the E5430. > The X5560 scores 36 while the E5430 scores about 18. > > > However, our own benchmarking using nbench, unixbench, and a home-grown > utility (twobod) all show that any differences are attributed to clock speed. I wouldn't use specfp**** ratios as a realistic guide for performance comparison. As always, use your own code. If your code shows 2x, great. If not, then attribute the specfp**** to what they are (marketing numbers, with some grounding in reality, but not a firm comparison metric for dissimilar apps). -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From niftyompi at niftyegg.com Mon Oct 12 14:29:49 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Disappointing floating point performance for X5560 versus SPECmarks? In-Reply-To: <6F127CF61C0FE143B5E46BF06036094F952A29B82F@ALTPHYEMBEVSP30.RES.AD.JPL> References: <6F127CF61C0FE143B5E46BF06036094F952A29B82F@ALTPHYEMBEVSP30.RES.AD.JPL> Message-ID: <20091012212949.GA3212@compegg> On Mon, Oct 12, 2009 at 01:08:47PM -0700, Gerasimatos, Dimitrios V (343K) wrote: > > > According to SPECfp2006, the X5560 should blow the doors off of the E5430. > The X5560 scores 36 while the E5430 scores about 18. > > > However, our own benchmarking using nbench, unixbench, and a home-grown > utility (twobod) all show that any differences are attributed to clock speed. Memory, compiler? -- T o m M i t c h e l l Found me a new hat, now what? From tom.elken at qlogic.com Mon Oct 12 14:53:08 2009 From: tom.elken at qlogic.com (Tom Elken) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Disappointing floating point performance for X5560 versus SPECmarks? In-Reply-To: <4AD39EF0.2040105@scalableinformatics.com> References: <6F127CF61C0FE143B5E46BF06036094F952A29B82F@ALTPHYEMBEVSP30.RES.AD.JPL> <4AD39EF0.2040105@scalableinformatics.com> Message-ID: <35AAF1E4A771E142979F27B51793A4888702AE7148@AVEXMB1.qlogic.org> > > Gerasimatos, Dimitrios V (343K) wrote: > > According to SPECfp2006, the X5560 should blow the doors off of the > E5430. > > The X5560 scores 36 while the E5430 scores about 18. > > > > > > However, our own benchmarking using nbench, unixbench, and a home- > grown > > utility (twobod) all show that any differences are attributed to > clock speed. > > I wouldn't use specfp**** ratios as a realistic guide for performance > comparison. As always, use your own code. What Joe says has a lot of merit. But that said, if your applications have advanced along with the capabilities of modern CPUs, SPECfp2006 is a lot better metric than SPECfp2000, which is a better metric than SPECfp95. Each succeeding generation of SPEC CPU benchmark grows substantially in memory footprint and in the memory bandwidth performance required of a CPU/memory system. I note that nbench and unixbench were last developed around 1996-1997 putting them in the same era of benchmarks as SPECfp95. I see that the application components of SPECfp2006 for which X5560 blows the doors off E5430 (by 2x or more) are: 410.bwaves (CFD), 433.milc (QCD), 450.soplex (Linear Programming), 459.GemsFDTD (Computational Electromagnetics), 470.lbm (CFD). These are CFP2006 applications which I remember as having the most memory bandwidth demand from my days on the CPU committee. Since the Nehalem X5560 can do McaCalpin's STREAM (memory bandwidth, OpenMP 8-thread version) benchmark at about 3x the rate of Harpertown E5430, that explains a lot of the difference. If you applications have a small memory footprint, or have great cache re-use, you probably don't need the newer generation of CPU. -Tom > If your code shows 2x, > great. If not, then attribute the specfp**** to what they are > (marketing numbers, with some grounding in reality, but not a firm > comparison metric for dissimilar apps). > > > > > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics Inc. > email: landman@scalableinformatics.com > web : http://scalableinformatics.com > http://scalableinformatics.com/jackrabbit > phone: +1 734 786 8423 x121 > fax : +1 866 888 3112 > cell : +1 734 612 4615 > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From bill at cse.ucdavis.edu Mon Oct 12 14:56:47 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] One time passwords and two factor authentication for a HPC setup (might be offtopic? ) In-Reply-To: References: Message-ID: <4AD3A61F.10903@cse.ucdavis.edu> Rahul Nabar wrote: > [I apologize if this might be somewhat offtopic for HPC;it could be > termed a generic Linux logon problem but I couldn't find many leads in > my typical linux.misc group.] How to secure a valuable network resource like a cluster sounds on topic to me. > I've used RSA type cards in the past for accessing larger > supercomputing environments and they seem fairly secure but I suspect Seemed kinda silly to me. My minimum level for security is something like ssh with certs. So various attacks don't work, things like spoofing dns, network sniffing, and man in the middle attacks don't work (assuming users with a clue). > that kind of setup is too large (expensive, proprietary, complicated) > for us. Are there any good open source alternatives? The actual > time-seeded random-number generation key fobs seem pretty cheap (less > than $20 a piece e.g. http://www.yubico.com/products/yubikey/ ). So > the hardware is OK but I still need the backend software to tie it in > to /etc/passwd or PAM or some such mechanism. The software I found was > either Win-based or catered to apache or email etc. I did find VASCO > and CryptoCard but am not sure they are the right fit. Sounds reasonable, so sure you get a one time password, the hard part is making sure nobody sees that password except the intended recipient. So if you buy the yubikey then what? Pam module? ssh client hack? Some webified openid setup? Apparently there's even yubikey emulators out there, I can't see any other way to let someone login from a smart phone (that lacks a powered usb port). > I looked around at open source but couldn't find much. Are other > sys-admins using some form of OTP. What options do I have? > > Of course, I know that OTP and two-factor is not some magic bullet > that makes my security watertight; but I still think its more secure > than static user passwords. I'd agree there, but more secure than ssh with a valid known host (knowing the key of the server you are logging into) and a certificate... not so sure. Ideally an auth mechanism would handle: * man in the middle (via attacker upstream, or via dns spoofing). * sniffing * compromised client desktop * brute force The one approach I've been considering is a smart phone with an out of band connection (wifi or cellular). The server knows a public key associated with a user's smart phone. The user's smartphone knows the public key associated with the server. When you go to login a challenge is sent to your smart phone (encrypted in it's public key), a cute dialog pops on the cell phone asking the user if they accept the connection, and the response is sent to the server (encrypted in it's public key). It would be single factor authentication (something you have), but much better than an average password which doesn't have much entropy and with which you have to trust the local client. Well that's based on the idea that a smart phone is much less likely to get hacked then the average desktop. From lindahl at pbm.com Mon Oct 12 16:55:02 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <571f1a060910121150q47382afbwa2be2194404b7544@mail.gmail.com> References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> <4AC8822B.6000603@gmx.com> <20091005002116.GA3130@tosh2egg.ca.sanfran.comcast.net> <4AC9C423.10003@gmx.com> <20091011202929.GA3032@tosh2egg.ca.sanfran.comcast.net> <571f1a060910121150q47382afbwa2be2194404b7544@mail.gmail.com> Message-ID: <20091012235502.GA19616@bx9.net> On Mon, Oct 12, 2009 at 11:50:10AM -0700, Greg Kurtzer wrote: > The underlying components that make up a distribution are in-fact an > important component to an HPC system in its entirety. There are many > reasons for this, but I will focus on just a few that I hope don't > strike too much of a religious chord with people while at the same > time letting me rant a bit. ;-) Well, allow me to point out the cow turd you may have stepped in ;-) > 1) HPC people are quite familiar with building their scientific apps > with optimized compilers and libraries. If an application is linking > against any OS libraries (yes, including the C library) it would > probably make sense to make sure those have been compiled with an > optimal build environment. Most distributions do not do this, If the library isn't significantly cpu intensive, then you're better off sticking to the best-tested option for your compiler. For gcc, that's -O2. Does gcc even bootstrap at -O3 on your favorite platform? And pass all the test suites? It's not worth risk unless there's a significant performance gain. Bah, humbug, Gentoo, pthui. -- greg From john.hearns at mclaren.com Tue Oct 13 01:19:18 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Ahoy shipmates In-Reply-To: <200910122333.09534.mm@yuhu.biz> References: <68A57CCFD4005646957BD2D18E60667B0D768E45@milexchmb1.mil.tagmclarengroup.com> <200910122333.09534.mm@yuhu.biz> Message-ID: <68A57CCFD4005646957BD2D18E60667B0D768F71@milexchmb1.mil.tagmclarengroup.com> I don't know about the wave power but the cooling power of the ocean or sea water is pretty good idea to look at. And add to that the Container based DC designs, you can have a very efficient Underwater DC. The only problem will be physically accessing the DC. Or you can simply build your DC near the shore and have pipes directly to the ocean/sea(however this approach will not be as effective as an underwater DC. Truth of course being stranger than fiction, there WAS a data centre on Sealand. Sealand was one of the Second World War forts established on the sands of the Channel, which was later inhabited by an eccentric character who claimed it was a separate nation. http://en.wikipedia.org/wiki/HavenCo The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From tjrc at sanger.ac.uk Tue Oct 13 01:42:28 2009 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Ahoy shipmates In-Reply-To: <200910122333.09534.mm@yuhu.biz> References: <68A57CCFD4005646957BD2D18E60667B0D768E45@milexchmb1.mil.tagmclarengroup.com> <200910122333.09534.mm@yuhu.biz> Message-ID: <258C92FC-E449-40F3-BE43-96EEDCA763FC@sanger.ac.uk> On 12 Oct 2009, at 9:33 pm, Marian Marinov wrote: > I don't know about the wave power but the cooling power of the ocean > or sea > water is pretty good idea to look at. Isn't sea water fairly corrosive? You get severe electrolytic corrosion problems on boats, hence the big lump of zinc on a yacht's propshaft... > And add to that the Container based DC designs, you can have a very > efficient > Underwater DC. The only problem will be physically accessing the DC. > Or you > can simply build your DC near the shore and have pipes directly to the > ocean/sea(however this approach will not be as effective as an > underwater DC. You can combine this with what they used to do at Fawley oil-fired power station on the side of the Solent - the warmed seawater encouraged oyster growth, so there were commercial oyster beds around the outlet. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From prentice at ias.edu Tue Oct 13 07:34:32 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: fullcluster KVM is not an option I suppose? In-Reply-To: References: <4AC2EEC6.4000902@scalableinformatics.com> <68A57CCFD4005646957BD2D18E60667B0D5251A1@milexchmb1.mil.tagmclarengroup.com> <4ACE2057.70108@ias.edu> Message-ID: <4AD48FF8.1070408@ias.edu> Mark Hahn wrote: >> Even with IPMI, you still need a crash cart of some type to initially >> set up IPMI in the system's BIOS. At the minimum, you need to set the IP >> address that the IMPI interface will listen on (if it's a shared NIC > > afaik, not really. here's what I prefer: cluster nodes normally come > out of the box with BIOS configured to try booting over the net before > local HD. > sometimes this is conditional on the local HD having no active partition. > > great: so they boot from a special PXE image I set up as a catchall. > (dhcpd lets you define a catchall for any not nodes which lack a their own > MAC-specific stanza.) when nodes are in that state, I like to > auto-configure > the cluster's knowlege of them: collect MAC, add to dhcpd.conf, etc. at > this stage, you can also use local (open) ipmi on the node itself to > configure the IPMI LAN interface: > ipmitool lan 2 set password pa55word > ipmitool lan 2 set defgw ipaddr 10.10.10.254 > ipmitool lan 2 set ipsrc dhcp > > none of this precludes tricks like frobing the switch to find the port-MAC > mappings of course - the point is simply that if you let unconfigured nodes > autoboot into a useful image, that image can help you automate more of the > config. My cluster nodes' IPMI share their physical port with the primary NIC. Before using IPMI, I had to enable it in the BIOS and then assign it an IP address in the BIOS, too. I didn't think of using ipmitool. I wonder if I could do all that using ipmitool, without enabling IPMI in the BIOS first. Anyone know for sure? -- Prentice From djholm at fnal.gov Tue Oct 13 08:33:05 2009 From: djholm at fnal.gov (Don Holmgren) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: fullcluster KVM is not an option I suppose? In-Reply-To: <4AD48FF8.1070408@ias.edu> References: <4AC2EEC6.4000902@scalableinformatics.com> <68A57CCFD4005646957BD2D18E60667B0D5251A1@milexchmb1.mil.tagmclarengroup.com> <4ACE2057.70108@ias.edu> <4AD48FF8.1070408@ias.edu> Message-ID: On Tue, 13 Oct 2009, Prentice Bisbal wrote: > Mark Hahn wrote: >>> Even with IPMI, you still need a crash cart of some type to initially >>> set up IPMI in the system's BIOS. At the minimum, you need to set the IP >>> address that the IMPI interface will listen on (if it's a shared NIC >> >> afaik, not really. here's what I prefer: cluster nodes normally come >> out of the box with BIOS configured to try booting over the net before >> local HD. >> sometimes this is conditional on the local HD having no active partition. >> >> great: so they boot from a special PXE image I set up as a catchall. >> (dhcpd lets you define a catchall for any not nodes which lack a their own >> MAC-specific stanza.) when nodes are in that state, I like to >> auto-configure >> the cluster's knowlege of them: collect MAC, add to dhcpd.conf, etc. at >> this stage, you can also use local (open) ipmi on the node itself to >> configure the IPMI LAN interface: >> ipmitool lan 2 set password pa55word >> ipmitool lan 2 set defgw ipaddr 10.10.10.254 >> ipmitool lan 2 set ipsrc dhcp >> >> none of this precludes tricks like frobing the switch to find the port-MAC >> mappings of course - the point is simply that if you let unconfigured nodes >> autoboot into a useful image, that image can help you automate more of the >> config. > > My cluster nodes' IPMI share their physical port with the primary NIC. > Before using IPMI, I had to enable it in the BIOS and then assign it an > IP address in the BIOS, too. I didn't think of using ipmitool. I wonder > if I could do all that using ipmitool, without enabling IPMI in the BIOS > first. Anyone know for sure? > > -- > Prentice With all (and various) flavors of IPMI and BMC hardware on Intel, Supermicro, and Asus systems since about 2004, we've been able to use ipmitool or equivalent software in Linux to setup IPMI LAN and other parameters. Consistently, though, for systems with serial over LAN, we've always had to configure serial port redirect settings in the BIOS. The latter is why we've always tried to get vendors to provide a mechanism for replicating BIOS settings from machine to machine without using a crash cart (not always successfully, unfortunately). Don Holmgren Fermilab From mathog at caltech.edu Tue Oct 13 09:06:09 2009 From: mathog at caltech.edu (David Mathog) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Re: Ahoy shipmates Message-ID: A sprinkler leak in a computer room is bad, but just imagine the damage that seawater would do. Even a fine mist would be dreadful, as it would be sucked through the cases and the droplets would either short things out immediately, or lead inevitably to corroded metal throughout the machine. To use seawater to cool a room safely it would have to be well isolated, exchanging heat with a loop of much more innocuous fluid that actually enters the room. David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From gerry.creager at tamu.edu Tue Oct 13 09:11:22 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] recommendation on crash cart for a cluster room: fullcluster KVM is not an option I suppose? In-Reply-To: References: <4AC2EEC6.4000902@scalableinformatics.com> <68A57CCFD4005646957BD2D18E60667B0D5251A1@milexchmb1.mil.tagmclarengroup.com> <4ACE2057.70108@ias.edu> <4AD48FF8.1070408@ias.edu> Message-ID: <4AD4A6AA.6000804@tamu.edu> Don Holmgren wrote: > > On Tue, 13 Oct 2009, Prentice Bisbal wrote: > >> Mark Hahn wrote: >>>> Even with IPMI, you still need a crash cart of some type to initially >>>> set up IPMI in the system's BIOS. At the minimum, you need to set >>>> the IP >>>> address that the IMPI interface will listen on (if it's a shared NIC >>> >>> afaik, not really. here's what I prefer: cluster nodes normally come >>> out of the box with BIOS configured to try booting over the net before >>> local HD. >>> sometimes this is conditional on the local HD having no active >>> partition. >>> >>> great: so they boot from a special PXE image I set up as a catchall. >>> (dhcpd lets you define a catchall for any not nodes which lack a >>> their own >>> MAC-specific stanza.) when nodes are in that state, I like to >>> auto-configure >>> the cluster's knowlege of them: collect MAC, add to dhcpd.conf, etc. at >>> this stage, you can also use local (open) ipmi on the node itself to >>> configure the IPMI LAN interface: >>> ipmitool lan 2 set password pa55word >>> ipmitool lan 2 set defgw ipaddr 10.10.10.254 >>> ipmitool lan 2 set ipsrc dhcp >>> >>> none of this precludes tricks like frobing the switch to find the >>> port-MAC >>> mappings of course - the point is simply that if you let unconfigured >>> nodes >>> autoboot into a useful image, that image can help you automate more >>> of the >>> config. >> >> My cluster nodes' IPMI share their physical port with the primary NIC. >> Before using IPMI, I had to enable it in the BIOS and then assign it an >> IP address in the BIOS, too. I didn't think of using ipmitool. I wonder >> if I could do all that using ipmitool, without enabling IPMI in the BIOS >> first. Anyone know for sure? >> >> -- >> Prentice > > > With all (and various) flavors of IPMI and BMC hardware on Intel, > Supermicro, and Asus systems since about 2004, we've been able to use > ipmitool or equivalent software in Linux to setup IPMI LAN and other > parameters. > > Consistently, though, for systems with serial over LAN, we've always had > to configure serial port redirect settings in the BIOS. The latter is > why we've always tried to get vendors to provide a mechanism for > replicating BIOS settings from machine to machine without using a crash > cart (not always successfully, unfortunately). We've been working with one vendor to get IPMI working better and more predictably with DHCP. Manual config with a crash cart has been a bit problemmatical for us, too. SuperMicro has been real good about custom BIOS mods to help us out in that regard. gerry From niftyompi at niftyegg.com Tue Oct 13 09:15:11 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <571f1a060910121150q47382afbwa2be2194404b7544@mail.gmail.com> References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> <4AC8822B.6000603@gmx.com> <20091005002116.GA3130@tosh2egg.ca.sanfran.comcast.net> <4AC9C423.10003@gmx.com> <20091011202929.GA3032@tosh2egg.ca.sanfran.comcast.net> <571f1a060910121150q47382afbwa2be2194404b7544@mail.gmail.com> Message-ID: <20091013161511.GA6183@compegg> All good points for a larger project, yet less interesting for a personal cluster when the budget is small and the personal desktop and personal work environment is Ubuntu. I do suspect that in 18 months or so the original poster will be looking to update his environment and your good summary will then be apropos. Especially the bit "a source based "ports-like" tree for building scientific packages, and then we integrate with Environment Modules to make them available for the users.....". See module-assistant in Ubuntu and friends. I just deleted a longish comparison of GenToo and British sports cars. Since the OP is not a mechanic and does not have a staff, mechanic or chauffeur to maintain his roadster. I suspect he needs a Toyota, Ford or Chevy with an automatic transmission that he can maintain himself (for now). Oh wait I almost retyped it all.... ;-) On Mon, Oct 12, 2009 at 11:50:10AM -0700, Greg Kurtzer wrote: > On Sun, Oct 11, 2009 at 1:29 PM, Nifty Tom Mitchell > wrote: > > On Mon, Oct 05, 2009 at 12:02:11PM +0200, Tomislav Maric wrote: > >> Nifty Tom Mitchell wrote: > >> > On Sun, Oct 04, 2009 at 01:08:27PM +0200, Tomislav Maric wrote: > >> >> Mark Hahn wrote: > >> >>>> I've seen Centos mentioned a lot in connection to HPC, am I making a > >> >>>> mistake with Ubuntu?? > >> >>> distros differ mainly in their desktop decoration. ?for actually > >> >>> getting cluster-type work done, the distro is as close to irrelevant > >> >>> as imaginable. ?a matter of taste, really. ?it's not as if the distros > >> >>> provide the critical components - they merely repackage the kernel, > >> >>> libraries, middleware, utilities. ?wiring yourself to a distro does > >> >>> affect when you can or have to upgrade your system, though. > >> >>> > > I seem to not have the original sources to this thread, but this is > something that I thought I should chime in on. > > The underlying components that make up a distribution are in-fact an > important component to an HPC system in its entirety. There are many > reasons for this, but I will focus on just a few that I hope don't > strike too much of a religious chord with people while at the same > time letting me rant a bit. ;-) > > > > 1) HPC people are quite familiar with building their scientific apps > with optimized compilers and libraries. If an application is linking > against any OS libraries (yes, including the C library) it would > probably make sense to make sure those have been compiled with an > optimal build environment. Most distributions do not do this, as for a > single standalone system the results may or not even be noticeable. I > have been part of large benchmark projects to evaluate the differences > of the distributions. In a nutshell, differences become more obvious > at scale. > > 2) Distributions focused on non-HPC targets may not include tools, > libraries or even functions that would be beneficial for HPC. And in > addition to that, they may not be included in a way that makes it very > usable. For example, Just because a distribution contains a package > does not mean that is what people should use. For example, using a > distribution supplied version of Open MPI would be an injustice to the > majority of cluster users, but many distributions consider themselves > HPC ready because they have some HPC capable libs. > > It is more important to have a solution for creating a suitable HPC > environment. For example, in Caos NSA the core OS is RPM based but we > also utilize a source based "ports-like" tree for building scientific > packages, and then we integrate with Environment Modules to make them > available for the users. So one can do: > > # cd /usr/src/cports/packages/openmpi/1.3.3 > # make install COMPILERS=intel > # make clean > # make install COMPILERS=gcc > # su - user > $ module load openmpi/1.3.3-intel > $ mpicc -show > icc ........ > $ module unload openmpi > $ module load openmpi/1.3.3-gcc > $ mpicc -show > gcc ........ > > > 3) HPC distributions should be focused on being lightweight and > efficient. Bloat free stateless environments are important for keeping > node operating systems quiet and supportive of HPC code. Lightweight > and bloat free does not mean an ancient and featureless core > environment either. > > 4) Even the kernel for most distributions is tuned for desktop use > which tries to give the fairest share of CPU time to all processes > (obviously not HPC supportive). > > 5) Clusters are not worth BETA quality code. Unstable environments > with no long term plans for upstream support makes it a ridiculous > solution for anybody trying to build a production environment. The > number of unsuitable solutions that we had to "rescue" because they > were running Fedora and totally unmaintainable by the people that > integrated them is just silly. > > > It really is a shame that religion plays such a big part of what OS > someone would use because each OS (even non-Linux) are good for > certain things. I can understand wanting to leverage an economy of > scale with a homogenous environment, but there is a particular point > where the economy of scale is no longer justified when shoe-horning a > non-suitable solution onto a cluster. Where that line is really > depends on the admins, users, the size of the system, and what they > baseline their benchmark for success at. > > Just my $0.02 for what it's worth. > > Greg > > > -- > Greg M. Kurtzer > Chief Technology Officer > HPC Systems Architect > Infiscale, Inc. - http://www.infiscale.com -- T o m M i t c h e l l Found me a new hat, now what? From john.hearns at mclaren.com Tue Oct 13 09:46:23 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Re: Ahoy shipmates In-Reply-To: References: Message-ID: <68A57CCFD4005646957BD2D18E60667B0D7FC083@milexchmb1.mil.tagmclarengroup.com> To use seawater to cool a room safely it would have to be well isolated, exchanging heat with a loop of much more innocuous fluid that actually enters the room. Dave, I really think that is how you would do it. I was a bit loose in terminology. You don't pump the raw cooling water from your private bit of sea (or, ahem, lake) round your computer room. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From gmkurtzer at gmail.com Mon Oct 12 18:27:25 2009 From: gmkurtzer at gmail.com (Greg Kurtzer) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <20091012235502.GA19616@bx9.net> References: <4AC78CC8.5060500@gmx.com> <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> <4AC8822B.6000603@gmx.com> <20091005002116.GA3130@tosh2egg.ca.sanfran.comcast.net> <4AC9C423.10003@gmx.com> <20091011202929.GA3032@tosh2egg.ca.sanfran.comcast.net> <571f1a060910121150q47382afbwa2be2194404b7544@mail.gmail.com> <20091012235502.GA19616@bx9.net> Message-ID: <571f1a060910121827s457d6f04l54e2bc032e86bc71@mail.gmail.com> Hi Greg, On Mon, Oct 12, 2009 at 4:55 PM, Greg Lindahl wrote: > On Mon, Oct 12, 2009 at 11:50:10AM -0700, Greg Kurtzer wrote: > >> The underlying components that make up a distribution are in-fact an >> important component to an HPC system in its entirety. There are many >> reasons for this, but I will focus on just a few that I hope don't >> strike too much of a religious chord with people while at the same >> time letting me rant a bit. ;-) > > Well, allow me to point out the cow turd you may have stepped in ;-) Yes. Well aware of the mine field of cow pies all around this area which was why my response was somewhat carefully constructed instead of spoken frankly. haha > >> 1) HPC people are quite familiar with building their scientific apps >> with optimized compilers and libraries. If an application is linking >> against any OS libraries (yes, including the C library) it would >> probably make sense to make sure those have been compiled with an >> optimal build environment. Most distributions do not do this, > > If the library isn't significantly cpu intensive, then you're better > off sticking to the best-tested option for your compiler. For gcc, > that's -O2. Does gcc even bootstrap at -O3 on your favorite platform? > And pass all the test suites? It's not worth risk unless there's a > significant performance gain. Yes, I think we are in total agreement just saying it differently. Generally there is the core library stack and the rest... The core stack should NOT be rebuilt by the end users especially with non-standard compilers and/or optimizations as that will make for a very difficult to support system. This should be done by the distribution maintainers themselves such that it is maintainable, tested and optimized for the target platform. Some binary based distributions spend much more time on the tool-chains they use to build the OS then others and this can have a *noticeable* performance impact. > Bah, humbug, Gentoo, pthui. While Gentoo may have some possible,... or potential,... or advertised performance benefits I personally do not run it on the systems that I architect because of the inconsistencies with different builds and installs (I prefer to use identical packages/binaries from one system to another via RPM guarantying consistency). Regards, Greg -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com From dzaletnev at yandex.ru Tue Oct 13 06:22:31 2009 From: dzaletnev at yandex.ru (Dmitry Zaletnev) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Home Beowulf In-Reply-To: <35AAF1E4A771E142979F27B51793A4888702AE6C7D@AVEXMB1.qlogic.org> References: <58241254741577@webmail83.yandex.ru> <35AAF1E4A771E142979F27B51793A4888702AE6C7D@AVEXMB1.qlogic.org> Message-ID: <74791255440151@webmail48.yandex.ru> Hi, may be somebody will find interesting this CFD perfomance tests for different desktop i7/memory configs: http://techreport.com/articles.x/15967/4 Sincerely, Dmitry > > On Behalf Of Dmitry Zaletnev > > > > I just thinking about a 2-nodes Beowulf for a CFD application that > > reads/writes small amounts of data in random areas of RAM (non-cached > > translation). What would be the difference in overall performance (I > > suppose, due to the memory access time) between Q9650/ 8 GB DDR2-800 > > and i7 920/ 6 GB DDR3-1866. > you probably mean "DDR3-1066". > Concerning overall CFD performance, I would go for the i7 920. > For CFD performance data, one good resource are the FLUENT benchmark pages: > http://www.fluent.com/software/fluent/fl6bench/fl6bench_6.4.x/index.htm > click on one of the 6 test datasets there and get tables of benchmark performance. For example the Aricraft_2m model (with at: > http://www.fluent.com/software/fluent/fl6bench/fl6bench_6.4.x/problems/aircraft_2m.htm > The Nehalem (same architecture as i7 920) generation: > INTEL WHITEBOX (INTEL_X5570_NHM4,2930,RHEL5)* > FLUENT rating at 8 cores: = 784.9 > The Harpertown generation (same architecture as Q9650, I think) > INTEL WHITEBOX (INTEL_X5482_HTN4,3200,RHEL5) > FLUENT rating at 8 cores: = 307.6 > * To help decode these brief descriptions of processors, systems & interconnects, you can often find more details at this page: http://www.fluent.com/software/fluent/fl6bench/new.htm > THe FLUENT rating is a "bigger is better" metric relating to the number of jobs you can run in 24 hours, IIRC. > So with the newer generation, on these FLUENT benchmarks anyway, you get approximately 2x the performance. > > By the way, are local variables of > > methods/functions stored in L2 cache? > There are 3 levels of cache in the i7, 2 levels in the Q9650. > All your problems are likely to use all levels of cache. You are right that data you tend to reuse frequently will tend to live in the lower cache levels longer, but there are no hard and fast rules. > Bets of luck, > Tom > > The code will use all the system > > memory for data storage, excluding used by OS, no swapping supposed. > > I'm going to buy motherboards next month. > > > > Thank you in advance for any advice, > > -- > > Dmitry Zaletnev > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin > > Computing > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > ??????? ????? ????????? ?????: http://mail.yandex.ru/promo/new/colors From atchley at myri.com Tue Oct 13 09:58:52 2009 From: atchley at myri.com (Scott Atchley) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Re: Ahoy shipmates In-Reply-To: References: Message-ID: <47805E61-5F00-4A1A-9687-0F87B4CEF407@myri.com> On Oct 13, 2009, at 12:06 PM, David Mathog wrote: > A sprinkler leak in a computer room is bad, but just imagine the > damage > that seawater would do. Even a fine mist would be dreadful, as it > would > be sucked through the cases and the droplets would either short things > out immediately, or lead inevitably to corroded metal throughout the > machine. It is worse than that. You are in a 100% humidity environment. The seawater is cooler than the air temp, so you get condensation everywhere below deck or anywhere the chilled water pipes run. Scott From rgb at phy.duke.edu Tue Oct 13 10:40:22 2009 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Re: Ahoy shipmates In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0D7FC083@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B0D7FC083@milexchmb1.mil.tagmclarengroup.com> Message-ID: On Tue, 13 Oct 2009, Hearns, John wrote: > To use seawater to cool a room safely it would have to be well isolated, > exchanging heat with a loop of much more innocuous fluid that actually > enters the room. Check out the Duke Marine Lab website. They basically do this, via geothermal exchange units (but they're on an island -- the ocean is basically the ultimate heat sink). I think that the technology is straightforward and readily available these days. The only real hassle is that ocean water is corrosive as hell, so the heat exchangers have to be stainless steel and/or barnacle proof unless you bury the exchangers and rely on slow diffusive convection back to the ocean. rgb > > Dave, I really think that is how you would do it. I was a bit loose in > terminology. > You don't pump the raw cooling water from your private bit of sea (or, > ahem, lake) round > your computer room. > > > The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From prentice at ias.edu Tue Oct 13 12:51:13 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Re: Ahoy shipmates In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D7FC083@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4AD4DA31.1020705@ias.edu> Robert G. Brown wrote: > On Tue, 13 Oct 2009, Hearns, John wrote: > >> To use seawater to cool a room safely it would have to be well isolated, >> exchanging heat with a loop of much more innocuous fluid that actually >> enters the room. > > Check out the Duke Marine Lab website. They basically do this, via > geothermal exchange units (but they're on an island -- the ocean is > basically the ultimate heat sink). > > I think that the technology is straightforward and readily available > these days. The only real hassle is that ocean water is corrosive as > hell, so the heat exchangers have to be stainless steel and/or barnacle > proof unless you bury the exchangers and rely on slow diffusive > convection back to the ocean. > Pex tubing might work. It's plastic, so it shouldn't be as corrosive as metal. It is used in radiant heating applications so it must be a decent heat conductor, for a plastic. I'm sure you would eventually still get some kind of buildup on the walls, affecting the condictivity. -- Prentice From hahn at mcmaster.ca Tue Oct 13 13:19:55 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Disappointing floating point performance for X5560 versus SPECmarks? In-Reply-To: <6F127CF61C0FE143B5E46BF06036094F952A29B82F@ALTPHYEMBEVSP30.RES.AD.JPL> References: <6F127CF61C0FE143B5E46BF06036094F952A29B82F@ALTPHYEMBEVSP30.RES.AD.JPL> Message-ID: > However, our own benchmarking using nbench, unixbench, and a home-grown > utility (twobod) all show that any differences are attributed to clock speed. do you have any sense for whether these are entirely in-cache benchmarks? that's the most obvious explanation. nehalem is all about the new memory interface - the cores are basically just core2's (with HT added and some smallish tweaks to caches, etc). basically 2x 2-wide GP flops/cycle/core. From hahn at mcmaster.ca Tue Oct 13 13:47:44 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Best Practices SOL vs Cyclades ACS In-Reply-To: <32182.89.180.64.142.1255277450.squirrel@webmail.fct.unl.pt> References: <200910100254.49024.mm@yuhu.biz> <32182.89.180.64.142.1255277450.squirrel@webmail.fct.unl.pt> Message-ID: > Do you have SOL on the HP DL145-G2 ? to be honest, I've never tried it. our machines came with HP's XC distro, which includes a console logging/terminal command. it appears to use the telnet interface, though, not ipmi sol. > I also have these nodes, and although I can use most ipmi functions > (including remote access power up/cycle), I can not get SOL to work. my naive attempt to use sol gets me "Insufficient privilege level" (that's for the admin account, which has "OEM" priv level. over local/open "ipmi sol info 2" gets me: Error requesting SOL parameter 'Set In Progress (0)': Invalid command ) > Also, i have noticed that the kipmi0 daemon does consume a little bit, that's a thread for the local/open ipmi interface. you don't need it at all if you use the lan interface. From lindahl at pbm.com Tue Oct 13 17:35:13 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <571f1a060910121827s457d6f04l54e2bc032e86bc71@mail.gmail.com> References: <4AC7BF39.8030504@abdn.ac.uk> <4AC7C768.5070705@gmx.com> <4AC8822B.6000603@gmx.com> <20091005002116.GA3130@tosh2egg.ca.sanfran.comcast.net> <4AC9C423.10003@gmx.com> <20091011202929.GA3032@tosh2egg.ca.sanfran.comcast.net> <571f1a060910121150q47382afbwa2be2194404b7544@mail.gmail.com> <20091012235502.GA19616@bx9.net> <571f1a060910121827s457d6f04l54e2bc032e86bc71@mail.gmail.com> Message-ID: <20091014003513.GB11663@bx9.net> On Mon, Oct 12, 2009 at 06:27:25PM -0700, Greg Kurtzer wrote: > Yes, I think we are in total agreement just saying it differently. > > Generally there is the core library stack and the rest... The core > stack should NOT be rebuilt by the end users especially with > non-standard compilers and/or optimizations as that will make for a > very difficult to support system. This should be done by the > distribution maintainers themselves such that it is maintainable, > tested and optimized for the target platform. But all distros aren't alike. If Red Hat wants to build a certain way, they have enough QA and testing that I'll trust the result. If Joe Small Distro Guy builds the entire distro with gcc -O3, and does a small amount of QA on the result, the first thought that springs to mind is "Gentoo! Run!!!" -- greg From csamuel at vpac.org Tue Oct 13 19:37:31 2009 From: csamuel at vpac.org (Chris Samuel) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] RAID for home beowulf In-Reply-To: <4AC929FA.6060201@gmx.com> Message-ID: <840724075.4493611255487851904.JavaMail.root@mail.vpac.org> ----- "Tomislav Maric" wrote: > How do you mean differences in config? I'm configuring the master, > and the other nodes are to be diskless.. I have separated these > partitions: > /swap /boot / /var and /home. Is this ok? Yes, that's fine, there are 2 reasons to stick with partitions in these days of large drives, and both are really questions of preference rather than solid technical reasons. 1) You can run different filesystems on them, so you might want to be conservative with your root filesystem and use ext3 but something faster for your home directory and use XFS or ext4 (or visa versa). For instance my work laptop has been using btrfs as /home since February and I make sure I back that up just in case. To date not had any issues though. 2) If something goes bonkers and fills /home then you don't run out of space on vital parts of the filesystem like /var so your logging or the apt-get dist-upgrade / yum upgrade you were in the middle of is still running OK. But a single partition (well, with a separate swap) is going to be OK.. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Tue Oct 13 19:59:21 2009 From: csamuel at vpac.org (Chris Samuel) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Re: RAID for home beowulf In-Reply-To: <1588045044.4493721255488203298.JavaMail.root@mail.vpac.org> Message-ID: <1162019079.4494461255489161584.JavaMail.root@mail.vpac.org> ----- "Tomislav Maric" wrote: > Well, I don't need too much scratch space, the important part of the > disk is the /home with the results. What file system should I use for > it, ext3? A couple of things to bear in mind.. 1) If you are using LVM on top of RAID you will need to tell the mkfs command how your RAID is constructed to get optimal performance (though LVM seems to impose a penalty in its own right). 2) If you are using XFS directly on top of an MD partition it should be able to learn the layout automagically and it'll be good. If you're using ext3/ext4 (and presumably others) then you'll need to again specify the layout for best perf. 3) Be aware that the kernel you are using will make a big difference and that the semantics for ext3 have changed a while ago so that it no longer uses data=ordered but instead uses data=writeback by default (you can override that in your /etc/fstab or your kernel config if you want). cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From hahn at mcmaster.ca Tue Oct 13 22:41:13 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Re: RAID for home beowulf In-Reply-To: <1162019079.4494461255489161584.JavaMail.root@mail.vpac.org> References: <1162019079.4494461255489161584.JavaMail.root@mail.vpac.org> Message-ID: > 3) Be aware that the kernel you are using will make a big > difference and that the semantics for ext3 have changed > a while ago so that it no longer uses data=ordered but > instead uses data=writeback by default (you can override AFAIKT, it's just an option for the default mount config now that wasn't offered before. major distros (I only checked fedora) still do data=ordered, and the latest kernel snapshot still has data=ordered as the default. the kernel config help text points to this: http://ext4.wiki.kernel.org/index.php/Ext3_Data%3DOrdered_vs_Data%3DWriteback_mode which makes a useful point about app-level consistency still requiring fsyncs (which operates above any fs-level consistency guarantees.) also, I find swapfiles more convenient than partitions. some installers always force a swap partition to the tail of the disk, which is pessimal. regards, mark hahn. From eagles051387 at gmail.com Tue Oct 13 23:12:14 2009 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Home Beowulf In-Reply-To: <74791255440151@webmail48.yandex.ru> References: <58241254741577@webmail83.yandex.ru> <35AAF1E4A771E142979F27B51793A4888702AE6C7D@AVEXMB1.qlogic.org> <74791255440151@webmail48.yandex.ru> Message-ID: this is quite an interesting read. it talks about the entry level i5 and compares them to the i7s. granted its comparing the 2 you will learn alot about the i7s as well. http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/22151-intel-lynnfield-core-i5-750-core-i7-870-processor-review.html 2009/10/13 Dmitry Zaletnev > > Hi, > may be somebody will find interesting this CFD perfomance tests for > different desktop i7/memory configs: > > http://techreport.com/articles.x/15967/4 > > Sincerely, > Dmitry > > > > On Behalf Of Dmitry Zaletnev > > > > > > I just thinking about a 2-nodes Beowulf for a CFD application that > > > reads/writes small amounts of data in random areas of RAM (non-cached > > > translation). What would be the difference in overall performance (I > > > suppose, due to the memory access time) between Q9650/ 8 GB DDR2-800 > > > and i7 920/ 6 GB DDR3-1866. > > you probably mean "DDR3-1066". > > Concerning overall CFD performance, I would go for the i7 920. > > For CFD performance data, one good resource are the FLUENT benchmark > pages: > > http://www.fluent.com/software/fluent/fl6bench/fl6bench_6.4.x/index.htm > > click on one of the 6 test datasets there and get tables of benchmark > performance. For example the Aricraft_2m model (with at: > > > http://www.fluent.com/software/fluent/fl6bench/fl6bench_6.4.x/problems/aircraft_2m.htm > > The Nehalem (same architecture as i7 920) generation: > > INTEL WHITEBOX (INTEL_X5570_NHM4,2930,RHEL5)* > > FLUENT rating at 8 cores: = 784.9 > > The Harpertown generation (same architecture as Q9650, I think) > > INTEL WHITEBOX (INTEL_X5482_HTN4,3200,RHEL5) > > FLUENT rating at 8 cores: = 307.6 > > * To help decode these brief descriptions of processors, systems & > interconnects, you can often find more details at this page: > http://www.fluent.com/software/fluent/fl6bench/new.htm > > THe FLUENT rating is a "bigger is better" metric relating to the number > of jobs you can run in 24 hours, IIRC. > > So with the newer generation, on these FLUENT benchmarks anyway, you get > approximately 2x the performance. > > > By the way, are local variables of > > > methods/functions stored in L2 cache? > > There are 3 levels of cache in the i7, 2 levels in the Q9650. > > All your problems are likely to use all levels of cache. You are right > that data you tend to reuse frequently will tend to live in the lower cache > levels longer, but there are no hard and fast rules. > > Bets of luck, > > Tom > > > The code will use all the system > > > memory for data storage, excluding used by OS, no swapping supposed. > > > I'm going to buy motherboards next month. > > > > > > Thank you in advance for any advice, > > > -- > > > Dmitry Zaletnev > > > _______________________________________________ > > > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin > > > Computing > > > To change your subscription (digest mode or unsubscribe) visit > > > http://www.beowulf.org/mailman/listinfo/beowulf > > > > ??????? ????? ????????? ?????: http://mail.yandex.ru/promo/new/colors > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091014/144d4154/attachment.html From csamuel at vpac.org Wed Oct 14 00:31:42 2009 From: csamuel at vpac.org (Chris Samuel) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Re: RAID for home beowulf In-Reply-To: <1471705539.4498101255501887399.JavaMail.root@mail.vpac.org> Message-ID: <1273447343.4498251255505502725.JavaMail.root@mail.vpac.org> Hi Mark, ----- "Mark Hahn" wrote: > AFAIKT, it's just an option for the default mount > config now that wasn't offered before. The kernel config option CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set by default in Linus's git tree. The changes were introduced in 2.6.30. That release also sets the default atime settings (for all filesystems I believe) to relatime, you need a mount command that understands the strictatime option to be able to change that one. > major distros (I only checked fedora) still do > data=ordered, and the latest kernel snapshot > still has data=ordered as the default. Sounds like they've changed the kernel defaults. > the kernel config help text points to this: > http://ext4.wiki.kernel.org/index.php/Ext3_Data%3DOrdered_vs_Data%3DWriteback_mode > which makes a useful point about app-level consistency still requiring > fsyncs (which operates above any fs-level consistency guarantees.) Oh indeed - Ted T'so apologised for his part in ext3's data=ordered mode at the Linux Storage and Filesystem workshop 2009. http://lwn.net/Articles/327601/ Look for "Rename, fsync, and ponies". :-) cheers! Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From rgb at phy.duke.edu Wed Oct 14 09:33:32 2009 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Re: Ahoy shipmates In-Reply-To: <4AD4DA31.1020705@ias.edu> References: <68A57CCFD4005646957BD2D18E60667B0D7FC083@milexchmb1.mil.tagmclarengroup.com> <4AD4DA31.1020705@ias.edu> Message-ID: On Tue, 13 Oct 2009, Prentice Bisbal wrote: > I'm sure you would eventually still get some kind of buildup on the > walls, affecting the condictivity. Yeah, the problem with barnacles is that you start getting significant encrustation in as little as a month of warm seawater exposure. The little suckers are ubiquitous and attach themselves in hours, grow in weeks. We dropped a lawn chair in the water by accident off of our dock last summer and hooked it and reeled it back in a month later, already partially covered with barnacles. There are paints and chemical compounds and so on that are "resistant", but dealing with them is still an ongoing battle of boat owners etc. rgb > > -- > Prentice > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From gus at ldeo.columbia.edu Wed Oct 14 10:47:53 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Re: Ahoy shipmates In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D7FC083@milexchmb1.mil.tagmclarengroup.com> <4AD4DA31.1020705@ias.edu> Message-ID: <4AD60EC9.7080905@ldeo.columbia.edu> Robert G. Brown wrote: > On Tue, 13 Oct 2009, Prentice Bisbal wrote: > >> I'm sure you would eventually still get some kind of buildup on the >> walls, affecting the condictivity. > > Yeah, the problem with barnacles is that you start getting significant > encrustation in as little as a month of warm seawater exposure. The > little suckers are ubiquitous and attach themselves in hours, grow in > weeks. We dropped a lawn chair in the water by accident off of our dock > last summer and hooked it and reeled it back in a month later, already > partially covered with barnacles. Yes, barnacles stick to hydrophone streamers, which move at 5 knots or so. Just as they do to the bodies of whales. They cause a lot of trouble for seismic imaging geophysical surveys, increasing the water drag on streamers, reducing the signal to noise ratio on hydrophones, etc. Some pictures: http://www.csiro.au/resources/Biofouling.html http://cmst.curtin.edu.au/news/e&p/e&pbarni.jpg http://www.glaucus.org.uk/Barnacles.htm http://www.learner.org/jnorth/images/graphics/u-z/gwhale_Barnacles.wmv Streamers need to be cleaned on a regular basis, during the surveys. There are devices to clean them up, lubricants to reduce the barnacle capacity to stick, etc: http://www.patentstorm.us/patents/7145833/description.html http://www.freepatentsonline.com/y2006/0144286.html OTOH, just to stay on the topic, seismic survey vessels have beefed up computer labs (or small data centers) on board, for data acquisition and QC, data processing (seismic imaging), etc. Not cooled by seawater, but not significantly affected by corrosion either. Gus Correa > > There are paints and chemical compounds and so on that are "resistant", > but dealing with them is still an ongoing battle of boat owners etc. > > rgb > >> >> -- >> Prentice >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > Robert G. Brown http://www.phy.duke.edu/~rgb/ > Duke University Dept. of Physics, Box 90305 > Durham, N.C. 27708-0305 > Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pbm.com Wed Oct 14 11:23:38 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Re: Ahoy shipmates In-Reply-To: <4AD60EC9.7080905@ldeo.columbia.edu> References: <68A57CCFD4005646957BD2D18E60667B0D7FC083@milexchmb1.mil.tagmclarengroup.com> <4AD4DA31.1020705@ias.edu> <4AD60EC9.7080905@ldeo.columbia.edu> Message-ID: <20091014182338.GD29808@bx9.net> You guys all read Slashdot, right? http://hardware.slashdot.org/article.pl?sid=08/09/06/1755216 and http://arstechnica.com/hardware/news/2008/01/new-startup-looking-to-set-up-floating-data-centers.ars -- greg From amacater at galactic.demon.co.uk Wed Oct 14 12:53:53 2009 From: amacater at galactic.demon.co.uk (Andrew M.A. Cater) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Re: Ahoy shipmates In-Reply-To: <20091014182338.GD29808@bx9.net> References: <68A57CCFD4005646957BD2D18E60667B0D7FC083@milexchmb1.mil.tagmclarengroup.com> <4AD4DA31.1020705@ias.edu> <4AD60EC9.7080905@ldeo.columbia.edu> <20091014182338.GD29808@bx9.net> Message-ID: <20091014195353.GA18906@galactic.demon.co.uk> On Wed, Oct 14, 2009 at 11:23:38AM -0700, Greg Lindahl wrote: > You guys all read Slashdot, right? > > http://hardware.slashdot.org/article.pl?sid=08/09/06/1755216 > > and > > http://arstechnica.com/hardware/news/2008/01/new-startup-looking-to-set-up-floating-data-centers.ars > > -- greg > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf Isn't there someone from one of the US Navy facilities who posts here occasionally. I seem to remember talk of fitting a Beowulf in the "spare space" in a submarine torpedo room or similar cramped quarters. Hmmm - does water cooled in a Polaris count :) AndyC From rgb at phy.duke.edu Wed Oct 14 15:25:52 2009 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Re: Ahoy shipmates In-Reply-To: <20091014182338.GD29808@bx9.net> References: <68A57CCFD4005646957BD2D18E60667B0D7FC083@milexchmb1.mil.tagmclarengroup.com> <4AD4DA31.1020705@ias.edu> <4AD60EC9.7080905@ldeo.columbia.edu> <20091014182338.GD29808@bx9.net> Message-ID: On Wed, 14 Oct 2009, Greg Lindahl wrote: > You guys all read Slashdot, right? > > http://hardware.slashdot.org/article.pl?sid=08/09/06/1755216 Good luck with the patents. Har har har. They'll be in line, there, and in e.g. Europe they'll laugh at them. In the US too if we have any sense. rgb > http://arstechnica.com/hardware/news/2008/01/new-startup-looking-to-set-up-floating-data-centers.ars > > -- greg > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From tomislav.maric at gmx.com Thu Oct 15 04:47:31 2009 From: tomislav.maric at gmx.com (tomislav.maric@gmx.com) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] RAID for home beowulf Message-ID: <20091015115855.248810@gmx.com> -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091015/5ba5cb98/attachment.html From rpnabar at gmail.com Thu Oct 15 09:56:02 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] One time passwords and two factor authentication for a HPC setup (might be offtopic? ) In-Reply-To: <4AD3A61F.10903@cse.ucdavis.edu> References: <4AD3A61F.10903@cse.ucdavis.edu> Message-ID: Thanks Bill; great comments. But they only make my decision harder. :) On Mon, Oct 12, 2009 at 4:56 PM, Bill Broadley wrote: > Seemed kinda silly to me. ?My minimum level for security is something like ssh > with certs. Of course, we use ssh too. Are certs. the same as a public-key private-key exchange? >So various attacks don't work, things like spoofing dns, network > sniffing, and man in the middle attacks don't work (assuming users with a clue). What can "users with a clue" do to defeat these kinds of attacks? Yes, users can choose smart passwords and protect them but how do users factor into protecting against spoofing, network sniffing etc? Of course, users are important for social engineering sort of attacks but in these other more advanced hacking strategies how can users protect themselves? I thought these were more ripe for sys admin level security solutions. > > Sounds reasonable, so sure you get a one time password, >the hard part is > making sure nobody sees that password except the intended recipient. Yes, agreed. But that problem exists whenever you use *any* password exchange. OTP's just reduce the risk of an intercepted P/W being continuously reused, correct? > ?So if > you buy the yubikey then what? ?Pam module? I usually put my trust in PAM. Seems a large enough and Open Source Project that in general ought to be vulnerability proof. >?ssh client hack? I wouldn't trust a ssh hack unless it was widely adopted and proven secure. >Some webified > openid setup? Dones't sound too appealing. Webifying what can otherwise be done via the command line is usally making things more insecure. >I can't see any other way > to let someone login from a smart phone (that lacks a powered usb port). I was just shopping for a H/W token that generated OTPs It doesn't have to actually connect to the login machine. The user could just type in the OTP in response to a command line challenge. > I'd agree there, but more secure than ssh with a valid known host (knowing the > key of the server you are logging into) and a certificate... not so sure. "Valid known hosts" are great but the reality is that many times users travel and would like to log in from a Laptop or off-site login PC that doesn't always have a static I/P etc. > The one approach I've been considering is a smart phone with an out of band > connection (wifi or cellular). ?The server knows a public key associated with > a user's smart phone. ?The user's smartphone knows the public key associated > with the server. ?When you go to login a challenge is sent to your smart phone > (encrypted in it's public key), a cute dialog pops on the cell phone asking > the user if they accept the connection, and the response is sent to the server > (encrypted in it's public key). Great idea. But we'd have to impliment that from scratch, correct? There is no app that does this for us automatically. I'd hate to write a secure app. More than writing just about any app something that needs to be secure requires a set of programming skills that I just doubt we have in-house. Besides, unless it has a large base using and trying to break it one would be unsure of its robustness. > It would be single factor authentication (something you have), but much better > than an average password which doesn't have much entropy and with which you > have to trust the local client. ?Well that's based on the idea that a smart > phone is much less likely to get hacked then the average desktop. Agreed. -- Rahul From hahn at mcmaster.ca Thu Oct 15 21:52:26 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] One time passwords and two factor authentication for a HPC setup (might be offtopic? ) In-Reply-To: References: <4AD3A61F.10903@cse.ucdavis.edu> Message-ID: >> Seemed kinda silly to me.  My minimum level for security is something like ssh >> with certs. > > Of course, we use ssh too. Are certs. the same as a public-key > private-key exchange? I think Bill meant the publickey mode of ssh, presumably encrypted ones (that is, passphrases.) technically 'cert' normally refers to X.509 certificates (as in SSL), which are somewhat more involved than ssh PKs. >> So various attacks don't work, things like spoofing dns, network >> sniffing, and man in the middle attacks don't work (assuming users with a clue). > > What can "users with a clue" do to defeat these kinds of attacks? Yes, > users can choose smart passwords and protect them but how do users > factor into protecting against spoofing, network sniffing etc? your password has to either be disabled (in favor of PK) or else unguessable. > Of course, users are important for social engineering sort of attacks > but in these other more advanced hacking strategies how can users > protect themselves? I thought these were more ripe for sys admin level > security solutions. ssh takes care of the connection, so there are still two vulnerabilities. if you're still using a password over ssh, it can be sniffed (visually, etc). but more fundamentally, the machine you're sitting in front of is your main vulnerability, so don't connect from any machine you don't trust. using PKs means your password doesn't get sniffed, but you're still sunk of your ssh client machine is compromised. (be careful with agent forwarding, as well...) >> Sounds reasonable, so sure you get a one time password, >> the hard part is >> making sure nobody sees that password except the intended recipient. > > Yes, agreed. But that problem exists whenever you use *any* password > exchange. OTP's just reduce the risk of an intercepted P/W being > continuously reused, correct? since the password is one-time, it doesn't do any good to sniff it. > "Valid known hosts" are great but the reality is that many times users > travel and would like to log in from a Laptop or off-site login PC > that doesn't always have a static I/P etc. not relevant - it's the ssh client that wants to verify the hostkey of the server it's connecting to. From jellogum at gmail.com Fri Oct 16 12:40:01 2009 From: jellogum at gmail.com (Jeremy Baker) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Re: Ahoy shipmates In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D7FC083@milexchmb1.mil.tagmclarengroup.com> <4AD4DA31.1020705@ias.edu> Message-ID: Mid 1990s, trying to learn about Linux, I was influenced the most be a news report of a Linux box working more than a year at sea without down time... for an oceanography project. Hull work is just the average day routine, average chore, work for a mariner is a way of life... Scraping benthic organisms from the hull is more work than applying protectant. Having cut thick sheets of alum to cut through the hull, to swap props for jet turbines to fix a design flaw, was a great deal of work. Engineering mistakes early will be revealed downstream during maintenance. I imagine the desired chassis for the data island might be modular with redundant parts, the fewer unique parts the better. I wonder what their flag will look like. Baker On Wed, Oct 14, 2009 at 9:33 AM, Robert G. Brown wrote: > On Tue, 13 Oct 2009, Prentice Bisbal wrote: > > I'm sure you would eventually still get some kind of buildup on the >> walls, affecting the condictivity. >> > > Yeah, the problem with barnacles is that you start getting significant > encrustation in as little as a month of warm seawater exposure. The > little suckers are ubiquitous and attach themselves in hours, grow in > weeks. We dropped a lawn chair in the water by accident off of our dock > last summer and hooked it and reeled it back in a month later, already > partially covered with barnacles. > > There are paints and chemical compounds and so on that are "resistant", > but dealing with them is still an ongoing battle of boat owners etc. > > rgb > > >> -- >> Prentice >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> > Robert G. Brown http://www.phy.duke.edu/~rgb/ > Duke University Dept. of Physics, Box 90305 > Durham, N.C. 27708-0305 > Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Jeremy Baker PO 297 Johnson, VT 05656 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091016/f4191de3/attachment.html From cbergstrom at pathscale.com Tue Oct 13 13:01:47 2009 From: cbergstrom at pathscale.com (=?ISO-8859-1?Q?=22C=2E_Bergstr=F6m=22?=) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Re: Ahoy shipmates In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D7FC083@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4AD4DCAB.3060205@pathscale.com> Robert G. Brown wrote: > On Tue, 13 Oct 2009, Hearns, John wrote: > >> To use seawater to cool a room safely it would have to be well isolated, >> exchanging heat with a loop of much more innocuous fluid that actually >> enters the room. > > Check out the Duke Marine Lab website. They basically do this, via > geothermal exchange units (but they're on an island -- the ocean is > basically the ultimate heat sink). Wait.. so global warming hasn't taught us anything yet and now we're off to boil the oceans :P ./Christopher From deadline at eadline.org Mon Oct 19 14:02:00 2009 From: deadline at eadline.org (Douglas Eadline) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] SC09 Help Message-ID: <37441.192.168.1.213.1255986120.squirrel@mail.eadline.org> An opportunity has come up for me to demonstrate some progress in my Limulus project at SC09. I have negotiated a corner of a trade show booth in which to demonstrate some hardware. Think of this hardware: http://limulus.basement-supercomputing.com/wiki/NorbertLimulus placed in a case that measures 22x20x9 (inches) with a single PS (and some other goodies). The booth opportunity came late in the game after I had already rented myself out as a "booth geek" to various vendors. I am interested if there are individual(s) who could stand next to the machine and answer questions. Any amount of time would be appreciated. You will get filled in on the design before hand. I will provide a badge if needed and you will also get a Limulus t-shirt and of course immediate fame. If you know of a local Linux users group or students who might be interested, please forward this email to them. Thanks - Oh and look for an announcement about the Monday night Beobash/LECCIBG real soon. (I'm not organizing it so has a good chance of being successful) -- Doug From carsten.aulbert at aei.mpg.de Tue Oct 20 01:28:53 2009 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Recent experience with Tyan MoBo? Message-ID: <200910201028.53881.carsten.aulbert@aei.mpg.de> Hi all, we have a large number of Supermicro based systems and got a couple of offers based on Tyan systems. Is there someone here in the "cloud" who has good or bad experience with recent motherboards from Tyan especially the S7025 one? Cheers and thanks a lot in advance Carsten From deadline at eadline.org Tue Oct 20 05:12:03 2009 From: deadline at eadline.org (Douglas Eadline) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Free Webinar Thursday: Dynamic Provisioning In-Reply-To: <200910201028.53881.carsten.aulbert@aei.mpg.de> References: <200910201028.53881.carsten.aulbert@aei.mpg.de> Message-ID: <56241.192.168.1.213.1256040723.squirrel@mail.eadline.org> I'll be moderating a free webinar on Thursday called Cluster 3.0: Dynamic Provisioning with MOAB and XCAT where the idea of dynamic provisioning will be discussed. i.e. booting nodes into different software environments through the scheduler. I find this processes interesting because rather than running various environments on top of a virtual layer, dynamic provisioning can boot what ever you want on the nodes -- when the node is allocated to your job. More info and to sign-up http://www.linux-mag.com/id/7565 -- Doug From rpnabar at gmail.com Wed Oct 21 19:18:59 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? Message-ID: On Thu, Oct 8, 2009 at 5:01 PM, Mark Hahn wrote: > > great: so they boot from a special PXE image I set up as a catchall. > (dhcpd lets you define a catchall for any not nodes which lack a their own > MAC-specific stanza.) when nodes are in that state, I like to > auto-configure > the cluster's knowlege of them: collect MAC, add to dhcpd.conf, etc. at > this stage, you can also use local (open) ipmi on the node itself to > configure the IPMI LAN interface: > ipmitool lan 2 set password pa55word > ipmitool lan 2 set defgw ipaddr 10.10.10.254 > ipmitool lan 2 set ipsrc dhcp > Thanks to all the great responses I got for my last post I now have a working IPMI solution! Thanks again guys! This Beowulf list really rocks! :) In case it helps anyone else I thought it might be good to post some points that had me stumped: #####load the required drivers modprobe ipmi_devintf modprobe ipmi_si modprobe ipmi_msghandler ###turn ipmi-over-lan on and set user and passwd. Set it to get its IP over DHCP ipmitool lan set 1 access on ipmitool lan set 1 ipsrc dhcp ipmitool user set password 2 foopasswd ####use this to get the MAC corresponding to the BMC ipmitool lan print ##put MAC into central dhcpd.conf [of course, maybe some of my settings are naiive or erronous; feel free to correct me] I can pretty much remote monitor logs, stats, remote power reset etc. The only two things I cannot: (1) Can't do a Serial-on-LAN (SOL); My Dell server needs a special card (read more money) for this function. That is unfortunate. (2)Can't mod the BIOS settings or even dump them. Is there a way to modify the BIOS settings via. IPMI (in general) I did find the "bootdev" option. So I can change the boot seq. but not the other BIOS settings. Any "industry standard" to change BIOS settings unattended. Preferably over LAN? -- Rahul From rpnabar at gmail.com Wed Oct 21 19:26:48 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Re: Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: References: Message-ID: On Wed, Oct 21, 2009 at 9:18 PM, Rahul Nabar wrote: > I did find the "bootdev" option. So I can change the boot seq. but not > the other BIOS settings. > > Any "industry standard" to change BIOS settings unattended. Preferably over LAN? I guess another way to pose the question is: What's the converse command (if at all exists) for "dmidecode". dmidecode (or biosdecode) pops the BIOS settings to console. What can push settings into the BIOS? -- Rahul From hahn at mcmaster.ca Wed Oct 21 20:26:43 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Re: Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: References: Message-ID: > #####load the required drivers > modprobe ipmi_devintf > modprobe ipmi_si > modprobe ipmi_msghandler IMO, you shouldn't: when permanently active, the local ipmi interface seems to consume noticable cycles (kipmi thread). just modprobe when you need it and rmmod after... > (2)Can't mod the BIOS settings or even dump them. Is there a way to > modify the BIOS settings via. IPMI (in general) definitely not in general. a vendor could certainly provide IPMI extensions that manipulate bios settings. but the market doesn't seem to find this kind of HPC-mostly concern worthwhile :( donno, maybe the really big players (google, etc) get satisfaction. I wonder though, whether it would kill vendors just to publish the source for their bios. bios-serial-redirection and serial-over-lan can let you do it manually. From rpnabar at gmail.com Thu Oct 22 07:40:23 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:00 2009 Subject: [Beowulf] Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: <4AE00AFE.2030707@diamond.ac.uk> References: <4AE00AFE.2030707@diamond.ac.uk> Message-ID: On Thu, Oct 22, 2009 at 2:34 AM, Gregory wrote: > Rahul Nabar wrote: >> >> (1) Can't do a Serial-on-LAN (SOL); My Dell server needs a special >> card (read more money) for this function. That is unfortunate. > > are you sure? have you tried using lanplus instead of lan? (e.g. use > ipmitool -I lanplus) Never had this problem with our bog-standard dells. > > I get this error: ipmitool -H 10.0.0.26 -I lanplus -U root sol activate Password: Error loading interface lanplus Any clue what this might be? How can I get the lanplus interface loaded? Is this a driver issue or does this mean that my BMC does not have a lanplus interface at all. The LAN interface does work file but apparantly "sol activate" needs lanplus. ipmitool -H 10.0.0.26 -I lan -U root sol activate Error: This command is only available over the lanplus interface -- Rahul From greg.matthews at diamond.ac.uk Thu Oct 22 08:11:52 2009 From: greg.matthews at diamond.ac.uk (Gregory Matthews) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: References: <4AE00AFE.2030707@diamond.ac.uk> Message-ID: <4AE07638.4070801@diamond.ac.uk> Rahul Nabar wrote: > On Thu, Oct 22, 2009 at 2:34 AM, Gregory wrote: >> Rahul Nabar wrote: >>> (1) Can't do a Serial-on-LAN (SOL); My Dell server needs a special >>> card (read more money) for this function. That is unfortunate. >> are you sure? have you tried using lanplus instead of lan? (e.g. use >> ipmitool -I lanplus) Never had this problem with our bog-standard dells. >> >> > I get this error: > > ipmitool -H 10.0.0.26 -I lanplus -U root sol activate > Password: > Error loading interface lanplus did you compile ipmitool yourself? it might be the libcrypto linking issue described not very satisfactorily here: http://www.mail-archive.com/ipmitool-devel@lists.sourceforge.net/msg01104.html perhaps you compiled it without ssl support? It certainly looks like a client side issue rather than hardware. GREG > > Any clue what this might be? How can I get the lanplus interface > loaded? Is this a driver issue or does this mean that my BMC does not > have a lanplus interface at all. > > The LAN interface does work file but apparantly "sol activate" needs lanplus. > > ipmitool -H 10.0.0.26 -I lan -U root sol activate > Error: This command is only available over the lanplus interface > -- Greg Matthews 01235 778658 Senior Computer Systems Administrator Diamond Light Source, Oxfordshire, UK From stewart at serissa.com Thu Oct 22 08:14:54 2009 From: stewart at serissa.com (Larry Stewart) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Nahalem / PCIe I/O write latency? Message-ID: <77b0285f0910220814x3d7163cawffb2afa4bcd6c1a@mail.gmail.com> Does anyone know, or know where to find out, how long it takes to do a store to a device register on a Nahelem system with a PCIexpress device? Also, does write combining work with such a setup? I recall that the QLogic Infinipath uses such features to get good short message performance, but my memory of it is pre- Nahelem. Question 2 - if the device stores to an address on which a core is spinning, how long does it take for the core to return the new value? These questions are to inform my thinking on PCIe device design. -Larry (Apologies if a duplicate with misspellings appears, I sent an earlier version from an off-list email.) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091022/104b3bdf/attachment.html From djholm at fnal.gov Thu Oct 22 08:22:22 2009 From: djholm at fnal.gov (Don Holmgren) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: References: <4AE00AFE.2030707@diamond.ac.uk> Message-ID: On Thu, 22 Oct 2009, Rahul Nabar wrote: > On Thu, Oct 22, 2009 at 2:34 AM, Gregory wrote: >> Rahul Nabar wrote: >>> >>> (1) Can't do a Serial-on-LAN (SOL); My Dell server needs a special >>> card (read more money) for this function. That is unfortunate. >> >> are you sure? have you tried using lanplus instead of lan? (e.g. use >> ipmitool -I lanplus) Never had this problem with our bog-standard dells. >> >> > I get this error: > > ipmitool -H 10.0.0.26 -I lanplus -U root sol activate > Password: > Error loading interface lanplus > > Any clue what this might be? How can I get the lanplus interface > loaded? Is this a driver issue or does this mean that my BMC does not > have a lanplus interface at all. > > The LAN interface does work file but apparantly "sol activate" needs lanplus. > > ipmitool -H 10.0.0.26 -I lan -U root sol activate > Error: This command is only available over the lanplus interface It sounds like your ipmitool was built without lanplus support. Try ipmitool -h and it will list under "Interfaces:" all of the protocols that it will support. You may need to explicitly enable lanplus when you configure the build. Don Holmgren Fermilab From hahn at mcmaster.ca Thu Oct 22 09:11:45 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: References: <4AE00AFE.2030707@diamond.ac.uk> Message-ID: > ipmitool -H 10.0.0.26 -I lanplus -U root sol activate > Password: > Error loading interface lanplus > > Any clue what this might be? How can I get the lanplus interface > loaded? sounds like your version of ipmitool is compiled without lanplus (or misconfigured in some way) > Is this a driver issue or does this mean that my BMC does not > have a lanplus interface at all. there's no driver here: ipmitool is a user-level network client that sends packets to the bmc. if the bmc isn't running a protocol like lanplus, ipmitool will just fail to connect (timeout): [hahn@req773 ~]$ ipmitool -U admin -P whatever -I lan -H cp-req1 chassis status Get Session Challenge command failed Error: Unable to establish LAN session ipmi_lan_send_cmd failed to open intf Error sending Chassis Status command > The LAN interface does work file but apparantly "sol activate" needs lanplus. > > ipmitool -H 10.0.0.26 -I lan -U root sol activate > Error: This command is only available over the lanplus interface IMO, they misuse the term 'interface' there - they mean protocol. From patrick at myri.com Thu Oct 22 10:17:35 2009 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Nahalem / PCIe I/O write latency? In-Reply-To: <77b0285f0910220814x3d7163cawffb2afa4bcd6c1a@mail.gmail.com> References: <77b0285f0910220814x3d7163cawffb2afa4bcd6c1a@mail.gmail.com> Message-ID: <4AE093AF.3030707@myri.com> Hey Larry, Larry Stewart wrote: > Does anyone know, or know where to find out, how long it takes to do a > store to a device register on a Nahelem system with a PCIexpress device? Are you asking for latency or throughput ? For latency, it depends on the distance between the core and the IOH (each QuickPath hop can be ~100 ns if I remember well) and if there are PCIe switches before the device. For throughput, it is limited by the PCIe bandwidth (~75% efficiency of link rate) but you can reach it with 64 Bytes writes. > Also, does write combining work with such a setup? Sure, write-combining works on all Intel CPUs since PentiumIII. It only burts at 64 bytes though, anything else is fragmented at 8 bytes. AMD chips do flush WC at 16, 32 and 64 bytes. And don't assume that because you have WC enabled you will only have 64 bytes writes. Sometimes, specially when there is an interrupt, the WC buffer can be flushed early. And don't assume order between the resulting multiple 8-byte writes either. > I recall that the QLogic Infinipath uses such features to get good short > message performance, but my memory of it is pre- Nahelem. Nehalem just add NUMA overhead, and a lot more memory bandwidth. > Question 2 - if the device stores to an address on which a core is > spinning, how long does it take for the core to return the new value? On NUMA, it depends if you write on the socket where you are spinning. If it's same socket, the cache is immediately invalidated. If you busy-poll on a different socket, cache coherency gets involved and it's more expensive. On the local socket, I would say between 150 and 250ns, assuming no PCIe switch in between (those can cost 150ns more). Patrick From Greg at keller.net Thu Oct 22 13:31:49 2009 From: Greg at keller.net (Greg Keller) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Re: Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: <200910221720.n9MHKIEK017978@bluewest.scyld.com> References: <200910221720.n9MHKIEK017978@bluewest.scyld.com> Message-ID: <633A1F6C-63E5-47DB-9C31-F495CF032BBD@Keller.net> Hi Rahul, > > [of course, maybe some of my settings are naiive or erronous; feel > free to correct me] OK > > I can pretty much remote monitor logs, stats, remote power reset etc. > The only two things I cannot: > > (1) Can't do a Serial-on-LAN (SOL); My Dell server needs a special > card (read more money) for this function. That is unfortunate. What server? If you have Power Control via IPMI SoL is there (at least anything less than 3 years old. Hopefully it's just ipmitool recompile as others suggested... all the prebuilt packages I've ever used had it. > > (2)Can't mod the BIOS settings or even dump them. Is there a way to > modify the BIOS settings via. IPMI (in general) > Dell provides for cmd line changing of almost all BIOS/RAID/Remote Access Card settings via their own tool, omconfig. Proprietary, yes. Works, yes. Avoid all the gui stuff during the installation with a couple install flags and the command line "omconfig", "omreport" are very scriptable. Industry standard, if you're Dell :) Googling for either will get you straight to the command line reference: http://support.dell.com/support/edocs/software/svradmin/1.9/en/cli/cli_cc7c.htm#1093458 Cheers! Greg -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091022/194cf426/attachment.html From rpnabar at gmail.com Thu Oct 22 13:52:01 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Re: Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: References: Message-ID: On Wed, Oct 21, 2009 at 10:26 PM, Mark Hahn wrote: >> #####load the required drivers >> modprobe ipmi_devintf >> modprobe ipmi_si >> modprobe ipmi_msghandler > > IMO, you shouldn't: when permanently active, the local ipmi interface seems > to consume noticable cycles (kipmi thread). just modprobe when you need it > and rmmod after... Good point. I will do that. The first time around local impi is useful to change the password and DHCP etc. THat's why I had to load those drivers. > definitely not in general. ?a vendor could certainly provide IPMI > extensions that manipulate bios settings. ?but the market doesn't seem to > find this kind of HPC-mostly concern worthwhile :( Hmm, the lack of a BIOS-mod standard is unfortunate. Maybe it will be there sometime!...shouldn't this be an enterprise-level concern not just HPC? I mean what if you have 200 Windows machines and wanted to turn hyperthreading off in the BIOS. It still boils down to some sort of standardized interface to change it remotely and automatically. Unless the admins change those manually..... > bios-serial-redirection and serial-over-lan can let you do it manually. I know about SOL but what's the "bios-serial-redirection". Is that the same or a different system? -- Rahul From rpnabar at gmail.com Thu Oct 22 13:53:37 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: References: <4AE00AFE.2030707@diamond.ac.uk> Message-ID: On Thu, Oct 22, 2009 at 10:22 AM, Don Holmgren wrote: > > It sounds like your ipmitool was built without lanplus support. ?Try > ? ipmitool -h > and it will list under "Interfaces:" all of the protocols that it will > support. ?You may need to explicitly enable lanplus when you configure > the build. Thanks Don! You are correct. ipmitool -h Interfaces: open Linux OpenIPMI Interface [default] imb Intel IMB Interface lan IPMI v1.5 LAN Interface Let me try to recompile my IPMI. -- Rahul From rpnabar at gmail.com Thu Oct 22 13:56:11 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Re: Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: <633A1F6C-63E5-47DB-9C31-F495CF032BBD@Keller.net> References: <200910221720.n9MHKIEK017978@bluewest.scyld.com> <633A1F6C-63E5-47DB-9C31-F495CF032BBD@Keller.net> Message-ID: On Thu, Oct 22, 2009 at 3:31 PM, Greg Keller wrote: > What server? ?If you have Power Control via IPMI SoL is there (at least > anything less than 3 years old. ?Hopefully it's just ipmitool recompile as > others suggested... all the prebuilt packages I've ever used had it. I'm trying two servers here in my cluster: SC1435 R410 On digging deeper with the helpdesk one of the things is that my R410 comes with a "DRAC Express". The "DRAC Enterprise" seems to have some more features and some say that SOL is one of those. But I still am not sure and will give it a shot. > Dell provides for cmd line changing of almost all BIOS/RAID/Remote Access > Card settings via their own tool, omconfig. ?Proprietary, yes. Works, yes. > ?Avoid all the gui stuff during the installation with a couple install flags > and the command line "omconfig", "omreport" are very scriptable. ?Industry > standard, if you're Dell :) ?Googling for either will get you straight to > the command line reference: > http://support.dell.com/support/edocs/software/svradmin/1.9/en/cli/cli_cc7c.htm#1093458 > Thanks Greg! I'm gonna give that a shot. -- Rahul From hahn at mcmaster.ca Thu Oct 22 14:08:13 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Re: Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: References: Message-ID: > there sometime!...shouldn't this be an enterprise-level concern not > just HPC? I mean what if you have 200 Windows machines and wanted to > turn hyperthreading off in the BIOS. first, you don't really depend solely on the bios for HT. next, those 200 boxes are probably spread out, and possibly not remote-managable in the first case. next, business IT support normally has a drone class, whose existence is predicated on doing this sort of thing. at least at my HPC organization, there are no drones (flatter hierarchy - not that we're just superior ;) I would love a documented bios setting interface, but I don't think it's salient on the radar of the mass market. >> bios-serial-redirection and serial-over-lan can let you do it manually. > > I know about SOL but what's the "bios-serial-redirection". Is that the > same or a different system? bioses can have a setting which enables interaction with the bios (at post time) through a serial port. that serial port may also be redirected over the lan. these are semi-independent features (the first mainly concerns the bios; the latter mainly the bmc that implements ipmi.) From lindahl at pbm.com Thu Oct 22 14:15:52 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Re: Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: References: Message-ID: <20091022211552.GA4119@bx9.net> On Wed, Oct 21, 2009 at 11:26:43PM -0400, Mark Hahn wrote: > definitely not in general. a vendor could certainly provide IPMI > extensions that manipulate bios settings. but the market doesn't seem to > find this kind of HPC-mostly concern worthwhile :( Well, it's not just for HPC anymore, all the warehouse computing guys (cloud) want this, too. I believe someone explained to me a long time ago that there is a standard way to read and write (but not interpret) the BIOS flash saved state, but the state is now too large to fit into the standard-sized area. > I wonder though, whether it would kill vendors just to publish the source > for their bios. Yes, it would. Not only is the source owned by AMI & Phoenix, and not mobo vendors, but it includes licensed stuff, and also secret workarounds to hardware bugs in lots of devices. Even IBM's BIOS (still used in a few high-end IBM x86 servers) is probably polluted that way. -- greg From rpnabar at gmail.com Thu Oct 22 14:41:50 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Re: Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: References: Message-ID: On Thu, Oct 22, 2009 at 4:08 PM, Mark Hahn wrote: > the first case. ?next, business IT support normally has a drone class, > whose existence is predicated on doing this sort of thing. ?at least at my > HPC organization, there are no drones (flatter hierarchy - not that we're > just superior ;) Can I have some drones too, Mark? ;-) I'd love some. I think here I'm drone+worker+soldier+queen bee combined into one. Cheers! -- Rahul From rpnabar at gmail.com Thu Oct 22 14:43:01 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Re: Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: <20091022211552.GA4119@bx9.net> References: <20091022211552.GA4119@bx9.net> Message-ID: On Thu, Oct 22, 2009 at 4:15 PM, Greg Lindahl wrote: > mobo vendors, but it includes licensed stuff, and also secret > workarounds to hardware bugs in lots of devices. Why does a workaround to a hardware bug have to be secret? Just curious.....Aren't hardware bugs published in errata regularly? -- Rahul From lindahl at pbm.com Thu Oct 22 15:00:35 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Re: Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: References: <20091022211552.GA4119@bx9.net> Message-ID: <20091022220035.GA15836@bx9.net> > > mobo vendors, but it includes licensed stuff, and also secret > > workarounds to hardware bugs in lots of devices. > > Why does a workaround to a hardware bug have to be secret? Just > curious.....Aren't hardware bugs published in errata regularly? It's not that it has to be a secret, it's that vendors don't want the public to know. Ideally they'd all appear in errata. In practice, vendors tell some of the bugs only to M$ and BIOS vendors under NDA. And then Linux people wonder why the device doesn't work very well under Linux. Some devices don't have any public errata, some have partial lists, for some you think the public list is complete, but... -- greg From landman at scalableinformatics.com Thu Oct 22 15:03:34 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Re: Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: References: <20091022211552.GA4119@bx9.net> Message-ID: <4AE0D6B6.4070204@scalableinformatics.com> Rahul Nabar wrote: > Why does a workaround to a hardware bug have to be secret? Just > curious.....Aren't hardware bugs published in errata regularly? [don't shoot the messenger] Because some hardware bugs are scary, and could/would have profound marketing/sales impacts. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From hahn at mcmaster.ca Thu Oct 22 15:34:24 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Re: Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: <20091022211552.GA4119@bx9.net> References: <20091022211552.GA4119@bx9.net> Message-ID: >> definitely not in general. a vendor could certainly provide IPMI >> extensions that manipulate bios settings. but the market doesn't seem to >> find this kind of HPC-mostly concern worthwhile :( > > Well, it's not just for HPC anymore, all the warehouse computing guys > (cloud) want this, too. sorry, I thought it was understood that "Cloud" is just a new name for the degenerate, low-end component of HPC, formerly known as "serial farming". ;) yes, I wouldn't be surprised if Google/Amazon/etc have good traction. > I believe someone explained to me a long time > ago that there is a standard way to read and write (but not interpret) > the BIOS flash saved state, but the state is now too large to fit into > the standard-sized area. indeed, there is a definition for PC-AT-vintage settings. but that doesn't really include anything interesting (ie, does include floppy config, but does not include ECC scrub settings...) >> I wonder though, whether it would kill vendors just to publish the source >> for their bios. > > Yes, it would. Not only is the source owned by AMI & Phoenix, and not > mobo vendors, my understanding is that AMI/Phoenix sell essentially a bios SDK, and that the board vendor assembles their own "app" based on that. > but it includes licensed stuff, and also secret > workarounds to hardware bugs in lots of devices. well, I'm really talking only about the basic MB bios, and really only the POST portion. so not PXE-related stuff, or code for add-in devices. is the memory count-up licensed? the display-splash-image code secret? for CPU/chipset workarounds, I'm skeptical about the secrecy, since AMD and Intel both disclose quite a lot in their chip eratta and bios developer's guides. for instance, I'm pretty sure the AMD doc is detailed enough to fully configure dram detect/config, HT and IO config, SMP bringup, etc. > Even IBM's BIOS > (still used in a few high-end IBM x86 servers) is probably polluted > that way. I wonder how hard it is to actually read the raw bios. for instance, under linux, could there not be a /dev/flash device driver that handled the access interface (I2C, I guess)? having at least a read/write char dev would open up some possibilities, like diffing the flash image when you change one setting. (hell, just being able to write the flash image from linux with a single, simple, non-proprietary tool would be a huge step...) From djholm at fnal.gov Thu Oct 22 16:02:57 2009 From: djholm at fnal.gov (Don Holmgren) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Re: Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: References: Message-ID: On Wed, 21 Oct 2009, Rahul Nabar wrote: > On Wed, Oct 21, 2009 at 9:18 PM, Rahul Nabar wrote: >> I did find the "bootdev" option. So I can change the boot seq. but not >> the other BIOS settings. >> >> Any "industry standard" to change BIOS settings unattended. Preferably over LAN? > > I guess another way to pose the question is: > > What's the converse command (if at all exists) for "dmidecode". > > dmidecode (or biosdecode) pops the BIOS settings to console. What can > push settings into the BIOS? > A long answer, sorry - read at your own risk. Maybe there will be some useful information. Since you have Dell equipment maybe their OpenManage stuff will do exactly what you want. The short answer, unfortunately, is no. Check with your system vendor to see if they can provide a utility. There is an industry standard - the "ISA System Architecture" - that defined among many other things the required interfaces to the real time clock chip (Motorola MC146818) that included a small amount of NVRAM - usually called the "CMOS memory" - that was kept non-volatile by means of a battery on your motherboard. The ISA standard includes the specifications for how some of those memory locations are to be used to store, for example, information about your hard disk drive and your floppy drives (that is, the BIOS settings of that time). Pretty much every x86 PC that you buy today, AFAIK (I've never seen an exception) still includes this RTC and NVRAM area. Motherboards still have a lithium battery that is used to keep this alive, although for a time in competing RTC chips these batteries were integrated into the integrated circuit (a very dumb idea, since systems are often used beyond the lifetime of the batteries). In Linux, /dev/nvram lets you access this area, and in 2.6 kernels (maybe in 2.4 also??) you can generally `cat /proc/driver/nvram` to see a dump of the ISA-specified data stored there (you may have to `modprobe nvram` first). On some motherboards I've used over the years, the BIOS settings were all stored in this nvram area, and so to replicate the BIOS settings from one system to a second system with the same motherboard, you could readout all of the data and write it back to the second system. There is a checksum kept in the nvram that has to be updated appropriately, otherwise at the next power on self test the system would complain about corrupted CMOS and would restore the default settings. The Linux nvram driver handles the checksum updating for you. (Not really - keep reading). The interface to access the RTC nvram area is very crude if you are writing your own code (as opposed to using the Linux nvram driver): use the "out" instruction to write the byte offset (the address) to I/O port 0x70, and then either read ('in') or write ('out') a byte (or word) from I/O port 0x71. See http://wiki.osdev.org/CMOS http://www.bioscentral.com/misc/cmosmap.htm http://bochs.sourceforge.net/techspec/CMOS-reference.txt and http://www.pixelbeat.org/docs/bios/ for discussions. In the ISA standard, the offset is limited to the range 0-0x80 (128 bytes). The bad news is that motherboards typically have larger nvram areas than this 128-byte ISA region, and so some of the BIOS settings you care about may be stored elsewhere (maybe even in EEPROM??). Worse, they may protect the settings (in the sense of detecting corruption) with additional checksums or CRCs, and unless you know which bytes or even bit ranges are used (and in which order) to compute the CRC, you won't be able to change individual settings w/out causing the POST to declare corruption, even if you could figure out which bits and/or bytes corresponded to a given BIOS setting. On either the Intel "Torrey Pines" or "Buckner" motherboards (can't remember now), there was a second 128-byte region that was accessed via I/O ports 0x72 and 0x73 in the same way that 0x70 and 0x71 are used for the ISA nvram region. All of the BIOS settings were in the two 128-byte regions, and we were able to replicate BIOS settings in Linux by copying all the bytes from both regions from one motherboard to another. Sweet - saved tons of "crash cart" work on a very big cluster. Unfortunately, I've never gotten this same thing to work on motherboards since those Intel boards. Often the POST detects corruption (although this can be useful - more below). So, perhaps larger or other nvram resources are probably being used, and who knows (without reading the BIOS source code, which AMI, Pheonix, etc. will probably never give you) what the interface to that memory might be on a given computer. On a large cluster of Intel Lancewood motherboards many years ago, we found a brute force solution for forcing BIOS settings (these motherboards had the annonying habit of deciding that the 2nd processor was dead, and you could only resurrect SMP operations by going into the BIOS and telling the system to reset the processor timer test; our cluster was remote and we needed to do this using serial redirect, and so we needed to force the BIOS into a state where the redirect worked). We used a PCI bus analyzer (made by VMetro) to record all I/O bus activity following executing the "Save Settings and Reboot" menu option in the BIOS. All activity to ports like 0x70 and 0x71 above was captured. Through trial and error we parsed out of this huge trace the raw I/O commands that were sufficient to put the BIOS settings into a known good state. When a Lancewood wiped the CMOS settings, in Linux we "replayed" this magic set of I/O instructions. On request, various manufacturers (Intel, Asus, Supermicro) have provided us utilities (always for DOS) for reading out and writing BIOS settings - you can always ask your vendor to see if such a utility is available for your hardware. It's a bit of a pain to have to go into DOS to do this, but it's better than nothing (we usually include a DOS partition on our system disks that can be used for flashing the BIOS, BMC, gigE, and other firmware). For one cluster, the motherboard manufacturer provided us with a tool from the BIOS vendor (AMI?) that allowed us, in a Windows application, to alter the default BIOS settings. These then get loaded when during the power on self test CMOS corruption is detected; to corrupt our CMOS, in Linux we just scribbled in the RTC nvram region. Hope this was useful, or at least maybe a mildly interesting read. It would be absolutely terrific if BIOS vendors would always provide the software to manipulate, or at least replicate, BIOS settings. Somehow it seems like well into the 2nd decade of Linux clusters this particular issue should have been solved once and for all. Don Holmgren Fermilab From rpnabar at gmail.com Thu Oct 22 16:56:44 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: <4AE07638.4070801@diamond.ac.uk> References: <4AE00AFE.2030707@diamond.ac.uk> <4AE07638.4070801@diamond.ac.uk> Message-ID: On Thu, Oct 22, 2009 at 10:11 AM, Gregory Matthews wrote: > Rahul Nabar wrote: >> > did you compile ipmitool yourself? it might be the libcrypto linking issue > described not very satisfactorily here: > > http://www.mail-archive.com/ipmitool-devel@lists.sourceforge.net/msg01104.html > > perhaps you compiled it without ssl support? It certainly looks like a > client side issue rather than hardware. > Yes. That's right. ./configure lists: Interfaces lan : yes lanplus : no open : yes free : no imb : yes bmc : no lipmi : no The relevant log snippet is: checking for EVP_aes_128_cbc in -lcrypto... no checking for MD5_Init in -lcrypto... no checking for MD2_Init in -lcrypto... no ** The lanplus interface requires an SSL library with EVP_aes_128_cbc defined. Any clues what I am doing wrong? I used: "./configure --enable-intf-lanplus" I checked that I have openssl, libgcrypt and libgcrypt-devel installed. I even found the file "ibcrypt.so.1" and created the link: "ln -s /lib/libcrypt.so.1 /usr/lib/libcrypto.so" (as per the link Greg had sent) Which one is the "SSL library with EVP"? Ideas? -- Rahul From rpnabar at gmail.com Thu Oct 22 17:56:16 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad? Message-ID: I wanted to get some opinions about if watchdog timers are a good idea or not. I came across watchdogs again when reading through my IPMI manual. In principle it sounds neat: If the system hangs then get it to reboot after, say, 5 minutes automatically. But, in practice, maybe it is a terrible idea. Of course, one might say, a well configured HPC compute-node shouldn't be getting to a hung point anyways; but in-practice I see a few nodes every month that can be resurrected by a simple reboot. Admittedly these nodes are quite senile. The danger, seems to me: What if a node kept crashing (due to say, a bad HDD or something). Then a watchdog would merely keep rebooting this node a hundred times. Not such a good thing. Have you guys used watchdog timers? Maybe there is a way to build a circuit-breaker around the principle so that if a node reboots automatically more than 3 times then watchdog gives up? If one had to do the watchdogging should one do the resets locally using the IPMI local interface (hogs cpu cycles) or a central Nagios-like system that could issue such a command. Many scenarios seem possible. The prospect of a automated system doing a reboot at 3am seems more tempting than me having to do this manually. -- Rahul From rpnabar at gmail.com Thu Oct 22 18:20:00 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: References: <4AE00AFE.2030707@diamond.ac.uk> Message-ID: On Thu, Oct 22, 2009 at 11:11 AM, Mark Hahn wrote: > sounds like your version of ipmitool is compiled without lanplus > (or misconfigured in some way) It works now (in a rudimentary way!). Thanks guys, for the tips. I found the ipmitool in a CentOS repo. I gave up on comping my own. /usr/bin/ipmitool -I lanplus -U root -H 10.0.0.26 sol activate Password: [SOL Session operational. Use ~? for help] Now, if I can only get it to display the remote screen! :) -- Rahul From lindahl at pbm.com Thu Oct 22 21:37:05 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: References: <4AE00AFE.2030707@diamond.ac.uk> Message-ID: <20091023043705.GB19092@bx9.net> On Thu, Oct 22, 2009 at 08:20:00PM -0500, Rahul Nabar wrote: > Now, if I can only get it to display the remote screen! :) SOL = serial over lan. No concept of a screen. If you want to look at what's on the text console, cat /dev/vcs Or use conman (which is available as a package) to record and log everything that comes through. -- greg From john.hearns at mclaren.com Fri Oct 23 01:57:16 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad? In-Reply-To: References: Message-ID: <68A57CCFD4005646957BD2D18E60667B0D970F6F@milexchmb1.mil.tagmclarengroup.com> Of course, one might say, a well configured HPC compute-node shouldn't be getting to a hung point anyways; but in-practice I see a few nodes every month that can be resurrected by a simple reboot. Admittedly these nodes are quite senile. I think that this is an interesting concept - and don't want to dismiss it. You could imagine jobs which checkpoint often, and automatically restart themselves from a checkpoint if a machine fails like this. My philosophy though would be to leave a machine down till the cause of the crash is established. Now that you have IPMI and serial consoles you should be looking at the IPMI logs and your /var/log/mcelog to see if there are uncorrectable ECC errors, and enabling crash dumps and the Magic Sysrq keys. Any cluster should be designed with a few extra nodes, which will normally be idle but will be used when one or two nodes are off on the Pat and Mick. However, this doesn't help when a large parallel run is brought down when a single node fails - advice here is checkpoint the jobs often. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From diep at xs4all.nl Fri Oct 23 06:38:28 2009 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Nahalem / PCIe I/O write latency? In-Reply-To: <4AE093AF.3030707@myri.com> References: <77b0285f0910220814x3d7163cawffb2afa4bcd6c1a@mail.gmail.com> <4AE093AF.3030707@myri.com> Message-ID: On Oct 22, 2009, at 7:17 PM, Patrick Geoffray wrote: > Hey Larry, > > Larry Stewart wrote: >> Does anyone know, or know where to find out, how long it takes to >> do a store to a device register on a Nahelem system with a >> PCIexpress device? > > Are you asking for latency or throughput ? For latency, it depends > on the distance between the core and the IOH (each QuickPath hop > can be ~100 ns if I remember well) and if there are PCIe switches > before the device. For throughput, it is limited by the PCIe > bandwidth (~75% efficiency of link rate) but you can reach it with > 64 Bytes writes. > The practice is a bit different with respect to latency. In practice Larry will be busy with existing hardware components i guess? The hardware in between for latency is even more important than the pci-e latency. Maybe. To most Nvidia gpgpu cards you'll have about some dozens to hundreds of microseconds latency or so for example from processor to hardware device over the pci-e. For custom FPGA (Xilinx based) someone had designed already pci-x 133Mhz had a limit there of 100k per second. About 10 us that is. You really stress in such case the hardware a lot. So the latency itself 'eats' so to speak a considerable amount (if not majority) of system time at such big stresstests. The question is whether you want to do that. For most devices that you can benchmark in this manner the latency of around some dozens to some hundreds of microseconds is a returning thing, already for many years. Of course there is a considerable distance at the mainboard between the devices and the processor. Will it ever improve? A lot better is co-processors (obviously). Vincent >> Also, does write combining work with such a setup? > > Sure, write-combining works on all Intel CPUs since PentiumIII. It > only burts at 64 bytes though, anything else is fragmented at 8 > bytes. AMD chips do flush WC at 16, 32 and 64 bytes. > > And don't assume that because you have WC enabled you will only > have 64 bytes writes. Sometimes, specially when there is an > interrupt, the WC buffer can be flushed early. And don't assume > order between the resulting multiple 8-byte writes either. > >> I recall that the QLogic Infinipath uses such features to get good >> short message performance, but my memory of it is pre- Nahelem. > > Nehalem just add NUMA overhead, and a lot more memory bandwidth. > >> Question 2 - if the device stores to an address on which a core is >> spinning, how long does it take for the core to return the new value? > > On NUMA, it depends if you write on the socket where you are > spinning. If it's same socket, the cache is immediately > invalidated. If you busy-poll on a different socket, cache > coherency gets involved and it's more expensive. On the local > socket, I would say between 150 and 250ns, assuming no PCIe switch > in between (those can cost 150ns more). > > Patrick > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From tomislav.maric at gmx.com Fri Oct 23 06:59:31 2009 From: tomislav.maric at gmx.com (tomislav.maric@gmx.com) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] BIOS & monitor power saving Message-ID: <20091023140154.248790@gmx.com> -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091023/1729c73d/attachment.html From prentice at ias.edu Fri Oct 23 07:42:56 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: References: <4AE00AFE.2030707@diamond.ac.uk> <4AE07638.4070801@diamond.ac.uk> Message-ID: <4AE1C0F0.5050700@ias.edu> > The relevant log snippet is: > > checking for EVP_aes_128_cbc in -lcrypto... no > checking for MD5_Init in -lcrypto... no > checking for MD2_Init in -lcrypto... no > ** The lanplus interface requires an SSL library with EVP_aes_128_cbc defined. > > Any clues what I am doing wrong? I used: "./configure --enable-intf-lanplus" > > I checked that I have openssl, libgcrypt and libgcrypt-devel installed. > > I even found the file "ibcrypt.so.1" and created the link: > "ln -s /lib/libcrypt.so.1 /usr/lib/libcrypto.so" (as per the link Greg had sent) > > Which one is the "SSL library with EVP"? Ideas? > The proper way to deal with this is to set LDFLAGS to include the correct lib directories in the correct order before running configure. While symlinks work, it can lead to a mess if done too often. On 64-bit RHEL-based systems, I've found that /usr/lib64 and lib64 are often overlooked and must be explicitly specified in LDFLAGS. I don't know why could be poorly written configure scripts. The proper search path for include directories should be defined in CPPFLAGS, too. -- Prentice From hahn at mcmaster.ca Fri Oct 23 10:35:55 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad? In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0D970F6F@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B0D970F6F@milexchmb1.mil.tagmclarengroup.com> Message-ID: > You could imagine jobs which checkpoint often, and automatically restart > themselves from > a checkpoint if a machine fails like this. I find that apps (custom or commercial) normally need some help to restart. (some need to be pointed at the checkpoint to start with, others need to be told it's a restart, rather than from-scratch, etc). and I expect that anything but a dedicated, single-person cluster will also be running a scheduler, which means that the app would, upon starting, need to queue a restart of itself as a dependency. we actually have one group that does this, but their main script already contains multiple iterations of gromacs (apparently to force re-load-balance.) their code contains a intelligence about catching crashed processes, finding where to pick up, etc. (this group also tends to be one that asks interesting questions about, for instance, when attributes propagate across NFS vs Lustre filesystems. I had never looked very closely, but various NFS clients have quite different behavior, including some oldish versions that will cache stale attrs *indefinitely*.) anyway, we strongly encourage checkpointing, and usually say that you should checkpoint as frequently as you can without inducing a significant IO overhead. our main clusters have Lustre filesystems that can sustain several GB/s, so I usually rule-of-thumb it as "a couple times a day, and more often with higher node-count. fortunately our node failure rate is fairly low, so we don't push very hard. it's easy to imagine a large-scale job needing to checkpoint ~hourly, though: if your spontaneous node failure rate is 1/node-year, then a 365-node job is 1/day, and that's not a very big job... > My philosophy though would be to leave a machine down till the cause of > the crash is established. absolutely. this is not an obvious principle to some people, though: it depends on whether your model of failures involves luck or causation ;) and having decent tools (IPMI SEL for finding UC ECCs/overheating/etc, console logging for panics) is what lets you rule out bad juju... From rpnabar at gmail.com Fri Oct 23 11:01:05 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D970F6F@milexchmb1.mil.tagmclarengroup.com> Message-ID: On Fri, Oct 23, 2009 at 12:35 PM, Mark Hahn wrote: > >> My philosophy though would be to leave a machine down till the cause of >> the crash is established. > > absolutely. ?this is not an obvious principle to some people, though: > it depends on whether your model of failures involves luck or causation ;) > and having decent tools (IPMI SEL for finding UC ECCs/overheating/etc, > console logging for panics) is what lets you rule out bad juju... Other factors that sometimes make me violate this principle of "always establish a crash cause": 1. Manpower to debug. Let's say the error has a cause but is relatively infrequent. I might achieve a higher uptime by a simple reboot until I get the time to fight this particular fire. People feel nicer to have a crashed node humming away as soon as possible rather than waiting for me to get the time to have a look at it and come to a definite diagnosis. Forensics takes time. 2. Some errors are hardware precipitated. Aging, out-of-warranty aging, hardware can sometimes need such a reboot compromise for one-off random errors. Maybe all the "nice" clusters out there never have this issue but for me it is fairly common. Just confessing. -- Rahul From lindahl at pbm.com Fri Oct 23 11:23:17 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D970F6F@milexchmb1.mil.tagmclarengroup.com> Message-ID: <20091023182317.GB30655@bx9.net> On Fri, Oct 23, 2009 at 01:01:05PM -0500, Rahul Nabar wrote: > 2. Some errors are hardware precipitated. Aging, out-of-warranty > aging, hardware can sometimes need such a reboot compromise for > one-off random errors. > > Maybe all the "nice" clusters out there never have this issue but for > me it is fairly common. Just confessing. Why, exactly, are you assuming that your freezes are one-off random errors due to aging hardware? Sounds like you're either guessing, or you _are_ doing forensics, but aren't calling it forensics. -- greg From oper.ml at gmail.com Wed Oct 21 11:27:41 2009 From: oper.ml at gmail.com (Tony Miranda) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Build a Beowulf Cluster Message-ID: <4ADF529D.2020109@gmail.com> Hi everyone, anyone could help me explaning how to build a beowulf cluster? An web site, a list of parameters anything updated. Cause i only found in the internet posts that are really old. Thanks a lot. Tony Miranda. From robertkubrick at gmail.com Thu Oct 22 18:25:30 2009 From: robertkubrick at gmail.com (Robert Kubrick) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] eth-mlx4-0/15 Message-ID: <08C3678A-7DEA-4BD8-863C-28A00AA53FBD@gmail.com> I noticed my machine has 16 drivers in the /proc/interrupts table marked as eth-mlx4-0 to 15, in addition to the usual mlx-async and mlx-core drivers. The server runs Linux Suse RT, has an infiniband interface, OFED 1.1 drivers, and 16 Xeon MP cores , so I'm assuming all these eth-mlx4 drivers are supposed to do "something" with each core. I've never seen these irq managers before. When I run infiniband apps the interrupts go to both mlx-async and eth-mlx4-0 (just 0, all the other drivers don't get any interrupts). Also the eth name part looks suspicious. I can't find any reference online, any idea what these drivers are about? From akshar.bhosale at gmail.com Thu Oct 22 18:50:25 2009 From: akshar.bhosale at gmail.com (akshar bhosale) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad? In-Reply-To: References: Message-ID: hi rahul, same thing happens at our side.node gets reboot due to asr and it doesnt crash.can u suggest any remedy? On Fri, Oct 23, 2009 at 6:26 AM, Rahul Nabar wrote: > I wanted to get some opinions about if watchdog timers are a good idea > or not. I came across watchdogs again when reading through my IPMI > manual. In principle it sounds neat: If the system hangs then get it > to reboot after, say, 5 minutes automatically. But, in practice, maybe > it is a terrible idea. > > Of course, one might say, a well configured HPC compute-node > shouldn't be getting to a hung point anyways; but in-practice I see a > few nodes every month that can be resurrected by a simple reboot. > Admittedly these nodes are quite senile. > > The danger, seems to me: What if a node kept crashing (due to say, a > bad HDD or something). Then a watchdog would merely keep rebooting > this node a hundred times. Not such a good thing. > > Have you guys used watchdog timers? Maybe there is a way to build a > circuit-breaker around the principle so that if a node reboots > automatically more than 3 times then watchdog gives up? > > If one had to do the watchdogging should one do the resets locally > using the IPMI local interface (hogs cpu cycles) or a central > Nagios-like system that could issue such a command. Many scenarios > seem possible. The prospect of a automated system doing a reboot at > 3am seems more tempting than me having to do this manually. > > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091023/1ca9fba2/attachment.html From kabbey at biomaps.rutgers.edu Thu Oct 22 19:03:58 2009 From: kabbey at biomaps.rutgers.edu (Kevin Abbey) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad? In-Reply-To: References: Message-ID: <4AE10F0E.4000907@biomaps.rutgers.edu> I tried this on a Supermicro board and a Sun box. On both systems the system would reboot randomly so I tuned it off. This is a serious problem of false positives. In a cluster, you may need to notify the scheduler in someway when a node reboots. Can someone elaborate on this? Specifically for torque, PBS and Sun GE. Regarding this: Have you guys used watchdog timers? Maybe there is a way to build a circuit-breaker around the principle so that if a node reboots automatically more than 3 times then watchdog gives up? It would far simpler to request the vendor to program thier firmware to log each reboot and set a limitation there as well as event notifications via email, snmp or other means. The more configurable the better. Kevin Rahul Nabar wrote: > I wanted to get some opinions about if watchdog timers are a good idea > or not. I came across watchdogs again when reading through my IPMI > manual. In principle it sounds neat: If the system hangs then get it > to reboot after, say, 5 minutes automatically. But, in practice, maybe > it is a terrible idea. > > Of course, one might say, a well configured HPC compute-node > shouldn't be getting to a hung point anyways; but in-practice I see a > few nodes every month that can be resurrected by a simple reboot. > Admittedly these nodes are quite senile. > > The danger, seems to me: What if a node kept crashing (due to say, a > bad HDD or something). Then a watchdog would merely keep rebooting > this node a hundred times. Not such a good thing. > > Have you guys used watchdog timers? Maybe there is a way to build a > circuit-breaker around the principle so that if a node reboots > automatically more than 3 times then watchdog gives up? > > If one had to do the watchdogging should one do the resets locally > using the IPMI local interface (hogs cpu cycles) or a central > Nagios-like system that could issue such a command. Many scenarios > seem possible. The prospect of a automated system doing a reboot at > 3am seems more tempting than me having to do this manually. > > -- Kevin C. Abbey System Administrator Rutgers University - BioMaPS Institute Email: kabbey@biomaps.rutgers.edu Hill Center - Room 259 110 Frelinghuysen Road Piscataway, NJ 08854 Phone and Voice mail: 732-445-3288 Wright-Rieman Laboratories Room 201 610 Taylor Rd. Piscataway, NJ 08854-8087 Phone: 732-445-2069 Fax: 732-445-5958 From working at surfingllama.com Thu Oct 22 22:12:11 2009 From: working at surfingllama.com (Carl Thomas) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Mature open source hierarchical storage management Message-ID: HI all, We are currently in the midst of planning a major refresh of our existing HPC cluster. It is expected that our storage will consist of a combination of fast fibre channel and SATA based disk and we would like to implement a system whereby user files are automatically migrated to and from slow storage depending on frequency of usage. Initial investigations seem to indicate that larger commercial hierarchical storage management systems vastly exceed our budget. Is there any mature open source alternatives out there? How are other organisations dealing with transparently presenting different tiers of storage to non technical scientists? Cheers, Carl. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091023/be44a482/attachment.html From ed92626 at gmail.com Fri Oct 23 09:01:28 2009 From: ed92626 at gmail.com (ed in 92626) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad? In-Reply-To: References: Message-ID: On Thu, Oct 22, 2009 at 5:56 PM, Rahul Nabar wrote: > I wanted to get some opinions about if watchdog timers are a good idea > or not. I came across watchdogs again when reading through my IPMI > manual. In principle it sounds neat: If the system hangs then get it > to reboot after, say, 5 minutes automatically. But, in practice, maybe > it is a terrible idea. > > Of course, one might say, a well configured HPC compute-node > shouldn't be getting to a hung point anyways; but in-practice I see a > few nodes every month that can be resurrected by a simple reboot. > Admittedly these nodes are quite senile. > > Some BIOS's have a setting for this, times to reboot before quitting. > The danger, seems to me: What if a node kept crashing (due to say, a > bad HDD or something). Then a watchdog would merely keep rebooting > this node a hundred times. Not such a good thing. > > Have you guys used watchdog timers? Maybe there is a way to build a > circuit-breaker around the principle so that if a node reboots > automatically more than 3 times then watchdog gives up? > You could also do something at the system level to prevent it. If the system boots and the previous_uptime is less that one hour shut down the system. The WD timer will not wake it up. > > If one had to do the watchdogging should one do the resets locally > using the IPMI local interface (hogs cpu cycles) or a central > Nagios-like system that could issue such a command. Many scenarios > seem possible. The prospect of a automated system doing a reboot at > 3am seems more tempting than me having to do this manually. > > Also almost all systems that can do this also send out a page and an email on the event, so someone will know about it. Ed > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091023/a4ca8da7/attachment.html From ed92626 at gmail.com Fri Oct 23 09:16:25 2009 From: ed92626 at gmail.com (ed in 92626) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] BIOS & monitor power saving In-Reply-To: <20091023140154.248790@gmx.com> References: <20091023140154.248790@gmx.com> Message-ID: I believe ANALOG is just your monitor telling you it's receiving analog input, as opposed to digital. Power saving mode is telling you the monitor is in PS mode and if you tell the BIOS to sleep the monitor is setup it comply with the request. Ed On Fri, Oct 23, 2009 at 6:59 AM, wrote: > Hi everyone, > > > I have assembled my home beowulf and I'm trying to install Rocks + Centos > on it. As I enter the graphical installation my screen signals power saving > mode: > > > ANALOG > > POWER SAVING MODE > > > I have been digging through BIOS, but I can't find any power saving options > for graphics/monitor, just for the system. I'm using P5Q-VM Asus mobo, with > integrated graphics: Integrated Intel Graphics Media Accelerator X4500HD . > Any advice? > > > Tomislav > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091023/6169cbcc/attachment.html From rpnabar at gmail.com Fri Oct 23 12:44:27 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad? In-Reply-To: <20091023182317.GB30655@bx9.net> References: <68A57CCFD4005646957BD2D18E60667B0D970F6F@milexchmb1.mil.tagmclarengroup.com> <20091023182317.GB30655@bx9.net> Message-ID: On Fri, Oct 23, 2009 at 1:23 PM, Greg Lindahl wrote: > On Fri, Oct 23, 2009 at 01:01:05PM -0500, Rahul Nabar wrote: > >> 2. Some errors are hardware precipitated. Aging, out-of-warranty >> aging, hardware can sometimes need such a reboot compromise for >> one-off random errors. >> >> Maybe all the "nice" clusters out there never have this issue but for >> me it is fairly common. Just confessing. > > Why, exactly, are you assuming that your freezes are one-off random > errors due to aging hardware? Sounds like you're either guessing, or > you _are_ doing forensics, but aren't calling it forensics. Greg. You are right. My bad. In hindsight, that doesn't make much sense. Sorry. -- Rahul From gerry.creager at tamu.edu Fri Oct 23 13:42:38 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] using watchdog timers to reboot a hung systemautomagically: Good idea or bad? In-Reply-To: <20091023182317.GB30655@bx9.net> References: <68A57CCFD4005646957BD2D18E60667B0D970F6F@milexchmb1.mil.tagmclarengroup.com> <20091023182317.GB30655@bx9.net> Message-ID: <4AE2153E.7030804@tamu.edu> Greg Lindahl wrote: > On Fri, Oct 23, 2009 at 01:01:05PM -0500, Rahul Nabar wrote: > >> 2. Some errors are hardware precipitated. Aging, out-of-warranty >> aging, hardware can sometimes need such a reboot compromise for >> one-off random errors. >> >> Maybe all the "nice" clusters out there never have this issue but for >> me it is fairly common. Just confessing. > > Why, exactly, are you assuming that your freezes are one-off random > errors due to aging hardware? Sounds like you're either guessing, or > you _are_ doing forensics, but aren't calling it forensics. *MY* aging hardware usually just falls over dead when it's done with its useful life. Too many intermittent errors/failures causes me to do sufficient diagnostics to repair the node (if it's cheap and easy enough) or drop it in the latest surplus run. -- Gerry Creager AATLT, Texas A&M University Tel: 979.862.3982 1700 Research Pkwy, Ste 160 Fax: 979.862.3983 College Station, TX Cell 979.229.5301 77843-3139 http://mesonet.tamu.edu From jlforrest at berkeley.edu Fri Oct 23 13:56:17 2009 From: jlforrest at berkeley.edu (Jon Forrest) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Mature open source hierarchical storage management In-Reply-To: References: Message-ID: <4AE21871.8080508@berkeley.edu> Carl Thomas wrote: > HI all, > > We are currently in the midst of planning a major refresh of our > existing HPC cluster. > It is expected that our storage will consist of a combination of fast > fibre channel and SATA based disk and we would like to implement a > system whereby user files are automatically migrated to and from slow > storage depending on frequency of usage. Initial investigations seem to > indicate that larger commercial hierarchical storage management systems > vastly exceed our budget. About 15 years ago I did A LOT of work with various HSMs on Unix. They were all very fragile, but I think this was mostly due to one prevailing problem. This is that, at the time, the OSs didn't have hooks in the places necessary for an HSM system to do the right thing. So HSM vendors had to make custom mods to the kernel, or else perform other heroics, to fake out the file system to be able to do migrations transparently to and from slow storage and fast storage. When you look at current HSM implementations I would suggest you look at this issue to see how this issue is handled today. Depending on the implementation of the HSM system, it might use a data base to keep track of where things on the slower media are kept. You might want to make sure that the slower media is is organized in a way so that the data is self describing in case the database becomes corrupt. Otherwise, you'd have just a pile of bits if this happens. Ironically, the system I was using used a University Ingres database for this. When the database became corrupt this was very embarrassing since I was using this HSM system for work that the Postgres database group was doing. This was the same group that had originally developed Ingres. I hope HSMs have become better since then. Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest@berkeley.edu From lindahl at pbm.com Fri Oct 23 14:08:41 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] using watchdog timers to reboot a hung system automagically: Good idea or bad? In-Reply-To: References: Message-ID: <20091023210841.GL30655@bx9.net> On Fri, Oct 23, 2009 at 09:01:28AM -0700, ed in 92626 wrote: > You could also do something at the system level to prevent it. If the system > boots and the previous_uptime is less that one hour shut down the system. > The WD timer will not wake it up. You have 2 power failures 15 minutes apart. Your entire cluster shuts down. It's turtles, all the way down. -- greg From gus at ldeo.columbia.edu Fri Oct 23 15:45:27 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Build a Beowulf Cluster In-Reply-To: <4ADF529D.2020109@gmail.com> References: <4ADF529D.2020109@gmail.com> Message-ID: <4AE23207.7000809@ldeo.columbia.edu> Hi Tony Check these: http://www.rocksclusters.org/wordpress/ http://www.rocksclusters.org/roll-documentation/base/5.2/ http://www.clustermonkey.net/ http://www.phy.duke.edu/~rgb/Beowulf/beowulf.php http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book.php If you search the list archives you will find tons of advice: http://www.beowulf.org/archive/index.html My $0.02 Gus Correa Tony Miranda wrote: > Hi everyone, > > anyone could help me explaning how to build a beowulf cluster? > An web site, a list of parameters anything updated. Cause i only found > in the internet posts that are really old. > > Thanks a lot. > > Tony Miranda. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowul From lindahl at pbm.com Fri Oct 23 17:14:45 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Mature open source hierarchical storage management In-Reply-To: <4AE21871.8080508@berkeley.edu> References: <4AE21871.8080508@berkeley.edu> Message-ID: <20091024001445.GA26898@bx9.net> On Fri, Oct 23, 2009 at 01:56:17PM -0700, Jon Forrest wrote: > They were all very fragile, but I think this was mostly due to > one prevailing problem. This is that, at the time, the OSs didn't > have hooks in the places necessary for an HSM system to do the > right thing. The hooks thing you're thinking of is the DMAPI standard. I think it's only supported in a subset of filesystems, things like xfs, jfs, GPFS, etc. I don't know if there are any free HSMs that use it out there. Googling turned up a CERN HSM project, CASTOR. It isn't a transparent interface, but that avoids a lot of OS hassle. -- greg From rpnabar at gmail.com Fri Oct 23 18:26:46 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Any industry-standards that allow automated BIOS modifications and dumping? IPMI cannot do it, can it? In-Reply-To: References: <4AE00AFE.2030707@diamond.ac.uk> Message-ID: On Thu, Oct 22, 2009 at 11:31 PM, Don Holmgren wrote: > > > You need to modify /etc/inittab to add a getty for your serial > line (to give you a login prompt). > > If in the BIOS you redirected COMn, the corresponding Linux serial > port is ttyS(n-1). ?Add this line to inittab to to run agetty on COM1, baud > rate 115200: > > S0:12345:respawn:/sbin/agetty ttyS0 115200 > > or, for COM2, baud rate 19200: > > S1:12345:respawn:/sbin/agetty ttyS1 19200 > Thanks to Don's config instructions I finally got SOL working! It took a while before I stumbled upon the help menu which said that SOL won't work over COM1. I had to use COM2. It'll even allow me to reach the BIOS *iff* I get the timing correctly. Of course on a reboot it goes down the instant eth0 gets pulled down. -- Rahul From hearnsj at googlemail.com Sat Oct 24 01:15:38 2009 From: hearnsj at googlemail.com (John Hearns) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Build a Beowulf Cluster In-Reply-To: <4ADF529D.2020109@gmail.com> References: <4ADF529D.2020109@gmail.com> Message-ID: <9f8092cc0910240115o436b2d10ja9ffe9003f4ca9db@mail.gmail.com> 2009/10/21 Tony Miranda : > Hi everyone, > > anyone could help me explaning how to build a beowulf cluster? Tony, there are also some books out there, some of which are on my bookshelf! There is also an online book by Robert Brown : http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book.php It might help if you tell us what hardware you have available. There is a bootable CD somewhere, which will enable you to try out a Beowulf cluster without doing any physical installs. From Craig.Tierney at noaa.gov Sat Oct 24 08:00:38 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Mature open source hierarchical storage management In-Reply-To: References: Message-ID: <4AE31696.1020902@noaa.gov> Carl Thomas wrote: > HI all, > > We are currently in the midst of planning a major refresh of our > existing HPC cluster. > It is expected that our storage will consist of a combination of fast > fibre channel and SATA based disk and we would like to implement a > system whereby user files are automatically migrated to and from slow > storage depending on frequency of usage. Initial investigations seem > to indicate that larger commercial hierarchical storage management > systems vastly exceed our budget. > > Is there any mature open source alternatives out there? How are other > organisations dealing with transparently presenting different tiers of > storage to non technical scientists? > Sun opensourced SamFS last year: http://opensolaris.org/os/project/samqfs/sourcecode/ I don't know what the state of the project is, but it is a place to start. The way we do at our NOAA site is to let the users migrate their data to/from the HSM system manually. There are several reasons for this. With the exception of GPFS, there are no filesystems that are really good for HPC clusters that also allow for automatic migration. I wouldn't want to use CXFS, StorNext, or QFS as the HPC filesystem across dozens or hundreds of nodes. There is another practical reason I wouldn't want to do it, even with GPFS. I want to prevent users from doing stupid things. Having the HSM try and archive a source code directory (not tarred) would be one of them. I know that many of these systems have policies for implementing containers or controlling if/when files get migrated, but for a general user community like HPC typically has, I think it is better to educate them on the proper way to archive files when the user decides data should be archived. That also reduces unneeded archives as well (tapes do get expensive). Craig > Cheers, > > Carl. > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From hahn at mcmaster.ca Sat Oct 24 14:21:02 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Build a Beowulf Cluster In-Reply-To: <4ADF529D.2020109@gmail.com> References: <4ADF529D.2020109@gmail.com> Message-ID: > anyone could help me explaning how to build a beowulf cluster? find or set up some networked linux machines, preferably with a shared filesystem. install a scheduler to manage jobs. run some tests. done. > An web site, a list of parameters anything updated. Cause i only found in the > internet posts that are really old. that's because none of the fundamentals have changed. today's tight-coupled MPI system will be based on IB; otherwise usually just gigabit. for memory-bandwidth-intensive applications, choose nehalem. software stacks are not much changed (pbs/torque/sge, maybe some advantage for openmpi over mpich). there's no beowulf-based reason to choose one distro over another. From hahn at mcmaster.ca Sat Oct 24 14:49:14 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Mature open source hierarchical storage management In-Reply-To: References: Message-ID: > It is expected that our storage will consist of a combination of fast fibre > channel and SATA based disk and we would like to implement a system whereby why do you think that's a good design? consider bandwidth: you will spend >5x as much on FC, but can easily obtain the same or higher bandwidth with SATA given that it's so much cheaper. (not to mention the fact that bandwidth is rpm * recording density, and SATA is consistently a generation or two ahead in density. that means that a 10k rpm FC disk may sustain 120 MB/s, but a 7200 rpm SATA will do 130 over the same size (and slow down to 90 or so on inner tracks - that is, at 3x the capacity of the FC disk...)) IMO, you'd do better to provide seeky workloads with their own filesystems, and put everyone else on plain old SATA (made to scale to arbitrary bandwidth and concurrency goals with, eg Lustre). that way someone dumping a 5TB checkpoint won't trash performance for people doing metadata or seek-intensive stuff. and as a side-benefit, the costliness of md/seek stuff becomes visible to those doing it... From rpnabar at gmail.com Sat Oct 24 15:13:19 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? Message-ID: Now that I have remote-IPMI and SOL working my next step is to try and crash Linux to see if there might be "pathological crash cases" where I will end up having to go to the server room. So far, whatever I do I'm pleasantly surprised that "chassis power cycle" always seems to work! I tried: `echo "c" > /proc/sysrq-trigger` to produce kernel panic. The node still reboots on its IPMI interface. What surprised me was that even if I take down my eth interface with a ifdown the IPMI still works. How does it do that? I mean I am using the shared NIC approach and I was expecting the IPMI to clam up the moment the OS took a port down. On Sept 30 Joe Landman said: >After years of configuring and helping run/manage both, we recommend strongly *against* the shared physical connector approach. The extra cost/hassle of the extra cheap >switch and wires is well worth the money. >Why do we take this view? Many reasons, but some of the bigger ones are (I know Joe Landman and others had warned me against this but I tried to start with configuring a single shared NIC and then go for two NICs. Just keeping things simple to start with.) But my single shared NIC results seem good enough already. Which is why I was trying to see if there are any worse possibilities of crashes that will render contacting the IPMI impossible. On Sept 30 Joe Landman said: >a) when the OS takes the port down, your IPMI no longer responds to arp requests. Which means ping, and any other service (IPMI) will fail without a continuous updating of the >arp tables, or a forced hardwire of those ips to those mac addresses. Another point that surprises me is how the IPMI kept working even after CentOS took the port down. I definitely see Joe Landman's arguments about why it shouldn't be responding to ARP's any more (unless I did something special). That's why I am a bit surprised that my IPMI I/P continues to respond to the pings even after the primary I/P is dead. #Ping primary I/P address ping 10.0.0.25 [no response] #Ping IPMI IP address ping 10.0.0.26 PING 10.0.0.26 (10.0.0.26) 56(84) bytes of data. 64 bytes from 10.0.0.26: icmp_seq=1 ttl=64 time=0.574 ms 64 bytes from 10.0.0.26: icmp_seq=2 ttl=64 time=0.485 ms Interestingly arp shows the primary IP as incomplete but the secondary IP resolves to the correct IP. This means that the BMC continues to respond to the second MAC even after the OS took the eth port down. How exactly does this "magic" happen. I'm just curious. node25 (incomplete) bond0 10.0.0.26 ether 00:24:E8:63:D6:9E C bond0 Another mysterious observation was this: Whenever I took eth down via the OS there is a latent period when the IPMI stops responding but then somehow it magically resurrects itself and starts working again. Just making sure this isn't a fluke case......Any comments or more disaster scenario simulations are welcome! -- Rahul From rpnabar at gmail.com Sat Oct 24 15:38:23 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Build a Beowulf Cluster In-Reply-To: <4ADF529D.2020109@gmail.com> References: <4ADF529D.2020109@gmail.com> Message-ID: On Wed, Oct 21, 2009 at 1:27 PM, Tony Miranda wrote: > Hi everyone, > > anyone could help me explaning how to build a beowulf cluster? > An web site, a list of parameters anything updated. Cause i only found in > the internet posts that are really old. You can find some such links where others who built a cluster document their efforts: https://wiki.fysik.dtu.dk/niflheim/ Not all might be suitable for your use but some of their tricks and choices might help. This was old but quite helpful too: http://tldp.org/HOWTO/Beowulf-HOWTO/ Of course, warning: I'm a newbiee too! :) -- Rahul From gerry.creager at tamu.edu Sun Oct 25 06:17:33 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: References: Message-ID: <4AE44FED.40703@tamu.edu> Oh, it doesn't always work, but even with the Dell hardware, it USUALLY does. As recently as Friday, I had to ssh into an IPMI module and reboot busybox (this was on a supermicro system, not a dell) because IPMI had gotten stupid. When I did, it regained its intelligence and has performed properly ever since. NO ideas what confused it, though, which is a bit disconcerting. gerry Rahul Nabar wrote: > Now that I have remote-IPMI and SOL working my next step is to try and > crash Linux to see if there might be "pathological crash cases" where > I will end up having to go to the server room. So far, whatever I do > I'm pleasantly surprised that "chassis power cycle" always seems to > work! > > I tried: > > `echo "c" > /proc/sysrq-trigger` to produce kernel panic. The node > still reboots on its IPMI interface. > > What surprised me was that even if I take down my eth interface with a > ifdown the IPMI still works. How does it do that? I mean I am using > the shared NIC approach and I was expecting the IPMI to clam up the > moment the OS took a port down. > > On Sept 30 Joe Landman said: > >> After years of configuring and helping run/manage both, we recommend strongly *against* the shared physical connector approach. The extra cost/hassle of the extra cheap >switch and wires is well worth the money. >> Why do we take this view? Many reasons, but some of the bigger ones are > > > (I know Joe Landman and others had warned me against this but I tried > to start with configuring a single shared NIC and then go for two > NICs. Just keeping things simple to start with.) > > But my single shared NIC results seem good enough already. Which is > why I was trying to see if there are any worse possibilities of > crashes that will render contacting the IPMI impossible. > > On Sept 30 Joe Landman said: > >> a) when the OS takes the port down, your IPMI no longer responds to arp requests. Which means ping, and any other service (IPMI) will fail without a continuous updating of the >arp tables, or a forced hardwire of those ips to those mac addresses. > > Another point that surprises me is how the IPMI kept working even > after CentOS took the port down. I definitely see Joe Landman's > arguments about why it shouldn't be responding to ARP's any more > (unless I did something special). That's why I am a bit surprised that > my IPMI I/P continues to respond to the pings even after the primary > I/P is dead. > > #Ping primary I/P address > ping 10.0.0.25 > [no response] > > #Ping IPMI IP address > ping 10.0.0.26 > PING 10.0.0.26 (10.0.0.26) 56(84) bytes of data. > 64 bytes from 10.0.0.26: icmp_seq=1 ttl=64 time=0.574 ms > 64 bytes from 10.0.0.26: icmp_seq=2 ttl=64 time=0.485 ms > > Interestingly arp shows the primary IP as incomplete but the secondary > IP resolves to the correct IP. This means that the BMC continues to > respond to the second MAC even after the OS took the eth port down. > How exactly does this "magic" happen. I'm just curious. > > node25 (incomplete) bond0 > 10.0.0.26 ether 00:24:E8:63:D6:9E C bond0 > > Another mysterious observation was this: Whenever I took eth down via > the OS there is a latent period when the IPMI stops responding but > then somehow it magically resurrects itself and starts working again. > > Just making sure this isn't a fluke case......Any comments or more > disaster scenario simulations are welcome! > From rpnabar at gmail.com Sun Oct 25 07:25:53 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: <4AE44FED.40703@tamu.edu> References: <4AE44FED.40703@tamu.edu> Message-ID: On Sun, Oct 25, 2009 at 8:17 AM, Gerry Creager wrote: > Oh, it doesn't always work, but even with the Dell hardware, it USUALLY > does. ?As recently as Friday, I had to ssh into an IPMI module and reboot How do you ssh into IPMI? Or do you mean ssh to the box and reset IPMI using the local interface? In real scenarios where I expect to use IPMI I work under the assumption that ssh will be dead. -- Rahul From niftyompi at niftyegg.com Sun Oct 25 16:53:56 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Mature open source hierarchical storage management In-Reply-To: References: Message-ID: <20091025235356.GA29096@hpegg.wr.niftyegg.com> On Fri, Oct 23, 2009 at 04:12:11PM +1100, Carl Thomas wrote: > > HI all, > We are currently in the midst of planning a major refresh of our existing > HPC cluster. > It is expected that our storage will consist of a combination of fast > fibre channel and SATA based disk and we would like to implement a system > whereby user files are automatically migrated to and from slow storage > depending on frequency of usage. Initial investigations seem to indicate > that larger commercial?hierarchical storage management systems vastly exceed our budget. > > Is there any mature open source alternatives out there? How are other organisations dealing with transparently presenting different tiers of storage to non technical scientists? > Just curious -- how large and how big are the deltas in the hierarchy? I ask because the new generation of 2TB SATA disks appear to be establishing the groundwork for a list of new storage options including cluster file systems that run circles around NFS and large storage RAIDS. From landman at scalableinformatics.com Sun Oct 25 17:29:41 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Mature open source hierarchical storage management In-Reply-To: <20091025235356.GA29096@hpegg.wr.niftyegg.com> References: <20091025235356.GA29096@hpegg.wr.niftyegg.com> Message-ID: <4AE4ED75.8090406@scalableinformatics.com> Nifty Tom Mitchell wrote: >> Is there any mature open source alternatives out there? How are >> other organisations dealing with transparently presenting different >> tiers of storage to non technical scientists? >> > > Just curious -- how large and how big are the deltas in the > hierarchy? > > I ask because the new generation of 2TB SATA disks appear to be > establishing the groundwork for a list of new storage options > including cluster file systems that run circles around NFS and large > storage RAIDS. HFS's and tiering in general make sense when the cost of the high performance storage per GB or per TB is so large as to make it impractical to keep all of the data on disk. As Tom points out, this really isn't the case anymore. Petabytes of very high speed, very reliable storage can be had for far less money than in the past. This doesn't mean that HFSes don't make sense for some cases. Though those cases are diminishing in number over time. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From eugen at leitl.org Mon Oct 26 04:26:33 2009 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Tilera targets Intel, AMD with 100-core processor Message-ID: <20091026112633.GH27331@leitl.org> http://www.goodgearguide.com.au/article/323692 Tilera targets Intel, AMD with 100-core processor Tilera hopes its new chips either replace or work alongside chips from Intel and AMD Agam Shah (IDG News Service) 26/10/2009 15:07:00 Tags: Intel, CPUs, amd Tilera on Monday announced new general-purpose CPUs, including a 100-core chip, as it tries to make its way into the server market dominated by Intel and Advanced Micro Devices. The two-year-old startup's Tile-GX series of chips are targeted at servers and appliances that execute Web-related functions such as indexing, Web search and video search, said Anant Agarwal, cofounder and chief technology officer of Tilera, which is based in San Jose, California. The chips have the attributes of a general-purpose CPU as they can run the Linux OS and other applications commonly used to serve Web data. "You can run us as an adjunct to something else, though the intent is to be able to run it stand-alone," Agarwal said. The chips could serve as co-processors alongside x86 chips, or potentially replace the chips in appliances and servers. Chip makers are continuously adding cores as a way to boost application performance. Most x86 server chips today come with either four or six cores, but Intel is set to release the Nehalem-EX chip, an x86 microprocessor with eight cores. AMD will shortly follow with a 12-core Opteron chip code-named Magny Cours. Graphics processors from companies like AMD and Nvidia include hundreds of cores to run high-performance applications, though the chips are making their way into PCs. The Gx100 100-core chip will draw close to 55 watts of power at maximum performance, Agarwal said. The 16-core chip will draw as little as 5 watts of power. Tilera's chips have an advantage in performance-per watt compared to x86 chips, but some will be skeptical as the chips are not yet established, said Will Strauss, principal analyst at Forward Concepts. "I don't think an average person is going to run out to buy a computer with Tilera in it," Strauss said. Intel has the advantage of being an incumbent, and even if Tilera offered something comparable to Intel's chips, it would take years to catch up. But to start, Tilera is focusing the chips on specific applications that can scale in performance across a large number of cores. It has ported certain Linux applications commonly used in servers, like the Apache Web server, MySQL database and Memcached caching software, to the Tilera architecture. "The reason we have target markets is not because of any technological limitations or other stuff in the chip. It is simply because, you know, you have to market your processor [to a] target audience. As a small company we can't boil the ocean," Agarwal said. The company's strategy is to go after lucrative markets where parallel-processing capability has a quick payout, Strauss said. Tilera could expand beyond the Web space to other markets where low-power chips are needed. It helps that applications can be programmed in C as with an Intel processor, but programmers are needed to write and port the applications, Strauss said. "How easy is it to port Windows or Linux also remains to be seen," he said. Applications like Apache and MySQL already run on x86 chips and can be ported to run on Tilera chips, company executives said. In a co-processor environment, x86 processors will run legacy applications, while the Tilera will do the Web-specific applications, he said. "As a smaller company, we can focus in on a couple of applications, drive those, and over time as we grow, we can expand," said Bob Doud, director of marketing at Tilera. The company didn't talk about the markets it would like to go into in the future. However, industry analysts say that application performance either levels off or even deteriorates as more cores are added to chips. Part of the performance relies on how the cores are assembled, said Agarwal, who is also a professor of electrical engineering and computer science at the Massachusetts Institute of Technology. For faster data exchange, Tilera has organized parallelized cores in a square with multiple points to receive and transfer data. Each core has a switch for faster data exchange. Chips from Intel and AMD rely on crossbars, but as the number of cores expands, the design could potentially cause a gridlock that could lead to bandwidth issues, he said. "You can have three or four streets coming in but ... it's hard to imagine 30 streets coming into an intersection," Agarwal said. The mesh architecture used in Tilera chips is expandable as the square gets bigger, he said. In addition to additional cores, the new Tilera chips include many upgrades from their predecessors. The chips are speedier, running at up to 1.5GHz, with support for 64-bit processing. The chips will be made using the 40-nanometer process, which make them smaller and more power-efficient. Earlier chips were made using the 90-nm process. The chips will start shipping next year, with the 100-core chip scheduled to ship in early 2011. Volume pricing for the chips will range from US$400 to $1,000. From bcostescu at gmail.com Mon Oct 26 06:11:47 2009 From: bcostescu at gmail.com (Bogdan Costescu) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: References: Message-ID: On Sat, Oct 24, 2009 at 11:13 PM, Rahul Nabar wrote: > What surprised me was that even if I take down my eth interface with a > ifdown the IPMI still works. How does it do that ? The IPMI traffic is IP (UDP) based and by inspecting the IP header one can make a difference between packets with the same MAC and different IPs. > That's why I am a bit surprised that my IPMI I/P continues to respond to the pings even after the primary I/P is dead. Generally speaking, an ARP or IP reply comes from the networking stack - if the port is ifdown-ed, the stack doesn't see any packets coming in and has no reason to send a reply. When the primary (system) IP is taken down, it's the Linux networking stack that doesn't see any packet coming in, however the BMC's network stack will still be active. That's the whole point of the BMC being a separate entity from the main system, so that its functionality remains undisturbed when something bad happens to the main system. > Another mysterious observation was this: Whenever I took eth down via the OS there is a latent period when the IPMI stops > responding but then somehow it magically resurrects itself and starts working again. Without claiming that this is the best explanation: it's possible that the Linux driver talks to the hardware and takes down the link at the physical level. The BMC driver then detects this and brings the link back up so that it can continue to receive the IPMI packets. -- Bogdan From Craig.Tierney at noaa.gov Mon Oct 26 06:48:28 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Wed Nov 25 01:09:01 2009 Subject: [Beowulf] Tilera targets Intel, AMD with 100-core processor In-Reply-To: <20091026112633.GH27331@leitl.org> References: <20091026112633.GH27331@leitl.org> Message-ID: <4AE5A8AC.4020208@noaa.gov> Anyone ever played with the current generation of chip? What I saw from the website for the current generation was: - No Fortran - No Floating point - In its fastest configuration, a 2-socket Nehalem has about the same memory bandwidth So unless your application sits in the on-core cache, I am wondering where the real benefit is going to be (ignoring the fact that the processor is still PCI-e connected). Craig Eugen Leitl wrote: > http://www.goodgearguide.com.au/article/323692 > > Tilera targets Intel, AMD with 100-core processor > > Tilera hopes its new chips either replace or work alongside chips from Intel > and AMD > > Agam Shah (IDG News Service) 26/10/2009 15:07:00 > > Tags: Intel, CPUs, amd > > Tilera on Monday announced new general-purpose CPUs, including a 100-core > chip, as it tries to make its way into the server market dominated by Intel > and Advanced Micro Devices. > > The two-year-old startup's Tile-GX series of chips are targeted at servers > and appliances that execute Web-related functions such as indexing, Web > search and video search, said Anant Agarwal, cofounder and chief technology > officer of Tilera, which is based in San Jose, California. The chips have the > attributes of a general-purpose CPU as they can run the Linux OS and other > applications commonly used to serve Web data. > > "You can run us as an adjunct to something else, though the intent is to be > able to run it stand-alone," Agarwal said. The chips could serve as > co-processors alongside x86 chips, or potentially replace the chips in > appliances and servers. > > Chip makers are continuously adding cores as a way to boost application > performance. Most x86 server chips today come with either four or six cores, > but Intel is set to release the Nehalem-EX chip, an x86 microprocessor with > eight cores. AMD will shortly follow with a 12-core Opteron chip code-named > Magny Cours. Graphics processors from companies like AMD and Nvidia include > hundreds of cores to run high-performance applications, though the chips are > making their way into PCs. > > The Gx100 100-core chip will draw close to 55 watts of power at maximum > performance, Agarwal said. The 16-core chip will draw as little as 5 watts of > power. > > Tilera's chips have an advantage in performance-per watt compared to x86 > chips, but some will be skeptical as the chips are not yet established, said > Will Strauss, principal analyst at Forward Concepts. > > "I don't think an average person is going to run out to buy a computer with > Tilera in it," Strauss said. Intel has the advantage of being an incumbent, > and even if Tilera offered something comparable to Intel's chips, it would > take years to catch up. > > But to start, Tilera is focusing the chips on specific applications that can > scale in performance across a large number of cores. It has ported certain > Linux applications commonly used in servers, like the Apache Web server, > MySQL database and Memcached caching software, to the Tilera architecture. > > "The reason we have target markets is not because of any technological > limitations or other stuff in the chip. It is simply because, you know, you > have to market your processor [to a] target audience. As a small company we > can't boil the ocean," Agarwal said. > > The company's strategy is to go after lucrative markets where > parallel-processing capability has a quick payout, Strauss said. Tilera could > expand beyond the Web space to other markets where low-power chips are > needed. > > It helps that applications can be programmed in C as with an Intel processor, > but programmers are needed to write and port the applications, Strauss said. > "How easy is it to port Windows or Linux also remains to be seen," he said. > > Applications like Apache and MySQL already run on x86 chips and can be ported > to run on Tilera chips, company executives said. In a co-processor > environment, x86 processors will run legacy applications, while the Tilera > will do the Web-specific applications, he said. > > "As a smaller company, we can focus in on a couple of applications, drive > those, and over time as we grow, we can expand," said Bob Doud, director of > marketing at Tilera. The company didn't talk about the markets it would like > to go into in the future. > > However, industry analysts say that application performance either levels off > or even deteriorates as more cores are added to chips. Part of the > performance relies on how the cores are assembled, said Agarwal, who is also > a professor of electrical engineering and computer science at the > Massachusetts Institute of Technology. > > For faster data exchange, Tilera has organized parallelized cores in a square > with multiple points to receive and transfer data. Each core has a switch for > faster data exchange. Chips from Intel and AMD rely on crossbars, but as the > number of cores expands, the design could potentially cause a gridlock that > could lead to bandwidth issues, he said. > > "You can have three or four streets coming in but ... it's hard to imagine 30 > streets coming into an intersection," Agarwal said. The mesh architecture > used in Tilera chips is expandable as the square gets bigger, he said. > > In addition to additional cores, the new Tilera chips include many upgrades > from their predecessors. The chips are speedier, running at up to 1.5GHz, > with support for 64-bit processing. The chips will be made using the > 40-nanometer process, which make them smaller and more power-efficient. > Earlier chips were made using the 90-nm process. The chips will start > shipping next year, with the 100-core chip scheduled to ship in early 2011. > Volume pricing for the chips will range from US$400 to $1,000. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From rpnabar at gmail.com Mon Oct 26 08:50:26 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: References: Message-ID: On Mon, Oct 26, 2009 at 8:11 AM, Bogdan Costescu wrote: > On Sat, Oct 24, 2009 at 11:13 PM, Rahul Nabar wrote: >> What surprised me was that even if I take down my eth interface with a >> ifdown the IPMI still works. How does it do that ? > > The IPMI traffic is IP (UDP) based and by inspecting the IP header one > can make a difference between packets with the same MAC and different > IPs. Actually, the MAC is different too. I have one NIC but it responds to two MACs. I guess one is transparent to the OS and the other is handled by the BMC. > taken down, it's the Linux networking stack that doesn't see any > packet coming in, however the BMC's network stack will still be > active. That's the whole point of the BMC being a separate entity from > the main system, so that its functionality remains undisturbed when > something bad happens to the main system. I see. So I assume the BMC's network stack is something that's hardware or firmware implemented. It's funny that in spite of this the IPMI gets hung sometimes (like Gerry says in his reply). I guess I can just attribute that to bad firmware coding in the BMC. >> Another mysterious observation was this: Whenever I took eth down via the OS there is a latent period when the IPMI stops >> responding but then somehow it magically resurrects itself and starts working again. > > Without claiming that this is the best explanation: it's possible that > the Linux driver talks to the hardware and takes down the link at the > physical level. The BMC driver then detects this and brings the link > back up so that it can continue to receive the IPMI packets. You are probably right. THe explanation sounds reasonable to me. A similar observation is for accessing the BIOS as well. The BMC stack is not responsive right from the power-up. It does become responsive for a bit but then the system drags it down (maybe when the BIOS hands over to PXE). If I manage to "ipmitool sol activate" within this correct window then I am able to see the BIOS. But that's pretty much trial and error. -- Rahul From hahn at mcmaster.ca Mon Oct 26 09:23:11 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: References: Message-ID: > IPMI gets hung sometimes (like Gerry says in his reply). I guess I can > just attribute that to bad firmware coding in the BMC. I think it's better to think of it as a piece of hw (the nic) trying to be managed by two different OSs: host and BMC. it's surprising that it works at all, since there's no real standard for sharing the hardware. I think all this adds up to a good argument for non-shared nics. (that doesn't necessitate a completely separate BMC-only fabric of switches, though.) From hahn at mcmaster.ca Mon Oct 26 09:32:04 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] Tilera targets Intel, AMD with 100-core processor In-Reply-To: <4AE5A8AC.4020208@noaa.gov> References: <20091026112633.GH27331@leitl.org> <4AE5A8AC.4020208@noaa.gov> Message-ID: > So unless your application sits in the on-core cache, I am wondering where > the real benefit > is going to be (ignoring the fact that the processor is still PCI-e "serve Web data" seems to be the target, as mentioned in the release. that seems pretty fair, since webservers tend to have pretty small footprints, and more specialized protocols can be quite compact as well. the WIMP paper (CMU, recent) makes a pretty nice argument for web services based on lots of p2p-organized low-power processors with content in dram and/or flash. that paper was based on (IIRC) geode-level embedded cpus; the tilera thing would probably compare quite well agains them. someone _has_ to write a webserver for Cuda/OpenCL soon! not because it makes sense, of course, but think of the bumper sticker: my webserver has 1600 cores. From bcostescu at gmail.com Mon Oct 26 09:34:23 2009 From: bcostescu at gmail.com (Bogdan Costescu) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: References: Message-ID: On Mon, Oct 26, 2009 at 4:50 PM, Rahul Nabar wrote: > I see. So I assume the BMC's network stack is something that's > hardware or firmware implemented. The BMC is a CPU running some firmware. It's a low power one though, as it doesn't usually have to do too many things and it should not consume significant power while the main system is off. Some BMCs even run a ssh or http daemon to allow an easier interaction. > It's funny that in spite of this the > IPMI gets hung sometimes (like Gerry says in his reply). I guess I can > just attribute that to bad firmware coding in the BMC. Sometimes the BMC can simply become overloaded. I've been told that some BMCs can't cope with a high network load, especially with broadcast packets. I have always considered the BMC as a blackbox or appliance, good for only one thing, so maybe someone with a better understanding of its inner architecture can provide some more details... -- Bogdan From rpnabar at gmail.com Mon Oct 26 09:48:53 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: References: Message-ID: On Mon, Oct 26, 2009 at 11:34 AM, Bogdan Costescu wrote: > On Mon, Oct 26, 2009 at 4:50 PM, Rahul Nabar wrote: > The BMC is a CPU running some firmware. It's a low power one though, > as it doesn't usually have to do too many things and it should not > consume significant power while the main system is off. Some BMCs even > run a ssh or http daemon to allow an easier interaction. To me the additional services seem one of the root causes of problems. Complexity just means more places for stuff to go wrong at. It may not be super difficult to impliment ARP and IPMI but when you start adding ssh and http you are pretty much writing some pretty complex daemons I suppose. I just discovered that my BMC will also serve out http pages if I point a browser to its I/P. While this might be "cool" it just increases the load on the BMC and leaves more scope for coding bugs. >> It's funny that in spite of this the >> IPMI gets hung sometimes (like Gerry says in his reply). I guess I can >> just attribute that to bad firmware coding in the BMC. > > Sometimes the BMC can simply become overloaded. I've been told that > some BMCs can't cope with a high network load, especially with > broadcast packets. I see. Any pointers to relieve this load? Any tweaks? Or precautions? > I have always considered the BMC as a blackbox or > appliance, good for only one thing, so maybe someone with a better > understanding of its inner architecture can provide some more > details... Yes, it's a black box for me too! -- Rahul From bcostescu at gmail.com Mon Oct 26 09:53:11 2009 From: bcostescu at gmail.com (Bogdan Costescu) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: References: Message-ID: On Mon, Oct 26, 2009 at 5:23 PM, Mark Hahn wrote: > it's surprising that it works at all, since there's no real > standard for sharing the hardware. I don't think that a standard is actually needed... My naive understanding is that the NIC firmware does packet inspection (no need for deep packet inspection, only MAC and IP headers are enough) and doesn't allow the main system to see that there is a packet if the packet had the BMC as the destination and doesn't allow the BMC to see that there is a packet if the packet had the main system as the destination. Then both the main system and the BMC can use the same way of accessing the NIC, as documented by the NIC vendor - no standard required. Hmmm, actually re-reading what I just wrote, sounds too good to be true so I might miss something :-) -- Bogdan From Greg at keller.net Mon Oct 26 09:57:11 2009 From: Greg at keller.net (Greg Keller) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] Re: Beowulf Digest, Vol 68, Issue 44 In-Reply-To: <200910261555.n9QFtP8J015452@bluewest.scyld.com> References: <200910261555.n9QFtP8J015452@bluewest.scyld.com> Message-ID: <2B3A548E-2751-4922-B559-FDBB5CBFC37B@Keller.net> On Oct 26, 2009, at 10:55 AM, beowulf-request@beowulf.org wrote: > Message: 6 > Date: Mon, 26 Oct 2009 10:50:26 -0500 > From: Rahul Nabar > Subject: Re: [Beowulf] any creative ways to crash Linux?: does a > shared NIC IMPI always remain responsive? > To: Bogdan Costescu > Cc: Beowulf Mailing List > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > On Mon, Oct 26, 2009 at 8:11 AM, Bogdan Costescu > wrote: >> On Sat, Oct 24, 2009 at 11:13 PM, Rahul Nabar >> wrote: >>> What surprised me was that even if I take down my eth interface >>> with a >>> ifdown the IPMI still works. How does it do that ? >> >> The IPMI traffic is IP (UDP) based and by inspecting the IP header >> one >> can make a difference between packets with the same MAC and different >> IPs. > > Actually, the MAC is different too. I have one NIC but it responds to > two MACs. I guess one is transparent to the OS and the other is > handled by the BMC. Correct. In some blades they used to share the mac, but I don't think anyone does that anymore. The BMC MAC/IP is hidden and functional regardless of the OS state. IPMI drivers can talk to the chip through the OS if need be by starting the appropriate service or kernel modules, but that's usually only fun for configuring the card, since you'll use the Network interface in most situations. > > > >> taken down, it's the Linux networking stack that doesn't see any >> packet coming in, however the BMC's network stack will still be >> active. That's the whole point of the BMC being a separate entity >> from >> the main system, so that its functionality remains undisturbed when >> something bad happens to the main system. > > I see. So I assume the BMC's network stack is something that's > hardware or firmware implemented. It's funny that in spite of this the > IPMI gets hung sometimes (like Gerry says in his reply). I guess I can > just attribute that to bad firmware coding in the BMC. "A Rich feature set" includes these issues :) > > >>> Another mysterious observation was this: Whenever I took eth down >>> via the OS there is a latent period when the IPMI stops >>> responding but then somehow it magically resurrects itself and >>> starts working again. >> >> Without claiming that this is the best explanation: it's possible >> that >> the Linux driver talks to the hardware and takes down the link at the >> physical level. The BMC driver then detects this and brings the link >> back up so that it can continue to receive the IPMI packets. > > You are probably right. THe explanation sounds reasonable to me. A > similar observation is for accessing the BIOS as well. The BMC stack > is not responsive right from the power-up. It does become responsive > for a bit but then the system drags it down (maybe when the BIOS hands > over to PXE). If I manage to "ipmitool sol activate" within this > correct window then I am able to see the BIOS. But that's pretty much > trial and error. You will probably also notice the BMC only brings the link up at 100Mb but the OS brings it up to 1Gb. Switches can add some lag here too, if Spanning tree is enabled. Turning off Spanning Tree or turning on "Port Fast" will help. Otherwise there is a period of up to about 40 seconds that the link is "up" but the switch hasn't started passing traffic (as it checks to make sure there's no ethernet loop). This has caused many Cluster Deployments hours of head banging. Cheers! Greg -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091026/3ed8ef55/attachment.html From john.hearns at mclaren.com Mon Oct 26 10:01:16 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: References: Message-ID: <68A57CCFD4005646957BD2D18E60667B0DA09445@milexchmb1.mil.tagmclarengroup.com> On Mon, Oct 26, 2009 at 11:34 AM, Bogdan Costescu wrote: > On Mon, Oct 26, 2009 at 4:50 PM, Rahul Nabar wrote: > The BMC is a CPU running some firmware. It's a low power one though, > as it doesn't usually have to do too many things and it should not > consume significant power while the main system is off. Some BMCs even > run a ssh or http daemon to allow an easier interaction. To me the additional services seem one of the root causes of problems. Complexity just means more places for stuff to go wrong at. It may not be super difficult to impliment ARP and IPMI but when you start adding ssh and http you are pretty much writing some pretty complex daemons I suppose. On the original Opteron sun servers which had BMCs and the two Ethernet interfaces which you could daisy-chain, the BMCs were Motorola single board computers running Linux. So ssh and http access were already there with whichever Linux distro they ran (you could look around in /proc for instance) http access was dead useful - I remember using it, and also on the silver coloured xNNNN series which followed them. As I recall, it was a nice interface to list the hardware error logs. Regarding how IPMI and eth0 co-exist on one interface, I though there was a simple bridge chip on the motherboard. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From john.hearns at mclaren.com Mon Oct 26 10:03:27 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: References: Message-ID: <68A57CCFD4005646957BD2D18E60667B0DA0944E@milexchmb1.mil.tagmclarengroup.com> I don't think that a standard is actually needed... My naive understanding is that the NIC firmware does packet inspection As I just said, I thought there was a bridge chip before the NIC, and I agree there is packet filtering. Look up my tortuous examination of what happens when you run a lot of 'rsh'sessions on a cluster, and clash with the IPMI ports (ie your rsh session disappears down a big hole). And before anyone says it - more modern kernels raise the sun rpc min reserved port to above the IPMI port(s) The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From rpnabar at gmail.com Mon Oct 26 10:15:01 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0DA0944E@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B0DA0944E@milexchmb1.mil.tagmclarengroup.com> Message-ID: On Mon, Oct 26, 2009 at 12:03 PM, Hearns, John wrote: > I don't think that a standard is actually needed... My naive > understanding is that the NIC firmware does packet inspection > > As I just said, I thought there was a bridge chip before the NIC, > and I agree there is packet filtering. > Look up my tortuous examination of what happens when you run a lot of > 'rsh'sessions on a cluster, John, you talked about this problem once before but I couldn't find the relevant thread. I was curious to read about it. Do you have a link to the thread or can you redump your analysis in this one? Thanks! -- Rahul From rpnabar at gmail.com Mon Oct 26 10:15:49 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0DA09445@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B0DA09445@milexchmb1.mil.tagmclarengroup.com> Message-ID: On Mon, Oct 26, 2009 at 12:01 PM, Hearns, John wrote: > > > > On the original Opteron sun servers > which had BMCs and the two Ethernet interfaces which you could > daisy-chain, > the BMCs were Motorola single board computers running Linux. > So ssh and http access were already there with whichever Linux distro > they > ran (you could look around in /proc for instance) Wow! I didn't realize that the BMC was again running a full blown Linux distro! -- Rahul From john.hearns at mclaren.com Mon Oct 26 11:18:33 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0DA09445@milexchmb1.mil.tagmclarengroup.com> Message-ID: <68A57CCFD4005646957BD2D18E60667B0DA094FE@milexchmb1.mil.tagmclarengroup.com> Wow! I didn't realize that the BMC was again running a full blown Linux distro! Only on those Sun servers - a friend who used to work for Sun showed me this. Those sun BMCs did more than act as IPMI controllers (ie you could have virtual CD drivers etc, though of course other IPMI type controllers can do this) For Dell, Supermicro, HP etc. etc. I don't know what the Drac/IPMI cards run. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From hahn at mcmaster.ca Mon Oct 26 11:23:28 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0DA09445@milexchmb1.mil.tagmclarengroup.com> Message-ID: >> the BMCs were Motorola single board computers running Linux. >> So ssh and http access were already there with whichever Linux distro >> they >> ran (you could look around in /proc for instance) > > Wow! I didn't realize that the BMC was again running a full blown Linux distro! sigh. the simplest unix distro is a kernel and a single /sbin/init in the initrd. remember, what you see as a conventional desktop/server OS is layered, mainly by the runlevel/init.d mechanism, then by X-related stuff. a kiosk running linux, for instance, might well avoid runlevels and have exactly one process alive. it's entirely possible to add ssh and its dependencies and still wind up with something very small: consider the firmware stack you find in media players and cable/wireless gateways. (or, for that matter, managed IB switches.) still a distros in the technical sense, but "full blown" as you mean it. several of the ssh-equipped firmwares I've interacted with (BMC-like or else storage controllers) have appeared to have custom command interpreters rather than a conventional shell (even of the busybox kind). From dnlombar at ichips.intel.com Mon Oct 26 11:53:54 2009 From: dnlombar at ichips.intel.com (David N. Lombard) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0DA094FE@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B0DA09445@milexchmb1.mil.tagmclarengroup.com> <68A57CCFD4005646957BD2D18E60667B0DA094FE@milexchmb1.mil.tagmclarengroup.com> Message-ID: <20091026185354.GA18478@nlxdcldnl2.cl.intel.com> On Mon, Oct 26, 2009 at 11:18:33AM -0700, Hearns, John wrote: > > Wow! I didn't realize that the BMC was again running a full blown Linux > distro! Well, "running Linux" is very different from "running a full blown Linux". For example, I have a kernel, initrd, dhcp, shell, ability to mount file systems, kexec, ssh, and a boot loader in about 1.28MiB that I can flash onto a platform flash. BTW, that userland is provided by busybox, kexec-tools, and Dropbear-SSH. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From gerry.creager at tamu.edu Mon Oct 26 12:22:02 2009 From: gerry.creager at tamu.edu (Gerald Creager) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0DA09445@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4AE5F6DA.5090403@tamu.edu> Mark Hahn wrote: >>> the BMCs were Motorola single board computers running Linux. >>> So ssh and http access were already there with whichever Linux distro >>> they >>> ran (you could look around in /proc for instance) >> >> Wow! I didn't realize that the BMC was again running a full blown >> Linux distro! > > sigh. the simplest unix distro is a kernel and a single /sbin/init in > the initrd. remember, what you see as a conventional desktop/server OS > is layered, mainly by the runlevel/init.d mechanism, then by X-related > stuff. > a kiosk running linux, for instance, might well avoid runlevels and have > exactly one process alive. it's entirely possible to add ssh and its > dependencies and still wind up with something very small: consider the > firmware stack you find in media players and cable/wireless gateways. > (or, for that matter, managed IB switches.) still a distros in the > technical > sense, but "full blown" as you mean it. several of the ssh-equipped > firmwares I've interacted with (BMC-like or else storage controllers) have > appeared to have custom command interpreters rather than a conventional > shell > (even of the busybox kind). SuperMicro uses Winbond IPMI modules. They're a pretty full-featured BusyBox implementation. gerry From rpnabar at gmail.com Mon Oct 26 14:35:30 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: <20091026185354.GA18478@nlxdcldnl2.cl.intel.com> References: <68A57CCFD4005646957BD2D18E60667B0DA09445@milexchmb1.mil.tagmclarengroup.com> <68A57CCFD4005646957BD2D18E60667B0DA094FE@milexchmb1.mil.tagmclarengroup.com> <20091026185354.GA18478@nlxdcldnl2.cl.intel.com> Message-ID: On Mon, Oct 26, 2009 at 1:53 PM, David N. Lombard wrote: > On Mon, Oct 26, 2009 at 11:18:33AM -0700, Hearns, John wrote: > > Well, "running Linux" is very different from "running a full blown Linux". ?For > example, I have a kernel, initrd, dhcp, shell, ability to mount file systems, > kexec, ssh, and a boot loader in about 1.28MiB that I can flash onto a platform > flash. True. I just thought that if my BMC is running a webserver it cannot be all that stripped down. Maybe I am wrong and it is possible to write a compact webserver. Maybe there's ways to serve out those html pages without even having a webserver..... -- Rahul From hearnsj at googlemail.com Mon Oct 26 14:45:24 2009 From: hearnsj at googlemail.com (John Hearns) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0DA09445@milexchmb1.mil.tagmclarengroup.com> <68A57CCFD4005646957BD2D18E60667B0DA094FE@milexchmb1.mil.tagmclarengroup.com> <20091026185354.GA18478@nlxdcldnl2.cl.intel.com> Message-ID: <9f8092cc0910261445x7ce5f2d4m9c59e5906c8baf77@mail.gmail.com> 2009/10/26 Rahul Nabar : > > True. I just thought that if my ?BMC is running a webserver it cannot > be all that stripped down. Maybe I am wrong and it is possible to > write a compact webserver. Google for Perl onleliner webserver Heck, your mobile phone is more powerful than the mainframes of yesteryear. I forst programmed on a mainframe with 4 megabytes of main memory. An dit ran virtual machines! From hearnsj at googlemail.com Mon Oct 26 14:51:29 2009 From: hearnsj at googlemail.com (John Hearns) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: <9f8092cc0910261445x7ce5f2d4m9c59e5906c8baf77@mail.gmail.com> References: <68A57CCFD4005646957BD2D18E60667B0DA09445@milexchmb1.mil.tagmclarengroup.com> <68A57CCFD4005646957BD2D18E60667B0DA094FE@milexchmb1.mil.tagmclarengroup.com> <20091026185354.GA18478@nlxdcldnl2.cl.intel.com> <9f8092cc0910261445x7ce5f2d4m9c59e5906c8baf77@mail.gmail.com> Message-ID: <9f8092cc0910261451m8f67385i372622d5580d1666@mail.gmail.com> 2009/10/26 John Hearns : > 2009/10/26 Rahul Nabar : >> >> True. I just thought that if my ?BMC is running a webserver it cannot >> be all that stripped down. Maybe I am wrong and it is possible to >> write a compact webserver. The address of the world's first webserver lives on at http://info.cern.ch/ The hardware is in a glass case at CERN. http://guides.macrumors.com/NeXT_Cube "The NeXT Cube ran a 25 MhZ 68030 processor, and came with a mammoth 8 MB of RAM. " From ed92626 at gmail.com Fri Oct 23 12:29:30 2009 From: ed92626 at gmail.com (ed in 92626) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] Build a Beowulf Cluster In-Reply-To: <4ADF529D.2020109@gmail.com> References: <4ADF529D.2020109@gmail.com> Message-ID: Here's something. http://www.mcsr.olemiss.edu/bookshelf/articles/how_to_build_a_cluster.html On Wed, Oct 21, 2009 at 11:27 AM, Tony Miranda wrote: > Hi everyone, > > anyone could help me explaning how to build a beowulf cluster? > An web site, a list of parameters anything updated. Cause i only found in > the internet posts that are really old. > > Thanks a lot. > > Tony Miranda. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091023/40f44b35/attachment.html From ed92626 at gmail.com Fri Oct 23 12:30:58 2009 From: ed92626 at gmail.com (ed in 92626) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] Build a Beowulf Cluster In-Reply-To: <4ADF529D.2020109@gmail.com> References: <4ADF529D.2020109@gmail.com> Message-ID: another one. http://www.cacr.caltech.edu/beowulf/tutorial/tutorial.html On Wed, Oct 21, 2009 at 11:27 AM, Tony Miranda wrote: > Hi everyone, > > anyone could help me explaning how to build a beowulf cluster? > An web site, a list of parameters anything updated. Cause i only found in > the internet posts that are really old. > > Thanks a lot. > > Tony Miranda. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091023/bdb2cb38/attachment.html From oper.ml at gmail.com Fri Oct 23 14:22:16 2009 From: oper.ml at gmail.com (Tony Miranda) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] Build a Beowulf Cluster In-Reply-To: <20091023191258.248790@gmx.com> References: <20091023191258.248790@gmx.com> Message-ID: <4AE21E88.90904@gmail.com> Hi Tomislav, thanks for the reply, and thanks to ed in 92626, too, for the url's provided. So, Tomislav, i got a cluster running, i dont know if it was i the beowuld default configurations, but i got it running with serveral howtos that i found o the web. I got it running with LAM-MPI and PVM. I had 3 machines, an Intel Core 2 Duo and two Pentium 4 HT, all with intel motherboard and 2GB RAM each. The test that i made with some binaries of LAM-MPI and PVM got good results but couldnt get the performance result, i didnt made any test for that. I was pretending to build a cluster with high number of processors and a lot of memory, so i could build a XEN server inside this cluster, and create a virtual machine to use all this resource, but seems that this is not possible. I cant get the xen to work with distributed processor, can i? Byt the way, thanks a lot! Thank you all! Tony Miranda. On 10/23/2009 04:56 PM, tomislav.maric@gmx.com wrote: > > Hello Tony, > > > I'm building my own at home with 2 nodes. This is a padawan's humble > info, so please, don't kill me if I get something wrong, I mean only well. > > > 1) You need to figure out what application are you going to run on the > cluster, in parallel. I have managed to follow only general guidelines > so far, since I'm running OpenFOAM CFD (Computational Fluid Dynamics) > simulations that burden the memory and the communications between the > nodes. This will tell you about the hardware: should you use > multiprocessor motherboards or like me, quad core single processor > ones. Actually, this is kind of a gradual process of examining your > application on a cluster, and it's kind of a closed circle: you need > to know which hardware suits your applications, and to know that, you > need to have a cluster to examine your app on. From my padawan point > of view (I may be totally wrong), people usually test their > application on a remote machine (if that't possible, from some company > that provides the hardware), or they buy just a few motherboards that > fit the price/performance conditions and are IN GENERAL, OK for a > targeted application. After they assemble the tiny cluster, they scale > it, praying as they do it, that the behaviour of the program won't > completely change on the bigger machine. > > > This is how I've dealt with this problem: > > > - I need a fast processor and loads of RAM: I've bought P5Q-VM > motherboard because it's not so expensive and you get 4 slots for RAM > (up to 16GB of RAM) per node, and I've put Intel Quad core processors > on the computing nodes. I could be totally wrong, but I don't think > so: anyway, the application will tell me, and then I can sell the > electronics and buy something different (I hope to God, this won't > happen). > > - I bought a GigEth switch (High Speed Interconnect I can't afford as > a graduate student because one HSI NIC kosts like 800 dollars) > > - I bought myself some hard disks > > - a monitor > > - another GigEth nic for the frontend (master) node to be able to > access Internet > > - some PSUs > > - ... etc > > > If you want to build a small home beowulf, check out Limulus project: > > > http://limulus.basement-supercomputing.com/ > > > When I figure out the casing for my cluster, it will actully become > kind of a Limulus cluster, i hope. > > > > 3) Don't worry about the "old" literature, I have read this: > > - prof. Robert G. Brown's book > > - how to build a beowulf cluster > > - Building Cluster Linux Systems > > - .... huge amount of googling... > > > The protocols are the same, and not much has changed from 2004. > for a small home beowulf, I have learned much by reading these > books/articles. Also, go to www.clustermonkey.net, and read until you > go blind. > > > I don't know much, but what I know I'll tell you (don't kill me if I'm > wrong, just correct me so that I can learn also :) ), feel free to ask > me on my email (or post here, which is more efficient because here > you'll get your answer from a well respected bunch of > HPC/Linux/Admin/Programming/... masters). > > > Best regards and godspeed, > > Tomislav > > > > ----- Original Message ----- > > From: Tony Miranda > > Sent: 10/21/09 08:27 pm > > To: beowulf@beowulf.org > > Subject: [Beowulf] Build a Beowulf Cluster > > Hi everyone, > > anyone could help me explaning how to build a beowulf cluster? > An web site, a list of parameters anything updated. Cause i only found > in the internet posts that are really old. > > Thanks a lot. > > Tony Miranda. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From brice.goglin at gmail.com Fri Oct 23 14:37:14 2009 From: brice.goglin at gmail.com (Brice Goglin) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] eth-mlx4-0/15 In-Reply-To: <08C3678A-7DEA-4BD8-863C-28A00AA53FBD@gmail.com> References: <08C3678A-7DEA-4BD8-863C-28A00AA53FBD@gmail.com> Message-ID: <4AE2220A.1070507@gmail.com> Robert Kubrick wrote: > I noticed my machine has 16 drivers in the /proc/interrupts table > marked as eth-mlx4-0 to 15, in addition to the usual mlx-async and > mlx-core drivers. > The server runs Linux Suse RT, has an infiniband interface, OFED 1.1 > drivers, and 16 Xeon MP cores , so I'm assuming all these eth-mlx4 > drivers are supposed to do "something" with each core. I've never seen > these irq managers before. When I run infiniband apps the interrupts > go to both mlx-async and eth-mlx4-0 (just 0, all the other drivers > don't get any interrupts). Also the eth name part looks suspicious. > > I can't find any reference online, any idea what these drivers are about? These are multiple interrupt queues. The driver probably setups one per core (or one send and one recv queue per core) for each physical NIC. Brice From worringen at googlemail.com Sun Oct 25 03:12:40 2009 From: worringen at googlemail.com (Joachim Worringen) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] eth-mlx4-0/15 In-Reply-To: <08C3678A-7DEA-4BD8-863C-28A00AA53FBD@gmail.com> References: <08C3678A-7DEA-4BD8-863C-28A00AA53FBD@gmail.com> Message-ID: <981e81f00910250312k15f600fdx9becfa6448502fbe@mail.gmail.com> I assume these are MSI-X interrupts of the one Mellanox driver instance. This feature allows to spread interrupts more or less evenly across CPUs, in conjunction with multiple send/recv queues. Each PCI device has a single driver (unless we talk about virtualized I/O, which does not apply here). But a single driver can serve any number of interrupts. Joachim On Fri, Oct 23, 2009 at 2:25 AM, Robert Kubrick wrote: > I noticed my machine has 16 drivers in the /proc/interrupts table marked as > eth-mlx4-0 to 15, in addition to the usual mlx-async and mlx-core drivers. > The server runs Linux Suse RT, has an infiniband interface, OFED 1.1 > drivers, and 16 Xeon MP cores , so I'm assuming all these eth-mlx4 drivers > are supposed to do "something" with each core. I've never seen these irq > managers before. When I run infiniband apps the interrupts go to both > mlx-async and eth-mlx4-0 (just 0, all the other drivers don't get any > interrupts). Also the eth name part looks suspicious. > > I can't find any reference online, any idea what these drivers are about? > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091025/ddea7a8e/attachment.html From robertkubrick at gmail.com Sun Oct 25 10:48:14 2009 From: robertkubrick at gmail.com (Robert Kubrick) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] eth-mlx4-0/15 In-Reply-To: <981e81f00910250312k15f600fdx9becfa6448502fbe@mail.gmail.com> References: <08C3678A-7DEA-4BD8-863C-28A00AA53FBD@gmail.com> <981e81f00910250312k15f600fdx9becfa6448502fbe@mail.gmail.com> Message-ID: Thanks, but I am not entirely clear on why the interrupts flow to both the mlx-core driver and eth-mlx4-0. This is what my /proc/interrupts table look like. Interrupts go to CPU0 for mlx4_core and CPU6 for eth-mlx4-0: 4319: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth- mlx4-15 4320: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth- mlx4-14 4321: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth- mlx4-13 4322: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth- mlx4-12 4323: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth- mlx4-11 4324: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth- mlx4-10 4325: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth- mlx4-9 4326: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth- mlx4-8 4327: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth- mlx4-7 4328: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth- mlx4-6 4329: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth- mlx4-5 4330: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth- mlx4-4 4331: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth- mlx4-3 4332: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth- mlx4-2 4333: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth- mlx4-1 4334: 34 0 0 0 0 34 97347 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth- mlx4-0 4335: 3197 0 152 4 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge mlx4_core(async) On Oct 25, 2009, at 6:12 AM, Joachim Worringen wrote: > I assume these are MSI-X interrupts of the one Mellanox driver > instance. This feature allows to spread interrupts more or less > evenly across CPUs, in conjunction with multiple send/recv queues. > > Each PCI device has a single driver (unless we talk about > virtualized I/O, which does not apply here). But a single driver > can serve any number of interrupts. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091025/4b920590/attachment.html From working at surfingllama.com Sun Oct 25 22:04:58 2009 From: working at surfingllama.com (Carl Thomas) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] Mature open source hierarchical storage management In-Reply-To: <4AE4ED75.8090406@scalableinformatics.com> References: <20091025235356.GA29096@hpegg.wr.niftyegg.com> <4AE4ED75.8090406@scalableinformatics.com> Message-ID: 2009/10/26 Joe Landman > Just curious -- how large and how big are the deltas in the > >> hierarchy? > At the start of the year we were seeing an average delta of about 10GB/day, currently we are seeing an average delta of 70GB/day. There are still a number of unknowns, but we are expecting that with next generation DNA sequencing and mass-spectrometry coming on board early next year, that the delta is likely to jump up to ~500GB/day. > I ask because the new generation of 2TB SATA disks appear to be >> establishing the groundwork for a list of new storage options >> including cluster file systems that run circles around NFS and large >> storage RAIDS. >> > > HFS's and tiering in general make sense when the cost of the high > performance storage per GB or per TB is so large as to make it impractical > to keep all of the data on disk. > > As Tom points out, this really isn't the case anymore. Petabytes of very > high speed, very reliable storage can be had for far less money than in the > past. > > This doesn't mean that HFSes don't make sense for some cases. Though those > cases are diminishing in number over time. > Its definitely good to see the cost of large storage coming down, unfortunately in our organisation amount of data that the machines are generating is increasing faster than the storage. The people driving the machines would like to see the raw data held indefinitely, but with approximately 10TB of data for an Illumina run its likely that we will only be able retain it until the intial processing is completed. Cheers, Carl. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091026/523be088/attachment.html From daniel.kidger at bull.co.uk Mon Oct 26 08:35:35 2009 From: daniel.kidger at bull.co.uk (Daniel Kidger) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] mpirun and line buffering In-Reply-To: <20091026112633.GH27331@leitl.org> References: <20091026112633.GH27331@leitl.org> Message-ID: <4AE5C1C7.5050502@bull.co.uk> Folks, I have an benchmark code that uses printf on each MPI process to print performance figures. On the system that the original author's developed on, the output from each process must have been line buffereed. On the QDR IB system I have testing on, the printf from each process is not line buffered and so the output is largely unreadable as all the lines have been jumbled together. Is this a known issue with a workaround? or does everyone else only print from just one MPI process? Daniel -- Bull, Architect of an Open World TM Dr. Daniel Kidger, HPC Technical Consultant daniel.kidger@bull.co.uk +44 (0) 7966822177 From rpnabar at gmail.com Mon Oct 26 15:33:05 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: <9f8092cc0910261451m8f67385i372622d5580d1666@mail.gmail.com> References: <68A57CCFD4005646957BD2D18E60667B0DA09445@milexchmb1.mil.tagmclarengroup.com> <68A57CCFD4005646957BD2D18E60667B0DA094FE@milexchmb1.mil.tagmclarengroup.com> <20091026185354.GA18478@nlxdcldnl2.cl.intel.com> <9f8092cc0910261445x7ce5f2d4m9c59e5906c8baf77@mail.gmail.com> <9f8092cc0910261451m8f67385i372622d5580d1666@mail.gmail.com> Message-ID: On Mon, Oct 26, 2009 at 4:51 PM, John Hearns wrote: > 2009/10/26 John Hearns : >> 2009/10/26 Rahul Nabar : >>> >>> True. I just thought that if my ?BMC is running a webserver it cannot >>> be all that stripped down. Maybe I am wrong and it is possible to >>> write a compact webserver. > > The address of the world's first webserver lives on at http://info.cern.ch/ > > The hardware is in a glass case at CERN. http://guides.macrumors.com/NeXT_Cube > "The NeXT Cube ran a 25 MhZ 68030 processor, and came with a mammoth 8 > MB of RAM. " Nice! Very impressive! And to think of all the bloat we accumulated over the years...... -- Rahul From gerry.creager at tamu.edu Mon Oct 26 15:34:47 2009 From: gerry.creager at tamu.edu (Gerald Creager) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: References: Message-ID: <4AE62407.9090105@tamu.edu> Mark Hahn wrote: >> IPMI gets hung sometimes (like Gerry says in his reply). I guess I can >> just attribute that to bad firmware coding in the BMC. > > I think it's better to think of it as a piece of hw (the nic) > trying to be managed by two different OSs: host and BMC. > it's surprising that it works at all, since there's no real > standard for sharing the hardware. I think all this adds up to a good > argument for non-shared nics. (that doesn't necessitate > a completely separate BMC-only fabric of switches, though.) Great in theory, and mostly in practice, but the one I was specifically referencing which lost its mind, was not sharing a NIC, or at least, not directly. As Mark says, there's not a good, standard, way to share a NIC. Folks (read, managers and vendors) who think this is a good idea usually don't have to fight the results of their musings. They leave it to someone like me (or, for that matter, most of the folks reading this list) to figure out with hints and dribs of information, make sense of it, and fix. THEY think it's cool to have eliminated another network. THEY don't have to trace the thing back, but instead, look at their bottom line and tell us how much they did to improve our lives. I've gotten to where I'd rather not use IPMI than to share a network between IPMI and "normal" network traffic. And that's assuming we're talking about a private network that's safely isolated from folks who'd do you ill. I'll not go into my initial thoughts about someone who'd expose a suite of IPMI hosts to a public network. gerry From secr at onthemove-conferences.org Mon Oct 26 16:11:17 2009 From: secr at onthemove-conferences.org (OnTheMove Federated Conferences) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] OnTheMove 2009, Vilamoura, Algarve: 5 top Keynote Speakers Message-ID: <1975DA4DACA64E5197984955DD28D7AF@Daniel> Call for Participation OnTheMove (OTM) 2009 is a federated event that counts 4 conferences and 10 workshops, co-located in the week of Nov.1 to 6, in the Tivoli Marina Conference Center and Hotel overlooking the pleasant fishing and yacht harbour of Vilamoura, in the Portuguese Algarve. All workshops and conferences gravitate around the themes and aspects of distributed, heterogenous and ubiquitous computing on the internet. There will be plenty of opportunities to mingle with researchers in related domains, or become informed of new developments in your area. More about OnTheMove 2009 can be found at http://www.onthemove-conferences.org/ We draw your attention to our prestigious keynote speaker program which by OTM tradition is shared among all workshops and conferences, so *any* registration gives access to *all* keynote talks! Keynotes: - Wolfgang Prinz - Santosh Shrivastava - Kai Hwang - Alejandro Buchmann - Claude Feliot We, the General co-Chairs of OTM'09 hope we may welcome you for a week of top professional enjoyment in this charming part of Portugal. Advance registration rates remain in effect if registration is done through our website until 31 Oct 2009. Pilar Herrero Tharam Dillon Robert Meersman OnTheMove Federated Conferences & Workshops 2009 www.onthemove-conferences.org -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091027/83f1aa3a/attachment.html From hahn at mcmaster.ca Mon Oct 26 22:24:05 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] mpirun and line buffering In-Reply-To: <4AE5C1C7.5050502@bull.co.uk> References: <20091026112633.GH27331@leitl.org> <4AE5C1C7.5050502@bull.co.uk> Message-ID: > On the QDR IB system I have testing on, the printf from each process is not setlinebuf(stdout)? maybe also due to change in mpi flavor? From h-bugge at online.no Tue Oct 27 00:17:31 2009 From: h-bugge at online.no (=?ISO-8859-1?Q?H=E5kon_Bugge?=) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] mpirun and line buffering In-Reply-To: <4AE5C1C7.5050502@bull.co.uk> References: <20091026112633.GH27331@leitl.org> <4AE5C1C7.5050502@bull.co.uk> Message-ID: <77AE5E93-4F25-40B5-869D-80C4FC61F639@online.no> Use an MPI where you can control the buffering. Line of character buffering would suffice in your case. Setting the line buffering in the MPI process itself might not be sufficient, depending on how the stdout/err is transported back to your (assumed) terminal. H?kon On Oct 26, 2009, at 16:35 , Daniel Kidger wrote: > Folks, > > I have an benchmark code that uses printf on each MPI process to > print performance figures. > On the system that the original author's developed on, the output > from each process must have been line buffereed. > > On the QDR IB system I have testing on, the printf from each process > is not line buffered and so the output is largely unreadable as all > the lines have been jumbled together. > > Is this a known issue with a workaround? or does everyone else only > print from just one MPI process? > > Daniel > > -- > Bull, Architect of an Open World TM > > Dr. Daniel Kidger, HPC Technical Consultant > daniel.kidger@bull.co.uk > +44 (0) 7966822177 > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > Mvh., H?kon Bugge h-bugge@online.no +47 924 84 514 From ashley at pittman.co.uk Tue Oct 27 02:37:44 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] mpirun and line buffering In-Reply-To: References: <20091026112633.GH27331@leitl.org> <4AE5C1C7.5050502@bull.co.uk> Message-ID: <1256636264.3577.14.camel@alpha> On Tue, 2009-10-27 at 01:24 -0400, Mark Hahn wrote: > > On the QDR IB system I have testing on, the printf from each process is not > > setlinebuf(stdout)? > maybe also due to change in mpi flavor? In my experience the resource manager is what controls this, the same binary with the same settings will produce good output on some clusters and bad on others. The only resource manager which seems to reliably not mess up output is orte, that and RMS of course. I believe most people take the route of getting rank[0] to do all the printing. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From john.hearns at mclaren.com Tue Oct 27 02:59:41 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] Infiniband range extenders Message-ID: <68A57CCFD4005646957BD2D18E60667B0DA09709@milexchmb1.mil.tagmclarengroup.com> My Google-fu is failing me today. If I wanted to run an Infiniband link over a distance of 200 or 300 metres, what would the options be? There is LX fibre in place, and the connection would be between a switch and a card in a remote machine. John Hearns The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From nixon at nsc.liu.se Tue Oct 27 06:13:02 2009 From: nixon at nsc.liu.se (Leif Nixon) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: (Rahul Nabar's message of "Sat\, 24 Oct 2009 17\:13\:19 -0500") References: Message-ID: Rahul Nabar writes: > (I know Joe Landman and others had warned me against this but I tried > to start with configuring a single shared NIC and then go for two > NICs. Just keeping things simple to start with.) Using shared NICs is the *complicated* alternative. You are exposing yourself to a lot of potential weird effects. I prefer having the IPMI stuff on a separate NIC that is not only dedicated, but also totally unreachable from the host OS. If somebody roots a node, they shouldn't get access to the IPMI network. -- / Swedish National Infrastructure for Computing Leif Nixon - Security officer < National Supercomputer Centre \ Nordic Data Grid Facility From stewart at serissa.com Tue Oct 27 06:33:07 2009 From: stewart at serissa.com (Larry Stewart) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] mpirun and line buffering In-Reply-To: <4AE5C1C7.5050502@bull.co.uk> References: <20091026112633.GH27331@leitl.org> <4AE5C1C7.5050502@bull.co.uk> Message-ID: <77b0285f0910270633nbe1b9adqaf0eaab300cf50f0@mail.gmail.com> On Mon, Oct 26, 2009 at 11:35 AM, Daniel Kidger wrote: > Folks, > > I have an benchmark code that uses printf on each MPI process to print > performance figures. > On the system that the original author's developed on, the output from each > process must have been line buffereed. > > As Ashley says, the job control system deals with this. I'm familiar with slurm, for example, which line-buffers output from each rank and can prepend output with the rank ID. Slurm can also direct output from each rank to an individual file. What job control system are you using? Check the man page for the job launch command to see if there are switches to control output. Other ideas - there are probably environment variables to tell the rank which rank it is, to find out see the documentation or run a two-rank "printenv". Then you could redirect each rank's output separately as program >output-from-rank-$RANK or whatever. -L -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20091027/7512642f/attachment.html From mdidomenico4 at gmail.com Tue Oct 27 07:26:39 2009 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] Infiniband range extenders In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0DA09709@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B0DA09709@milexchmb1.mil.tagmclarengroup.com> Message-ID: If you're talking about pushing just a workstation 200meters away from the core network, then Obisidian longbow's are probably your best bet http://www.obsidianresearch.com/products/c-series.html We've used them on the SCINet network at SuperComputing for the past few years without issue. On Tue, Oct 27, 2009 at 5:59 AM, Hearns, John wrote: > My Google-fu is failing me today. > If I wanted to run an Infiniband link over a distance of 200 or 300 > metres, what would the > options be? > > There is LX fibre in place, and the connection would be between a switch > and a card in > a remote machine. > > John Hearns > > The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From gerry.creager at tamu.edu Tue Oct 27 07:28:52 2009 From: gerry.creager at tamu.edu (Gerald Creager) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: <9f8092cc0910261451m8f67385i372622d5580d1666@mail.gmail.com> References: <68A57CCFD4005646957BD2D18E60667B0DA09445@milexchmb1.mil.tagmclarengroup.com> <68A57CCFD4005646957BD2D18E60667B0DA094FE@milexchmb1.mil.tagmclarengroup.com> <20091026185354.GA18478@nlxdcldnl2.cl.intel.com> <9f8092cc0910261445x7ce5f2d4m9c59e5906c8baf77@mail.gmail.com> <9f8092cc0910261451m8f67385i372622d5580d1666@mail.gmail.com> Message-ID: <4AE703A4.7060102@tamu.edu> Ye GODS, I remember the NeXT Cube. They were so cute, and neat looking, compared to the Sun I had next to my desk. John Hearns wrote: > 2009/10/26 John Hearns : >> 2009/10/26 Rahul Nabar : >>> True. I just thought that if my BMC is running a webserver it cannot >>> be all that stripped down. Maybe I am wrong and it is possible to >>> write a compact webserver. > > The address of the world's first webserver lives on at http://info.cern.ch/ > > The hardware is in a glass case at CERN. http://guides.macrumors.com/NeXT_Cube > "The NeXT Cube ran a 25 MhZ 68030 processor, and came with a mammoth 8 > MB of RAM. " > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mwill at penguincomputing.com Tue Oct 27 11:44:40 2009 From: mwill at penguincomputing.com (Michael Will) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] Immediate need for multiple HPC Sys Admins (contract basis) Message-ID: <20091027184440.GA7059@miwi.penguincomputing.com> Let me know if Job postings are off-topic for the beowulf mailinglist, I don't remember having seen many of those. However in these economic times I presume most people would be happy to find that... Penguin has a Need for HPC Sys Admins (contract basis) We have an opportunity to put a team together to support a strategic customer for the world-wide installation of their HPC application. We are looking to contract with 10 or more HPC cluster/application engineers for a 4 month engagement starting immediately. Extensive travel is required, potentially anywhere in the world (Europe, Asia, Americas). Skill set includes Linux administration, file systems, network engineering, facilities planning (power+cooling), scientific background (Life Sciences a plus), familiarity with standard x86 64 hardware plus integration with existing corporate infrastructures. This is ASAP, starting ASAP. Competitive compensation is DOE. (enough acronyms ? :) ) If you are interested, please contact Tom Coull with a meaningful resume and references under tcoull@penguincomputing.com immediately. Best of luck - Michael Will From niftyompi at niftyegg.com Tue Oct 27 18:02:03 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] Mature open source hierarchical storage management In-Reply-To: References: Message-ID: <20091028010203.GA26039@hpegg.wr.niftyegg.com> these On Fri, Oct 23, 2009 at 04:12:11PM +1100, Carl Thomas wrote: > Date: Fri, 23 Oct 2009 16:12:11 +1100 > We are currently in the midst of planning a major refresh of our existing > HPC cluster. Carl, Do add "PowerFile" to your research list. http://www.powerfile.com/ My back of the email envelope view of what you are doing should have quick cluster disks for binary objects, swap and libs /scratch /tmp and a largish NFS RAID based filesystem with an archival back end. Perhaps a large slow spinning disk staging RAID in the middle or off to the side too. There are multiple "delta equations" that you need to evaluate. I know I missed some - delta file change (GB/day). - performance delta at each layer. - cost delta at each layer. - management cost delta - operational cost delta - cost of compliance -- what the law requires, by method. - cost of physical storage on and off site, include handling and shipping. - cost of user training delta. - cost of expansion delta. - cost of necessary bandwidth, by layer. Clusters are unique in that they have the potential of hosting their own distributed RAID (lustre, gluster, zfs) and with a sufficient archival backend life could be good. Thus select systems that you can add a second disk to. Choice of filesystem can help too (see dmapi and friends). Have fun. mitch From greg.matthews at diamond.ac.uk Wed Oct 28 03:39:43 2009 From: greg.matthews at diamond.ac.uk (Gregory Matthews) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: <9f8092cc0910261445x7ce5f2d4m9c59e5906c8baf77@mail.gmail.com> References: <68A57CCFD4005646957BD2D18E60667B0DA09445@milexchmb1.mil.tagmclarengroup.com> <68A57CCFD4005646957BD2D18E60667B0DA094FE@milexchmb1.mil.tagmclarengroup.com> <20091026185354.GA18478@nlxdcldnl2.cl.intel.com> <9f8092cc0910261445x7ce5f2d4m9c59e5906c8baf77@mail.gmail.com> Message-ID: <4AE81F6F.7050302@diamond.ac.uk> John Hearns wrote: > 2009/10/26 Rahul Nabar : >> True. I just thought that if my BMC is running a webserver it cannot >> be all that stripped down. Maybe I am wrong and it is possible to >> write a compact webserver. > > Google for Perl onleliner webserver and my favourite - the awk webserver... > -- Greg Matthews 01235 778658 Senior Computer Systems Administrator Diamond Light Source, Oxfordshire, UK From niftyompi at niftyegg.com Wed Oct 28 11:29:39 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] Tilera targets Intel, AMD with 100-core processor In-Reply-To: References: <20091026112633.GH27331@leitl.org> <4AE5A8AC.4020208@noaa.gov> Message-ID: <20091028182939.GA7926@hpegg.wr.niftyegg.com> On Mon, Oct 26, 2009 at 12:32:04PM -0400, Mark Hahn wrote: > >> So unless your application sits in the on-core cache, I am wondering >> where the real benefit >> is going to be (ignoring the fact that the processor is still PCI-e > > "serve Web data" seems to be the target, as mentioned in the release. > that seems pretty fair, since web servers tend to have pretty small Add deep packet inspection to the list. Their literature is full of example data points for this. However with this many processing cores other things seem possible including encryption and decryption so firewalls and network management tools make sense as embedded targets. Their compiler and tool chain is apparently open source so when they open up that package a lot more will be known. PCI-e seems to be an issue but there are some other links on the part that might let large memory high internal bandwidth systems be built. If it can appear to be a general purpose system it might have value in the markets that Azul was targeting for their multiprocessor boxes where the application sitting in in core is a Java interpreter. Too bad the economic/investment environment is so upside down. This looks much more interesting than 10,000 lines of php and teen age web eyeball counts. I have not looked at the power budget but 10U of 100 cores per U is a lot of processor elements to harness. Interesting stuff. From hahn at mcmaster.ca Wed Oct 28 11:31:59 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: <4AE81F6F.7050302@diamond.ac.uk> References: <68A57CCFD4005646957BD2D18E60667B0DA09445@milexchmb1.mil.tagmclarengroup.com> <68A57CCFD4005646957BD2D18E60667B0DA094FE@milexchmb1.mil.tagmclarengroup.com> <20091026185354.GA18478@nlxdcldnl2.cl.intel.com> <9f8092cc0910261445x7ce5f2d4m9c59e5906c8baf77@mail.gmail.com> <4AE81F6F.7050302@diamond.ac.uk> Message-ID: >> Google for Perl onleliner webserver > > and my favourite - the awk webserver... or inetd and sh. but really, my favorite webserver was fingerd ;) From hahn at mcmaster.ca Wed Oct 28 11:57:27 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] Tilera targets Intel, AMD with 100-core processor In-Reply-To: <20091028182939.GA7926@hpegg.wr.niftyegg.com> References: <20091026112633.GH27331@leitl.org> <4AE5A8AC.4020208@noaa.gov> <20091028182939.GA7926@hpegg.wr.niftyegg.com> Message-ID: > open up that package a lot more will be known. PCI-e > seems to be an issue but there are some other links > on the part that might let large memory high internal bandwidth > systems be built. indeed. it's interesting to consider a small board with one tilera chip, 4x 64b ddr3/2133 dimms, and 8x 10G links (let's just suppose it's an 8d hypercube with sfp+ copper links). 6400 cores in a box at <4kw? the 100x chips is 2011 though: the upcoming chip is only 36-core, which makes it an interesting comparison to intel larrabee (rumored to also have 4x64b memory, and 32-48 cores in 2010.) From dnlombar at ichips.intel.com Thu Oct 29 08:37:16 2009 From: dnlombar at ichips.intel.com (David N. Lombard) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0DA09445@milexchmb1.mil.tagmclarengroup.com> <68A57CCFD4005646957BD2D18E60667B0DA094FE@milexchmb1.mil.tagmclarengroup.com> <20091026185354.GA18478@nlxdcldnl2.cl.intel.com> Message-ID: <20091029153716.GB18478@nlxdcldnl2.cl.intel.com> On Mon, Oct 26, 2009 at 02:35:30PM -0700, Rahul Nabar wrote: > On Mon, Oct 26, 2009 at 1:53 PM, David N. Lombard > wrote: > > On Mon, Oct 26, 2009 at 11:18:33AM -0700, Hearns, John wrote: > > > > > Well, "running Linux" is very different from "running a full blown Linux". ?For > > example, I have a kernel, initrd, dhcp, shell, ability to mount file systems, > > kexec, ssh, and a boot loader in about 1.28MiB that I can flash onto a platform > > flash. > > True. I just thought that if my BMC is running a webserver it cannot > be all that stripped down. Maybe I am wrong and it is possible to > write a compact webserver. Maybe there's ways to serve out those html > pages without even having a webserver..... Busybox includes a web server. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From dzaletnev at yandex.ru Wed Oct 28 22:40:55 2009 From: dzaletnev at yandex.ru (Dmitry Zaletnev) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] OpenMPI and MPICH-1.2.5 Message-ID: <76671256794855@webmail90.yandex.ru> Hi, all During my postgraduate studies I take part in testing and, in future may be, in development of CFD-suite of my professor. I know how to start daemon ring and run an application on it in MPICH2, but: 1. I cannot do this on my PS3-cluster running YellowDogLinux because there's OpenMPI only. 2. I cannot reproduce bugs in our suite when running MPICH-1.2.5 on EM64T because I don't know how to install, start daemon ring running, and run an application on it in MPICH-1.2.5. So I'm interested in any advices how to install, run daemon ring, and run application in OpenMPI and MPICH-1.2.5 when running any x64 Linux (CentOS, Debian, OpenSuSE, etc.) or links on suitable documentation. Thank you in advance. -- Dmitry From gus at ldeo.columbia.edu Thu Oct 29 13:51:48 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] OpenMPI and MPICH-1.2.5 In-Reply-To: <76671256794855@webmail90.yandex.ru> References: <76671256794855@webmail90.yandex.ru> Message-ID: <4AEA0064.7080609@ldeo.columbia.edu> Hi Dmitry As far as I know, MPICH-1 doesn't use the mpd daemon. http://www.mcs.anl.gov/research/projects/mpi/mpich1/ http://www.mcs.anl.gov/research/projects/mpi/mpich1/docs/mpichman-chp4/node69.htm#Node69 Why do you want to use MPICH-1? It is not maintained anymore, and it uses the old P4 communication mechanism that often hangs when running on current Linux kernels. http://marc.info/?l=npaci-rocks-discussion&m=123175012813683&w=2 The MPICH developers recommend using MPICH2 instead: "With the exception of users requiring the communication of heterogeneous data, we strongly encourage everyone to consider switching to MPICH2. Researchers interested in using using MPICH as a base for their research into MPI implementations should definitely use MPICH2. " (excerpt from http://www.mcs.anl.gov/research/projects/mpi/mpich1/) OpenMPI and MPICH2 have more recent and more efficient communication channels. They will run your CFD code more efficiently, and will keep you out of trouble: http://www.open-mpi.org/ http://www.mcs.anl.gov/research/projects/mpich2/ My two cents. Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Dmitry Zaletnev wrote: > Hi, all > > During my postgraduate studies I take part in testing and, in future may be, in development of CFD-suite of my professor. I know how to start daemon ring and run an application on it in MPICH2, but: > > 1. I cannot do this on my PS3-cluster running YellowDogLinux because there's OpenMPI only. > 2. I cannot reproduce bugs in our suite when running MPICH-1.2.5 on EM64T because I don't know how to install, start daemon ring running, and run an application on it in MPICH-1.2.5. > > So I'm interested in any advices how to install, run daemon ring, and run application in OpenMPI and MPICH-1.2.5 when running any x64 Linux (CentOS, Debian, OpenSuSE, etc.) or links on suitable documentation. > > Thank you in advance. From robh at dongle.org.uk Fri Oct 30 08:26:31 2009 From: robh at dongle.org.uk (Robert Horton) Date: Wed Nov 25 01:09:02 2009 Subject: [Beowulf] Storage recommendations? Message-ID: <1256916391.6856.225.camel@moelwyn.maths.qmul.ac.uk> Hi, I'm looking for some recommendations for a new "scratch" file server for our cluster. Rough requirements are: - Around 20TB of storage - Good performance with multiple nfs writes (it's quite a mixed workload so hard to characterise further) - Data security not massively important as it's just for scratch / temporary data. It'll just be a single server serving nfs, I'm not looking to go down the luster / pvfs route. I'm thinking of - 3Ware 9690 24 port controller with - 20 1TB SATA disks in RAID5 for the data - 2 500GB SATA disk in RAID1 for the OS and - 2 64GB SSDs in RAID1 for the journal for the data fs. with 2 Quadcore 3.3GHz Xeon processors and 32GB of RAM Anyone have any thoughts / suggestions on this? Thanks, Rob