From john.hearns at streamline-computing.com Tue Feb 1 09:34:32 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:44 2009 Subject: [Beowulf] Re: real hard drive failures In-Reply-To: References: Message-ID: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com> On Mon, 2005-01-31 at 14:14 -0500, Mark Hahn wrote: > > on that note, though - does anyone have comments about booting > machines from flash? > I've booted a mini-ITX system from flash, the distribution in question was a wireless access point. All you need is a CF to IDE adapter. Its common to have firewall distributions, such as ipcop, to boot from flash. http://www.ipcop.org/1.4.0/en/install/html/mkflash.html I believe one wrinkle is to either log to a remote host, or if you log locally to log to a ramdisk and only write to the CF card at infrequent intervals. John Hearns ps. >sounds like putting mudflaps and a cattle bar on a city-SUV. Called Chelsea Tractors in the part of the world I live in. From hahn at physics.mcmaster.ca Tue Feb 1 10:07:37 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:44 2009 Subject: [Beowulf] Re: real hard drive failures In-Reply-To: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com> Message-ID: > > on that note, though - does anyone have comments about booting > > machines from flash? > > > I've booted a mini-ITX system from flash, > the distribution in question was a wireless access point. > All you need is a CF to IDE adapter. I don't really see those much at all. perhaps I'm not using the right search terms. have you looked into booting from usb-flash? that would be very much dependent on bios, of course, but far more accessible. thanks, mark hahn. From James.P.Lux at jpl.nasa.gov Tue Feb 1 10:32:35 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:44 2009 Subject: [Beowulf] Re: real hard drive failures References: Message-ID: <6.1.1.1.2.20050201103123.0417a670@mail.jpl.nasa.gov> At 09:34 AM 2/1/2005, John Hearns wrote: >On Mon, 2005-01-31 at 14:14 -0500, Mark Hahn wrote: > > > > on that note, though - does anyone have comments about booting > > machines from flash? > > >I've booted a mini-ITX system from flash, >the distribution in question was a wireless access point. >All you need is a CF to IDE adapter. > >Its common to have firewall distributions, such as ipcop, >to boot from flash. >http://www.ipcop.org/1.4.0/en/install/html/mkflash.html > >I believe one wrinkle is to either log to a remote host, >or if you log locally to log to a ramdisk and only write to >the CF card at infrequent intervals. > >John Hearns > >ps. > >sounds like putting mudflaps and a cattle bar on a city-SUV. > >Called Chelsea Tractors in the part of the world I live in. > I boot mini-ITX systems from flash, and also via PXE, both wireless and wired. As John says, you need a CF to IDE adapter, which in my case is combined with the unregulated 12VDC to ATX power supply, a watchdog timer, and some other hardware. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From James.P.Lux at jpl.nasa.gov Tue Feb 1 10:50:28 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:44 2009 Subject: [Beowulf] Re: real hard drive failures References: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com> Message-ID: <6.1.1.1.2.20050201104425.041f1938@mail.jpl.nasa.gov> At 10:07 AM 2/1/2005, Mark Hahn wrote: > > > on that note, though - does anyone have comments about booting > > > machines from flash? > > > > > I've booted a mini-ITX system from flash, > > the distribution in question was a wireless access point. > > All you need is a CF to IDE adapter. > >I don't really see those much at all. perhaps I'm not using >the right search terms. Try JKMicrodevices or ituner.com or www.mini-itx.com or www.damnsmalllinux.org or wwww.logicsupply.com or www.epiacenter.com (google for "compact flash mini-itx" ) They run about $15-$20, depending on configuration, and there's nothing special about them for MiniITX.. they should work on anything. There ARE rumored to be "difficulties" with how the CF is formatted in some contexts, but I don't know any details. Maybe it has to do with whether the BIOS supports the "virtual" head, track, sector details? I've also heard that one cannot boot "Win xx" from CF, but have no reason to see why this would be so (it's a disk drive, after all...) Maybe with a PCI<>CF adapter it's a problem? >have you looked into booting from usb-flash? that would be very >much dependent on bios, of course, but far more accessible. Oooooh... that didn't work so well for me on the various machines I tried it on. The IDE/CF is essentially bios independent (to the BIOS, it just looks link another IDE drive). The USB drive has to have all the USB stuff up and running first. >thanks, mark hahn. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From alvin at Mail.Linux-Consulting.com Tue Feb 1 13:58:51 2005 From: alvin at Mail.Linux-Consulting.com (Alvin Oga) Date: Wed Nov 25 01:03:44 2009 Subject: [Beowulf] Re: real hard drive failures In-Reply-To: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com> Message-ID: On Tue, 1 Feb 2005, John Hearns wrote: > > on that note, though - does anyone have comments about booting > > machines from flash? > > > I've booted a mini-ITX system from flash, > the distribution in question was a wireless access point. > All you need is a CF to IDE adapter. ANY system can be booted from CF ... amd for an AP, you'd probably want to boot off a usb stick since those are presumably hotswappable whereas CF is not there are lots of "CF - ide adpators" pcengine.ch makes um and resells to the list of folks in the list jim posted ( ituner(mini-box), logicsupply, etc ... ) they also have those that plug the CF into the ide port on the motherboard - but, i havent seen any hotswap cf-ide adaptors yet though > Its common to have firewall distributions, such as ipcop, > to boot from flash. > http://www.ipcop.org/1.4.0/en/install/html/mkflash.html installing to 128MB or 256MB CF implies that you install the minimum packages ( glibc + networking ) and have the rest of your binaries on nfs-server:/usr/local/cluster-stuff which gets automounted onto the CF-based nodes it'd be good to keep a master CFnode ( minimal system install ) so that it can be updated and patched as needed on one place, and those patch files also makes it to the next CF release for the other nodes -- or dont patch the cf after its made :-) > I believe one wrinkle is to either log to a remote host, > or if you log locally to log to a ramdisk and only write to > the CF card at infrequent intervals. writing to CF is good and bad ... since it has limited write capabilities, but there's not much writing that needs to be done, and even if there is, one can write all the system data to /dev/ramdisk instead of CF the CF can be mounted read-only c ya alvin From john.hearns at streamline-computing.com Wed Feb 2 08:38:38 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:44 2009 Subject: [Beowulf] Re: real hard drive failures In-Reply-To: References: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com> Message-ID: <33904.143.167.3.70.1107362318.squirrel@webmail.streamline-computing.com> >> > on that note, though - does anyone have comments about booting >> > machines from flash? >> > > have you looked into booting from usb-flash? that would be very > much dependent on bios, of course, but far more accessible. > Indeed, as Alvin says any system can be booted from a CF. Some mini-ITX Cases come with a little slot, which makes changing the CF card easy. I agree with the USB comment - I always travel with a USB stick which has Stresslinux on it. www.stresslinux.org This is a little distro which has lm_sensors, cpu_burn etc. on it, plus memtest. Invaluable for the roaming engineer :-) From list-beowulf at onerussian.com Thu Feb 3 19:20:05 2005 From: list-beowulf at onerussian.com (Yaroslav Halchenko) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] NFS over TCP or smth else... WHAT I've done wrong? Message-ID: <20050204032005.GB2444@washoe.rutgers.edu> Dear Beowulfers, Today is sad day for our 25 nodes cluster: I decided to improve its performance and as a result I crippled it quite a lot. The story is that for some reason many nodes started loosing connection with the NFS server node, I started looking for a solution and decided to try NFS over TCP. After I've adjusted configs across the cluster (cfengine rulez), even rebooted the nodes (besides the main one) for the sake of it, and put a slight load on a cluster (occupied 6 nodes with intensive I/O which rw data from the NFS server) pretty much all of 60 nfsd instances start occupying CPU on the main node, so load reached around 20 or 30 which is star hitting number... main node (NFS server) start to behave unresponsively and start killing applications due to reason of "running out of memory". So what is wrong in the next config: vana:/raid /raid nfs defaults,tcp,hard,rw,nosuid,wsize=8192,rsize=8192 ? later I've adjusted it with bg,timeo=60,noatime to reduce the load but it didn't quite help. details about cluster: 23 active nodes at the moment running 2.6.8.1 SMP, main node with 8GB, RPCNFSDCOUNT=70, nfs-kernel-server What would be the best NFS config for it if we provide two directories from the NFS server: /raid as rw,sync and /share/apps as ro,async Thank you in advance P.S. BTW - here is the dump from "killing mess" Fixed up OOM kill of mm-less task oom-killer: gfp_mask=0xd0 DMA per-cpu: cpu 0 hot: low 2, high 6, batch 1 cpu 0 cold: low 0, high 2, batch 1 cpu 1 hot: low 2, high 6, batch 1 cpu 1 cold: low 0, high 2, batch 1 Normal per-cpu: cpu 0 hot: low 32, high 96, batch 16 cpu 0 cold: low 0, high 32, batch 16 cpu 1 hot: low 32, high 96, batch 16 cpu 1 cold: low 0, high 32, batch 16 HighMem per-cpu: cpu 0 hot: low 32, high 96, batch 16 cpu 0 cold: low 0, high 32, batch 16 cpu 1 hot: low 32, high 96, batch 16 cpu 1 cold: low 0, high 32, batch 16 Free pages: 2969440kB (2966528kB HighMem) Active:506964 inactive:611412 dirty:461 writeback:0 unstable:0 free:742360 slab:193835 mapped:269296 pagetables:2983 DMA free:1048kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB protections[]: 8 476 732 Normal free:1864kB min:936kB low:1872kB high:2808kB active:32632kB inactive:21288kB present:901120kB protections[]: 0 468 724 HighMem free:2966528kB min:512kB low:1024kB high:1536kB active:1995096kB inactive:2424488kB present:7471104kB protections[]: 0 0 256 DMA: 0*4kB 15*8kB 10*16kB 8*32kB 2*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1048kB Normal: 14*4kB 2*8kB 0*16kB 38*32kB 1*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1864kB HighMem: 0*4kB 0*8kB 0*16kB 48126*32kB 16915*64kB 2081*128kB 109*256kB 57*512kB 20*1024kB 0*2048kB 0*4096kB = 2966528kB Swap cache: add 538373, delete 522525, find 54148646/54172304, race 0+5 Out of Memory: Killed process 17465 (gnome-settings-). -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555] Key http://www.onerussian.com/gpg-yoh.asc GPG fingerprint 3BB6 E124 0643 A615 6F00 6854 8D11 4563 75C0 24C8 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20050203/a3da36ad/attachment.bin From wytsang at clustertech.com Tue Feb 1 02:42:37 2005 From: wytsang at clustertech.com (Clotho) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] IntelMPITEST-1.0 compiled with icc in heterogeneous environment Message-ID: <41FF5D1D.50903@clustertech.com> Hi, I would like to ask a question about using icc to compile IntelMPITEST-1.0 and run the program in heterogeneous environment. I have a i386 node and a x86_64 node. I have configed and compiled IntelMPITEST-1.0 testsuite at the i386 node. I run the testsuite in the i386 node, and use "mpirun -machinefile" to run the binary on both nodes. I have tried the test with gcc and pgi compilers, they work. But for icc8, I have encounter error in c/blocking/functional/MPI_Ssend_ator The error message is very long, but has similar pattern like: MPITEST error (3): i=0, long double value= -0.0000000000, expected 0.0000000000 MPITEST error (3): 10 errors in buffer (3,0,13) len 8 commsize 4 commtype -10 data_type 13 root 3 MPITEST error (3): Send/Receive lengths differ - Sender(node/length)=0/8, Receiver(node/length)=3/-32766 MPITEST error (3): i=0, long double value= -0.0000000000, expected 0.0000000000 MPITEST error (3): 117 errors in buffer (4,0,13) len 83 commsize 4 commtype -10 data_type 13 root 3 MPITEST error (3): Send/Receive lengths differ - Sender(node/length)=0/83, Receiver(node/length)=3/-32766 All the errors are related to data_type 13 and 14. This error does not happen when I run the tests on 2 i386 nodes. Have you any idea on the problem? Thank you. PS. I find that the error message is produced from "libmpitest.c" line 2361. And I find that one of the many compilation warning is related to the line ./libmpitest.c(2361): warning #181: argument is incompatible with corresponding format string conversion i, ((derived1 *)buffer)[i].LongDouble[k], May be it's related, I am not sure. From denis.che at gmail.com Tue Feb 1 07:18:22 2005 From: denis.che at gmail.com (Denis) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Re: Max common block size, global array size on ia32 Message-ID: >A more involved fix is to change the location of the shared >libraries in memory by changing kernel. Look for the variable >__PAGE_OFFSET in the kernel header files. How exactly do you go about doing this? I know how to compile/recompile a kernel, but I have no idea as to how to implement this fix... I have a similar machine... Dual Xeon 2.2-GHz with 2GB RAM and exactly the same problem with mem limitations for a single fixed-size array... Thanks From Kris.Boutilier at scrd.bc.ca Tue Feb 1 10:23:05 2005 From: Kris.Boutilier at scrd.bc.ca (Kris Boutilier) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Re: real hard drive failures Message-ID: There quite an elegant set of scripts available at http://gate-bunker.p6.msu.ru/~berk/router.html to tweak a standard debian installation to boot from an IDE device and run entirely from tempfs from that point on, thereby avoiding the 'worn out' compact flash problem. Targeted at router applications but certainly useful for other semi-embedded applications. > -----Original Message----- > From: John Hearns [SMTP:john.hearns@streamline-computing.com] > Sent: Tuesday, February 01, 2005 9:35 AM > To: beowulf@beowulf.org > Subject: Re: [Beowulf] Re: real hard drive failures > > On Mon, 2005-01-31 at 14:14 -0500, Mark Hahn wrote: > > > > on that note, though - does anyone have comments about booting > > machines from flash? > > > I've booted a mini-ITX system from flash, > the distribution in question was a wireless access point. > All you need is a CF to IDE adapter. > > Its common to have firewall distributions, such as ipcop, > to boot from flash. > http://www.ipcop.org/1.4.0/en/install/html/mkflash.html > {clip} From dwu at swales.com Tue Feb 1 11:18:05 2005 From: dwu at swales.com (Dominic Wu) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Re: real hard drive failures References: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com> <6.1.1.1.2.20050201104425.041f1938@mail.jpl.nasa.gov> Message-ID: <003701c50892$c2876df0$69704e89@jpl.nasa.gov> It is motherboard dependent and if your BIOS supports USB boot and most newer ones do, there should be no problem in theory. That said, booting up CF (or any solidstate/microdrive devices) via an IDE interface is still probably easier with less drivers you have to load. > > >have you looked into booting from usb-flash? that would be very > >much dependent on bios, of course, but far more accessible. > > Oooooh... that didn't work so well for me on the various machines I tried > it on. The IDE/CF is essentially bios independent (to the BIOS, it just > looks link another IDE drive). The USB drive has to have all the USB stuff > up and running first. > > > >thanks, mark hahn. > > James Lux, P.E. > Spacecraft Radio Frequency Subsystems Group > Flight Communications Systems Section > Jet Propulsion Laboratory, Mail Stop 161-213 > 4800 Oak Grove Drive > Pasadena CA 91109 > tel: (818)354-2075 > fax: (818)393-6875 > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From Glen.Gardner at verizon.net Tue Feb 1 17:20:00 2005 From: Glen.Gardner at verizon.net (Glen Gardner) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Re: real hard drive failures References: Message-ID: <42002AC0.3090806@verizon.net> USB flash is really slow. Regular CF (@ 128 KB/s writes) on a cf to ide adapter is a lot faster (particularly write speed) than USB flash (@ 64 KB/s write speed) "thumb drives". I've had good luck with IBM microdrives, but CF is getting cheaper than microdrives. Of course , the microdrives are a lot faster (@ 1MB/s R/W) than CF on write. But CF is pretty fast on read (10 MB/s ??). CF has a limited number of writes before it fails , anywhere from 100K to 1M write cycles. The time for write cycles is typically anywhere from 300 milliseconds to 500 milliseconds for a 32 KB chunk for regular CF. Typically you write a chunk of CF at once in each write cycle, and 32KB is a typical figure for that (but it varies with the particular memory chips used). This is why CF is so awfully slow when writing. Using serial CF makes it even worse, which is one reason why USB thumb drives are even slower than regular CF cards. CF is okay for booting a system from, but things like /tmp , /var are best mountd in a memory file system and only written to cf when shutting down. Swap partitions and /home need to be mounted via NFS. At present, I have two nodes of a 14 node cluster booting from CF, and /home is mounted on another machine with a proper hard drive via NFS. Ten of the nodes are booting from microdrives, and two nodes have ata 133 hard drives for /home, development and backups. /var /tmp and swap are actually mounted on the cf card, and I'm waiting to see how long before the cf actually expires. These nodes have been up 24/7 for over a month now, with no problems. I have not tried to force the nodes to swap. For saving power and reducing heat, CF is going to be the best you can get. Microdrives are almost as good, laptop drives are pretty good, and a regular IDE drive is a pig in comparison. I use a USB thumb drive with a bootable OS on it as an emergency boot drive. It comes in handy when installing a node. Since I use microdrives, all I do is shut down the node and plug the new microdrive into the cf adapter, and the cf thumb drive in the usb port and turn the node on, and it boots from USB so I can then install a system image stored on the development node onto the new microdrive via an NFS mount. It takes about 5 minutes to install and configure a new node in this fashion. Writing the disk image to a 512 MB cf card is going to take up to an hour, and plan on at least twice that to write a disk image to a 512 MB USB flash. (CF is just plain slow) Glen Mark Hahn wrote: >>>on that note, though - does anyone have comments about booting >>>machines from flash? >>> >>> >>> >>I've booted a mini-ITX system from flash, >>the distribution in question was a wireless access point. >>All you need is a CF to IDE adapter. >> >> > >I don't really see those much at all. perhaps I'm not using >the right search terms. > >have you looked into booting from usb-flash? that would be very >much dependent on bios, of course, but far more accessible. > >thanks, mark hahn. > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > -- Glen E. Gardner, Jr. AA8C AMSAT MEMBER 10593 Glen.Gardner@verizon.net http://members.bellatlantic.net/~vze24qhw/index.html From award at andorra.ad Tue Feb 1 23:25:01 2005 From: award at andorra.ad (Alan Ward i Koeck) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Re: real hard drive failures References: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com> <6.1.1.1.2.20050201104425.041f1938@mail.jpl.nasa.gov> Message-ID: <4200804D.66B07502@andorra.ad> Jim Lux wrote: > > At 10:07 AM 2/1/2005, Mark Hahn wrote: > > >have you looked into booting from usb-flash? that would be very > >much dependent on bios, of course, but far more accessible. > > Oooooh... that didn't work so well for me on the various machines I tried > it on. The IDE/CF is essentially bios independent (to the BIOS, it just > looks link another IDE drive). The USB drive has to have all the USB stuff > up and running first. Done that, though I had to use a kernel diskette with USB et al compiled in. My BIOS could only boot from a USB external hard drive/CD, not flash. Alan Ward > >thanks, mark hahn. > > James Lux, P.E. > Spacecraft Radio Frequency Subsystems Group > Flight Communications Systems Section > Jet Propulsion Laboratory, Mail Stop 161-213 > 4800 Oak Grove Drive > Pasadena CA 91109 > tel: (818)354-2075 > fax: (818)393-6875 > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mtpratol at cs.sfu.ca Wed Feb 2 12:54:28 2005 From: mtpratol at cs.sfu.ca (Matthew Pratola) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] SGE web frontends Message-ID: Hi all, Can anyone recommend a simple web frontend for submitting SGE jobs? Thanks, Matthew Pratola M.Sc. Candidate Dept. of Statistics and Actuarial Science Simon Fraser University Vancouver, BC, CANADA From diep at xs4all.nl Wed Feb 2 19:53:27 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies Message-ID: <3.0.32.20050203045323.01002100@pop.xs4all.nl> Good morning! With the intention to run my chessprogram on a beowulf to be constructed here (starting with 2 dual-k7 machines here) i better get some good advice on which network to buy. Only interesting thing is how fast each node can read out 64 bytes randomly from RAM of some remote cpu. All nodes do that simultaneously. The faster this can be done the better the algorithmic speedup for parallel search in a chess program (property of YBW, see publications in journal of icga: www.icga.org). This speedup is exponential (or better you get punished exponential compared to single cpu performance). Which network cards considering my small budget are having lowest latencies can be used? quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro per card when i altavista'ed online and i wonder how to get more than 2 nodes to work without switch. Perhaps there is low cost switches with reasonable low latency? Please note MPI is probably what i'll use, though i keep finding online information about 'gamma'. Is that faster latency than MPI implementations? Note normal 1Gbit cards for normal network traffic. Each node is a SMP or NUMA node and not only multiprocessor also multithreaded. I welcome any advice, Best regards, Vincent Vincent Diepeveen From rhamann at uccs.edu Wed Feb 2 23:56:13 2005 From: rhamann at uccs.edu (R Hamann) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] MPICH2: Handle Limit? Message-ID: I've been having some strange problems with a program using the MPICH2 library. When I added some new datatypes for ghost cell exchange, the program would hang. I figured out that any number of handles over 84 would cause this. Fortunately, I could delete some handles that I no longer needed, but it still seemed strange. Are my calculations correct that for each process there is an 84 handle limit? or am I seeing some other problem? Ron From maurice at harddata.com Thu Feb 3 16:05:19 2005 From: maurice at harddata.com (Maurice Hilarius) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Re: Botting from flash ( was Re: Re: real hard drive failures) In-Reply-To: <200501311938.j0VJc3lt003632@bluewest.scyld.com> References: <200501311938.j0VJc3lt003632@bluewest.scyld.com> Message-ID: <4202BC3F.4070401@harddata.com> >From: Mark Hahn >Subject: Re: [Beowulf] Re: real hard drive failures >... > >on that note, though - does anyone have comments about booting >machines from flash? > > Compact Flash (CF) IS an ATA device, and requires no specific drivers other than standard kernel ATA driver. CF slot reader/writers are now under $25, and as a matter of fact we offer this as an option on both our tower workstations on in our rack chassis. Recent prices on CF are at $50 or less for 512MB, so a "CD sized" boot image flash device is now trivial. If you look inside a Force10 network switch you will see the OS and firmware are loaded on a flash card. You can even buy CF packaged in a device that is a 40 pin female "dongle" that plugs directly to the motherboard HD IDE slot. These go for around $100 for 512MB Other flash types, like SD, XD, Memory stick, &c do not have the AT interface built in, so a chip and driver are needed to use them, pretty well ruling them out as useful for boot devices, unless you write the driver into BIOS, on, for example, LinuxBIOS. With our best regards, Maurice W. Hilarius Telephone: 01-780-456-9771 Hard Data Ltd. FAX: 01-780-456-9772 11060 - 166 Avenue email:maurice@harddata.com Edmonton, AB, Canada http://www.harddata.com/ T5X 1Y3 This email, message, and content, should be considered confidential, and is the copyrighted property of Hard Data Ltd., unless stated otherwise. From rgb at phy.duke.edu Fri Feb 4 03:55:48 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: References: Message-ID: On Wed, 2 Feb 2005, Matthew Pratola wrote: > Hi all, > > Can anyone recommend a simple web frontend for submitting SGE jobs? http://www.globus.org/ One stop shopping. rgb > > Thanks, > > Matthew Pratola > M.Sc. Candidate > Dept. of Statistics and Actuarial Science > Simon Fraser University > Vancouver, BC, CANADA > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From landman at scalableinformatics.com Fri Feb 4 05:20:48 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: References: Message-ID: <420376B0.7000107@scalableinformatics.com> Robert G. Brown wrote: > On Wed, 2 Feb 2005, Matthew Pratola wrote: > > >>Hi all, >> >>Can anyone recommend a simple web frontend for submitting SGE jobs? > > > http://www.globus.org/ > > One stop shopping. Did I miss something? Was a tongue planted in cheek with this reply? As far as I know there are very few web interfaces to running SGE (or LSF, or ...) jobs. If I am wrong please do provide links/references. Globus is not a web interface (last I checked), but a large group of middleware to manage something that looks a lot closer to the definition of a grid than SGE. SGE is a job scheduler (with a name "engineered" to make you think it is a one-stop-shop as a grid-in-a-box). My company is interested in (and we are developing) web portals for end user cluster work, so if you know of any, we would like to hear about them. Good open-source platforms that are current/supported could be worth looking at (and will save us time/development effort). There seem to be lots of bits of abandonware in the grid portal/user-interface area. We don't want to re-invent wheels, but at the same time, we don't want to adopt abandoned ones either. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From rene at renestorm.de Fri Feb 4 03:12:39 2005 From: rene at renestorm.de (rene) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: References: Message-ID: <200502041212.39511.rene@renestorm.de> Hi Matthew, i've been working/thinking on that a year ago and my opinion: "You don't want to do that." But there are some questions Do you want to go public with that little webpage? Do you want to execute common sge jobs or is it just one application? Do these jobs have input data? How complex is your authorization hierarchy? What do you do with the next sge release? How to you share the results and the status with the users? There are webfrontend for cluster apllications out there eg NCBI's blast, but never heard of it for common jobs. Cya > Hi all, > > Can anyone recommend a simple web frontend for submitting SGE jobs? > > Thanks, > > Matthew Pratola > M.Sc. Candidate > Dept. of Statistics and Actuarial Science > Simon Fraser University > Vancouver, BC, CANADA > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Rene Storm @Cluster From diep at xs4all.nl Fri Feb 4 04:35:22 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies Message-ID: <3.0.32.20050204133518.01007860@pop.xs4all.nl> At 00:29 4-2-2005 -0800, Bill Broadley wrote: >On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote: >> Good morning! >> >> With the intention to run my chessprogram on a beowulf to be constructed >> here (starting with 2 dual-k7 machines here) i better get some good advice >> on which network to buy. Only interesting thing is how fast each node can >> read out 64 bytes randomly from RAM of some remote cpu. All nodes do that >> simultaneously. > >Is there any way to do this less often with a larger transfer? >If you >wrote a small benchmark that did only that (send 64 bytes randomly >from a large array in memory) and make it easy to download, build, run, >and report results, I suspect some people would. One way pingpong with 64 bytes will do great. Shared memory examples i have plenty, but one way pingpong approaches it excellent. Just multiply the time with 2 and one knows the bound :) >> The faster this can be done the better the algorithmic speedup for parallel >> search in a chess program (property of YBW, see publications in journal of >> icga: www.icga.org). This speedup is exponential (or better you get >> punished exponential compared to single cpu performance). >> >> Which network cards considering my small budget are having lowest latencies >> can be used? > >Define small budget. For more than 2 nodes myrinet needs a switch. >Do you expect to be totally network latency bound? How low is enough >to keep the processors busy? CPU's are 100% busy and after i know how many times a second the network can handle in theory requests i will do more probes per second to the hashtable. The more probes i can do the better for the game tree search. >> quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro per >> card when i altavista'ed online and i wonder how to get more than 2 nodes >> to work without switch. Perhaps there is low cost switches with reasonable >> low latency? >Do you know that gigabit is too high latency? The few one way pingpong times i can find online from gigabit cards are not exactly promising, to say it very polite. Something in the order or 50 us one way pingpong time i don't even consider worth taking a look at at the picture. Each years cpu's get faster. For small networks 10 us really is the upper limit. >Can't you send enough >work, like say search 3 moves ahead on the head node, then for each legal >move send that search tree to a different node? Each node would reply with >the highest ranked moves when done. Let's not discuss parallel chess algorithm too much in depth. 100 different algorithms/enhancements get combined with each other. They are not the biggest latency problem. The latency problem is caused by the hashtable. Hashtable is a big cache. The bigger the better. It avoids researching the same tree again. In games like chess and every search terrain (even simulated flight) you can get back to the same spot by different means causing a transposition. Like suppose you start the game with 1.e4,e5 2.d4 that leads to the same position like 1.d4,e5 2.e4. So if we have searched already 1.e4,e5 2.d4 that position P we store into a large cache. Other cpu's first want to know whether we already searched that position. Those hashtable positions get created quite quickly. Deep Blue created them at a 100 million positions a second and simply didn't store vaste majority in hashtable (would be hard as it was in hardware). That's one of the reasons why it searched only 10-12 ply, already in 1999 that was no longer spectacular when 4 processor pc's showed up at world champs. At a PC with a shared hashtable nowadays i get 10-12 ply (ply = half move, full move is when both sides make a move) in a few seconds, searching a 100000 positions per second a cpu. So before we start searching every node (=position) we quickly want to find out whether other cpu's already searched it. At the origin3800 at 512 processors i used a 115 GB hashtable (i started search at 460 processors). Simply because the machine has 512GB ram. So in short you take everything you can get. The search works with internal iterative deepending which means we first search 1 ply, then 2 ply, then 3 ply and so on. The time it takes to get to the next iteration i hereby define as the branching factor (Knuth has a different definition as he just took into account 1 algorithm, the 'todays' definition looks more appropriate). In order to search 1 ply deeper obvious it's important to maintain a good branching factor. I'm very bad in writing out mathematical proofs, but it's obvious that the more memory we use, the more we can reduce the number of legal moves in this position P as next few ply it might be in hashtable, which trivially makes the time needed to search 1 ply deeper shorter. Storing closer to the root (position where we started searching) is of course more important than near the leafs of the search tree. When for example not storing in hashtable last 10 ply near the leafs in an overnight experiment the search depth dropped at 460 processors from 20 ply to 13 ply. Of course each processor of supercomputers is deadslow for game tree search (it's branchy 100% integer work completely knocking down the caches), so compared to pc's you already start at a disadvantage of a factor 16 or so very quickly, before you start searching (in case of TERAS i had to fight with outdated 500Mhz MIPS processors against opterons and high clocked quad Xeons), so upgrading my own networkcards is more clever. Yet getting yourself a network even between a few nodes as quick as those supercomputers is not so easy... Additional your own beowulf network you can first decently test at before playing at a tournament, and without good testing at the machine you play at in tournaments you have a hard 0% chance that it plays well. The only thing in software that matters is testing. >-- >Bill Broadley >Computational Science and Engineering >UC Davis From ilumb at platform.com Fri Feb 4 06:06:14 2005 From: ilumb at platform.com (Ian Lumb) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] SGE web frontends Message-ID: <4AB0624F069DAD4E90F18B13A818EEFE016B50A6@catoexm04.noam.corp.platform.com> Open Source GridPort (www.gridport.net) merits consideration. It interfaces with SGE, LSF, PBS, etc., via Globus. And NICE EnginFrame (http://www.enginframe.com) is a commercial offering which already has customizations for the Life Sciences. For the record, we provide our own Web GUI with Platform LSF, and make use of these portals as required. -Ian -----Original Message----- From: beowulf-bounces@beowulf.org [mailto:beowulf-bounces@beowulf.org]On Behalf Of Joe Landman Sent: Friday, February 04, 2005 8:21 AM To: Robert G. Brown Cc: Matthew Pratola; beowulf@beowulf.org Subject: Re: [Beowulf] SGE web frontends Robert G. Brown wrote: > On Wed, 2 Feb 2005, Matthew Pratola wrote: > > >>Hi all, >> >>Can anyone recommend a simple web frontend for submitting SGE jobs? > > > http://www.globus.org/ > > One stop shopping. Did I miss something? Was a tongue planted in cheek with this reply? As far as I know there are very few web interfaces to running SGE (or LSF, or ...) jobs. If I am wrong please do provide links/references. Globus is not a web interface (last I checked), but a large group of middleware to manage something that looks a lot closer to the definition of a grid than SGE. SGE is a job scheduler (with a name "engineered" to make you think it is a one-stop-shop as a grid-in-a-box). My company is interested in (and we are developing) web portals for end user cluster work, so if you know of any, we would like to hear about them. Good open-source platforms that are current/supported could be worth looking at (and will save us time/development effort). There seem to be lots of bits of abandonware in the grid portal/user-interface area. We don't want to re-invent wheels, but at the same time, we don't want to adopt abandoned ones either. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From laurence at scalablesystems.com Fri Feb 4 06:08:09 2005 From: laurence at scalablesystems.com (Laurence Liew) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: <420376B0.7000107@scalableinformatics.com> References: <420376B0.7000107@scalableinformatics.com> Message-ID: <420381C9.808@scalablesystems.com> Hi all We have the SGE web interface. It integrates into our Rocks cluster management web interface. That is you NEED to use ROCKS (www.rockscluster.org) You can download RxC from www.scalablesystems.com It is free for non-commercial, academic use. It provides web based: - SGE management - SGE job submission - some basic reporting - and of course managing a Rocks cluster via a web interface. Have fun. Laurence Joe Landman wrote: > > > Robert G. Brown wrote: > >> On Wed, 2 Feb 2005, Matthew Pratola wrote: >> >> >>> Hi all, >>> >>> Can anyone recommend a simple web frontend for submitting SGE jobs? >> >> >> >> http://www.globus.org/ >> >> One stop shopping. > > > Did I miss something? Was a tongue planted in cheek with this reply? > > As far as I know there are very few web interfaces to running SGE (or > LSF, or ...) jobs. If I am wrong please do provide links/references. > > Globus is not a web interface (last I checked), but a large group of > middleware to manage something that looks a lot closer to the definition > of a grid than SGE. SGE is a job scheduler (with a name "engineered" to > make you think it is a one-stop-shop as a grid-in-a-box). > > My company is interested in (and we are developing) web portals for end > user cluster work, so if you know of any, we would like to hear about > them. Good open-source platforms that are current/supported could be > worth looking at (and will save us time/development effort). There seem > to be lots of bits of abandonware in the grid portal/user-interface > area. We don't want to re-invent wheels, but at the same time, we don't > want to adopt abandoned ones either. > > Joe > -- Laurence Liew, CTO Email: laurence@scalablesystems.com Scalable Systems Pte Ltd Web : http://www.scalablesystems.com (Reg. No: 200310328D) 7 Bedok South Road Tel : 65 6827 3953 Singapore 469272 Fax : 65 6827 3922 From brian at cmrl.wustl.edu Fri Feb 4 07:14:02 2005 From: brian at cmrl.wustl.edu (Brian Henerey) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: <420376B0.7000107@scalableinformatics.com> References: <420376B0.7000107@scalableinformatics.com> Message-ID: <4203913A.1030202@cmrl.wustl.edu> I don't mean to hijack this thread, but I'd also be interested to know if there are any open source web frontends for launching jobs on clusters. I've mostly written my own anyway, but if something's out there I'd like to know. Thanks, Brian Henerey Joe Landman wrote: > > > Robert G. Brown wrote: > >> On Wed, 2 Feb 2005, Matthew Pratola wrote: >> >> >>> Hi all, >>> >>> Can anyone recommend a simple web frontend for submitting SGE jobs? >> >> >> >> http://www.globus.org/ >> >> One stop shopping. > > > Did I miss something? Was a tongue planted in cheek with this reply? > > As far as I know there are very few web interfaces to running SGE (or > LSF, or ...) jobs. If I am wrong please do provide links/references. > > Globus is not a web interface (last I checked), but a large group of > middleware to manage something that looks a lot closer to the definition > of a grid than SGE. SGE is a job scheduler (with a name "engineered" to > make you think it is a one-stop-shop as a grid-in-a-box). > > My company is interested in (and we are developing) web portals for end > user cluster work, so if you know of any, we would like to hear about > them. Good open-source platforms that are current/supported could be > worth looking at (and will save us time/development effort). There seem > to be lots of bits of abandonware in the grid portal/user-interface > area. We don't want to re-invent wheels, but at the same time, we don't > want to adopt abandoned ones either. > > Joe > From rgb at phy.duke.edu Fri Feb 4 10:02:48 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: <420376B0.7000107@scalableinformatics.com> References: <420376B0.7000107@scalableinformatics.com> Message-ID: On Fri, 4 Feb 2005, Joe Landman wrote: > > > Robert G. Brown wrote: > > On Wed, 2 Feb 2005, Matthew Pratola wrote: > > > > > >>Hi all, > >> > >>Can anyone recommend a simple web frontend for submitting SGE jobs? > > > > > > http://www.globus.org/ > > > > One stop shopping. > > Did I miss something? Was a tongue planted in cheek with this reply? Actually, it was a reply I snapped off on my way out the door on the edge of late for teaching. Let me reconsider my answer. You don't like yes/globus, how about "no". At least if you mean really really simple by simple. I would argue that a cluster designed to run primarily embarrassingly parallel jobs, fronted by a web portal/interface, is a not uncommon form of a grid, although perhaps the definition is large enough to include a union of such clusters or some more general structure (certainly access to other kinds of resources than strictly "a cluster"). So I read this question as "I want to make my local cluster into a grid, so users faraway with no direct LAN accounts or access can submit jobs into my local SGE queue after being properly authenticated". And, of course, be notified (with messages) when the jobs crash or terminate normally, facilitate data transfer and resource allocation requests, etc. Not exactly simple... Globus TK is as I understand it a toolkit from which one can build a web interface for generalized remote task submission to "a grid". It has to have lots of moving parts to do that well -- just AUTHENTICATING data transfer and job execution via a web interface isn't really terribly "simple", becaues to do it decently generally requires e.g. stuff like kerberos, ssl, ssh that aren't terribly simple either. So I definitely failed on the "simple" bit. However, simple or not, I believe that Globus does contain the components to do what you want -- provide a very generic web interface for people far away who don't share any LAN components such as mounted filespace, authentication/userid mappings, etc to transfer data and job execution instructions to a system. That system, if it is a front end running SGE and/or stuff like condor (policy, load balance, batch job tools) can then put the job into a queue, run it, and let globus know when it is finished so it can tell the original user. If you look over just their security layer (GSI -- Grid Security Infrastructure) you rapidly come to realize that to run any sort of remote job execution service you NEED most of its components -- authentication (including a Certificate Authority CA), encryption (public/private key, managed with certificates), permissions, etc. Some grid designs I've seen use just this component of Globus and use other tools (like PBS or SGE or custom designed stuff) for other components. Ian Foster seems to have a list of at least some of the major grid projects around the world -- enough to be able to google on them by name -- here: http://www-fp.mcs.anl.gov/~foster/grid-projects/ Perhaps you can find a reusable interface at one of their project websties. You can also check out e.g. the Grid Portal Development Kit: http://doesciencegrid.org/projects/GPDK/ or The Grid Portal Kit: https://gridport.npaci.edu/ or the Open Grid Computing Environment: http://www.ogce.org/index.php all of which I believe use globus as at least part of their middleware for e.g. authentication etc. Some of these are (e.g. the DOE's GPDK) currently unsupported although still available and possibly still reasonably functional. I don't really know the status of the rest of them, and I doubt that this is all of them. So you're right, I should have answered "no" because it isn't simple to offer a web interface to any active service, ESPECIALLY one that permits a remote user to upload arbitrary programs for execution on arbitrary data of arbitrary size where authentication, encryption, data transport, and remote job management become absolutely essential components of the solution. AFAIK, Globus is one of the if not the only middleware toolkits of choice for people who run the big grids -- they probably write their own actual web portal, but they use Globus to do at least some of the heavy lifting that goes on behind the scenes. Maybe one of the "portal projects" above (all open source) will be of use in setting up a "simple" portal to your cluster, but be aware that the problem itself is far from simple. However, I could be wrong and as always cherish being corrected. rgb > > As far as I know there are very few web interfaces to running SGE (or > LSF, or ...) jobs. If I am wrong please do provide links/references. > > Globus is not a web interface (last I checked), but a large group of > middleware to manage something that looks a lot closer to the definition > of a grid than SGE. SGE is a job scheduler (with a name "engineered" to > make you think it is a one-stop-shop as a grid-in-a-box). > > My company is interested in (and we are developing) web portals for end > user cluster work, so if you know of any, we would like to hear about > them. Good open-source platforms that are current/supported could be > worth looking at (and will save us time/development effort). There seem > to be lots of bits of abandonware in the grid portal/user-interface > area. We don't want to re-invent wheels, but at the same time, we don't > want to adopt abandoned ones either. > > Joe > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From mwill at penguincomputing.com Fri Feb 4 10:36:51 2005 From: mwill at penguincomputing.com (Michael Will) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Information Reseach Lab In-Reply-To: <200502010848.19789.bvanhaer@sckcen.be> References: <200502010848.19789.bvanhaer@sckcen.be> Message-ID: <200502041036.52020.mwill@penguincomputing.com> It depends on the GIS software used. There was some work to mpi-enable GRASS modules a while back, no idea where it went. Here is something about a parallel version of s.surf.rst: http://skagit.meas.ncsu.edu/~helena/grasswork/grasscontrib/ And of course if you program against the GIS api's you might be able to take advantage of a cluster as well. There is a paper that mentiones they used MPI for paralelizing their GIS/EM4 software on http://www.colorado.edu/research/cires/banff/pubpapers/104/ Michael On Monday 31 January 2005 11:48 pm, Ben Vanhaeren wrote: > On Monday 31 January 2005 11:46, Ziad Shaaban wrote: > > Dear All, > > > > I am planning to have an information lab in our faculty built of: Dell, > > Linux, Oracle and GIS. > > > > Can I use Beowulf to analyze GIS Data and display them on the web using > > ArcIMS, all three vendors said yes, but can I use Beowulf? > > > I think you should read the Beowulf FAQ: > http://www.beowulf.org/overview/faq.html#1 > Beowulf is a concept not a piece of software. > > I don't think you are going to need a beowulf cluster for the kind of > application you want to run (analyzing GIS data). If you want to guarantee > availability of your GIS data or do loadbalancing (distribute the load to > several servers) you should take a look at linux HA project: > http://www.linux-ha.org/ > Apache loadbalancing with mod_backhand: > http://www.backhand.org/ApacheCon2001/US/backhand_course_notes.pdf > and Oracle Real Application Clusters (RAC). > > -- Michael Will, Linux Sales Engineer Tel: 415-954-2822 Toll Free: 888-PENGUIN Fax: 415-954-2899 www.penguincomputing.com Visit us at LinuxWorld 2005! Hynes Convention Center, Boston, MA February 15th-17th, 2005 Booth 609 From rgb at phy.duke.edu Fri Feb 4 10:37:02 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: <4203913A.1030202@cmrl.wustl.edu> References: <420376B0.7000107@scalableinformatics.com> <4203913A.1030202@cmrl.wustl.edu> Message-ID: On Fri, 4 Feb 2005, Brian Henerey wrote: > > I don't mean to hijack this thread, but I'd also be interested to know > if there are any open source web frontends for launching jobs on > clusters. I've mostly written my own anyway, but if something's out > there I'd like to know. Same topic. The issue is having a web "portal" that manages stuff like authentication, data transport, job submission/status etc. Running the submissions through SGE rather than something else is just a detail. rgb > > Thanks, > Brian Henerey > > > Joe Landman wrote: > > > > > > Robert G. Brown wrote: > > > >> On Wed, 2 Feb 2005, Matthew Pratola wrote: > >> > >> > >>> Hi all, > >>> > >>> Can anyone recommend a simple web frontend for submitting SGE jobs? > >> > >> > >> > >> http://www.globus.org/ > >> > >> One stop shopping. > > > > > > Did I miss something? Was a tongue planted in cheek with this reply? > > > > As far as I know there are very few web interfaces to running SGE (or > > LSF, or ...) jobs. If I am wrong please do provide links/references. > > > > Globus is not a web interface (last I checked), but a large group of > > middleware to manage something that looks a lot closer to the definition > > of a grid than SGE. SGE is a job scheduler (with a name "engineered" to > > make you think it is a one-stop-shop as a grid-in-a-box). > > > > My company is interested in (and we are developing) web portals for end > > user cluster work, so if you know of any, we would like to hear about > > them. Good open-source platforms that are current/supported could be > > worth looking at (and will save us time/development effort). There seem > > to be lots of bits of abandonware in the grid portal/user-interface > > area. We don't want to re-invent wheels, but at the same time, we don't > > want to adopt abandoned ones either. > > > > Joe > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From landman at scalableinformatics.com Fri Feb 4 11:33:30 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: References: <420376B0.7000107@scalableinformatics.com> Message-ID: There are web interfaces to SGE, and there are web interfaces to grids ... I think the important aspect of this is the marketing use of the term "Grid" in a name. Way back in high school, they used to teach us that what was in a name was exactly opposite of what it really was... A bit cynical, but amazingly effective at cutting through marketing. Globus is glue. Middleware. There are portals atop globus. SGE (despite its name) is a job scheduler. As is LSF. And others. The short version of things are that in order to get a web interface to SGE, one need not go through the joy of Globus, especially as Globus will not in and of itself get you where you want to go. GridPort I knew of. The other I did not. Joe From josip at lanl.gov Fri Feb 4 11:57:28 2005 From: josip at lanl.gov (Josip Loncaric) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <3.0.32.20050204133518.01007860@pop.xs4all.nl> References: <3.0.32.20050204133518.01007860@pop.xs4all.nl> Message-ID: <4203D3A8.4070702@lanl.gov> Vincent Diepeveen wrote: > At 00:29 4-2-2005 -0800, Bill Broadley wrote: >> >>Do you know that gigabit is too high latency? Gigabit Ethernet adapters often need tweaking to deliver reasonable latency, bandwidth, and CPU utilization. For example, if your system uses the e1000 driver (Intel's gigabit Ethernet), the default setting is "dynamic Interrupt Throttle Rate" -- which means that the card will delay interrupting the CPU by up to about 130 microseconds after receiving a packet. Moreover, the "dynamic" part causes the network chip microcode to vary this delay in multiples of about 16 microseconds, so that different packets will generally experience different receive delays. For the e1000 driver, https://lists.dulug.duke.edu/pipermail/dulug/2004-August/015415.html recommends using "options e1000 InterruptThrottleRate=80000" (add this line to /etc/modules.conf). Users of this driver may also want to check Intel's parameters for e1000 listed at http://www.intel.com/support/network/sb/cs-009209.htm#parameters -- just don't assume that the default values are appropriate for cluster use. Other gigabit Ethernet adapters have similar interrupt mitigation strategies, all designed to gracefully cope with high packet rates at high network speeds. For cluster use, adjustments are usually advisable. The basic Rx interrupt mitigation scheme is this: the receiver's CPU won't be interrupted until at least N packets have arrived or M microseconds have elapsed (whichever comes first). This clearly adds up to M microseconds to network latency. BTW, one often sees N=6 (otherwise NFS performance can seriously degrade) and M>=16. Other variants of this basic scheme are possible; but they all mean increased latencies. Finally, don't forget the Tx side interrupt mitigation, or else the sending CPU might not be told promptly that it's OK to send more. The default Tx settings are probably fine for full size packets, but if your applications send lots of small packets, tweaking your network driver's Tx settings may help. Sincerely, Josip From atp at piskorski.com Fri Feb 4 12:20:23 2005 From: atp at piskorski.com (Andrew Piskorski) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <3.0.32.20050203045323.01002100@pop.xs4all.nl> References: <3.0.32.20050203045323.01002100@pop.xs4all.nl> Message-ID: <20050204202023.GA32459@piskorski.com> On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote: > Please note MPI is probably what i'll use, though i keep finding > online information about 'gamma'. Is that faster latency than MPI > implementations? http://www.disi.unige.it/project/gamma/ Gamma is a non-TCP/IP Linux 2.6.x network driver for Intel Pro/1000 gigabit ethernet cards, for use with MPI. It offers much better latency (11 us or so) than TCP/IP over ethernet (maybe 60 or 100 us), but worse than the specialized HPC interconnects (maybe 3 us). The attraction of GAMMA, is that Intel Pro/1000 cards can be had for $11 to $60 or so each (depending on exact model, etc.), and gigabit switches are also pretty cheap, while SCI or Myrinet is somewhere in the $500 to $1500 per node range (I don't keep track). So if your application can benefit from lower latency, but you want something really cheap, GAMMA should be well worth trying. -- Andrew Piskorski http://www.piskorski.com/ From lindahl at pathscale.com Fri Feb 4 12:50:34 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <20050204202023.GA32459@piskorski.com> References: <3.0.32.20050203045323.01002100@pop.xs4all.nl> <20050204202023.GA32459@piskorski.com> Message-ID: <20050204205034.GA18717@greglaptop.internal.keyresearch.com> On Fri, Feb 04, 2005 at 03:20:23PM -0500, Andrew Piskorski wrote: > On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote: > > > Please note MPI is probably what i'll use, though i keep finding > > online information about 'gamma'. Is that faster latency than MPI > > implementations? > > http://www.disi.unige.it/project/gamma/ In addition to gamma, there's also MVAPICH from LBL, and at least two commercial products, one from Scali, and one from the Cluster Competence Center. -- greg From ctierney at HPTI.com Fri Feb 4 14:13:19 2005 From: ctierney at HPTI.com (Craig Tierney) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <20050204202023.GA32459@piskorski.com> References: <3.0.32.20050203045323.01002100@pop.xs4all.nl> <20050204202023.GA32459@piskorski.com> Message-ID: <1107555198.2916.4.camel@localhost.localdomain> On Fri, 2005-02-04 at 13:20, Andrew Piskorski wrote: > On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote: > > > Please note MPI is probably what i'll use, though i keep finding > > online information about 'gamma'. Is that faster latency than MPI > > implementations? > > http://www.disi.unige.it/project/gamma/ > > Gamma is a non-TCP/IP Linux 2.6.x network driver for Intel Pro/1000 > gigabit ethernet cards, for use with MPI. It offers much better > latency (11 us or so) than TCP/IP over ethernet (maybe 60 or 100 us), > but worse than the specialized HPC interconnects (maybe 3 us). See Josip's post on tweaking interrupts on gigE drivers, but I have a small system with Intel gigE cards and a Dell gigE switch. Latency between two nodes through the swtich is 30 us. This is typical of what I see for other gigE cards. A latency of 60-100 is a bit high. Avoiding TCP/IP is still a big improvement. Craig > > The attraction of GAMMA, is that Intel Pro/1000 cards can be had for > $11 to $60 or so each (depending on exact model, etc.), and gigabit > switches are also pretty cheap, while SCI or Myrinet is somewhere in > the $500 to $1500 per node range (I don't keep track). > > So if your application can benefit from lower latency, but you want > something really cheap, GAMMA should be well worth trying. From john.hearns at streamline-computing.com Sat Feb 5 00:58:49 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: References: <420376B0.7000107@scalableinformatics.com> <4203913A.1030202@cmrl.wustl.edu> Message-ID: <1107593929.5504.1.camel@Vigor45> On Fri, 2005-02-04 at 13:37 -0500, Robert G. Brown wrote: > On Fri, 4 Feb 2005, Brian Henerey wrote: > > > > > I don't mean to hijack this thread, but I'd also be interested to know > > if there are any open source web frontends for launching jobs on > > clusters. I've mostly written my own anyway, but if something's out > > there I'd like to know. > > Same topic. The issue is having a web "portal" that manages stuff like > authentication, data transport, job submission/status etc. Running the > submissions through SGE rather than something else is just a detail. I know that the London E-science centre do work in that area. Have a look at GridSAM http://www.lesc.ic.ac.uk/gridsam/index.html Haven never used it myself mind - it was only out in beta last week! And sadly: "The DRMConnector for launching to Grid Engine resource using DRMAA is currently in development and not yet released. " Also, you could ask the same question on the SGE list. http://gridengine.sunsource.net/project/gridengine/maillist.html From kus at free.net Fri Feb 4 09:06:01 2005 From: kus at free.net (Mikhail Kuzminsky) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home Beowulf - NIC latencies Message-ID: >Good morning! >With the intention to run my chessprogram on a beowulf to be >constructed here (starting with 2 dual-k7 machines here) i better get some good advice >on which network to buy. Only interesting thing is how fast each node >can read out 64 bytes randomly from RAM of some remote cpu. All nodes >do that simultaneously. I'm very glad that parallelised chessprograms are developed, but I'm regretted that chess programs don't have coarse-grained parallelizm ... :-( I thought that every processor can handle some big part of moves tree. Unfortunatelly I can't win Deep Fritz 8 also w/o parallelization :-) >The faster this can be done the better the algorithmic speedup for >parallel search in a chess program (property of YBW, see publications >in journal of >icga: www.icga.org). This speedup is exponential (or better you get >punished exponential compared to single cpu performance). >Which network cards considering my small budget are having lowest >latencies can be used? >quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro > per card when i altavista'ed online and i wonder how to get more than >2 nodes to work without switch. Perhaps there is low cost switches >with reasonable low latency? One idea for "low price & low latency interconnect infrastructure" may be ATOLL (//www.atoll-net.de), because it has no "external" switches. But I don't know about commercial availability of ATOLL hardware just now. >Please note MPI is probably what i'll use, though i keep finding online >information about 'gamma'. Is that faster latency than MPI >implementations? You can use MPI over GAMMA having more low latencies. Yours Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow >Note normal 1Gbit cards for normal network traffic. >Each node is a SMP or NUMA node and not only multiprocessor also >multithreaded. >I welcome any advice, >Best regards, >Vincent Vincent Diepeveen From nj at hemeris.com Fri Feb 4 09:21:24 2005 From: nj at hemeris.com (Nicolas Jungers) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <3.0.32.20050203045323.01002100@pop.xs4all.nl> References: <3.0.32.20050203045323.01002100@pop.xs4all.nl> Message-ID: <1107537685.6224.12.camel@lcube.bxl.jungers.net> On Thu, 2005-02-03 at 04:53 +0100, Vincent Diepeveen wrote: > Good morning! > > With the intention to run my chessprogram on a beowulf to be > constructed > here (starting with 2 dual-k7 machines here) i better get some good > advice > on which network to buy. Only interesting thing is how fast each node > can > read out 64 bytes randomly from RAM of some remote cpu. All nodes do > that > simultaneously. > > The faster this can be done the better the algorithmic speedup for > parallel > search in a chess program (property of YBW, see publications in > journal of > icga: www.icga.org). This speedup is exponential (or better you get > punished exponential compared to single cpu performance). > > Which network cards considering my small budget are having lowest > latencies > can be used? > > quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro > per > card when i altavista'ed online and i wonder how to get more than 2 > nodes > to work without switch. Perhaps there is low cost switches with > reasonable > low latency? > > Please note MPI is probably what i'll use, though i keep finding > online > information about 'gamma'. Is that faster latency than MPI > implementations? gamma bypass tcp/ip, then shaving most of the latency. Unfortunately it's not very actively developed, though they "recently" (last year) updated their stack to the e1000 (intel Giga ethernet) NIC. I know that (some at) the CERN use they own communication stack on e1000 similar to gamma, with impressive results. I dunno if it's widely available. Nicolas From ashley at quadrics.com Fri Feb 4 09:31:01 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <3.0.32.20050204133518.01007860@pop.xs4all.nl> References: <3.0.32.20050204133518.01007860@pop.xs4all.nl> Message-ID: <1107538261.13957.10.camel@localhost.localdomain> On Fri, 2005-02-04 at 13:35 +0100, Vincent Diepeveen wrote: > At 00:29 4-2-2005 -0800, Bill Broadley wrote: > >On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote: > >> Good morning! > >> > >> With the intention to run my chessprogram on a beowulf to be constructed > >> here (starting with 2 dual-k7 machines here) i better get some good advice > >> on which network to buy. Only interesting thing is how fast each node can > >> read out 64 bytes randomly from RAM of some remote cpu. All nodes do that > >> simultaneously. > > > >Is there any way to do this less often with a larger transfer? > >If you > >wrote a small benchmark that did only that (send 64 bytes randomly > >from a large array in memory) and make it easy to download, build, run, > >and report results, I suspect some people would. > > One way pingpong with 64 bytes will do great. pingpong is not really the same, adding a random element can slow down comms and ideally it sounds like you want a one-sided operation. Perhaps you should look at tabletoy (cray shmem) or gups (MPI) as a benchmark. > CPU's are 100% busy and after i know how many times a second the network > can handle in theory requests i will do more probes per second to the > hashtable. The more probes i can do the better for the game tree search. Are you overlapping comms and compute or doing blocking reads? If you are overlapping then the issue rate for reads is more important than the raw latency. > >> quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro per > >> card when i altavista'ed online and i wonder how to get more than 2 nodes > >> to work without switch. Perhaps there is low cost switches with reasonable > >> low latency? > >Do you know that gigabit is too high latency? > > The few one way pingpong times i can find online from gigabit cards are not > exactly promising, to say it very polite. Something in the order or 50 us > one way pingpong time i don't even consider worth taking a look at at the > picture. > > Each years cpu's get faster. For small networks 10 us really is the upper > limit. 10us is easily achievable, I've just measured a read time of a little over 3us and a issue rate of 1.33us. > So before we start searching every node (=position) we quickly want to find > out whether other cpu's already searched it. > > At the origin3800 at 512 processors i used a 115 GB hashtable (i started > search at 460 processors). Simply because the machine has 512GB ram. > > So in short you take everything you can get. So is this a parallel algorithm or simply a big "memory farm" you are after? You don't hear much of clusters being used for the latter but in some cases it's a eminently sensible thing to do. Ashley, From monang at gmail.com Fri Feb 4 11:23:13 2005 From: monang at gmail.com (Monang Setyawan) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Newbie Question Message-ID: <5dc04bbf050204112319fe7fbf@mail.gmail.com> Hi. I'm a newbie in this parallel computing thing. (sorry for my bad english, I'm Indonesian) My current project is a software that analyze DNA/Protein sequence data that needs high performance aspect on it. I plan to deploy this software on network of workstations (mm, may be just about 10 PCs on the network). Am I in wrong place now? I am going to use message passing paradigm (MPI) to write the software. I've read that there are several choice of MPI implementation. The problem is, I'm bad in both C or Fortran (I usually use Java as my favorite language). Some source said that Java (or it's MPI wrapper or pure MPI implementation) isn't good enough to implement a parallel computing solution. Is that right? My third question is, is there any pdf/ps/one file version of "Engineering a Beowulf-style Compute Cluster''? Thanks in advance. -- For the sake of time.. From rhamann at uccs.edu Fri Feb 4 11:31:01 2005 From: rhamann at uccs.edu (R Hamann) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] MPICH2: Handle Limit? In-Reply-To: References: Message-ID: Rob, I thought any limit would be wierd, let alone something like 84 (7 X 12?) Anyway, I thought it was based on the number of MPI variables declared (data_types, windows, requests) because every time I added new declarations, it would hang on Fedora core 2, but run to completion on Scyld (but with erroneous results). If I deleted unused MPI declarations, it would start to work again. I counted all my handles and came up with 84. However, after deleting two 26 element arrays of handles, I thought it would work. When I added more handles, it bombed again. I started to try other things. I added 4 junk ints. I didn't use the variables I declared, but it still bombed. When I converted them to chars, it started working again. Very strange. Have you ever encountered this before? I'm doing a 3d cellular automata, so I need a lot of datatypes for exchange of ghost cells. It's obviously some strange error I've made that's manifesting itself in MPI instead of a runtime or sytax error. I'm gonna try looking for any buffer overruns now, but other than that I'm stumped. GCC on Fedora Core 2 and on Scyld Beowulf MPICH 2 1.0 Thanks, R On Fri, 4 Feb 2005 12:16:17 -0600 (CST) Rob Ross wrote: > Hi Ron, > > There should not be an 84 handle limit. > > Can you tell me what version of MPICH2 this is, and what >architecture and > OS you're running on? Do you have a simple test that exhibits the > problem? > > Thanks, > > Rob > --- > Rob Ross, Mathematics and Computer Science Division, Argonne >National Lab > > > On Thu, 3 Feb 2005, R Hamann wrote: > >> I've been having some strange problems with a program using the >>MPICH2 >> library. When I added some new datatypes for ghost cell exchange, >>the >> program would hang. I figured out that any number of handles over >>84 >> would cause this. Fortunately, I could delete some handles that I >>no >> longer needed, but it still seemed strange. Are my calculations >> correct that for each process there is an 84 handle limit? or am I >> seeing some other problem? >> >> Ron From rodmur at maybe.org Fri Feb 4 12:29:55 2005 From: rodmur at maybe.org (Dale Harris) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] scyld's beorun Message-ID: <20050204202955.GS32046@maybe.org> Hey, I was looking some web page talking about schedulers and Scyld's beowulf, and using the beorun command. I'm not able to find much of any documentation out there about what this command is, or does. Anyone familiar with it? -- Dale Harris rodmur@maybe.org /.-) From diep at xs4all.nl Fri Feb 4 11:39:12 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies Message-ID: <3.0.32.20050204203911.01006630@pop.xs4all.nl> At 17:31 4-2-2005 +0000, Ashley Pittman wrote: >On Fri, 2005-02-04 at 13:35 +0100, Vincent Diepeveen wrote: >> At 00:29 4-2-2005 -0800, Bill Broadley wrote: >> >On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote: >> >> Good morning! >> >> >> >> With the intention to run my chessprogram on a beowulf to be constructed >> >> here (starting with 2 dual-k7 machines here) i better get some good advice >> >> on which network to buy. Only interesting thing is how fast each node can >> >> read out 64 bytes randomly from RAM of some remote cpu. All nodes do that >> >> simultaneously. >> > >> >Is there any way to do this less often with a larger transfer? >> >If you >> >wrote a small benchmark that did only that (send 64 bytes randomly >> >from a large array in memory) and make it easy to download, build, run, >> >and report results, I suspect some people would. >> >> One way pingpong with 64 bytes will do great. >pingpong is not really the same, adding a random element can slow down >comms and ideally it sounds like you want a one-sided operation. >Perhaps you should look at tabletoy (cray shmem) or gups (MPI) as a >benchmark. Thank you for your answer, I indeed investigated quadrics cards intensively. Ask your college Daniel Kidger. The shmem is an ideal solution for what searching algorithms are doing. Regrettably seems no one is willing to sell old quadrics cards (QM400). >> CPU's are 100% busy and after i know how many times a second the network >> can handle in theory requests i will do more probes per second to the >> hashtable. The more probes i can do the better for the game tree search. > >Are you overlapping comms and compute or doing blocking reads? If you >are overlapping then the issue rate for reads is more important than the >raw latency. A node (chessposition in this case) eats on average 10 us. sometimes that's 50us other times it's 1us. That's the time a cpu is busy calculating a chesstechnical value how good the position is applying human chesspatterns. Called evaluation function in search world. Before applying evaluation function one is doing a lookup to the cache whether one already searched this position. In case of a 2 node beowulf that means you have 50% odds that this position is in local memory and 50% chance it's a remote lookup. The reason for this is very simple by explaining the hash function which in a lot of different software gets used too (not only search, also encryption and string matching and all types of caches). For each piece at each square take a random value ( long long randomvalue[12][64] ) XOR all values with each other and you have what is called a Zobrist hash from a position. Very effectively. Nothing beats the speed of Zobrist as you can do it incremental. Now suppose we use the lower 20 bits to lookup at 1 million entries. So we AND the hash number with 2^20 - 1 and lookup at that adress in the hashtable. Obviously such cache is distributed across the nodes. Each node having an equal share of the global transpositiontable as it is called officially. Trivially doing this each 10 us will put too much stress on the network. So usually one doesn't do it at the leaves itself (called quiescencesearch). That means only in 20% of the nodes such a thing gets tried. That's already on average once in each 100 us. The slower the network card, the less remote hashtable lookups one tries obviously. Finding for each cluster an optimum search depth when to try it is of course not so difficult to figure out. 1 lookup reads 64 bytes and that's 4 entries where the position could be stored. 1 entry is 16 bytes and stores quite some information. Apart from a lot of bits to identify a chessposition, the score is there (20 bits) and what the best move was in this position. >> >> quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro per >> >> card when i altavista'ed online and i wonder how to get more than 2 nodes >> >> to work without switch. Perhaps there is low cost switches with reasonable >> >> low latency? >> >Do you know that gigabit is too high latency? >> >> The few one way pingpong times i can find online from gigabit cards are not >> exactly promising, to say it very polite. Something in the order or 50 us >> one way pingpong time i don't even consider worth taking a look at at the >> picture. >> >> Each years cpu's get faster. For small networks 10 us really is the upper >> limit. > >10us is easily achievable, I've just measured a read time of a little >over 3us and a issue rate of 1.33us. Suppose a 8 node quadrics setup so with a switch in the middle and all processors trying to read nonstop over the network to a random remote processor. Each processor reading out of the 64MB on card. So never in the physical memory of a processor, just at the remote network cards. What speed is achievable then to read 64 bytes? SGI with their supercomputers never get better than 5.8 us there (that's reading 8 bytes) on average (origin3800) when the numaflex routers kick in. Altix3000 is way worse there. More bandwidth optimized i guess. >> So before we start searching every node (=position) we quickly want to find >> out whether other cpu's already searched it. >> >> At the origin3800 at 512 processors i used a 115 GB hashtable (i started >> search at 460 processors). Simply because the machine has 512GB ram. >> >> So in short you take everything you can get. > >So is this a parallel algorithm or simply a big "memory farm" you are >after? You don't hear much of clusters being used for the latter but in >some cases it's a eminently sensible thing to do. I take care the cpu's get nearly 100% load and say am prepared to sacrafice 10% of the scaling at a network to read/write latency to the hashtable. So i just figure out how many reads i can do in 10% system time and fill that with reads. The other 90% system time it has to evaluate chesspositions and be busy with the real stuff. By putting the depth in the search at which it is allowed to read higher or lower, i can manual adjust the traffic over the network. Best regards, Vincent >Ashley, From diep at xs4all.nl Fri Feb 4 12:39:47 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies Message-ID: <3.0.32.20050204213943.010127d0@pop.xs4all.nl> At 11:38 4-2-2005 -0800, Bill Broadley wrote: >> >> One way pingpong with 64 bytes will do great. >> > >A very similar number I build a circularly linked list and read a value, >add 1 to it, and send it to the next host, with a GigE network: > >compute-0-8.local compute-0-7.local compute-0-2.local compute-0-4.local compute-0-8.local compute-0-7.local compute-0-2.local compute-0-4.local >size= 10, 131072 hops, 8 nodes in 5.30 sec ( 40.4 us/hop) 966 KB/sec > >Oh, you said 64 (I'm sending INTs, so 16): >size= 16, 131072 hops, 8 nodes in 5.35 sec ( 40.8 us/hop) 1531 KB/sec I'm amazed you get it to 40.8 us. Probably you tested at an idle network? How fast is it when the cpu's are 100% busy doing integer work? >> CPU's are 100% busy and after i know how many times a second the network >> can handle in theory requests i will do more probes per second to the >> hashtable. The more probes i can do the better for the game tree search. > >With a gigE network that sounds like 40us or so. With Myrinet or IB >it's in the 4-6us range. If you bought dual opterons with the special At the quadrics and dolphin homepage they both claim 12+ us for Myrinet. For example : http://www.dolphinics.com/pdf/datasheet/Dolphin_socket_4p.pdf >hypertransport slot you could get it down to 1.5us or so. SGI >altix machines can get that down again to around 1.0us. Of course >speed isn't cheap. Altix3000 has worse latency than origin3800 if interpret results well. Altix3000 is 3-4 us one way pingpong at 64 processors, which origin3800 gets at 512 processors. At 64 processors see extensive benchmarking by prof Aad v/d Steen for dutch government organisations. His results are at www.sara.nl in pdf format. Look for his presentation 1 july 2003. When i ran at limited number of cpu's my latency tests (using shared memory) the origin3800 really is a lot faster in latency than altix3000. A problem of altix3000 design is that of course scheduling is very hard thanks to the complex routing as each brick is connected to 2 routers which each connect to other parts of the machine. This causes for immense scheduling problems when there is a 150 users simultaneously on the machine normally spoken which are not there when you can benchmark an entire empty machine with just 1 user. >> The few one way pingpong times i can find online from gigabit cards are not >> exactly promising, to say it very polite. Something in the order or 50 us >> one way pingpong time i don't even consider worth taking a look at at the >> picture. >> >> Each years cpu's get faster. For small networks 10 us really is the upper >> limit. > >Okay, so dolphin, myrinet, or IB. Have URL's from where IB is buyable without needing to buy entire system? >> Let's not discuss parallel chess algorithm too much in depth. 100 different >> algorithms/enhancements get combined with each other. They are not the >> biggest latency problem. The latency problem is caused by the hashtable. >> Hashtable is a big cache. The bigger the better. It avoids researching the >> same tree again. > >Okay, so my question is, which would be better: >* 8 4GB caches that you could query 80 million times a second? This one by far. Actually for the top searches not such big caches are needed. Locally i may allocate 200-400MB a cpu for cache, but a shared cache can be easily as low as 4MB a cpu, no problem. Could get it even down to less than that if needed. 99% of all nodes (chesspositions) that get searched are near the leafs. So if i move up the variable where it also may lookup at remote cpu's from 0 to 2, then already 99% of all nodes don't get looked up remote. >* 1 64GB cache that you could query 200,000 times a second? >> In games like chess and every search terrain (even simulated flight) you >> can get back to the same spot by different means causing a transposition. >> Like suppose you start the game with 1.e4,e5 2.d4 that leads to the same >> position like 1.d4,e5 2.e4. So if we have searched already 1.e4,e5 2.d4 >> that position P we store into a large cache. Other cpu's first want to know >> whether we already searched that position. > >Right. But if you can calculate a few Billion operations per second >sometimes it is faster to recalculate then wait 10-20us for an answer. To look 1 ply deeper in search is exponential. At a 460 cpu search (origin3800) moving the variable from 1 (default so it was already not storing/looking up the leaves remote) to 10, lost me 7 ply search depth. That's about 3^7 = factor 2187 To answer the question, YES 1 fast pc processor would outsearch in such a case handsdown a 512 processor supercomputer. Supercomputers are of course notorious here. It takes a year or so to deliver them and the processor chosen at the time of buying already wasn't the fastest, so when they finally work fine for users the processors are at least 2 times slower than pc processors (for integer work). Clusters are far superior in that respect. >> Those hashtable positions get created quite quickly. Deep Blue created them >> at a 100 million positions a second and simply didn't store vaste majority >> in hashtable (would be hard as it was in hardware). That's one of the >> reasons why it searched only 10-12 ply, already in 1999 that was no longer >> spectacular when 4 processor pc's showed up at world champs. > >Indeed, better algorithms can allow a 4 cpu to compete with a 2000. The Sheikh (one of the princes of the united arab emirates, see www.hydrachess.com) plans on building a 1024 processor chess computer he told me over MSN. He's having bad advisors IMHO. He's using myrinet and a bad parallel search (speedup less than square root out of total number of cpu's). Objectivity and desert sand are a bad combination. >> At a PC with a shared hashtable nowadays i get 10-12 ply (ply = half move, >> full move is when both sides make a move) in a few seconds, searching a >> 100000 positions per second a cpu. >> >> So before we start searching every node (=position) we quickly want to find >> out whether other cpu's already searched it. > >So that operation will cost around 80us with GigE, and 10-16us with IB >or Myri. 80 us is what i read elsewhere too yes for GigE. Is it so hard to make a card with lower latency for a few dollar? I mean if i buy for 135 euro a cpu i can get myself an opteron 1.4Ghz or something. If i buy for 1000 euro i get myself say a 2.4Ghz opteron. Less than factor 2 faster. If you buy for 135 euro a network card it is 80 us. When you buy a highend netwerk card it's factor 10 faster from user viewpoint. That's quite a lot! >> At the origin3800 at 512 processors i used a 115 GB hashtable (i started >> search at 460 processors). Simply because the machine has 512GB ram. > >The origin 3800 has a very healthy interconnect, shared memory lookups >are in the few 100 ns range, and MPI with the newest libraries are >in the 1-2us range. If the interconnects (hubs) of the origin are fine, then they must use real slow routers. It's 5.8 us is a shared memory lookup on average at 460 processors origin3800, no one else at the system (looking up 8 bytes). 3-4 us one way pingpong. That machine is equipped with so called 35ns routers. Lookup to local memory is 280 ns by the way at both itanium2 as well as origin. Of course everything is randomized. It's complete TLB trashing. >> So in short you take everything you can get. > >Of course. > >> The search works with internal iterative deepending which means we first >> search 1 ply, then 2 ply, then 3 ply and so on. >> >> The time it takes to get to the next iteration i hereby define as the >> branching factor (Knuth has a different definition as he just took into >> account 1 algorithm, the 'todays' definition looks more appropriate). >> >> In order to search 1 ply deeper obvious it's important to maintain a good >> branching factor. I'm very bad in writing out mathematical proofs, but it's >> obvious that the more memory we use, the more we can reduce the number of >> legal moves in this position P as next few ply it might be in hashtable, >> which trivially makes the time needed to search 1 ply deeper shorter. >> >> Storing closer to the root (position where we started searching) is of >> course more important than near the leafs of the search tree. >> >> When for example not storing in hashtable last 10 ply near the leafs in an >> overnight experiment the search depth dropped at 460 processors from 20 ply >> to 13 ply. >> >> Of course each processor of supercomputers is deadslow for game tree search >> (it's branchy 100% integer work completely knocking down the caches), so >> compared to pc's you already start at a disadvantage of a factor 16 or so >> very quickly, before you start searching (in case of TERAS i had to fight >> with outdated 500Mhz MIPS processors against opterons and high clocked quad >> Xeons), so upgrading my own networkcards is more clever. > >Interesting. Of course the Origin 3800 is quite dated, not that the >Itanium is an opteron killer, but it is much more competitive, and has >much larger caches. Itanium 1.3Ghz using 24 hours of PGO and after i figured out all kind of options in the compiler to not take shortcuts by default, is same speed like a 1.3Ghz opteron for DIEP. I understand why governments buy them. They are good on paper and have no real weak spots. Horror & co to program for those itaniums. L3 cache sizes for diep are not important. See extensive benchmarking at the different hardware sites of my program. For example by Johan de Gelas or : Aceshardware : http://www.aceshardware.com/read.jsp?id=60000259 Sudhian : http://www.sudhian.com/showdocs.cfm?aid=635&pid=2403 Soon also tested at www.anandtech.com ! >> Yet getting yourself a network even between a few nodes as quick as those >> supercomputers is not so easy... > >Quadrics and Pathscale's infinipath have networks available that are in the >same ballpark as the SGI origin. Even dolphin although I'm not very >familar with them. I am very impressed by the quadrics and dolphin cards. Probably by infinipath too when i check them out. Will do. I'm not so impressed yet by myrinet actually, but if cluster builders can earn a couple of hundreds of dollars more on each node i'm sure they'll do it. >> Additional your own beowulf network you can first decently test at before >> playing at a tournament, and without good testing at the machine you play >> at in tournaments you have a hard 0% chance that it plays well. >> >> The only thing in software that matters is testing. > >Indeed, good luck, thanks for the overview. I'm planning on a cluster >with a very fast (sub 2.5us network), but I won't have it for a few months. > >I had some infiniband hardware on loan, but I had to return it. > >-- >Bill Broadley >Computational Science and Engineering >UC Davis > > From diep at xs4all.nl Fri Feb 4 13:33:47 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies Message-ID: <3.0.32.20050204223345.01009170@pop.xs4all.nl> Thanks for your deep inside, this is very helpful! Vincent www.diep3d.com At 12:57 4-2-2005 -0700, Josip Loncaric wrote: >Vincent Diepeveen wrote: >> At 00:29 4-2-2005 -0800, Bill Broadley wrote: >>> >>>Do you know that gigabit is too high latency? > >Gigabit Ethernet adapters often need tweaking to deliver reasonable >latency, bandwidth, and CPU utilization. > >For example, if your system uses the e1000 driver (Intel's gigabit >Ethernet), the default setting is "dynamic Interrupt Throttle Rate" -- >which means that the card will delay interrupting the CPU by up to about >130 microseconds after receiving a packet. Moreover, the "dynamic" part >causes the network chip microcode to vary this delay in multiples of >about 16 microseconds, so that different packets will generally >experience different receive delays. > >For the e1000 driver, >https://lists.dulug.duke.edu/pipermail/dulug/2004-August/015415.html >recommends using "options e1000 InterruptThrottleRate=80000" (add this >line to /etc/modules.conf). Users of this driver may also want to check >Intel's parameters for e1000 listed at >http://www.intel.com/support/network/sb/cs-009209.htm#parameters -- just >don't assume that the default values are appropriate for cluster use. > >Other gigabit Ethernet adapters have similar interrupt mitigation >strategies, all designed to gracefully cope with high packet rates at >high network speeds. For cluster use, adjustments are usually advisable. > >The basic Rx interrupt mitigation scheme is this: the receiver's CPU >won't be interrupted until at least N packets have arrived or M >microseconds have elapsed (whichever comes first). This clearly adds up >to M microseconds to network latency. BTW, one often sees N=6 >(otherwise NFS performance can seriously degrade) and M>=16. Other >variants of this basic scheme are possible; but they all mean increased >latencies. > >Finally, don't forget the Tx side interrupt mitigation, or else the >sending CPU might not be told promptly that it's OK to send more. The >default Tx settings are probably fine for full size packets, but if your >applications send lots of small packets, tweaking your network driver's >Tx settings may help. > >Sincerely, >Josip >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From rross at mcs.anl.gov Fri Feb 4 10:16:17 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] MPICH2: Handle Limit? In-Reply-To: References: Message-ID: Hi Ron, There should not be an 84 handle limit. Can you tell me what version of MPICH2 this is, and what architecture and OS you're running on? Do you have a simple test that exhibits the problem? Thanks, Rob --- Rob Ross, Mathematics and Computer Science Division, Argonne National Lab On Thu, 3 Feb 2005, R Hamann wrote: > I've been having some strange problems with a program using the MPICH2 > library. When I added some new datatypes for ghost cell exchange, the > program would hang. I figured out that any number of handles over 84 > would cause this. Fortunately, I could delete some handles that I no > longer needed, but it still seemed strange. Are my calculations > correct that for each process there is an 84 handle limit? or am I > seeing some other problem? > > Ron From rross at mcs.anl.gov Sat Feb 5 08:05:04 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] MPICH2: Handle Limit? In-Reply-To: References: Message-ID: Hi Ron, Well there *is* a limit, because the handles are represented by an integer, but from a practical perspective you should never have to worry about it. I have not ever encountered this before. I wrote most of that code, so I would very much like to figure out what is happening in your case. I tend to agree that it is probably some sort of buffer overrun. We test on IA32 with gcc as our primary environment. What exactly is happening when it "bombs"? Are you getting a segfault? Is this something where you could capture a core file and get a stack trace? Are there any errors reported? Will the problem manifest itself with a single-process run? If so, you could try valgrind. Actually, while we're discussing it, why do you need "lots" of datatypes to exchange ghost cells? There might be a way to simplify that too. Regards, Rob On Fri, 4 Feb 2005, R Hamann wrote: > I thought any limit would be wierd, let alone something like 84 (7 X > 12?) Anyway, I thought it was based on the number of MPI variables > declared (data_types, windows, requests) because every time I added > new declarations, it would hang on Fedora core 2, but run to > completion on Scyld (but with erroneous results). If I deleted unused > MPI declarations, it would start to work again. I counted all my > handles and came up with 84. > > However, after deleting two 26 element arrays of handles, I thought it > would work. When I added more handles, it bombed again. I started to > try other things. I added 4 junk ints. I didn't use the variables I > declared, but it still bombed. When I converted them to chars, it > started working again. Very strange. > > Have you ever encountered this before? I'm doing a 3d cellular > automata, so I need a lot of datatypes for exchange of ghost cells. > It's obviously some strange error I've made that's manifesting itself > in MPI instead of a runtime or sytax error. I'm gonna try looking for > any buffer overruns now, but other than that I'm stumped. > > GCC on Fedora Core 2 and on Scyld Beowulf > MPICH 2 1.0 > > Thanks, > > R From h.jasak at wikki.co.uk Fri Feb 4 11:13:04 2005 From: h.jasak at wikki.co.uk (Hrvoje Jasak) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] OpenFOAM Message-ID: <4203C940.1000402@wikki.co.uk> Hi Mike, I've just found your post on OpenFOAM. I am one of the (two) main authors/developers of FOAM and have been using it since 1993. Linux is these days the main and most important parallel platforms for FOAM and it is regularly used for large-scale simulations (especially LES). I am still developing the code and doing research/working with students etc. with it - if you've got any questions or would like to get involved in keeping FOAM alive, please feel free to contact me. Regards, Hrvoje Jasak -- Dr. Hrvoje Jasak Wikki Ltd. 10 Palmerston House, Tel: +44 (0)20 7221 9815 60 Kensington Place, E-mail: H.Jasak@wikki.co.uk London W8 7PU, United Kingdom From mprinkey at aeolusresearch.com Fri Feb 4 13:00:02 2005 From: mprinkey at aeolusresearch.com (Michael T. Prinkey) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <20050204205034.GA18717@greglaptop.internal.keyresearch.com> Message-ID: On Fri, 4 Feb 2005, Greg Lindahl wrote: > On Fri, Feb 04, 2005 at 03:20:23PM -0500, Andrew Piskorski wrote: > > On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote: > > > > > Please note MPI is probably what i'll use, though i keep finding > > > online information about 'gamma'. Is that faster latency than MPI > > > implementations? > > > > http://www.disi.unige.it/project/gamma/ > > In addition to gamma, there's also MVAPICH from LBL, and at least two > commercial products, one from Scali, and one from the Cluster > Competence Center. > > -- greg Greg, I think you mean MVICH at LBL. It and MVIA are all but dead, AFAICT: http://old-www.nersc.gov/research/FTG/mvich/index.html Mike From fant at pobox.com Fri Feb 4 13:27:47 2005 From: fant at pobox.com (Andrew D. Fant) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: <4203913A.1030202@cmrl.wustl.edu> References: <420376B0.7000107@scalableinformatics.com> <4203913A.1030202@cmrl.wustl.edu> Message-ID: <4203E8D3.9030509@pobox.com> Brian Henerey wrote: > > I don't mean to hijack this thread, but I'd also be interested to know > if there are any open source web frontends for launching jobs on > clusters. I've mostly written my own anyway, but if something's out > there I'd like to know. > > Thanks, > Brian Henerey Most of this is admittedly not open-source, but it is what I can think of off the top of my head for web/gui cluster front end tools. I think Platform explored a web front end for LSF after they killed off the xlsf tools. The tool I have seen lately that I would be more interested in seeing more of is Auger from the Jefferson Laboratory in Norfolk. Technically it's not a web front end, because it's a java front end tool, but it looks nice in any case. Most of the true web front ends for cluster jobs that I have seen are application specific portals. NCSA has some examples, and PNL has a nice distributed web front end for computation chemistry applications, as well. Andy From nix at petelancashire.com Sat Feb 5 10:07:56 2005 From: nix at petelancashire.com (Pete Lancashire) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] real hard drive failures In-Reply-To: References: Message-ID: <1107626876.3794.17.camel@l1.pdxeng.com> The nice thing and about the only nice thing about using a fan is in this case, the failure of a fan is not going to kill you. If your mother board has a 3-wire fan 'port' not used you can have it report failure. In the past I've built using a 8pin MicroChip a simple failure detector. I would think with some imagination you could take a 555 + transistor + pizo buzzer and create a simple alarm. Another thing to use but I've not seen as an individual item is a heat sink. The Sun SPUD brackets come with a plate that attaches to the bottom of the drive, the plate has been punched with hmmm .. louvers ?. -pete "ah the days of so many fans you could not hear yourself talk" On Tue, 2005-01-25 at 14:26, Mark Hahn wrote: > > > I'm only partially interested in the thread "Cooling vs HW replacement" but > > > the problem with drive failures is a real pain for me. So, I thought I'd > > > share some of my experience. > > > > i'd add 1 or 2 cooling fans per ide disk, esp if its 7200rpm or 10,000 rpm > > disks > > I'm pretty dubious of this: adding two 50Khour moving parts to > improve the airflow around a 1Mhour moving part which only dissipates > 10W in the first place? designing the chassis for proper airflow > with minimum fanage is obviously smarter and probably safer. > > > - if downtime is important, and should be avoidable, than raid > > is the worst thing, since it's 4x slower to bring back up than > > a single disk failure > > eh? you have a raid which is not operational while rebuilding? > > > - raid will NOT prevent your downtime, as that raid box > > will have to be shutdown sooner or later > > ( shutting down sooner ( asap ) prevents data loss ) > > huh? hotspares+hotplug=zero downtime. > > but yes, treating whole servers as your hotspare+hotplug element is > a nice optimization, since hotplug ethernet is pretty cheap vs > $50 hotplug caddies for each and every disk ;) > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From john.hearns at streamline-computing.com Sun Feb 6 00:55:21 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Newbie Question In-Reply-To: <5dc04bbf050204112319fe7fbf@mail.gmail.com> References: <5dc04bbf050204112319fe7fbf@mail.gmail.com> Message-ID: <1107680121.28574.5.camel@Vigor45> On Sat, 2005-02-05 at 02:23 +0700, Monang Setyawan wrote: > Hi. I'm a newbie in this parallel computing thing. > (sorry for my bad english, I'm Indonesian) > > My current project is a software that analyze DNA/Protein sequence > data that needs high performance aspect on it. I plan to deploy this > software on network of workstations (mm, may be just about 10 PCs on > the network). Am I in wrong place now? > You could start by looking at the BioBrew Linux distribution. It probably has a lot of the tools you want for this work. http://bioinformatics.org/biobrew From john.hearns at streamline-computing.com Sun Feb 6 01:07:05 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <3.0.32.20050204213943.010127d0@pop.xs4all.nl> References: <3.0.32.20050204213943.010127d0@pop.xs4all.nl> Message-ID: <1107680825.28574.14.camel@Vigor45> On Fri, 2005-02-04 at 21:39 +0100, Vincent Diepeveen wrote: > > > >So that operation will cost around 80us with GigE, and 10-16us with IB > >or Myri. > > 80 us is what i read elsewhere too yes for GigE. > > Is it so hard to make a card with lower latency for a few dollar? > > I mean if i buy for 135 euro a cpu i can get myself an opteron 1.4Ghz or > something. If i buy for 1000 euro i get myself say a 2.4Ghz opteron. We supply turnkey clusters with the SCore environment, which gives excellent latency figures using standard gigabit ethernet NICs. If you are looking for different hardware, Google for 'TOE' - TCP Offload Engine. These are claimed to offer lower latency than onboard adapters. But caveats apply: I've no idea how these work with MPI type applications, as they're probably aimed at high bandwidth applications, and it is probably more cost effective to go Myrinet/Quadrics/IB Actually, it would be worth having the list's opinions on TOE adapters. My guess is that they really don't do much for the latency, but would be very good on webservers and databases servers. From rgb at phy.duke.edu Sun Feb 6 06:15:56 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Newbie Question In-Reply-To: <5dc04bbf050204112319fe7fbf@mail.gmail.com> References: <5dc04bbf050204112319fe7fbf@mail.gmail.com> Message-ID: On Sat, 5 Feb 2005, Monang Setyawan wrote: > Hi. I'm a newbie in this parallel computing thing. > (sorry for my bad english, I'm Indonesian) > > My current project is a software that analyze DNA/Protein sequence > data that needs high performance aspect on it. I plan to deploy this > software on network of workstations (mm, may be just about 10 PCs on > the network). Am I in wrong place now? > > I am going to use message passing paradigm (MPI) to write the > software. I've read that there are several choice of MPI > implementation. The problem is, I'm bad in both C or Fortran (I > usually use Java as my favorite language). Some source said that Java > (or it's MPI wrapper or pure MPI implementation) isn't good enough to > implement a parallel computing solution. Is that right? > > My third question is, is there any pdf/ps/one file version of > "Engineering a Beowulf-style Compute Cluster''? On my personal website, on brahma, both. Follow the links for beowulf and beowulf book on my personal page, or use google with "beowulf book pdf" to go right there. Also, there are images for both US letter and Euro A4 there, as you might have either kind of printer/paper. rgb > > Thanks in advance. > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From patrick at myri.com Sat Feb 5 18:27:57 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <3.0.32.20050204213943.010127d0@pop.xs4all.nl> References: <3.0.32.20050204213943.010127d0@pop.xs4all.nl> Message-ID: <420580AD.5050003@myri.com> Hi Vincent, Vincent Diepeveen wrote: >>>CPU's are 100% busy and after i know how many times a second the network >>>can handle in theory requests i will do more probes per second to the >>>hashtable. The more probes i can do the better for the game tree search. >> >>With a gigE network that sounds like 40us or so. With Myrinet or IB >>it's in the 4-6us range. If you bought dual opterons with the special > > > At the quadrics and dolphin homepage they both claim 12+ us for Myrinet. Seriously, here are MPI latencies with MX on F cards on Opteron (PCI-X), that includes fibers and a switch in the middle: Length Latency(us) Bandwidth(MB/s) 0 2.684 0.000 1 2.874 0.336 2 2.898 0.690 4 2.978 1.343 8 2.965 2.699 16 2.993 5.347 32 3.409 9.388 64 3.563 17.960 128 3.977 32.185 256 5.699 44.916 Quadrics would be lower by a 1.5 us, I don't know about Dolphin, I didn't hear about noticeable SCI clusters in a long time. > I am very impressed by the quadrics and dolphin cards. Probably by > infinipath too when i check them out. Will do. > > I'm not so impressed yet by myrinet actually, but if cluster builders can > earn a couple of hundreds of dollars more on each node i'm sure they'll do it. I don't think Myrinet would be the cheapest, I am sure you can get a better deal from desperate interconnect vendors. What does not impress you in Myrinet ? Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From diep at xs4all.nl Sat Feb 5 19:36:20 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies Message-ID: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl> At 21:27 5-2-2005 -0500, Patrick Geoffray wrote: >Hi Vincent, > >Vincent Diepeveen wrote: >>>>CPU's are 100% busy and after i know how many times a second the network >>>>can handle in theory requests i will do more probes per second to the >>>>hashtable. The more probes i can do the better for the game tree search. >>> >>>With a gigE network that sounds like 40us or so. With Myrinet or IB >>>it's in the 4-6us range. If you bought dual opterons with the special >> >> >> At the quadrics and dolphin homepage they both claim 12+ us for Myrinet. > >Seriously, here are MPI latencies with MX on F cards on Opteron (PCI-X), >that includes fibers and a switch in the middle: > > Length Latency(us) Bandwidth(MB/s) > 0 2.684 0.000 > 1 2.874 0.336 > 2 2.898 0.690 > 4 2.978 1.343 > 8 2.965 2.699 > 16 2.993 5.347 > 32 3.409 9.388 > 64 3.563 17.960 > 128 3.977 32.185 > 256 5.699 44.916 > >Quadrics would be lower by a 1.5 us, I don't know about Dolphin, I >didn't hear about noticeable SCI clusters in a long time. > >> I am very impressed by the quadrics and dolphin cards. Probably by >> infinipath too when i check them out. Will do. >> >> I'm not so impressed yet by myrinet actually, but if cluster builders can >> earn a couple of hundreds of dollars more on each node i'm sure they'll do it. > >I don't think Myrinet would be the cheapest, I am sure you can get a >better deal from desperate interconnect vendors. > >What does not impress you in Myrinet ? Thanks for your kind answer Patrick, Obviously i mentionned that number because i read it elsewhere. Well a number of points bother my mind from which majority is true for others as well. But first let me note that i'm not against myrinet in general. I am just trying to solve a very specific case. For that specific case i'm not so impressed. Note that so far i didn't find any desperate vendor. For sure quadrics doesn't look desperate to me, they aren't even selling old cards anymore though they must have still thousands of them lying at home from returned upgraded networks. Finding second hand highend cards seems to be very seldom. First of all i'm interested in how quick i can get 4-64 bytes from remote memory. So not from some kind of network card cache, as myrinet doesn't have some megabytes on chip, but just a few tens of kilobytes. The memory has to come therefore from the remote nodes main memory, at a random adress in the main memory. No streaming at all happens. that 400 ns extra that the TLB gives is definitely not the problem i guess. The problem for me is to understand: "how do you get that memory at a cluster?" A latency on paper says of course nothing when you can't actually get it within that time. "Paper supports everything." Arturo Ochoa (Caracas, Venezuela) I hope everyone realizes that an important consequence from beowulf clusters is that you actually want to *use* all those cpu's you have to your avail. So every cpu has a program running that eats 100% system time. Because if it wouldn't use 100% system time, you wouldn't need a cluster! >From that 100% system time obviously you must be prepared to give away some to serve other nodes as quickly as possible doing a read. All latencies i see quoted at all hardware sites, it is very hard to figure for me out whether that's a latency that is supported by paper, or whether it's a practical latency i can take into account as a programmer with all software layers overhead when each cpu is 100% running a program. Secondly, but as i'm not a cluster expert i don't know how to avoid that, it's of course a big LOSS in sequential speed if my program each few instructions must check whether there is some MPI message to get handled. If i check a lot that will slow down my program 20 times. If i don't check a lot, other cpu's will have to wait longer and that defeats the purpose of a fast network card. Factor 20 is about the slowdown of the average 'old' supercomputer chessprograms which use MPI type solutions. Zugzwang (Paderborn-Siemens), P.Conners (Paderborn-Siemens), cilkchess (MIT). I've been playing with my own eyes against those programs in world champs and despite that it has happened that i played at the same hardware with a similar amount of cpu's and a program having factor 100 more chessknowledge (which slows down the program *considerable*), the actual speed at which the program searches nodes was up to factor 5-10 faster. Now a few years ago this was not a major problem because for example Cilkchess which obviously ran factor 20-40 times slower than it could, used 1800 processors for example in world champs 1995 (Hong kong) and 512 processors in world champs 1999 (Paderborn). Of course because 1 processor was real real fast compared to the speed of 1 pc processor in those days, they practical were searching a lot deeper than pc programs (and both played excellent for its days, especially Don Dailey needs to get a big compliment for that). However if i show up with 2 pc's and 2 network cards, then it sure matters when i lose a lot of speed. Obviously for embarassingly parallel software this is no issue, but usually for embarrassingly parallel software all you need is gigabit ethernet. There is so many MPI applications which are not exactly embarassingly parallel from which you see that a decent programmer single cpu would be doing that 20 times faster. Or to quote someone who has been doing such rewriting work for some physical applications that run here and there: "I didn't blink my eyes when i managed to speedup an application factor 1000". So it is very interesting for us all and me especially to understand how *fast* you can get that memory under full load of all the logical cpu's. Third each pc has 2 cheapo k7 processors which are a lot slower than opterons. Second problem i have is that i can get easily dual k7 pc's from chessplayers and they can get bought cheap still. Dual k7 is practical same speed like a dual xeon 3.06Ghz Northwood with all memory slots filled with 2-2-2 DIMMS for DIEP. So just compare the price of such a system with a cheapo dual k7 with registered cas3 RAM. Those dual k7's have 64 bits 66Mhz slots, not pci-x as far as i know and also those who do have A64's or P4's usually don't have pci-x onboard either. Sure there is boards that have them and i'm sure that if you make a network Dolphin can deliver 'bytes' they say at their homepage in 3.3 us at MPX mainboards and claim somewhere a paper latency of 1.x us. What is the achieved read speed to remote memory myrinet gets at 64 bits / 66Mhz in software, so ready to use 4-64 bytes for applications? I'm not asking it to be accurate within 400ns, as that's the delay you'll have from TLB trashing the remote node. But accuracy within 1.5 us would be quite nice. First of all for integer intensive applications i'm doing fastest processor is opteron, k7 comes second and P4 comes third. Exception is a P4 machine equipped with the most expensive stuff (2-2-2 ram and all banks filled) good mainboard and northwoods and overclocked at the mainboard. However for that price a dual opteron can get bought and it just blows away that P4 bigtime. Every year that new software gets released of course that P4 gets slower, because newer software only gets more and more complex with more options and will fit less perfectly in P4's small tiny caches, let alone when we get a lot of 64 bits programs. They won't fit at all in those tiny slow caches. So until the dual core opterons arrive at low cost, obviously you can make dual k7 nodes for just a few hundreds of dollar a node. When adding new nodes which in the future no doubt are dual opteron, you still run further with those dual k7 nodes and want to mix them obviously with dual opterons. Is that possible? >Patrick >-- > >Patrick Geoffray >Myricom, Inc. >http://www.myri.com > > From diep at xs4all.nl Sun Feb 6 07:10:39 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Newbie Question Message-ID: <3.0.32.20050206161035.01013bd0@pop.xs4all.nl> At 02:23 5-2-2005 +0700, Monang Setyawan wrote: >Hi. I'm a newbie in this parallel computing thing. >(sorry for my bad english, I'm Indonesian) > >My current project is a software that analyze DNA/Protein sequence >data that needs high performance aspect on it. I plan to deploy this >software on network of workstations (mm, may be just about 10 PCs on >the network). Am I in wrong place now? >I am going to use message passing paradigm (MPI) to write the >software. I've read that there are several choice of MPI >implementation. The problem is, I'm bad in both C or Fortran (I >usually use Java as my favorite language). Some source said that Java >(or it's MPI wrapper or pure MPI implementation) isn't good enough to >implement a parallel computing solution. Is that right? You definitely want to write it in C. Basically protein research, which might touch a field which is forbidden to research in EU countries, but not forbidden to research in USA, Israel and i must admit i'm amazed that's legal in Indonesia usually is heavily floating point oriented. Just calculating what i would classify as matrix invariants to determine origins and consequences of modifications. In C there is superb libraries you want to consider. Certain calculations can get speeded up bigtime by FFT, but not always, as sometimes you just want accurate results and not approximations. C is ideal because it's easier to use SSE2 for it which is what you need of course. Please note both P4 and A64/Opteron have that functionality and Opteron is 2 times faster than P4 there, but perhaps you can get the P4 hardware factor 2 cheaper, which would make it very attractive for such a cluster. In all cases such software is embarassingly parallel. gigabit ethernet is more than sufficient. Yet taking care the pc's have relative fast floating point possibilities is very relevant. Cheapest gflop per dollar might be probably surprising hardware. A beowulf definitely is ideal for this type of software. >My third question is, is there any pdf/ps/one file version of >"Engineering a Beowulf-style Compute Cluster''? > >Thanks in advance. > >-- >For the sake of time.. >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From wytsang at clustertech.com Sun Feb 6 19:24:43 2005 From: wytsang at clustertech.com (Clotho) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] ifort MPI_FILE_OPEN err with romio testsuite Message-ID: <4206DF7B.3040705@clustertech.com> In MPICH-1.2.6, romio directory, there is a test program called "fcoll_test.f". The test program run successfully with gcc compiler. However, with ifort (8.0/8.1) compiler, the program fails. After debugging, I find that the function MPI_FILE_OPEN fails (ierr is non-zero). But change the size of character array from 1024 to 200 can solve the problem. I have found another people with similar experience as me: (in Chinese) http://www.lasg.ac.cn/cgi-bin/forum/view.cgi?forum=4&topic=2519 Here is the full program : http://clustertech.com/~wytsang/fcoll_test.f Here is the simplier version of the program. program main implicit none include 'mpif.h' integer nprocs integer mynod integer fh, ierr character*1024 str ! used to store the filename c character*200 str ! this will work integer writebuf(1) call MPI_INIT(ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, mynod, ierr) str = 'test' writebuf(0) = 0 call MPI_FILE_OPEN(MPI_COMM_WORLD, str, & & MPI_MODE_CREATE+MPI_MODE_RDWR, MPI_INFO_NULL, fh, ierr) print *,ierr call MPI_FINALIZE(ierr) stop end From patrick at myri.com Mon Feb 7 00:11:58 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl> References: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl> Message-ID: <420722CE.5010408@myri.com> Vincent, Vincent Diepeveen wrote: > Thanks for your kind answer Patrick, > > Obviously i mentionned that number because i read it elsewhere. I know, I have seen worse. > Note that so far i didn't find any desperate vendor. For sure quadrics > doesn't look desperate to me, they aren't even selling old cards anymore > though they must have still thousands of them lying at home from returned > upgraded networks. Finding second hand highend cards seems to be very seldom. Tip: desperate companies are usually young and spend a lot of VC money on marketing. Quadrics does not fit, I am afraid, they have been around too long :-) Furthermore, selling old hardware is not very cost effective for a vendor: compatibility troubles with newer machines, require to support old hardware in new drivers and new middlewares, tap in inventory reserved for replacement parts, etc. > First of all i'm interested in how quick i can get 4-64 bytes from remote > memory. So not from some kind of network card cache, as myrinet doesn't > have some megabytes on chip, but just a few tens of kilobytes. The memory > has to come therefore from the remote nodes main memory, at a random adress > in the main memory. No streaming at all happens. that 400 ns extra that the > TLB gives is definitely not the problem i guess. Myrinet has 2 MB of SRAM in standard, used by firmware code, data and buffers. What you want to do basically is a Get. In practice, the origin of the Get will send a small packet with a virtual address or a RDMA handle and an offset, the NIC on the target side converts it in a physical address, fetches the data by DMA and sends it back to the origin side. > All latencies i see quoted at all hardware sites, it is very hard to figure > for me out whether that's a latency that is supported by paper, or whether > it's a practical latency i can take into account as a programmer with all > software layers overhead when each cpu is 100% running a program. No, it's not likely to fit your usage. Vendors quote MPI latency on pingpong. That's pretty much the cost of sending/receiving an MPI message from user space to user space. Often, this is also with only 2 nodes, optimal conditions and everybody holding their breath. You want RMA Get. The latency for a Get is larger than for a MPI send. For 64 bytes, it is basically the MPI latency for 0 bytes (for the Get request) + the latency for 64 bytes (for the reply). Assuming that you don't Get all over the host memory, the virtual/physical translation will be hot in the target NIC so the translation cost will be very small. You want less than 3us per Get of 64 Bytes ? I don't know if even Quadrics can do it. The good news is that you can pipeline it very well. So it may cost more than 3 us for one Get, but you may complete a Get every 0.5 us if you post a bunch of them. > Secondly, but as i'm not a cluster expert i don't know how to avoid that, > it's of course a big LOSS in sequential speed if my program each few > instructions must check whether there is some MPI message to get handled. If you want perfect overlap and if you are ready to go as low level as possible, one-sided communication are for you (no host CPU involved on the target side). All low level communication interfaces support one-sided communications (not yet released for MX on Myrinet, but GM has it). > However if i show up with 2 pc's and 2 network cards, then it sure matters > when i lose a lot of speed. > > Obviously for embarassingly parallel software this is no issue, but usually > for embarrassingly parallel software all you need is gigabit ethernet. If you can and know how to overlap, latency is irrelevant. It's hard to do on complex irregular codes, but you can usually do it if you can use one-sided communications. Don't put your communications in the critical path. Post them early and post many of them concurrently, pipelining will hide the latency of the critical path. That's why desperate vendors use pipelined pingpong to get better curves. > There is so many MPI applications which are not exactly embarassingly > parallel from which you see that a decent programmer single cpu would be > doing that 20 times faster. Or to quote someone who has been doing such Most of the times, you go parallel to go bigger, not faster. If the problem size fits in one node, don't use a cluster, use a multi-processor nodes. You will have more bangs for your bucks. > So it is very interesting for us all and me especially to understand how > *fast* you can get that memory under full load of all the logical cpu's. Using one-sided communications, there is little difference if the CPUs are loaded or not on the target side. > Third each pc has 2 cheapo k7 processors which are a lot slower than opterons. IO bus is more important for the communications part. I don't know of cheapo k7 machines with a decent PCI bus. However, for 64 bytes, even a cheesy PCI will not slow things down that much. > Dolphin can deliver 'bytes' they say at their homepage in 3.3 us at MPX > mainboards and claim somewhere a paper latency of 1.x us. How long can you hold your breath ? > What is the achieved read speed to remote memory myrinet gets at 64 bits / > 66Mhz in software, so ready to use 4-64 bytes for applications? I have no idea, I am not even sure that I have a 64 bits/66 Mhz machine around to measure it. With GM, I would say at least 10 us. Certainely more. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Mon Feb 7 01:48:22 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <30062B7EA51A9045B9F605FAAC1B4F627D505C@exch01.quadrics.com> References: <30062B7EA51A9045B9F605FAAC1B4F627D505C@exch01.quadrics.com> Message-ID: <42073966.7090009@myri.com> Hi Duncan, duncan.roweth@quadrics.com wrote: > This example reports the average time for 1000 > blocking get calls. Patrick's description of the > mechanism is essentially correct, apart from the > detail that we have a fast path for short operations > that avoids the need to set up a DMA. How can you do one-sided operations without a DMA on the target side ?!? The only way that I can think of is to map the host virtual memory into the NIC memory space and let all memory writes generates PIO writes to actually modify the NIC memory. Surely, you must be talking about another DMA. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From john.hearns at streamline-computing.com Mon Feb 7 01:50:58 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <420722CE.5010408@myri.com> References: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl> <420722CE.5010408@myri.com> Message-ID: <1107769858.12606.57.camel@Vigor45> On Mon, 2005-02-07 at 03:11 -0500, Patrick Geoffray wrote: > Vincent, > > Tip: desperate companies are usually young and spend a lot of VC money > on marketing. Quadrics does not fit, I am afraid, they have been around > too long :-) Furthermore, selling old hardware is not very cost > effective for a vendor: compatibility troubles with newer machines, > require to support old hardware in new drivers and new middlewares, tap > in inventory reserved for replacement parts, etc. Why would Quadrics have old/second hand hardware to sell anyway? If they have older model cards unsold they would be holding them as spares for customers who are still running those models, as Patrick says. Clusters which have been upgraded or scrapped are unlikely to be returned to Quadrics/Myricom. Clusters are usually bought as completely integrated systems, from companies such as ourselves. We install and configure the Myrinet networking for customers - they don't buy direct from Myricom. And, like many companies on this list, we provide continuing support and advice. So I'd say there is no conspiracy against you - if you are seeking second hand high performance networking gear, look on eBay or ask nicely on this list. I was surprised recently to see small fibre channel switches go very cheaply on eBay - not so long ago you would pay $$$$ for them. From jcownie at etnus.com Mon Feb 7 08:26:29 2005 From: jcownie at etnus.com (James Cownie) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: Message from Patrick Geoffray of "Mon, 07 Feb 2005 04:48:22 EST." <42073966.7090009@myri.com> References: <30062B7EA51A9045B9F605FAAC1B4F627D505C@exch01.quadrics.com> <42073966.7090009@myri.com> Message-ID: <20050207162629.C478D1C826@amd64.cownie.net> > Patrick Geoffray wrote: > duncan.roweth@quadrics.com wrote: > > This example reports the average time for 1000 > > blocking get calls. Patrick's description of the > > mechanism is essentially correct, apart from the > > detail that we have a fast path for short operations > > that avoids the need to set up a DMA. > > How can you do one-sided operations without a DMA on the target side ?!? > > The only way that I can think of is to map the host virtual memory > into the NIC memory space and let all memory writes generates PIO > writes to actually modify the NIC memory. Surely, you must be talking > about another DMA. I think you're talking at cross-purposes. Patrick is right that in the target machine there is a DMA operation initiated by the NIC. However Duncan is saying that Quadrics don't send a DMA request packet over their network, but have a more optimised less general request that they can issue without having to build a full DMA descriptor in the host machine and transfer it to the target. Therefore in Quadrics' terms no DMA operation is sent over the net, whereas from Patricks' viewpoint a DMA operation _does_ occur. -- -- Jim -- James Cownie Etnus, LLC. +44 117 9071438 http://www.etnus.com From duncan.roweth at quadrics.com Mon Feb 7 01:29:15 2005 From: duncan.roweth at quadrics.com (duncan.roweth@quadrics.com) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies Message-ID: <30062B7EA51A9045B9F605FAAC1B4F627D505C@exch01.quadrics.com> Patrick, Vincent Some input into your discussion. Here is the data on get latency for Elan4 in an Opteron cluster quorumi: prun -N2 pgping -f get 0 64 1: 4 bytes 2.36 uSec 1.69 MB/s 1: 8 bytes 2.35 uSec 3.40 MB/s 1: 16 bytes 2.38 uSec 6.73 MB/s 1: 32 bytes 2.37 uSec 13.50 MB/s 1: 64 bytes 2.43 uSec 26.30 MB/s This example reports the average time for 1000 blocking get calls. Patrick's description of the mechanism is essentially correct, apart from the detail that we have a fast path for short operations that avoids the need to set up a DMA. You can probably do a bit better on the very fastest nodes, but this is what I see on the system we have in the office. > You want less than 3us per Get of 64 Bytes ? I don't > know if even Quadrics can do it. Yes we can! > The good news is that you can pipeline it very well. Indeed. There is lots of parallelism in the hardware so you can me processing multiple requests at the same time. In this sequence of short jobs I measure the average time for 8 byte gets 2 at a time, 4 at a time etc. quorumi: prun -N2 pgping -f get -b2 8 1: 8 bytes 1.32 uSec 6.07 MB/s quorumi: prun -N2 pgping -f get -b4 8 1: 8 bytes 1.04 uSec 7.66 MB/s quorumi: prun -N2 pgping -f get -b8 8 1: 8 bytes 0.84 uSec 9.47 MB/s quorumi: prun -N2 pgping -f get -b16 8 1: 8 bytes 0.82 uSec 9.79 MB/s quorumi: prun -N2 pgping -f get -b32 8 1: 8 bytes 0.79 uSec 10.18 MB/s The limiting factor is the rate at which the remote NIC can read data over the PCI bus. Best Wishes Duncan Roweth Quadrics Limited P.S. Clearly our sales people focus on the current product (Elan4 NICs) but we will be supporting the installed base of Elan3 systems for some years yet. Most of the big systems have extended warranties, so we keep a stock of spares, but there are a few hundred adapters and associated switches. Drop us some mail if you are interested. From duncan.roweth at quadrics.com Mon Feb 7 02:01:26 2005 From: duncan.roweth at quadrics.com (duncan.roweth@quadrics.com) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies Message-ID: <30062B7EA51A9045B9F605FAAC1B4F627D505E@exch01.quadrics.com> Patrick Thanks for your mail. > How can you do one-sided operations without a DMA on the > target side ?!? Gets are done by telling the remote adapter to perform a put back to the source. This can be a request to start a DMA (for large transfers) or it can be a request to the the Short Transaction ENgine (STEN). The STEN is a fast path for short puts that can be used from either the main CPU or from the adapter. It can generate network packets from a stream of commands and data written either by the main CPU (as PIO writes) or directly by the adapter. There are more details are in the "Hot Chips" paper that we wrote with Fabrizio Petrini of Los Alamos. http://www.c3.lanl.gov/~fabrizio/papers/hot03.pdf Best Wishes Duncan Roweth Quadrics Limited From rcmanglekar at rediffmail.com Mon Feb 7 06:17:05 2005 From: rcmanglekar at rediffmail.com (Rahul Manglekar) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] How-TO Mysql on Lam-cluster? Message-ID: <20050207141705.26471.qmail@webmail29.rediffmail.com> hi all.., i have setup up LAM-MPI cluster on 3 machine for testing. i want do put mysql on cluster..,, such that if mysql need more processor power , it can use processor power of all nodes that are present in cluster. i am using MySQL-4.0. can u guide me please.. thank you in advance.. -- Rahul.. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20050207/7f5ae763/attachment.html From mark.westwood at ohmsurveys.com Mon Feb 7 06:39:42 2005 From: mark.westwood at ohmsurveys.com (Mark Westwood) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Newbie Question In-Reply-To: <5dc04bbf050204112319fe7fbf@mail.gmail.com> References: <5dc04bbf050204112319fe7fbf@mail.gmail.com> Message-ID: <42077DAE.8020806@ohmsurveys.com> Hi Monang Here's my contribution to your decision about which language you program in for your cluster: Suppose that you know Java well, but not C. Suppose that it will take you 6 months to learn C well enough to be able to write your programs in it. In those 6 months you can do an awful lot of computing in Java. If your project is intended to last, say, 9 months, then you might decide that you will program in Java because you will get more computing done that way than by learning a new language. If your project will last much longer then you might decide that learning C will be of benefit, because each program will be faster in C than in Java. If you're doing some calculations then I'd suggest that you allow C to be 5 times faster than Java on average for cluster-type computing. Some will tell you that it is more than 10 times as fast (and it is for some types of computation), others that it is no faster (which is true for some types of computation). Another issue (or problem if you look at things that way ) with Java is that the implementations of MPI for Java are non-standard and not as widely used as the implementations for C. You might find it difficult, therefore, to get good support from groups such as this one, for a Java / MPI program. To sum up: If you can write good Java programs to solve your problems on your cluster then you should prefer that to writing bad C (or Fortran) programs. If you find that your Java program is not fast enough then you might think about rewriting parts of it in C (or another compiled language) to achieve specific performance improvements. Hope this helps Mark Monang Setyawan wrote: > Hi. I'm a newbie in this parallel computing thing. > (sorry for my bad english, I'm Indonesian) > > My current project is a software that analyze DNA/Protein sequence > data that needs high performance aspect on it. I plan to deploy this > software on network of workstations (mm, may be just about 10 PCs on > the network). Am I in wrong place now? > > I am going to use message passing paradigm (MPI) to write the > software. I've read that there are several choice of MPI > implementation. The problem is, I'm bad in both C or Fortran (I > usually use Java as my favorite language). Some source said that Java > (or it's MPI wrapper or pure MPI implementation) isn't good enough to > implement a parallel computing solution. Is that right? > > My third question is, is there any pdf/ps/one file version of > "Engineering a Beowulf-style Compute Cluster''? > > Thanks in advance. > -- Mark Westwood Parallel Programmer OHM Ltd The Technology Centre Offshore Technology Park Claymore Drive Aberdeen AB23 8GD United Kingdom +44 (0)870 429 6586 www.ohmsurveys.com From deadline at clusterworld.com Mon Feb 7 07:34:44 2005 From: deadline at clusterworld.com (Douglas Eadline, Cluster World Magazine) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <20050204202023.GA32459@piskorski.com> Message-ID: On Fri, 4 Feb 2005, Andrew Piskorski wrote: > On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote: > > > Please note MPI is probably what i'll use, though i keep finding > > online information about 'gamma'. Is that faster latency than MPI > > implementations? > > http://www.disi.unige.it/project/gamma/ > > Gamma is a non-TCP/IP Linux 2.6.x network driver for Intel Pro/1000 > gigabit ethernet cards, for use with MPI. It offers much better > latency (11 us or so) than TCP/IP over ethernet (maybe 60 or 100 us), > but worse than the specialized HPC interconnects (maybe 3 us). The "60-100 us" is incorrect. With proper tuning an e1000 can get 25us latency (using netpipe). (see Jossip's post about tunning parameters) Oh, and by the way this was using a 32 PCI desk top card. A low latency number is not the whole story however, processor load is another issue. The point is that tuning can make a difference. Default values are usually set for maximum throughput and low CPU overhead. It all depends on what your application needs. If you need GAMMA, then that is a good choice, but many applications may work well with proper tuning of NIC parameters. As an aside, Netgear used to sell a low cost desktop NIC (GA302T-tigon3/Broadcom) which had very good numbers as well. I profiled this NIC in the first issue of ClusterWorld. Doug > > The attraction of GAMMA, is that Intel Pro/1000 cards can be had for > $11 to $60 or so each (depending on exact model, etc.), and gigabit > switches are also pretty cheap, while SCI or Myrinet is somewhere in > the $500 to $1500 per node range (I don't keep track). > > So if your application can benefit from lower latency, but you want > something really cheap, GAMMA should be well worth trying. > > -- ---------------------------------------------------------------- Editor-in-chief ClusterWorld Magazine Desk: 610.865.6061 Fax: 610.865.6618 www.clusterworld.com From rokrau at yahoo.com Mon Feb 7 09:01:38 2005 From: rokrau at yahoo.com (Roland Krause) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] memory allocation on x86_64 returning huge addresses Message-ID: <20050207170138.72473.qmail@web52907.mail.yahoo.com> I am trying to dynamically allocate memory for a Fortran-77 code that is supposed to run in I4 R4 mode on an x86_64 running SuSE-9.2 with a kernel.org 2.6.9 kernel. The machine has 8GB memory and memory has to be allocated in one large chunk. The problem is that malloc returns an address that is way beyond 8billion which is not what I had expected. Does anybody why Linux gives me an address that is outside the physical memory range? Does anybody whether there are any kernel parameter that affect this behavior? Any pointers to some good reading about the Linux VM would also be appreciated. Regards Roland __________________________________ Do you Yahoo!? Yahoo! Mail - now with 250MB free storage. Learn more. http://info.mail.yahoo.com/mail_250 From kus at free.net Mon Feb 7 09:37:17 2005 From: kus at free.net (Mikhail Kuzminsky) Date: Wed Nov 25 01:03:45 2009 Subject: [Beowulf] How-TO Mysql on Lam-cluster? In-Reply-To: <20050207141705.26471.qmail@webmail29.rediffmail.com> Message-ID: In message from "Rahul Manglekar" (7 Feb 2005 14:17:05 -0000): > >hi all.., > >i have setup up LAM-MPI cluster on 3 machine for testing. > >i want do put mysql on cluster..,, >such that if mysql need more processor power , >it can use processor power of all nodes that are present in cluster. No. Usual MySQL isn't capable to use cluster nodes in parallel. But you may work w/special software which allows to split your database between cluster nodes. You may find the corresponding information at mysql site or also search Beowulf maillist archive (if I remember right, it was some discussion of databases in cluster here). Yours Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow > >i am using MySQL-4.0. > >can u guide me please.. > >thank you in advance.. > > >-- Rahul.. From James.P.Lux at jpl.nasa.gov Mon Feb 7 09:55:34 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Newbie Question In-Reply-To: <42077DAE.8020806@ohmsurveys.com> References: <5dc04bbf050204112319fe7fbf@mail.gmail.com> <42077DAE.8020806@ohmsurveys.com> Message-ID: <6.1.1.1.2.20050207093914.027efd38@mail.jpl.nasa.gov> At 06:39 AM 2/7/2005, Mark Westwood wrote: >Hi Monang > >Here's my contribution to your decision about which language you program >in for your cluster: > >Suppose that you know Java well, but not C. Suppose that it will take you >6 months to learn C well enough to be able to write your programs in >it. In those 6 months you can do an awful lot of computing in Java. If >your project is intended to last, say, 9 months, then you might decide >that you will program in Java because you will get more computing done >that way than by learning a new language. > >If your project will last much longer then you might decide that learning >C will be of benefit, because each program will be faster in C than in >Java. If you're doing some calculations then I'd suggest that you allow C >to be 5 times faster than Java on average for cluster-type >computing. Some will tell you that it is more than 10 times as fast (and >it is for some types of computation), others that it is no faster (which >is true for some types of computation). I would agree with Mark. I've been faced by a similar decision.. do we do the calculations in Excel using Visual Basic for Applications (VBA), Visual Basic, or C++, or Matlab, or something else. Various pieces of the puzzle exist in all of these, so the problem is do we translate (for example) the VB into C, or, glue it all together with scripts, or rewrite from scratch. Complicating this is that the people available to work on it have various skill sets which don't map well to any of the approaches (how many people do YOU know who are equally facile in VBA, C++, and Matlab??). In our case, the goal was to demonstrate that a particular capability can exist at all, versus making it really fly, so we went with the cobbled together scripts. It might turn out, after all, that the speed of the software isn't the "rate determining" factor, but that availability of staff is. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From rene at renestorm.de Mon Feb 7 09:53:02 2005 From: rene at renestorm.de (rene) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] mpich future Message-ID: <200502071853.02126.rene@renestorm.de> Hi folks, there are many mpi implementations out there, but which one ist "the best"? As far as I know, there are commercial prodcuts which support different hardware in one library (eg myrinet + ethernet). Which is a nice feature. Is there a working mpich which unites the common channels? Score did that once, but it's a year ago, since I've worked with it. In addition to that I've ran into trouble with the different standarts (1.2, 2.0). It seems to me that Openmpi gets more influence. Is that right? I dont feel like put 20 different preprocessor variables on my applications, like #if MPI_VERSION > 1 for each of that implementation. So my question is: In which direction goes mpi tomorrow? Cu -- Rene Storm @Cluster From daniel.kidger at quadrics.com Mon Feb 7 10:18:33 2005 From: daniel.kidger at quadrics.com (daniel.kidger@quadrics.com) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] memory allocation on x86_64 returning huge addresses Message-ID: <30062B7EA51A9045B9F605FAAC1B4F62812104@exch01.quadrics.com> Roland, Sigh! :-) malloc can return any address it so wishes. Don't forget that this is a *virtual* address and so is not bounded by physical memory. A 64-bit O/s with say 8GB RAM can easily have stack addresses in the window 1TB - 2TB, and heap address even higher (!) I guess your real problem is that you are porting a (Fortran) program whose authors did not understand that it might ever run on a 64-bit machine. Your code does a malloc and then tries to store this in an I4 Fortran Integer. This would only be gaurunteed to work on a 32-bit architecture like say a Pentium. So Solutions? 1. since this is x86_64 simply run your compile your program with a 32-bit compiler You can still run under under the 64-bit O/S 2. Mend your application to store addresses in I8 variables, but keep I4 for other stuff if you wish. 3. (dubious) only save the lower 32-bits of the addresses in your I4 variables and then when being used add the known offset to yield the original 64-bit address. The offset is likely to be constant for all variables in your programs but ymmv. 4. Port your code away from using malloc() altogether. Recently (well make that 15 years), Fortran has had its own dynamic memory allocation- the allocate() function. Hope this helps, Daniel. -------------------------------------------------------------- Dr. Dan Kidger, Quadrics Ltd. daniel.kidger@quadrics.com One Bridewell St., Bristol, BS1 2AA, UK 0117 915 5505 ----------------------- www.quadrics.com -------------------- > -----Original Message----- > From: Roland Krause [mailto:rokrau@yahoo.com] > Sent: 07 February 2005 17:02 > To: beowulf@beowulf.org > Subject: [Beowulf] memory allocation on x86_64 returning huge > addresses > > > I am trying to dynamically allocate memory for a Fortran-77 code that > is supposed to run in I4 R4 mode on an x86_64 running SuSE-9.2 with a > kernel.org 2.6.9 kernel. The machine has 8GB memory and memory has to > be allocated in one large chunk. > > The problem is that malloc returns an address that is way beyond > 8billion which is not what I had expected. > > Does anybody why Linux gives me an address that is outside > the physical > memory range? > > Does anybody whether there are any kernel parameter that affect this > behavior? > > Any pointers to some good reading about the Linux VM would also be > appreciated. > > > Regards > Roland > > > > > __________________________________ > Do you Yahoo!? > Yahoo! Mail - now with 250MB free storage. Learn more. > http://info.mail.yahoo.com/mail_250 > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) > visit http://www.beowulf.org/mailman/listinfo/beowulf > From daniel.kidger at quadrics.com Mon Feb 7 10:34:28 2005 From: daniel.kidger at quadrics.com (daniel.kidger@quadrics.com) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Home beowulf - NIC latencies Message-ID: <30062B7EA51A9045B9F605FAAC1B4F62812105@exch01.quadrics.com> Duncan wrote (in reply to Patrick) > > The good news is that you can pipeline it very well. > > Indeed. There is lots of parallelism in the hardware > so you can me processing multiple requests at the same > time. In this sequence of short jobs I measure the > average time for 8 byte gets 2 at a time, 4 at a time > etc. > > quorumi: prun -N2 pgping -f get -b2 8 > 1: 8 bytes 1.32 uSec 6.07 MB/s > quorumi: prun -N2 pgping -f get -b4 8 > 1: 8 bytes 1.04 uSec 7.66 MB/s > quorumi: prun -N2 pgping -f get -b8 8 > 1: 8 bytes 0.84 uSec 9.47 MB/s > quorumi: prun -N2 pgping -f get -b16 8 > 1: 8 bytes 0.82 uSec 9.79 MB/s > quorumi: prun -N2 pgping -f get -b32 8 > 1: 8 bytes 0.79 uSec 10.18 MB/s Or for those that distrust quoting pure powers of two in benchmarks and/or know too much bash: [dan@quorumi]$ for ((i=1,j=1;$i<999;i=$i+$j,j=$i)) ;do echo -ne "pipelining \t$i:\t"; prun -N2 pgping -f get -b$i 64|cut -c20-35; done pipelining 1: 2.39 uSec pipelining 2: 1.38 uSec pipelining 3: 1.24 uSec pipelining 5: 1.05 uSec pipelining 8: 0.92 uSec pipelining 13: 0.91 uSec pipelining 21: 0.86 uSec pipelining 34: 0.80 uSec pipelining 55: 0.78 uSec pipelining 89: 0.79 uSec pipelining 144: 0.77 uSec pipelining 233: 0.73 uSec pipelining 377: 0.77 uSec pipelining 610: 0.78 uSec pipelining 987: 0.77 uSec Note that the above is for 64 *byte* reads which iirc is what Vincent was targetting. Daniel. -------------------------------------------------------------- Dr. Dan Kidger, Quadrics Ltd. daniel.kidger@quadrics.com One Bridewell St., Bristol, BS1 2AA, UK 0117 915 5505 ----------------------- www.quadrics.com -------------------- From lindahl at pathscale.com Mon Feb 7 10:41:08 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] memory allocation on x86_64 returning huge addresses In-Reply-To: <30062B7EA51A9045B9F605FAAC1B4F62812104@exch01.quadrics.com> References: <30062B7EA51A9045B9F605FAAC1B4F62812104@exch01.quadrics.com> Message-ID: <20050207184108.GA1364@greglaptop.internal.keyresearch.com> > The problem is that malloc returns an address that is way beyond > 8billion which is not what I had expected. This e-vile hack makes it produce something lower in memory. What it does is turns off glibc's malloc algorithm's feature that has it mmap() large malloc()s. Stuff into a .c, link the .o into your application. -- greg #include #include static void mem_init_hook(void); static void *mem_malloc_hook(size_t, const void *); static void *(*glibc_malloc)(size_t, const void *); void (*__malloc_initialize_hook)(void) = mem_init_hook; static void mem_init_hook(void) { mallopt (M_MMAP_MAX, 0); } From rross at mcs.anl.gov Mon Feb 7 11:07:43 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] mpich future In-Reply-To: <200502071853.02126.rene@renestorm.de> References: <200502071853.02126.rene@renestorm.de> Message-ID: Hi Rene, You are right that there are a decent number of MPI implementations out there, all with their pros and cons. There is no "best" implementation, and in fact I would say that the existence of multiple implementations is helpful to the community by providing (a) multiple takes on how to build these libraries, and (b) competition between the implementations to be the "best" at what they think is most important. I'm not sure what you mean by "trouble with the different standards"? All implementations should at this point be striving for complete 2.0 compliance, and there are very few things from 1.x that won't work in a 2.0 compliant system (the group defining the standard went to great pains, as do the developers, to maintain this compatibility). So you shouldn't need those preprocessor variables. What functionality are you finding that you need to test for? I would say that at this time MPICH2 has as much influence as any implementation, because it is being used as the basis for multiple Cray platform implementations, the IBM BG/L implementation, the OSU IB implementation, and of course as-is on Windows, OS X, and Linux clusters. Of course I am part of the MPICH2 team, so I am biased :). OpenMPI will undoubtedly be an influential member of the MPI community once the software is made widely available. That group also has a collection of developers with very good track records in this area, and I look forward to being able to compare and contrast the designs and resulting performance. The big buzz in the MPI world right now is fault tolerance. I think this topic is going to be a hot one for some time, and there are definitely differences of opinion on how the MPI implementation should deal with faults and to what degree and how users should be made aware of failures, both transient and catastrophic. Less visible, but at least as important, is figuring out how best to implement the one-sided (RMA) operations that are part of MPI 2.0. My colleague Rajeev Thakur has (in my opinion) done an excellent job of these, building in part on concepts from the BSP system of old. Figuring out how to make collectives as efficient as possible on new, very large machines is also extremely important for those that have access to these new machines. Gheorghe Almasi from IBM had an excellent paper discussing collectives on the BG/L machine in last year's EuroPVM/MPI conference. Rolf Rabenseifner and Jesper Traff both presented improvements to collective algorithms as well. These two were iterative improvements I'd say, so less exciting in some sense, but it is critical that we make these algorithms as efficient as possible, given the scale of upcoming systems. If you are really interested in what is happening in MPI, the best place by far to look is the EuroPVM/MPI series of conferences and their proceedings. This is where everyone that is serious about MPI implementations is publishing and going to talk with colleagues, and every year the conference attendee list is literally a list of the most knowledgable MPI developers in the world (and hangers-on such as myself). Regards, Rob --- Rob Ross, Mathematics and Computer Science Division, Argonne National Lab On Mon, 7 Feb 2005, rene wrote: > there are many mpi implementations out there, but which one ist "the best"? > As far as I know, there are commercial prodcuts which support different > hardware in one library (eg myrinet + ethernet). Which is a nice feature. > > Is there a working mpich which unites the common channels? > Score did that once, but it's a year ago, since I've worked with it. > > In addition to that I've ran into trouble with the different standarts (1.2, > 2.0). > It seems to me that Openmpi gets more influence. Is that right? > > I dont feel like put 20 different preprocessor variables on my applications, > like > #if MPI_VERSION > 1 > for each of that implementation. > > So my question is: > In which direction goes mpi tomorrow? > > Cu > > -- > Rene Storm > @Cluster From mwill at penguincomputing.com Mon Feb 7 11:11:46 2005 From: mwill at penguincomputing.com (Michael Will) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] How-TO Mysql on Lam-cluster? In-Reply-To: References: Message-ID: <4207BD72.1070505@penguincomputing.com> MySQL-4.1 has cluster support according to http://dev.mysql.com/downloads/cluster/ but I have not checked out how and what. In any case I would expect it to NOT use MPI for anything. Michael Mikhail Kuzminsky wrote: > In message from "Rahul Manglekar" (7 Feb > 2005 14:17:05 -0000): > >> >> hi all.., >> >> i have setup up LAM-MPI cluster on 3 machine for testing. >> >> i want do put mysql on cluster..,, such that if mysql need more >> processor power , it can use processor power of all nodes that are >> present in cluster. > > No. Usual MySQL isn't capable to use cluster nodes in parallel. > But you may work w/special software which allows to split your > database between cluster nodes. You may find the corresponding > information at mysql site or also search Beowulf maillist archive > (if I remember right, it was some discussion of databases in cluster > here). > > Yours > Mikhail Kuzminsky > Zelinsky Institute of Organic Chemistry > Moscow > >> >> i am using MySQL-4.0. >> >> can u guide me please.. >> >> thank you in advance.. >> >> >> -- Rahul.. > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From rokrau at yahoo.com Mon Feb 7 12:18:00 2005 From: rokrau at yahoo.com (Roland Krause) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] memory allocation on x86_64 returning huge addresses In-Reply-To: <20050207184108.GA1364@greglaptop.internal.keyresearch.com> Message-ID: <20050207201800.97502.qmail@web52909.mail.yahoo.com> Greg, thanks a lot for this hint. I will try it. Quick question: So this will let me sbrk all the available memory then? Is there a way to tell it to allocate all available memory with mmap? I used to hack the kernel and change TASK_UNMAPPED_BASE in the kernel in order to get all memory from the box in one large chunk. I guess I should have instead lowering it raised the value. I really would like to actually find some docs about this... Again thanks! Roland --- Greg Lindahl wrote: > > The problem is that malloc returns an address that is way beyond > > 8billion which is not what I had expected. > > This e-vile hack makes it produce something lower in memory. What it > does > is turns off glibc's malloc algorithm's feature that has it mmap() > large > malloc()s. Stuff into a .c, link the .o into your application. > > -- greg > > #include > #include > > static void mem_init_hook(void); > static void *mem_malloc_hook(size_t, const void *); > static void *(*glibc_malloc)(size_t, const void *); > void (*__malloc_initialize_hook)(void) = mem_init_hook; > > static void mem_init_hook(void) > { > mallopt (M_MMAP_MAX, 0); > } > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > __________________________________ Do you Yahoo!? Yahoo! Mail - Helps protect you from nasty viruses. http://promotions.yahoo.com/new_mail From rene at renestorm.de Mon Feb 7 13:56:27 2005 From: rene at renestorm.de (rene) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] How-TO Mysql on Lam-cluster? In-Reply-To: <4207BD72.1070505@penguincomputing.com> References: <4207BD72.1070505@penguincomputing.com> Message-ID: <200502072256.27614.rene@renestorm.de> HI, > MySQL-4.1 has cluster support according to > http://dev.mysql.com/downloads/cluster/ As far as I know they used the nbd daemon to generate the db-nodes, but every node has the full database access. It isnt shared over the disks. Just in case you have a really huge db. http://www.emicnetworks.com/ has an own implementation too -- Rene Storm @Cluster From rene at renestorm.de Mon Feb 7 15:37:05 2005 From: rene at renestorm.de (rene) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] mpich future In-Reply-To: References: <200502071853.02126.rene@renestorm.de> <200502072246.02830.rene@renestorm.de> Message-ID: <200502080037.05409.rene@renestorm.de> Hi Rob, > The MQbench project does look interesting. Sort of a GUI version of > SkaMPI? It's something like the Pallas benchmark. But there aren't all Mpi Calls implemented yet. But its nice to choose a bunch of nodes and then a second one in the same application and see the differences. Its ordinary C mpi surrounded by a C++ Qt gui. > If you write an MPI program, it should work with all MPI implementations > (modulo missing MPI-2 features). It will not necessarily cleanly link > with any arbitrary library, in the same way that a C program will not > dynamically link with any arbitrary C library. > > So there is always going to be an issue of recompilation; is that your > second concern? Yes it is. Its probably possible to make software packages available for common linux distributions. But if you have to consider several mpi implementations extra, that could be lot of packages. I've you have written a major application like ls-dyna you can say: Take these linux, these compiler and this mpi and you get our compiled version, but nobody will alter their cluster for an add-on program like a mpi-copy tool. So the only choice is to go opensource. But in some areas it is important that (small or large) companies make professional, supported software available. But this isn't easy with mpi. Regards, Rene From hasan at grant.phys.subr.edu Mon Feb 7 18:15:25 2005 From: hasan at grant.phys.subr.edu (Saleem Hasan) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Newbie question on mpich2 installation Message-ID: Hello all, I apologise for what may be a very simple issue but is giving me trouble. I would really appreciate some advice. For learning the setup of a cluster, I have installed mpich2 on a linux machine with Red Hat 8.0. I have a second machine RH 8.0. w2 is the master and w1 is the slave. I have installed mpich2 on w2 (/home/mpi) and used nfs to share /home with w1. I have also setup passwordless ssh between w1 and w2. I am able to bring up mpd on the local machine (w2) and do mpdtrace and mpdallexit. I am following the installation procedure from the MPICH2 home. I am unable to boot mpd on the slave. The first time I ran mpdboot -n 2 -f /home/mpi/mpd.hosts, I got the message that there was no mpd.conf file in w1 and that could be a reason for the mpd not coming up the slave. I added an mpd.conf (secretword) to /etc in the slave also. Now I get a different message [root@w2 mpich2-1.0]# mpdboot -n 2 -f /home/mpi/mpd.hosts mpdboot_w2.maverick.net_0 (mpdboot 357): error trying to start mpd(boot) at 1 w1.maverick.net; output: mpdboot_w1_1 (err_exit 379): mpd failed to start correctly on w1 reason: 1: invalid msg from mpd :{}: mpdboot_w1_1 (err_exit 385): contents of mpd logfile in /tmp: logfile for mpd with pid 1654 mpdboot_w2.maverick.net_0 (err_exit 379): mpd failed to start correctly on w2.maverick.net Even though the message says mpd failed to start coorectly on w2 (last line), mpdtrace gives w2. The log file in w1 (slave) states the following logfile for mpd with pid 1654 w1_1060 failed ; cause: unable to obtain socket for rhs in ring traceback: [('/home/mpi/mpich2-install/bin/mpd.py', '1192', '_enter_existing_ring'), ('/home/mpi/mpich2-install/bin/mpd.py', '173', '_mpd_init'), ('/home/mpi/mpich2-install/bin/mpd.py', '1374', '?')] Thank you very much. Saleem Hasan From list-beowulf at onerussian.com Tue Feb 8 06:16:19 2005 From: list-beowulf at onerussian.com (Yaroslav Halchenko) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] cheap 48 port gigabit ethernet switch w/ jumbo frames? In-Reply-To: <41EFF701.60905@pa.msu.edu> References: <200501201700.j0KH0PfQ032360@bluewest.scyld.com> <41EFF701.60905@pa.msu.edu> Message-ID: <20050208141619.GM2996@washoe.rutgers.edu> On my latest researches on switches I've found SMC8648T (48 ports) which does support jumbo 9K and cost 2400$ and is managed Does anyone has experience with such thing or I should check out also Nortell switches which are approx 50% more expensive -- Yarik On Thu, Jan 20, 2005 at 01:22:57PM -0500, Tom Rockwell wrote: > Hi, > I'm looking for a switch that will be used for NFS traffic on a cluster > of about 40 nodes. The nodes will have Broadcom 5704 ethernet. From > what I've read, jumbo frames is important for getting the best NFS > performance over gigabit ethernet. > D-link and Netgear have newer 48 port switches priced below managed > switches. The D-link is model DGS-1248T > http://dlink.com/products/?sec=2&pid=367 and the Netgear is model GS748T > http://netgear.com/products/details/GS748T.php. Each are about $1200 or > so. I'm unable to find info on their websites specifying whether these > switches support jumbo frames. Anyone know? > Thanks, > Tom Rockwell > Michigan State University > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555] Key http://www.onerussian.com/gpg-yoh.asc GPG fingerprint 3BB6 E124 0643 A615 6F00 6854 8D11 4563 75C0 24C8 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20050208/759c457c/attachment.bin From rossen at VerariSoft.Com Tue Feb 8 06:23:57 2005 From: rossen at VerariSoft.Com (Rossen Dimitrov) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl> References: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl> Message-ID: <4208CB7D.6070309@verarisoft.com> Vincent, Your questions related to the actual cost (in terms of processor overhead) of achieving the latency numbers that are posted by the network vendors are very interesting and have important aspects, which are often overlooked or paid little attention to. Warning: This posting is long and may be boring. The ping-pong tests that are often used for measuring the communication latency (from user level) are an extreme and often unrealistic mode of operation of the parallel system. Sending bytes across the software layers and over the network is a fundamental factor for contributing to fast computation but without looking at the cost and the likelihood (as Patrick mentioned "crossing the fingers") of producing the best quoted latencies, you don't usually get the whole picture. Besides the network hardware/firmware, the implementation (and use) of the low-level network messaging layer (GM, ELAN, VERBS, etc) and the MPI library are also of a big importance. The design space of parallel applications is quite large (size of messages, frequency of messages, regularity in space and time, synchrony, communication pattern, etc) in order to hope that any single mode of the entire system would be always optimal. In this regard, the ping-pong latency test, exercising only one of these modes, obviously gives you insufficient information on how to predict the behavior of the communication sub-system in realistic scenarios. In order to address this issue, our MPI/Pro implementation (plug!) has long had different modes of using the network and the low-level messaging layer for all major high-speed networks as well as for TCP/IP communication. We usually support at least 2 modes - one that optimizes short message latency (as many of the other MPI implementations do), at the expense of increased CPU overhead, and one that trades some latency (communication overhead) for low CPU overhead, higher predictability, and much better opportunity for overlapping and pipelining. We have carried out studies for quantifying the degree of overlapping that these different modes can achieve (using only our MPI implementation, e.g., comparing apples to apples) and we have obtained some interesting results. When you combine all of the complexities of the communication sub-system (network hardware/firmware, messaging layer, MPI library), the application, and the OS (let's only take the virtual memory system, process/thread scheduling, and interrupt/signal handling) you get a highly probabilistic system, which is hard to quantify and predict by a single ping-pong latency number. Our experiments have shown that using a different MPI/Pro mode on the same application code, executed on the same parallel system, can yield sometimes substantially different performance results. This shows that the implementation and the use of the middleware alone can have a substantial impact on your performance and scalability. Further, the application code can be written (not always but often) to take advantage of asynchrony, pipelining, and overlapping. Implementing these mechanisms in your code (using MPI) often doesn't cost much, but can speed up your application quite a bit on many parallel systems (running middleware with the right design) and in the worst case give you no benefit (on systems that don't provide adequate support for these mechanisms). So, if you really want to optimize the use of your cluster resources, in addition to the network and compute nodes, you will need to also consider the communication middleware and the design of your application and how they all work together. -- Rossen Dimitrov Verari Systems Software, Inc. http://www.verarisoft.com Vincent Diepeveen wrote: > At 21:27 5-2-2005 -0500, Patrick Geoffray wrote: > >>Hi Vincent, >> >>Vincent Diepeveen wrote: >> >>>>>CPU's are 100% busy and after i know how many times a second the network >>>>>can handle in theory requests i will do more probes per second to the >>>>>hashtable. The more probes i can do the better for the game tree search. >>>> >>>>With a gigE network that sounds like 40us or so. With Myrinet or IB >>>>it's in the 4-6us range. If you bought dual opterons with the special >>> >>> >>>At the quadrics and dolphin homepage they both claim 12+ us for Myrinet. >> >>Seriously, here are MPI latencies with MX on F cards on Opteron (PCI-X), >>that includes fibers and a switch in the middle: >> >> Length Latency(us) Bandwidth(MB/s) >> 0 2.684 0.000 >> 1 2.874 0.336 >> 2 2.898 0.690 >> 4 2.978 1.343 >> 8 2.965 2.699 >> 16 2.993 5.347 >> 32 3.409 9.388 >> 64 3.563 17.960 >> 128 3.977 32.185 >> 256 5.699 44.916 >> >>Quadrics would be lower by a 1.5 us, I don't know about Dolphin, I >>didn't hear about noticeable SCI clusters in a long time. >> >> >>>I am very impressed by the quadrics and dolphin cards. Probably by >>>infinipath too when i check them out. Will do. >>> >>>I'm not so impressed yet by myrinet actually, but if cluster builders can >>>earn a couple of hundreds of dollars more on each node i'm sure they'll > > do it. > >>I don't think Myrinet would be the cheapest, I am sure you can get a >>better deal from desperate interconnect vendors. >> >>What does not impress you in Myrinet ? > > > Thanks for your kind answer Patrick, > > Obviously i mentionned that number because i read it elsewhere. > > Well a number of points bother my mind from which majority is true for > others as well. But first let me note that i'm not against myrinet in > general. I am just trying to solve a very specific case. For that specific > case i'm not so impressed. > > Note that so far i didn't find any desperate vendor. For sure quadrics > doesn't look desperate to me, they aren't even selling old cards anymore > though they must have still thousands of them lying at home from returned > upgraded networks. Finding second hand highend cards seems to be very seldom. > > First of all i'm interested in how quick i can get 4-64 bytes from remote > memory. So not from some kind of network card cache, as myrinet doesn't > have some megabytes on chip, but just a few tens of kilobytes. The memory > has to come therefore from the remote nodes main memory, at a random adress > in the main memory. No streaming at all happens. that 400 ns extra that the > TLB gives is definitely not the problem i guess. > > The problem for me is to understand: "how do you get that memory at a > cluster?" > > A latency on paper says of course nothing when you can't actually get it > within that time. > > "Paper supports everything." > Arturo Ochoa (Caracas, Venezuela) > > I hope everyone realizes that an important consequence from beowulf > clusters is that you actually want to *use* all those cpu's you have to > your avail. > > So every cpu has a program running that eats 100% system time. Because if > it wouldn't use 100% system time, you wouldn't need a cluster! > >>From that 100% system time obviously you must be prepared to give away some > to serve other nodes as quickly as possible doing a read. > > All latencies i see quoted at all hardware sites, it is very hard to figure > for me out whether that's a latency that is supported by paper, or whether > it's a practical latency i can take into account as a programmer with all > software layers overhead when each cpu is 100% running a program. > > Secondly, but as i'm not a cluster expert i don't know how to avoid that, > it's of course a big LOSS in sequential speed if my program each few > instructions must check whether there is some MPI message to get handled. > If i check a lot that will slow down my program 20 times. If i don't check > a lot, other cpu's will have to wait longer and that defeats the purpose of > a fast network card. > > Factor 20 is about the slowdown of the average 'old' supercomputer > chessprograms which use MPI type solutions. Zugzwang (Paderborn-Siemens), > P.Conners (Paderborn-Siemens), cilkchess (MIT). I've been playing with my > own eyes against those programs in world champs and despite that it has > happened that i played at the same hardware with a similar amount of cpu's > and a program having factor 100 more chessknowledge (which slows down the > program *considerable*), the actual speed at which the program searches > nodes was up to factor 5-10 faster. > > Now a few years ago this was not a major problem because for example > Cilkchess which obviously ran factor 20-40 times slower than it could, used > 1800 processors for example in world champs 1995 (Hong kong) and 512 > processors in world champs 1999 (Paderborn). Of course because 1 processor > was real real fast compared to the speed of 1 pc processor in those days, > they practical were searching a lot deeper than pc programs (and both > played excellent for its days, especially Don Dailey needs to get a big > compliment for that). > > However if i show up with 2 pc's and 2 network cards, then it sure matters > when i lose a lot of speed. > > Obviously for embarassingly parallel software this is no issue, but usually > for embarrassingly parallel software all you need is gigabit ethernet. > > There is so many MPI applications which are not exactly embarassingly > parallel from which you see that a decent programmer single cpu would be > doing that 20 times faster. Or to quote someone who has been doing such > rewriting work for some physical applications that run here and there: "I > didn't blink my eyes when i managed to speedup an application factor 1000". > > So it is very interesting for us all and me especially to understand how > *fast* you can get that memory under full load of all the logical cpu's. > > Third each pc has 2 cheapo k7 processors which are a lot slower than opterons. > > Second problem i have is that i can get easily dual k7 pc's from > chessplayers and they can get bought cheap still. Dual k7 is practical same > speed like a dual xeon 3.06Ghz Northwood with all memory slots filled with > 2-2-2 DIMMS for DIEP. So just compare the price of such a system with a > cheapo dual k7 with registered cas3 RAM. > > Those dual k7's have 64 bits 66Mhz slots, not pci-x as far as i know and > also those who do have A64's or P4's usually don't have pci-x onboard > either. Sure there is boards that have them and i'm sure that if you make a > network > > Dolphin can deliver 'bytes' they say at their homepage in 3.3 us at MPX > mainboards and claim somewhere a paper latency of 1.x us. > > What is the achieved read speed to remote memory myrinet gets at 64 bits / > 66Mhz in software, so ready to use 4-64 bytes for applications? > > I'm not asking it to be accurate within 400ns, as that's the delay you'll > have from TLB trashing the remote node. But accuracy within 1.5 us would be > quite nice. > > First of all for integer intensive applications i'm doing fastest processor > is opteron, k7 comes second and P4 comes third. Exception is a P4 machine > equipped with the most expensive stuff (2-2-2 ram and all banks filled) > good mainboard and northwoods and overclocked at the mainboard. However for > that price a dual opteron can get bought and it just blows away that P4 > bigtime. > > Every year that new software gets released of course that P4 gets slower, > because newer software only gets more and more complex with more options > and will fit less perfectly in P4's small tiny caches, let alone when we > get a lot of 64 bits programs. They won't fit at all in those tiny slow > caches. > > So until the dual core opterons arrive at low cost, obviously you can make > dual k7 nodes for just a few hundreds of dollar a node. > > When adding new nodes which in the future no doubt are dual opteron, you > still run further with those dual k7 nodes and want to mix them obviously > with dual opterons. Is that possible? > > > > >>Patrick >>-- >> >>Patrick Geoffray >>Myricom, Inc. >>http://www.myri.com >> >> > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From josip at lanl.gov Tue Feb 8 08:54:07 2005 From: josip at lanl.gov (Josip Loncaric) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <4208CB7D.6070309@verarisoft.com> References: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl> <4208CB7D.6070309@verarisoft.com> Message-ID: <4208EEAF.105@lanl.gov> Rossen Dimitrov wrote: > > So, if you really want to optimize the use of your cluster resources, in > addition to the network and compute nodes, you will need to also > consider the communication middleware and the design of your application > and how they all work together. Are there any projects that would expand the ability of MPI application programmers to provide performance hints to the MPI library? For example, hints indicating that certain messages are latency sensitive whereas others need optimal bandwidth and low CPU overhead? One can already obtain some MPI performance data through the PMPI mechanism, and Rossen is helping develop MPI PERUSE (http://www.mpi-peruse.org/) intended to provide even more detail. I'm asking about the other direction of information flow, i.e. performance hints from the application to the MPI layer... Ideally, such hints would be propagated fairly close to the actual hardware, e.g. application hints would guide the MPI library in selecting improved interrupt mitigation strategies used by the network interfaces (assuming that a suitable API exists for the underlaying hardware). Sincerely, Josip From twilcox at terrascale.com Tue Feb 8 09:35:27 2005 From: twilcox at terrascale.com (Tim Wilcox) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Call for participation StorCloud at SC2005 Message-ID: <011001c50e04$9757f830$a201a8c0@deepthoughthp> Hi all, The StorCloud applications and Challenge submission form is now online at http://www.sc-submissions.org and we are currently accepting submissions the instructions are posted at http://www.vtksolutions.com/StorCloud/2005/StorCloudAppFormHelp.html. The deadline is March 31st. Tim Wilcox Applications Challenge Committee From natorro at fisica.unam.mx Tue Feb 8 10:27:39 2005 From: natorro at fisica.unam.mx (Carlos Lopez Nataren) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] G5 beowulf cluster Message-ID: <1107887259.2124.16.camel@natorro> Hello, we, at the physics institute in Mexico have got four Xserve G5 and we would like to use them as a beowulf, my first doubt is about what operating system to use, we've been using linux for our other clusters, even a G4 one, but I haven't seen anything about G5, is there a linux distribution that runs well on this type of machines? or do I better use the operating system they came with? or are there any documentation out there outlining the way they should be configured to be used as a beowulf? Thank you very much for any help. natorro -- Carlos Lopez Nataren Instituto de Fisica, UNAM From dag at sonsorol.org Tue Feb 8 10:51:25 2005 From: dag at sonsorol.org (Chris Dagdigian) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] G5 beowulf cluster In-Reply-To: <1107887259.2124.16.camel@natorro> References: <1107887259.2124.16.camel@natorro> Message-ID: <42090A2D.4090204@sonsorol.org> The Mac OS X server OS that came with your Xserve G5s is quite good and you'll find all the developer tools, compilers, cluster scheduler systems like Grid Engine or LSF are all working and well supported. The scientific community of G5/Xserve users is growing quite rapidly. If you are familiar with Linux the learning curve for OS X is not all that bad. http://www.apple.com/science/ -- may help http://www.apple.com/server/macosx/ -- also You may want to make the OS decision based on what physics apps you need to run and how *they* are supported on OS X vs Linux. I can't help you there as I'm a life sciences / biology person. In our lab we've installed both Gentoo Linux as well as Yellow Dog on Xserve G5s. Both seemed to install and run smoothly but for production clustering work we still use the OS X Server OS. -Chris Carlos Lopez Nataren wrote: > Hello, we, at the physics institute in Mexico have got four Xserve G5 > and we would like to use them as a beowulf, my first doubt is about what > operating system to use, we've been using linux for our other clusters, > even a G4 one, but I haven't seen anything about G5, is there a linux > distribution that runs well on this type of machines? or do I better use > the operating system they came with? or are there any documentation out > there outlining the way they should be configured to be used as a > beowulf? > > Thank you very much for any help. > natorro > -- Chris Dagdigian, BioTeam - Independent life science IT & informatics consulting Office: 617-665-6088, Mobile: 617-877-5498, Fax: 425-699-0193 PGP KeyID: 83D4310E iChat/AIM: bioteamdag Web: http://bioteam.net From hahn at physics.mcmaster.ca Tue Feb 8 11:26:32 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] G5 beowulf cluster In-Reply-To: <1107887259.2124.16.camel@natorro> Message-ID: > Hello, we, at the physics institute in Mexico have got four Xserve G5 > and we would like to use them as a beowulf, my first doubt is about what > operating system to use, we've been using linux for our other clusters, I would strongly encourage you to try both Linux and Mac OS/X because doing so would permit a VERY interesting and useful comparison. the comparison is interesting because Mac OS/X is tuned very differently from Linux - even more than the differences it inherits from its *BSD heritage. for instance, if you run LMBench on the two machines, you'll see that certain syscalls are drastically different in speed. obviously, Macophiliacs and Apple sales reps would be scandalized at this idea. but the truth is that the Xserve hardware is reasonably competive with dual-xeon alternatives, but in a cluster, no one really cares about PDF imaging models or other traditional Apple qualities. what matters is things like TCP stack efficiency, syscall overheads, etc. > even a G4 one, but I haven't seen anything about G5, is there a linux > distribution that runs well on this type of machines? or do I better use I've heard of yellowdog linux; there are probably many other flavors (perhaps even a fedora version?). ultimately, the distro is almost irrelevant to a cluster, since it's the kernel, booting and FS that matter, not .999 of userspace. incidentally, I've measured the power consumption of a ppc970fx (90nm, 2.0 GHz) system, under load, and found it to be marginally cooler than, say a similar-speed HP DL145 (dual-opteron). we're talking 200 vs 220W. this is old news; what's new is that up-coming 90nm Opterons appear to change the picture fairly dramatically, since the drop the TDP from 89 to 65W. and of course, for those of you who are cache-friendly, dual-core opterons at 95W TDP is rather attractive. (ie, dual ppc970/2.0's with a 3.2 GB/s apiece vs four DC opteron/2.2's with 3.2 GB/s apiece. 100W/p vs maybe 60W/p, hmmm.) regards, mark hahn. From dtj at uberh4x0r.org Tue Feb 8 11:35:49 2005 From: dtj at uberh4x0r.org (Dean Johnson) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] G5 beowulf cluster In-Reply-To: <1107887259.2124.16.camel@natorro> References: <1107887259.2124.16.camel@natorro> Message-ID: <1107891349.5042.10.camel@terra> On Tue, 2005-02-08 at 12:27 -0600, Carlos Lopez Nataren wrote: > Hello, we, at the physics institute in Mexico have got four Xserve G5 > and we would like to use them as a beowulf, my first doubt is about what > operating system to use, we've been using linux for our other clusters, > even a G4 one, but I haven't seen anything about G5, is there a linux > distribution that runs well on this type of machines? or do I better use > the operating system they came with? or are there any documentation out > there outlining the way they should be configured to be used as a > beowulf? > The native OSX should be fine. It sort depends on the applications that you intend on using. Lots of the major apps seem to have efforts to make them work well on the altivec machines. There was a problem, something about semaphores I believe, that caused problems with MPI apps. I ran into it trying to get benchmark numbers for Amber and Gromacs on 4 G5 towers that I was playing with. -Dean From idooley at isaacdooley.com Tue Feb 8 13:55:12 2005 From: idooley at isaacdooley.com (Isaac Dooley) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] G5 beowulf cluster In-Reply-To: <200502082000.j18K09iS022564@bluewest.scyld.com> References: <200502082000.j18K09iS022564@bluewest.scyld.com> Message-ID: <42093540.6010801@isaacdooley.com> I've used the new ~600 node G5 Xserve cluster named turing: http://www.cse.uiuc.edu/turing/ It works, and is using OSX 10.3. I've used YellowDog Linux personally on a few G3 and G4 machines, and have had good experiences. If you want to do very fine grained parallel computation, one important thing to do is to disable all unneeded system daemons. There are a bunch of these in YDL and OSX. Also, depending on your needs for 64-bit addressing, you may need YDL until OSX 10.4 is released(it is available to developers now if you really want it). I'm not sure if you can disable the GUI for OSX, which may be a minor resource waster. Also you may want to consider Darwin without OSX. Darwin is the open source kernel used by OSX. One thing we've noticed with our OSX is that connect() sometimes takes too long to complete. Hopefully I can figure out why this is. Isaac Dooley >Hello, we, at the physics institute in Mexico have got four Xserve G5 >and we would like to use them as a beowulf, my first doubt is about what >operating system to use, we've been using linux for our other clusters, >even a G4 one, but I haven't seen anything about G5, is there a linux >distribution that runs well on this type of machines? or do I better use >the operating system they came with? or are there any documentation out >there outlining the way they should be configured to be used as a >beowulf? > > From ole at scali.com Wed Feb 9 04:24:15 2005 From: ole at scali.com (Ole W. Saastad) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <200502070951.j179oSDB010742@bluewest.scyld.com> References: <200502070951.j179oSDB010742@bluewest.scyld.com> Message-ID: <1107951855.5682.31.camel@pc-2.office.scali.no> Dear all, this thread reminded us, that we promised to post HPCC numbers depicting differences between interconnects, not interconnects and software stacks in combination. The numbers below stems from a fairly old system (400MHz FSB, PCI-X, etc.) and does not reflect the absolute performance achievable on modern hardware. Similar, the NICs used are _not_ the latest and greatest. The intent is simply to show the effect of different interconnects, on the four simple (excluding PTRANS etc) communication metrics measured by HPCC. (see web page http://icl.cs.utk.edu/hpcc/) Gigabit Eth. SCI Myrinet InfiniBand Max Ping Pong Latency : 36.32 4.44 8.65 7.36 Min Ping Pong Bandw. : 117.01 121.31 245.31 359.21 Random Ring Bandw. : 37.59 47.70 69.30 18.02 Random Ring Latency : 42.17 8.91 19.02 9.94 Latency in microseconds and bandwidth in MBytes/s. (1e6 bytes/s). The HPCC version is 0.8 and the very same binary (and Scali MPI Connect library) is used for all interconnects (change of interconnect is done by -net tcp|sci|gm0|ib0 on the command line). Cluster information : 16 x Dell PowerEdge 2650 2.4 GHz Dell PowerConnect 5224 GBE switch. Mellanox HCA Infinicon InfiniIO 3000 Myrinet 2000 Dolphin SCI 4x4 Torus Scali MPI Connect version : scampi-3.3.7-2.rhel3 Mellanox IB driver version : thca-linux-3.2-build-024 GM version : 2.0.14 -- Ole W. Saastad, Dr.Scient. Manager Cluster Expert Center dir. +47 22 62 89 68 fax. +47 22 62 89 51 mob. +47 93 05 74 87 ole@scali.com Scali - www.scali.com High Performance Clustering From joachim at ccrl-nece.de Thu Feb 10 01:15:54 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <420580AD.5050003@myri.com> References: <3.0.32.20050204213943.010127d0@pop.xs4all.nl> <420580AD.5050003@myri.com> Message-ID: <420B264A.7050004@ccrl-nece.de> Patrick Geoffray wrote: > Seriously, here are MPI latencies with MX on F cards on Opteron (PCI-X), > that includes fibers and a switch in the middle: > > Length Latency(us) Bandwidth(MB/s) > 0 2.684 0.000 [...] Nice work, Patrick - but such numbers are of little value if the benchmark used to get them is not stated. I'd recommend mpptest (from MPICH). Plus, the compiler etc. is also of interest when it comes to latencies. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From joachim at ccrl-nece.de Thu Feb 10 01:28:27 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <4208EEAF.105@lanl.gov> References: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl> <4208CB7D.6070309@verarisoft.com> <4208EEAF.105@lanl.gov> Message-ID: <420B293B.9060604@ccrl-nece.de> Josip Loncaric wrote: > Are there any projects that would expand the ability of MPI application > programmers to provide performance hints to the MPI library? For > example, hints indicating that certain messages are latency sensitive > whereas others need optimal bandwidth and low CPU overhead? MPI offers a lot of different send modes already. If you use a ready send, the MPI library can assume that you are interested in low-latency delivery; if you use a non-blocking send, it should be o.k. for the library to assume that you are interested in overlapping computation and communication and so on. On the receiving side, a hybrid polling-blocking approach for receiving can be applied. I do not think that there is serious demand for more explicit "steering" of the MPI library. User's make much to little use of the existing ways (that I described above). But, if you really want to do such stuff, you could use (implementation-specific) attributes which you assign to different communicators, one for "low-latency" delivery and one for "low-cpu", or whatever. But this has more effect on the sending side than on the receiving side. I wouldn't invest work into this unless you have very good reasons. Esp. as this would be non-portable, few users would ever take notice. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From landman at scalableinformatics.com Thu Feb 10 13:40:52 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] A thread-safe PRNG for an OpenMP progra Message-ID: <420BD4E4.4080401@scalableinformatics.com> Hi folks: I need to get a thread-safe pseudo-random number generator. All I have found online was SPRNG which is set up for MPI. Anyone have a quick pointer to their favorite thread safe PRNG that works well in OpenMP? Thanks. Joe -- Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com From maurice at harddata.com Thu Feb 10 12:35:06 2005 From: maurice at harddata.com (Maurice Hilarius) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <200502102000.j1AK0Eb7016772@bluewest.scyld.com> References: <200502102000.j1AK0Eb7016772@bluewest.scyld.com> Message-ID: <420BC57A.5060007@harddata.com> ---------------------------------------------------------------------- >Message: 1 >Date: Thu, 10 Feb 2005 10:15:54 +0100 >From: Joachim Worringen >Subject: Re: [Beowulf] Home beowulf - NIC latencies > >Patrick Geoffray wrote: > > >>Seriously, here are MPI latencies with MX on F cards on Opteron (PCI-X), >>that includes fibers and a switch in the middle: >> >> Length Latency(us) Bandwidth(MB/s) >> 0 2.684 0.000 >> >> >[...] > >Nice work, Patrick - but such numbers are of little value if the >benchmark used to get them is not stated. I'd recommend mpptest (from >MPICH). Plus, the compiler etc. is also of interest when it comes to >latencies. > > Joachim > > > True, but it does not change the facts. Further, all of these lovely benchmarks lack one really important detail: Comparisons between different interfaces and drivers MUST show CPU usage while running them. If I have a fantastic device that uses infinitely small time (latency) and moves huge amounts of data (bandwidth) but in doing so it takes 80% of a CPU, we do not have a useful solution.. That is where Myrinet and Quadrics shine, and also this is the detail that the various OB vendors carefully dance around. All the communications performance in the world does not matter if it consumes a large amount of CPU cycles. A further test that some vendors artfully avoid is the actual latency of all nodes in a cluster across the switching device. I have seen a number of "benchmarks" showing great numbers, but on looking closer a great number of them are either on two computers, directly connected, or are on switching networks that use a number of small switches, and they do not show the worst case latency across all the switches, on the greater number of hops. So, your points are excellent, Joachim, but I have to say that even greater degrees of information are needed before any meaningful conclusions may be drawn. What we all need is some form of useful standardized benchmarks that looks like real world code from a number of different disciplines, that we can use to test the hardware, so we may compare results in a meaningful manner. With our best regards, Maurice W. Hilarius Telephone: 01-780-456-9771 Hard Data Ltd. FAX: 01-780-456-9772 11060 - 166 Avenue email:maurice@harddata.com Edmonton, AB, Canada http://www.harddata.com/ T5X 1Y3 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20050210/feb53a3e/attachment.html From lindahl at pathscale.com Thu Feb 10 18:36:20 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <420BC57A.5060007@harddata.com> References: <200502102000.j1AK0Eb7016772@bluewest.scyld.com> <420BC57A.5060007@harddata.com> Message-ID: <20050211023619.GB5174@greglaptop.internal.keyresearch.com> On Thu, Feb 10, 2005 at 01:35:06PM -0700, Maurice Hilarius wrote: > Further, all of these lovely benchmarks lack one really important detail: > Comparisons between different interfaces and drivers MUST show CPU usage > while running them. No. If you want to look at that, run a real application and watch the wall time. It's extremely hard to get a good estimate of cpu usage out of a microbenchmark, and running "top" or /bin/time to do it is definitely bogus. > If I have a fantastic device that uses infinitely small time (latency) > and moves huge amounts of data (bandwidth) but in doing so it takes 80% > of a CPU, we do not have a useful solution.. If large cpu usage is a problem, it will show up nicely in real application benchmarks. > What we all need is some form of useful standardized benchmarks that > looks like real world code from a number of different disciplines, that > we can use to test the hardware, so we may compare results in a > meaningful manner. Amen. So use the MM5 t3a benchmark, maybe even SPEC HPC, the canned benchmarks for Amber, Charmm, DL_POLY, etc. The NAS Parallel Benchmarks are also good, they are much closer to real apps than microbenchmarks. -- greg From eugen at leitl.org Fri Feb 11 06:06:43 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] more details on Cell emerge Message-ID: <20050211140642.GV1404@leitl.org> http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT021005084318&mode=print By: David T. Wang (dwang@realworldtech.com) Updated: 02-10-2005 Back to Basics The fundamental task of a processor is to manage the flow of data through its computational units. However in the past two decades, each successive generation of processors for personal computers has added more transistors dedicated to increasing the performance of spaghetti-like integer code. For example, it is well known that typical integer codes are branchy and that branch mispredict penalties are expensive; in an effort to minimize the impact of branch instructions, transistors were used to develop highly accurate branch predictors. Aside from branch predictors, sophisticated cache hierarchies with large tag arrays and predictive cache prefetch units attempt to hide the complexity of data movement from the software, and further increase the performance of single threaded applications. The pursuit of single threaded performance can be observed in recent years in the proposal of extraordinarily deeply pipelined processors designed primarily to increase the performance of single threaded applications, at the cost of higher power consumption and larger transistor budgets. The fundamental idea of the CELL processor project is to reverse this trend and give up the pursuit of single threaded performance, in favor of allocating additional hardware resources to perform parallel computations. That is, minimal resources are devoted toward the execution of single threaded workloads, so that multiple DSP-like processing elements can be added to perform more parallelizable multimedia-type computations. In the examination of the first implementation of the CELL processor, the theme of the shift in focus from the pursuit of single threaded integer performance to the pursuit of multiply threaded, easily parallelizable multimedia-type performance is repeated throughout. CELL Basics The CELL processor is a collaboration between IBM, Sony and Toshiba. The CELL processor is expected by this consortium to provide computing power an order of magnitude above and beyond what is currently available to its competitors. The International Solid-State Circuits Conference (ISSCC) 2005 was chosen by the group as the location to describe the basic hardware architecture of the processor and announce the first incarnation of the CELL processor family. Members of the CELL processor family share basic building blocks, and depending on the requirement of the application, specific versions of the CELL processor can be quickly configured and manufactured to meed that need. The basic building blocks shared by members of the CELL family of processor are the following: * The PowerPC Processing Element (PPE) * The Synergistic Processing Element (SPE) * The L2 Cache * The internal Element Interconnect Bus(EIB) * The shared Memory Interface Controller (MIC) and * The FlexIO interface Each SPE is in essence a private system-on-chip (SoC), with the processing unit connected directly to 256KB of private Load Store (LS) memory. The PPE is a dual threaded (SMT) PowerPC processor connected to the SPE's through the EIB. The PPE and SPE processing elements access system memory through the MIC, which is connected to two independent channels of Rambus XDR memory, providing 25 GB/s of memory bandwidth. The connection to I/O is done through the FlexIO interface, also provided by Rambus, providing 44.8 GB/s of raw outbound BW and 32 GB/s of raw inbound bandwidth for total I/O bandwidth of 76.8 GB/s. At ISSCC 2005, IBM announced that the first implementation of the CELL processor has been tested to operate at frequencies above 4 GHz. In the CELL processor, each SPE is capable of sustaining 4 FMADD operations per cycle. At an operating frequency of 4 GHz, the CELL processor is thus capable of achieving a peak throughput rate of 256 GFlops from the 8 SPE's. Moreover, the PPE can contribute some amount of additional compute power with its own FP and VMX units. Processor Overview Figure 1 - Die photo of CELL processor with block diagram overlay Figure 1 shows the die photo of the first CELL processor implementation with 8 SPE.s. The sample processor tested was able to operate at a frequency of 4 GHz with Vdd of 1.1V. The power consumption characteristics of the processor were not disclosed by IBM. However, estimates in the range of 50 to 80 Watts @ 4 GHz and 1.1 V were given. One unconfirmed report claims that at the extreme end of the frequency/voltage/power spectrum, one sample CELL processor was observed to operate at 5.6 GHz with 1.4 V Vdd and consumed 180 W of power. As described previously, the CELL processor with 8 SPE.s operating at 4 GHz has a peak throughput rate of over 256 GFlops. To provide the proper balance between processing power and data bandwidth, an enormously capable system interconnects and memory system interface is required for the CELL processor. For that task, the CELL processor was designed as a Rambus Sandwich, with Redwood Rambus Asic Cell (RRAC) acting as the system interface on one end of the CELL processor, and the XDR (formerly Yellowstone) high bandwidth DRAM memory system interface on the other end of the CELL processor. Finally, the CELL processor has 2954 C4 contacts to the 3-2-3 organic package, and the BGA package is 42.5 mm by 42.5 mm in size. The BGA package contains 1236 contacts, 506 of which are signal interconnects and the remainder are devoted to power and ground interconnects. Logic Depth, Circuit Design, Die Size and Process Shrink Figure 2 - Per stage circuit delay depth of 11 FO4 often left only 5~8 FO4 for logic flow The first incarnation of the CELL processor is implemented in a 90nm SOI process. IBM claims that while the logic complexity of each pipeline stage is roughly comparable to other processors with a per stage logic depth of 20 FO4, aggressive circuit design, efficient layout and logic simplification enabled the circuit designers of the CELL processor to reduced the per stage circuit delay to 11 FO4 throughout the entire design. The design methodology deployed for the CELL processor project provides an interesting contrast to that of other IBM processor projects in that the first incarnation of the CELL processor makes use of fully custom design. Moreover, the full custom design includes the use of dynamic logic circuits in critical data paths. In the first implementation of the CELL processor, dynamic logic was deployed for both area minimization as well as performance enhancement to reach the aggressive goal of 11 FO4 circuit delay per stage. Figure 2 shows that with the circuit delay depth of 11 FO4, oftentimes only 5~8 FO4 are left for inter-latch logic flow. The use of dynamic logic presents itself as an interesting issue in that dynamic logic circuits rely on the capability of logic transistors to retain a capacitive load as temporary storage. The decreasing capacitance and increasing leakage of each successive process generation means that dynamic logic design becomes more challenging with each successive process generation. In addition, dynamic circuits are reportedly even more challenging on SOI based process technologies. However, circuit design engineers from IBM believe that the use of dynamic logic will not present itself as an issue in the scalability of the CELL processor down to 65 nm and below. The argument was put forth that since the CELL processor is a full custom design, the task of process porting with dynamic circuits is no more and no less challenging than the task of process porting on a design without dynamic circuits. That is, since the full custom design requires the re-examination and re-optimization of transistor and circuit characteristics for each process generation, if a given set of dynamic logic circuits become impractical for specific functions at a given process node, that set of circuits can be replaced with static circuits as needed. The process portability of the CELL processor design is an interesting topic due to the fact that the prototype CELL processor is a large device that occupies 221 mm2 of silicon area on the 90 nm process. Comparatively, the IBM PPC970FX processor has a die size of 62 mm2 on the 90 nm process. The natural question then arises as to whether Sony will choose to reduce the number of SPE.s to 4 for the version of the CELL processor to appear in the next generation Playstation, or keep the 8 SPE.s and wait for the 65 nm process before it ramps up the production of the next generation Playstation. Although no announcements or hints have been given, IBM.s belief in regards to the process portability of the CEL Figure 6 - SPE pipeline diagram Table 1 - Unit latencies for SPE instructions. Figure 6 shows the pipeline diagram of the SPE and Table 1 shows the unit latency of the SPE. Figure 6 shows that the SPE pipeline makes heavy use of the forward-and-delay concept to avoid the access latency of a register file access in the case of dependent instructions that flow through the pipeline in rapid succession. One interesting aspect of the floating point pipeline is that the same arrays are used for floating point computation as well as integer multiplication. As a result, integer multiplies are sent to the floating point pipeline, and the floating point pipeline bypasses the FP handling and computes the integer multiply. SPE Schmoo Plot Figure 7 - Schmoo plot for the SPE Figure 7 shows the schmoo plot for the SPE. The schmoo plot shows that the SPE can comfortably operate at a frequency of 4 GHz with Vdd of 1.1 V, consuming approximately 4 W. The schmoo plot also reveals that due to the careful segmentation of signal path lengths, the design is far from being wire delay limited. Frequency scaling relative to voltage continues past 1.3 V. This schmoo plot also contributes to the plausibility of the unconfirmed report that the CELL processor could operate at upwards of 5.6 GHz. .Unknown. Functional Units: ATO and RTB Oftentimes when a paper relating to a complex project is written collaboratively by a group of people, details are lost. Still, it appeared as rather humorous that of the six design engineers and architects from the CELL processor project present at Tuesday evening.s chat session, no one could recall what the acronyms ATO and RTB stood for. ATO and RTB are functional blocks labeled in the floorplan of the SPE. However, the functionality of these functional blocks or the meaning of the acronym were neither noted on the floorplan, nor explained in the paper, nor mentioned in the technical presentation. In an effort to cover all the corners, this author placed the question on a list of questions to be asked of the CELL project team members. Hilarity thus ensued as slightly embarrassed CELL project members stared blankly at each other in an attempt to recall the functionality or definition of the acronyms. In all fairness, since the SPE was presented on Monday and the CELL processor itself was presented on Tuesday, CELL project members responsible for the SPE were not present for Tuesday evening.s chat sessions. As a result, the team members responsible for the overall CELL processor and internal system interconnects were asked to recall the meaning of acronyms of internal functional units within the SPE. Hence, the task was unnecessarily complicated by the absence of key personnel that would have been able to provide the answer faster than the CELL processor can rotate a million triangles by 12 degrees about the Z axis. After some discussion (and more wine), it was determined that the ATO unit is most likely the Atomic (memory) unit responsible for coherency observation/interaction with dataflow on the EIB. Then, after the injection of more liquid refreshments (CH3CH2OH), it was theorized that the RTB most likely stood for some sort of Register Translation Block whose precise functionality was unknown to those outside of the SPE. However, this theory would turn out to be incorrect. Finally, after sufficient numbers of hydrocarbon bonds have been broken down into H-OH on Wednesday, a member of the CELL processor team member tracked down the relevant information and he writes: The R in RTB is an internal 1 character identifier that denotes that the RTB block is a unit in the SPE. The TB in RTB stands for "Test Block". It contains the ABIST (Array Built In Self Test) engines for the Local Store and other arrays in the SPE, as well as other test related control functions for the SPE. Element Interconnect Bus The element interconnect bus is the on chip interconnect that ties together all of the processing, memory, and I/O elements on the CELL processor. The EIB is implemented as a set of four concentric rings that is routed through portions of the SPE, where each ring is a 128 bit wide interconnect. To reduce coupling noises, the wires are arranged in groups of four and interleaved with ground and power shields. To further reduce coupling noises, the direction of data flow alternates between each adjacent ring pair. Data travels on the EIB through staged buffer/repeaters at the boundaries of each SPE. That is, data is driven by one set of staged buffer and latched by the buffer at the next stage every clock cycle. Data moving from one SPE through other SPE.s requires the use of repeaters in the intermediary SPE.s for the duration of the transfer. Independently from the buffer/repeater elements, separate data on/off ramps exist in the BIU of the SPE, as data targeted for the LS unit of a given SPE can be off-loaded at the BIU. Similarly, outgoing data can be placed onto the EIB by the BIU. Figure 8 - Counter rotational rings of the EIB - 4 SPE.s shown The design of the EIB is specifically geared toward the scalability of the CELL processor. That is, signal path lengths on the EIB do not change regardless of the number of SPE.s in a given CELL processor configuration. Since the data travels no more than the width of one SPE, more SPE.s on a given CELL processor simply means that the data transport latency increases by the number of additional hops through those SPE.s. Data transfer through the EIB is controlled by the EIB controller, and the EIB controller works with the DMA engine and the channel controllers to reserve the buffers drivers for certain number of cycles for each data transfer request. The data transfer algorithm works by reserving channel capacity for each data transfer, thus providing support for real time applications. Finally, the design and implementation of the EIB has a curious side effect in that it limits the current version of the CELL processor to expand only along the horizontal axis. Thus, the EIB enables the CELL processor to be highly configurable and SPE.s can be quickly and easily added or removed along the horizontal axis, and the maximum number of SPE.s that can be added is set by the maximum width of the chip allowable by the reticule size of the fabrication equipment. The POWERPC Processing Element Neither microarchitectural details nor the performance characteristics of the POWERPC Processing Element were disclosed by IBM during ISSCC 2005. However, what is known is that the PPE processor core is a new core that is fully compliant with the POWERPC instruction set, the VMX instruction set extension inclusive. Additionally, the PPE core is described as a two issue, in-order, 64 bit processor that supports 2 way SMT. The L1 cache sizes of the PPE is reported to be 32KB each, and the unified L2 cache is 512 KB in size. Furthermore, the lineage of the PPE can be traced to a research project commissioned by IBM to examine high speed processor design with aggressive circuit implementations. The results of this research project were published by IBM first in the Journal of Solid State Circuits (JSSC) in 1998, then again in ISSCC 2000. The paper published in JSSC in 1998 described a processor implementation that supported a subset of the POWERPC instruction set, and the paper published in ISSCC 2000 described a processor that supported the complete POWERPC instruction set and operated at 1 GHz on a 0.25?m process technology. The microarchitecture of the research processor was disclosed in some detail in the ISSCC 2000 paper. However, that processor was a single issue processor whose design goal was to reach high operating frequency by limiting pipestage delay to 13 FO4, and power consumption limitations were not considered. For the PPE, several major changes in the design goal dictated changes in the microarchitecture from the research processor disclosed at ISSCC in 2000. Firstly, to further increase frequency, the per stage circuit delay design target was lowered from 13 FO4 to 11 FO4. Secondly, limiting power consumption and minimize leakage current were added as high priority design goals for the PPE. Collectively, these changes limited the per stage logic depth, and the pipeline was lengthened as a result. The addition of SMT and the two issue design goal completed the metamorphosis of the research processor to the PPE. The result is a processing core that operates at a high frequency with relatively low power consumption, and perhaps relatively poorer scalar performance compared to the beefy POWER5 processor core. Rambus XDR Memory System Figure 9 - The two channel XDR Memory System To provide machine balance and support the peak rating of more than 256 SP GFlops (or 25-30 DP GFlops), the CELL processor requires an enormously capable memory system. For that reason, two channels of Rambus XDR memory are used to obtain 25.2 GB/s of memory bandwidth. In the XDR memory system, each channel can support a maximum of thirty-six devices connected to the same command and address bus. The data bus for each device connects to the memory controller through a set of bi-directional point-to-point connections. In the XDR memory system, addresses and commands are sent on the address and command bus at a rate of 800 Mbits per second (Mbps), and the point to point interface operates at a datarate of 3.2 Gbps. Using DRAM devices with 16 bit wide data busses, each channel of XDR memory can sustain a maximum bandwidth of 102.4 Gbps (2 x 16 x 3.2), or 12.6 GB/s. The CELL processor can thus achieve a maximum bandwidth of 25.2 GB/s with a 2 channel, 4 device configuration. The obvious advantage of the XDR memory system is the bandwidth that it provides to the CELL processor. However, in the configuration illustrated in figure 9, the maximum of 4 DRAM devices means that the CELL processor is limited to 256 MB of memory, given that the highest capacity XDR DRAM device is currently 512 Mbits. Fortunately, XDR DRAM devices could in theory be reconfigured in such a way so that more than 36 XDR devices can be connected to the same 36 bit wide channel and provide 1 bit wide data bus each to the 36 bit wide point-to-point interconnect. In such a configuration, a two channel XDR memory can support upwards of 16 GB of ECC protected memory with 256 Mbit DRAM devices or 32 GB of ECC protected memory with 512 Mbit DRAM devices. As a result, the CELL processor could in theory address a large amount of memory if the price premium of XDR DRAM devices could be minimized. IBM did not release detailed information about the configuration of the XDR memory system. One feature to watch for in the future is ECC support in the DRAM memory system. Since ECC support is clearly not a requirement of a processor to be used in a game machine, the presence of ECC support would likely indicate IBM.s ambition to promote the use of CELL processors in applications that require superior reliability, availability and serviceability, such as HPC, workstation or server systems. Incidentally, Toshiba is a manufacturer of XDR DRAM devices. Presumably it brought the XDR memory controller and memory system design expertise to the table, and could ramp up production of XDR DRAM devices as needed. FlexIO System Interface At ISSCC 2005, Rambus presented a paper on the FlexIO interface used on the CELL processor. However, the presentation was limited to describing the physical layer interconnect. Specifically, the difficulties of implementing the Redwood Rambus ASIC Cell on IBM.s 90nm SOI process were examined in some detail. While circuit level issues regarding the challenges of designing high speed I/O interfaces on an SOI based process are in their own right extremely intriguing topics, the focus of this article is geared toward the architectural implications of the high bandwidth interface. As a result, the circuit level details will not be covered here. Interested readers are encouraged to seek out details on Rambus.s Redwood technology separately. What is known about the system interface of the CELL processor is that the FlexIO consists of 12 byte lanes. Each byte lane is a set of 8 bit wide, source synchronous, unidirectional, point-to-point interconnects. The FlexIO makes use of differential signaling to achieve the data rate of 6.4 Gb per second per signal pair, and that data rate in turn translates to 6.4 GB/s per byte lane. The 12 byte lanes are asymmetric in configuration. That is, 7 byte lanes are outbound from the CELL processor, while 5 byte lanes are inbound to the CELL processor. The 12 byte lanes thus provide 44.8 GB/s of raw outbound bandwidth and 32 GB/s of raw inbound bandwidth for total I/O bandwidth of 76.8 GB/s. Furthermore, the byte lanes are arranged into two groups of ports: one group of ports are dedicated to non-coherent off-chip traffic, while the other group of ports are usable for coherent off-chip traffic. It seems clear that Sony itself is unlikely to make use of a coherent, multiple CELL processor configuration for Playstation 3. However, the fact that the PPE and the SPE.s can snoop traffic transported through the EIB, and that coherency traffic can be sent to other CELL processors via a coherent interface, means that the CELL processor can indeed be an interesting processor. If nothing else, the CELL processor should enable startups that propose to build FlexIO based coherency switches to garner immediate interest from venture capitalists. Summary The CELL processor presents an intriguing alternative in its pursuit of performance. It seems to be a forgone conclusion that the CELL processor will be an enormously successful product, and that millions of CELL processors will be sold as the processors that power the next generation Sony Playstation. However, IBM has designed some features into the CELL processor that clearly reveals its ambition in seeking new applications for the CELL processor. At ISSCC 2005, much fanfare has been generated by the rating of 256 GFlops @ 4 GHz for the CELL processor. However, it is the little mentioned double precision capability and the yet undisclosed system level coherency mechanism that appear to be the most intriguing aspects that could enable the CELL processor to find success not just inside the Playstation, but outside of it as well. References [1] J. Silberman et. al., .A 1.0- GHz Single-Issue 64-Bit PowerPC Integer Processor., IEEE Journal of Solid-State Circuits, Vol 33, No.11, Nov 1998. [2] P. Hofstee et. al., .A 1 GHz Single-Issue 64b PowerPC Processor., International Solid-State Circuits Conference Technical Digest, Feb. 2000. [3] N. Rohrer et. al. .PowerPC in 130nm and 90nm Technologies., International Solid-State Circuits Conference Technical Digest, Feb. 2004. [4] B. Flachs et. al. .A Streaming Processing Unit for A CELL Processor., International Solid-State Circuits Conference Technical Digest, Feb. 2005. [5] D. Pham et. al. .The Design and Implementation of a First-Generation CELL Processor., International Solid-State Circuits Conference Technical Digest, Feb. 2005. [6] J. Kuang et. al. .A Double-Precision Multiplier with Fine-Grained Clock-Gating Support for a First-Generation CELL Processor., International Solid-State Circuits Conference Technical Digest, Feb. 2005. [7] S. Dhong et. al. .A 4.8 GHz Fully Pipelined Embedded SRAM in the Streaming Processor of a CELL Processor., International Solid-State Circuits Conference Technical Digest, Feb. 2005. [8] K. Chang et. al. .Clocking and Circuit Design for a Parallel I/O on a First-Generation CELL Processor., International Solid-State Circuits Conference Technical Digest, Feb. 2005. -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050211/c7cfa2c4/attachment.bin From mathog at mendel.bio.caltech.edu Fri Feb 11 08:17:34 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] cooling question: cfm per rack? Message-ID: In designing a computer room two key factors are: 1. Power in (electricity) 2. Power out (A/C) The second term really has two parts: A. the amount of air moved B. the reduction in temperature of that air across the A/C unit The latter part is specified in tons. The A/C guys I've spoken with recently utilize some more or less standard relationship between cubic feet per minute (cfm) and A/C tons for the units they maintain. These run off the campus cold water supply, so it makes sense that heat out is proportional to flow across, assuming that the cold water has a very large heat capacity. However, in terms of cooling the units themselves, the amount of air flow through the racks is also important. That flow is also in cfm. Ideally cfm through the racks would be equal to cfm through the A/C, ie, all air goes once through the racks and then directly through the A/C. Even more ideally cfm through _each_ rack could be modulated somehow, since some racks move much more air than others and putting a low flow rack next to a high flow rack might drive the air the wrong way through the low flow unit. How does one calculate an optimal cfm through a rack? For a specific example with round numbers, let's say it's a 25U rack, dissipates 10kW, and has a single 50 cfm per minute output fan per 1U node. (Ie, all air out must go through that path.) There seem to be a bunch of variables that are hard to deal with. For instance, adding the exhaust fans would be 50*25 = 1250 cfm. Is that all there is to it? But that type of fan only runs at the stated flow rate if the pressures are exactly as specified. Without incredibly careful balancing of the pressure across the rack it won't generally run at 50 cfm. Is cfm the key unit here or should one think in terms of pressure at various points in the room? Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From joachim at ccrl-nece.de Fri Feb 11 10:49:48 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <20050211023619.GB5174@greglaptop.internal.keyresearch.com> References: <200502102000.j1AK0Eb7016772@bluewest.scyld.com> <420BC57A.5060007@harddata.com> <20050211023619.GB5174@greglaptop.internal.keyresearch.com> Message-ID: <420CFE4C.6050003@ccrl-nece.de> Greg Lindahl wrote: > On Thu, Feb 10, 2005 at 01:35:06PM -0700, Maurice Hilarius wrote: >>If I have a fantastic device that uses infinitely small time (latency) >>and moves huge amounts of data (bandwidth) but in doing so it takes 80% >>of a CPU, we do not have a useful solution.. > > If large cpu usage is a problem, it will show up nicely in real > application benchmarks. True. I always wonder what the low-CPU-usage-advocates want the MPI process to do while i.e. an MPI_Send() is executed. For small messages (which are critical for many applications), it's somewhat like requesting that a local memory-write has to show low CPU usage. Of course, I can think of scenarios in which data transfers w/o CPU usage do promise advantages, and I have implemented and evaluated such techniques myself. But in the end (for the application), it always boiled down to latency and bandwidth as most applications don't honor "true" asynchronous communication. The latest unsuccessful case of uncoupling computation and MPI communication I read about was BG/L when using the second CPU as a message processor. Maybe Myrinet MX will behave differently by making the MPI itself more concurrent on hardware level (is this a correct description, Patrick?) - but it will need matching applications, too. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From rgb at phy.duke.edu Fri Feb 11 11:02:23 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] cooling question: cfm per rack? In-Reply-To: References: Message-ID: On Fri, 11 Feb 2005, David Mathog wrote: > In designing a computer room two key factors are: > > 1. Power in (electricity) > 2. Power out (A/C) > > The second term really has two parts: > > A. the amount of air moved > B. the reduction in temperature of that air across the A/C unit > > The latter part is specified in tons. The A/C guys I've spoken > with recently utilize some more or less standard relationship > between cubic feet per minute (cfm) and A/C tons for the units they > maintain. These run off the campus cold water supply, so > it makes sense that heat out is proportional to flow across, assuming > that the cold water has a very large heat capacity. > > However, in terms of cooling the units themselves, the amount of > air flow through the racks is also important. That flow is > also in cfm. Ideally cfm through the racks would be equal to cfm > through the A/C, ie, all air goes once through the racks and then > directly through the A/C. Even more ideally cfm through _each_ rack > could be modulated somehow, since some racks move much more > air than others and putting a low flow rack next to a high flow rack > might drive the air the wrong way through the low flow unit. > > How does one calculate an optimal cfm through a rack? > > For a specific example with round numbers, let's say it's a > 25U rack, dissipates 10kW, and has a single 50 cfm per minute output > fan per 1U node. (Ie, all air out must go through that path.) > > There seem to be a bunch of variables that are hard to deal with. > For instance, adding the exhaust fans would be 50*25 = 1250 cfm. > Is that all there is to it? But that type of fan only runs at > the stated flow rate if the pressures are exactly as specified. > Without incredibly careful balancing of the pressure across the > rack it won't generally run at 50 cfm. > > Is cfm the key unit here or should one think in terms of pressure > at various points in the room? I can't answer all your questions here, but you've pointed out a lot of the problems. You have to arrange for the blower to deliver chilled air to the right places in the room, and you ALSO have to arrange for a warm air return that picks up the warmed air (after it has passed through the systems and cooled them) and returns it to be cooled and cycled again. The overall airflow is determined by those two things -- cool air being delivered at an overpressure, warm air being returned at an underpressure, and the intermediate pressure gradient (interacting with intervening obstacles such as the racks full of equipment) determining the flow pattern. That flow pattern needs to avoid things like "hot spots" that are isolated from the overall cooling flow, especially hot spots that ultimately feed rack intake, and flow that feeds the warmed exhaust from one or more units back into the cool air intake of others. Ultimately, this is a nonlinear problem with turbulence and other factors and hence difficult to make pronouncements on without knowing the geometry of your rack layout and other stuff. One reason that raised floor designs are popular is that it makes establishing a clean circulation pattern a bit simpler -- feed cold air from the botton right into the rack intakess, vent the warmed air from their outflow directly into a warm air return. The cooling air mixes minimally with ambient room air and is relatively easy to balance. In a simpler overhead cold air delivery, warm air return system you'll need to be able to balance the cold air delivery at several points in the room, perhaps blowing it down directly into the front (intake) faces of opposing racks, while letting the warm air get pulled along the ceiling to one or more major return vents. That way you can get a delivered cold-air-down, in-through-rack, out-from-rack, warm-air-up, warm-air-along-ceiling and returned sort of pattern established that is consistent and balancable among the delivery registers throughout the room. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From idooley at isaacdooley.com Fri Feb 11 12:39:29 2005 From: idooley at isaacdooley.com (Isaac Dooley) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <200502112000.j1BK0DNm021457@bluewest.scyld.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> Message-ID: <420D1801.9090206@isaacdooley.com> >True. I always wonder what the low-CPU-usage-advocates want the MPI >process to do while i.e. an MPI_Send() is executed. > They don't want the process to do anything when the call MPI_Send, however carefully using asynchronous or non-blocking messaging ideally would not use the CPU. Using MPI_ISend() allows programs to not waste CPU cycles waiting on the completion of a message transaction. This is critical for some tightly coupled fine grained applications. Also it allows for overlapping computation and communication, which is beneficial. Isaac Dooley From lindahl at pathscale.com Fri Feb 11 13:03:35 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <420CFE4C.6050003@ccrl-nece.de> References: <200502102000.j1AK0Eb7016772@bluewest.scyld.com> <420BC57A.5060007@harddata.com> <20050211023619.GB5174@greglaptop.internal.keyresearch.com> <420CFE4C.6050003@ccrl-nece.de> Message-ID: <20050211210335.GE1256@greglaptop.internal.keyresearch.com> On Fri, Feb 11, 2005 at 07:49:48PM +0100, Joachim Worringen wrote: > The latest unsuccessful case of uncoupling computation and MPI > communication I read about was BG/L when using the second CPU as a > message processor. Yep, "offload" that improves performance is more complicated than it seems. The new InfiniPath adapter aims at raw latency and bandwidth excellence, because this is always helpful. It's also frequently helpful to be able to send directly out of cache, for medium-sized packets, instead of using send dma, which has to flush cache to main memory. Memory bandwidth isn't free. Getting more concurrency, by the way, is as much a hardware issue as a software issue. InfiniPath's hardware is dumb, but highly pipelined. Most offload engines seem to have less pipelining. And cpu software overhead generally scales nicely with additional cpus... -- greg From lindahl at pathscale.com Fri Feb 11 13:21:38 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <420D1801.9090206@isaacdooley.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> Message-ID: <20050211212137.GA2278@greglaptop.internal.keyresearch.com> On Fri, Feb 11, 2005 at 02:39:29PM -0600, Isaac Dooley wrote: > Using MPI_ISend() allows programs to not waste > CPU cycles waiting on the completion of a message transaction. This is > critical for some tightly coupled fine grained applications. We do pretty much the same thing for MPI_Send and MPI_ISend for small packets: they're nearly on the wire when the routine returns, and the subsequent MPI_Wait is a no-op. This is actually pretty common among MPI implementations. The problem with trying to generalize about what MPI calls do is that different implementations do different things with them. Reading the standard won't teach you much about implementations. -- greg From rross at mcs.anl.gov Fri Feb 11 13:47:39 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <420D1801.9090206@isaacdooley.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> Message-ID: On Fri, 11 Feb 2005, Isaac Dooley wrote: > >True. I always wonder what the low-CPU-usage-advocates want the MPI > >process to do while i.e. an MPI_Send() is executed. > > > They don't want the process to do anything when the call MPI_Send, > however carefully using asynchronous or non-blocking messaging ideally > would not use the CPU. Unless your code is multi-threaded, why do you care what the CPU utilization is during MPI_Send()? Saving on the power bill? When you call MPI_Send() semantically you've said "Hey, send this, and btw I can't do anything else until you are done." Likewise for MPI_Recv(). So the implementation will be built to get things done as quickly as possible. Often the path to lowest latency leads to polling, which leads to the high CPU utilization. Same issue with interrupt mitigation, as mentioned earlier in the thread; you can save CPU by coalescing, or you can get better performance. > Using MPI_ISend() allows programs to not waste CPU cycles waiting on the > completion of a message transaction. No, it allows the programmer to express that it wants to send a message but not wait for it to complete right now. The API doesn't specify the semantics of CPU utilization. It cannot, because the API doesn't have knowledge of the hardware that will be used in the implementation. > This is critical for some tightly coupled fine grained applications. What exactly is critical for tightly coupled, fine grained applications? I would think that extremely low latency communication would be the most important factor, not whether or not we crank on the CPU to get that. > Also it allows for overlapping computation and communication, which is > beneficial. Sure! Rob --- Rob Ross, Mathematics and Computer Science Division, Argonne National Lab From rbw at ahpcrc.org Fri Feb 11 14:11:14 2005 From: rbw at ahpcrc.org (Richard Walsh) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <20050211212137.GA2278@greglaptop.internal.keyresearch.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <20050211212137.GA2278@greglaptop.internal.keyresearch.com> Message-ID: <420D2D82.5050609@ahpcrc.org> Greg Lindahl wrote: >On Fri, Feb 11, 2005 at 02:39:29PM -0600, Isaac Dooley wrote: > > > >>Using MPI_ISend() allows programs to not waste >>CPU cycles waiting on the completion of a message transaction. This is >>critical for some tightly coupled fine grained applications. >> >> > >We do pretty much the same thing for MPI_Send and MPI_ISend for small >packets: they're nearly on the wire when the routine returns, and >the subsequent MPI_Wait is a no-op. This is actually pretty common >among MPI implementations. > >The problem with trying to generalize about what MPI calls do is that >different implementations do different things with them. Reading >the standard won't teach you much about implementations. > >-- greg >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > Right. Small messages are where latency matters anyway. As the message size dwindles, the remaining overhead is mostly intrinsic to the subroutine call and unavoidable. What is to be done? The only choice is to squeeze out the subroutine call itself with a different programming model (say UPC) and a memory and instruction set architecture that supports single instruction (preferably pipeline with a block/vector length and stride option to hide latency) remote memory addressing. Additions like the STEN on the Quadrics Elan4 and Hypertransport directly from remote processor cache are cluster hardware morphs taking things the direction of GAS systems like the Cray X1 and SGI Altix. rbw From mathog at mendel.bio.caltech.edu Fri Feb 11 14:59:55 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:46 2009 Subject: Fw: Re: [Beowulf] cooling question: cfm per rack? Message-ID: Mike, I've been trying to pick the brains of other folks on the beowulf list who have computer rooms with modern equipment. One problem with the existing air, with regards to future expansion, is apparently the total amount of air that the current A/C can move. This is all horrendously complicated and needs to be looked at carefully by a HVAC consultant. Pretty sure we have enough tons and flow for now, meaning my rack and Deshaies and everything else I know is going in there in a couple of months. More and more convinced that we don't have enough to handle multiple full racks of the next generation of computers. Jim Lux from JPL answered my questions as attached after my signature. His back of the envelope calculations for a 5kW rack (roughly equal to what I have now) give a requirement for 1800 cfm flow through the rack. The current A/C is, according to the A/C guy who was here, good for only 5500 cfm. However, since I don't know what the inlet or outlet temperatures on the rack are going to be (ie, the temperature of the air the A/C returns to the room and how hot the air is coming out the back) the required cfm may be quite different than this. Hmm, let me go borrow a thermometer and measure it, 22 C in, 32 C out the back, on the node in the middle of the rack. So there's 10 degrees across my rack and he assumes 15. Anyway, a safer estimate for the next generation is 10kW, and there are people who predict 20kW, so total airflow through the A/C seems unlikely to be sufficient a few years down the road. Assuming that people put this equipment in the room. Sorry, to be vague, there are just so many unknowns. I also talked to Darryl Willick, who runs a bunch of machine rooms on campus for Chemistry and some of Rees, Bjorkman and Mayo's stuff. His main room is about at capacity now with 6 full racks and a few odds and ends. He has 2 x 250A panels in there and apparently only a 45kW A/C unit. That second number is really odd because they aren't usually rated that way, but that's the number he remembered. If he's right that's 45000/3500=12 tons, roughly the same as the unit currently in the Rees area. He said his had to be serviced recently because they were having overheating problems, but only a belt was changed. Unknown how many cfm it is. He has a small workstation area that is somehow or other connected to his machine room ventilation wise, and apparently when they prop the door open in the workstation area it causes problems in the machine room. So maybe it would make sense to put a small separate A/C unit in the proposed classroom to avoid those sorts of complications in the future. Or maybe it can tap off building air. Darryl did say something interesting though, he said that for some units the A/C people can increase the capacity by changing the pulleys around. Apparently this blows more air, and the cold water isn't limiting, so it effectively upgrades the unit without changing very much. Darryl said that this was done at some point for Mayo's computer room in the subbasement of the BI. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech ------------- Forwarded message follows ------------- At 08:17 AM 2/11/2005, you wrote: >In designing a computer room two key factors are: > >1. Power in (electricity) >2. Power out (A/C) > >The second term really has two parts: > > A. the amount of air moved > B. the reduction in temperature of that air across the A/C unit > >The latter part is specified in tons. The A/C guys I've spoken >with recently utilize some more or less standard relationship >between cubic feet per minute (cfm) and A/C tons for the units they >maintain. These run off the campus cold water supply, so >it makes sense that heat out is proportional to flow across, assuming >that the cold water has a very large heat capacity. > >However, in terms of cooling the units themselves, the amount of >air flow through the racks is also important. That flow is >also in cfm. Ideally cfm through the racks would be equal to cfm >through the A/C, ie, all air goes once through the racks and then >directly through the A/C. Even more ideally cfm through _each_ rack >could be modulated somehow, since some racks move much more >air than others and putting a low flow rack next to a high flow rack >might drive the air the wrong way through the low flow unit. > >How does one calculate an optimal cfm through a rack? Decide on a maximum outlet temperature (say, 30C) Find your inlet air temperature (say, 15C) You know your dissipation.. (say, 5kW) Calculate how much air you need to move using the specific heat of air. (about 1 kJ/(kg K)) 5 kJ/sec means you'd need 5 kg/sec for a 1 degree rise, but here, with a 15 degree rise, you can get by with .33 kg/sec. Turn the kg/sec into cfm... .33 kg * 1.3 m3/kg = .43 cubic meters/sec. There's about 35 cubic feet in a cubic meter, so we need about 15 cubic feet per second. Multiply by 60 and you get a bit more than 900 cfm. Now.. that's idealized, so double it. 1800 cfm or so. Step 2: How big is the duct? Generally, you don't want to go any faster than 1000 linear feet per minute, so your duct will need to be about 2 square feet. (you begin to see why you don't want some little 6" diameter blower...) >For a specific example with round numbers, let's say it's a >25U rack, dissipates 10kW, and has a single 50 cfm per minute output >fan per 1U node. (Ie, all air out must go through that path.) > >There seem to be a bunch of variables that are hard to deal with. >For instance, adding the exhaust fans would be 50*25 = 1250 cfm. >Is that all there is to it? But that type of fan only runs at >the stated flow rate if the pressures are exactly as specified. >Without incredibly careful balancing of the pressure across the >rack it won't generally run at 50 cfm. This is precisely the case. And, of course, the actual circumstances will be nothing like what the design specs are. >Is cfm the key unit here or should one think in terms of pressure >at various points in the room? Trying to come up with an accurate aerodynamic model is a worthy challenge for a very large cluster (computational challenge, not thermal). It's all done by rules of thumb and adding lots of margin. Use the rough sizing technique to get an approximate air flow. Use reasonable sized ducts and air speeds. Measure the actual outlet temperatures. Actually, what most people do is a rough sizing, then call in someone who actually does this for a living (a HVAC contractor) and use their rough sizing to validate what the contractor tells you you should have. >Thanks, > >David Mathog >mathog@caltech.edu >Manager, Sequence Analysis Facility, Biology Division, Caltech >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From mathog at mendel.bio.caltech.edu Fri Feb 11 15:06:49 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Oops Message-ID: Sorry about that message addressed to "Mike", it wasn't supposed to go to the list. Please cancel it if that's possible. Apologies otherwise. David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rross at mcs.anl.gov Fri Feb 11 18:47:22 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <420D54DA.8000904@uiuc.edu> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> Message-ID: Hi Isaac, On Fri, 11 Feb 2005, Isaac Dooley wrote: > >>Using MPI_ISend() allows programs to not waste CPU cycles waiting on the > >>completion of a message transaction. > > > >No, it allows the programmer to express that it wants to send a message > >but not wait for it to complete right now. The API doesn't specify the > >semantics of CPU utilization. It cannot, because the API doesn't have > >knowledge of the hardware that will be used in the implementation. > > > That is partially true. The context for my comment was under your > assumption that everyone uses MPI_Send(). These people, as I stated > before, do not care about what the CPU does during their blocking calls. I think that it is completely true. I made no assumption about everyone using MPI_Send(); I'm a late-comer to the conversation. I was not trying to say anything about what people making the calls care about; I was trying to clarify what the standard does and does not say. However, I agree with you that it is unlikely that someone calling MPI_Send() is too worried about what the CPU utilization is during the call. > I was trying to point out that programs utilizing non-blocking IO may > have work that will be adversely impacted by CPU utilization for > messaging. These are the people who care about CPU utilization for > messaging. This I hopes answers your prior question, at least partially. I agree that people using MPI_Isend() and related non-blocking operations are sometimes doing so because they would like to perform some computation while the communication progresses. People also use these calls to initiate a collection of point-to-point operations before waiting, so that multiple communications may proceed in parallel. The implementation has no way of really knowing which of these is the case. Greg just pointed out that for small messages most implementations will do the exact same thing as in the MPI_Send() case anyway. For large messages I suppose that something different could be done. In our implementation (MPICH2), to my knowledge we do not differentiate. You should understand that the way MPI implementations are measured is by their performance, not CPU utilization, so there is pressure to push the former as much as possible at the expense of the latter. > Perhaps your applications demand low latency with no concern for the CPU > during the time spent blocking. That is fine. But some applications > benefit from overlapping computation and communication, and the cycles > not wasted by the CPU on communication can be used productively. I wouldn't categorize the cycles spent on communication as "wasted"; it's not like we code in extraneous math just to keep the CPU pegged :). Regards, Rob --- Rob Ross, Mathematics and Computer Science Division, Argonne National Lab From hahn at physics.mcmaster.ca Fri Feb 11 21:41:41 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] cooling question: cfm per rack? In-Reply-To: Message-ID: > The second term really has two parts: > A. the amount of air moved > B. the reduction in temperature of that air across the A/C unit > > The latter part is specified in tons. The A/C guys I've spoken well, I usually think of temperature as a side-effect of the more direct measure, movement of energy. hence, I always think of the tidy relation of 3.517 KW = 1 ton. I usually skip any BTUs... > with recently utilize some more or less standard relationship > between cubic feet per minute (cfm) and A/C tons for the units they > maintain. CFM and delta-t across the machine-to-be-cooled are convolved to give you how much heat you're extracting. no doubt both pressure and humidity are involved to some degree as well, and I don't have a good equation for this. the good thing is that turning down the temperature can partly mitigate minor airflow problems. some reasonable discussion from Intel, (a bit axe-grinding, though): http://www.7x24nw.org/Presentations-folder/Air%20Cooling%20in%20Servers%20and%20IT%20Facilities.pdf a dell 1855 blade chassis spec's 400 CFM for ~4KW. they're talking 6 of those chassis in a rack (24 KW!). then again, that's assuming an unrealistic power-per blade (>400W), which sounds like corporate CYA to me: http://www.dell.com/downloads/global/products/pedge/en/PowerEdge%201855%20DC%20Whitepaper.pdf this is a good overall discussion, though perhaps a bit pessimistic about "typical" machinerooms: http://www.chatsworth.com/uploads/pdf/best_practices_cooling_wp.pdf http://www.chatsworth.com/uploads/pdf/increase_computerrm_cooling_wp.pdf sun recommends 21-23C, 45-50%. 35% min, ESD critical at 30%: http://www.sun.com/products-n-solutions/hardware/docs/html/817-4137-10/2__EnvReq.html to complicate matters, HVAC folk always bring up the issue of "sensible load". as near as I can tell, this is just a way of saying that if you try to impose too much delta-T on humid air, you wind up wasting a lot of energy dehumidifying it... tiles between 500-2000 CFM: http://h200005.www2.hp.com/bc/docs/support/SupportManual/c00064724/c00064724.pdf that also gives: CFM = btu/hr / (1.08 * dT) so for 1 ton = 12000 BTU/hr and 70->90, 555 CFM per ton of cooling. HVAC folk also tend to say 1 tile/ton, which seems about right. > These run off the campus cold water supply, so > it makes sense that heat out is proportional to flow across, assuming > that the cold water has a very large heat capacity. our experience with CW has been disasterous, but we made the huge mistake of not using precision/machineroom chillers (fancoils, actually). our old/existing machineroom, for instance, is supposed to have 2x8ton fancoils, but combined they never moved more than about 20 KW (should be 56). unless you have pretty extreme assurances about WC quality (flow, temp), I would only consider using dual-cool machineroom chillers (DX + CW, usually adds about 15% to price.) > directly through the A/C. Even more ideally cfm through _each_ rack > could be modulated somehow, since some racks move much more > air than others and putting a low flow rack next to a high flow rack > might drive the air the wrong way through the low flow unit. well, the stuff in racks does probably have quite a few fans, which could ideally modulate themselves. my current-gen clusters certainly don't do that, but I'd be quite happy if next-gen did... > How does one calculate an optimal cfm through a rack? > > For a specific example with round numbers, let's say it's a > 25U rack, dissipates 10kW, and has a single 50 cfm per minute output > fan per 1U node. (Ie, all air out must go through that path.) that sounds reasonable to me - 10KW is ~3 tons, and the formula above relates your 1250 CFM total to about 3 tons as well. for my 10KW racks, I'm hoping to push the temperature down a bit (60-65), keep the humidity low to avoid "sensible" wastage, and hope for the best with our tiles. > Is cfm the key unit here or should one think in terms of pressure > at various points in the room? I think the answer is yes. with a good raised floor, you seem to be able to expect fairly even pressure distribution. we turned on our new machineroom yesterday, and the pressure feels similar everywhere (16" raised floor, though with some conduits down there, and 3x30T Liebert deluxe system 3's.) if your pressure is reasonably even, the same tiles should flow the same CFM. I'd LOVE to find some way to measure airflow, since I'd actually consider doing things like adding patches of duct tape to the underside of too-high-flow tiles. I suppose that the empiricist approach is just to sample all your system temperatures, and if some are too high, reduce the airflow to racks which are "too cool". From james.p.lux at jpl.nasa.gov Sat Feb 12 05:45:02 2005 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] cooling question: cfm per rack? References: Message-ID: <000401c51109$0df9a6d0$19f29580@LAPTOP152422> > CFM and delta-t across the machine-to-be-cooled are convolved to give > you how much heat you're extracting. no doubt both pressure and humidity > are involved to some degree as well, and I don't have a good equation > for this. Indeed.. there is no "nice simple" equation for the general case, because of the problem with humidity. You really need to be worrying about enthalpy, etc., and with any sort of significant temperature change, it's neither constant pressure, nor constant volume, not to mention mechanical turbulence, etc.. All that icky thermodynamics stuff. I once spend several weeks trying to figure out if one could make theatrical fog without using liquid nitrogen. They do it by having a big tank of water about half full at around 160-180F, and then they inject liquid nitrogen into the headspace above the water. Turns out that the heat of vaporization of the LN2 is almost exactly balanced by the heat of condensation of the saturated water vapor, and that the volume of nitrogen gas produced, etc, works out to the outlet stream being around 38F, with the water droplets at the same temperature. Very, very tough to do this with mechanical refrigeration for a variety of reasons. So, as you say, unless you're airconditioning a huge building (where the cost of excess capacity is significant, and where there all those hot, water exhaling people inside), you can just do some quasi-worst case approximating. the good thing is that turning down the temperature can partly > mitigate minor airflow problems. > > to complicate matters, HVAC folk always bring up the issue of "sensible > load". as near as I can tell, this is just a way of saying that if you try > to impose too much delta-T on humid air, you wind up wasting a lot of energy > dehumidifying it... Yes.. this is especially true if you're not recirculating, but chilling fresh air from "outside". If you've got a reasonably closed system and there's no people inside, it's less of an issue. > > tiles between 500-2000 CFM: > http://h200005.www2.hp.com/bc/docs/support/SupportManual/c00064724/c00064724 .pdf > that also gives: > CFM = btu/hr / (1.08 * dT) > so for 1 ton = 12000 BTU/hr and 70->90, 555 CFM per ton of cooling. > HVAC folk also tend to say 1 tile/ton, which seems about right. > > > These run off the campus cold water supply, so > > it makes sense that heat out is proportional to flow across, assuming > > that the cold water has a very large heat capacity. Yes, in a theoretical sense. However, there are two factors to be aware of: 1) run the air too fast past the coils and it doesn't have time to exchange the heat; 2) run the air too fast and you consume power (and make heat) in compressing it to overcome the pressure drop. There's also a practical limit on just how much delta T you can get in one pass through the chiller coils. > > our experience with CW has been disasterous, but we made the huge mistake > of not using precision/machineroom chillers (fancoils, actually). > our old/existing machineroom, for instance, is supposed to have 2x8ton > fancoils, but combined they never moved more than about 20 KW (should be 56). > > unless you have pretty extreme assurances about WC quality (flow, temp), > I would only consider using dual-cool machineroom chillers (DX + CW, usually > adds about 15% to price.) > > > directly through the A/C. Even more ideally cfm through _each_ rack > > could be modulated somehow, since some racks move much more > > air than others and putting a low flow rack next to a high flow rack > > might drive the air the wrong way through the low flow unit. > > > > Is cfm the key unit here or should one think in terms of pressure > > at various points in the room? > > > if your pressure is reasonably even, the same tiles should flow the > same CFM. I'd LOVE to find some way to measure airflow, since I'd > actually consider doing things like adding patches of duct tape to > the underside of too-high-flow tiles. I suppose that the empiricist > approach is just to sample all your system temperatures, and if some > are too high, reduce the airflow to racks which are "too cool". Hie thee to a company called Dwyer, who make equipment specifically designed to measure airflow. There are several approaches.. One is using a pitot tube with a Magnehelic differential pressure gauge. Another is to measure the pressure drop across a calibrated orifice (again, using a sensitive pressure gauge). http://www.dwyer-inst.com/ Another is to use a airspeed probe (looks like a wand with a little fan in a hole on the end). The fancy ones will average a bunch of readings over an opening and do the calculation to turn area*average speed into CFM. You can find Magnehelic gauges surplus all the time.. keep your eyes open and when one turns up for $15-20, grab it. They're handy devices that can measure fairly small pressures (few inches of water column), and come with all sorts of weird scales (including some already calibrated in feet per minute or m/sec, all ready for use with a pitot tube). Interesting to measure the pressure in a room (or your house) and see what happens when the heater turns on, or the kids open and close the doors, etc. Some time spent with the Mc-Master Carr catalog (http://www.mcmaster.com/) or the Grainger catalog (http://www.grainger.com/) (both are large suppliers of stuff mechanical, materials, etc.. everyone should have a copy of the several thousand page yellow McMaster Carr catalog on their desk...). Omega (usually associated with temperature measuring) has a fair number of airspeed and volume measuring devices. http://www.omega.com However, your empirical approach of reducing the flow through the coldest racks is probably as good as anything. Jim Lux From rgb at phy.duke.edu Sat Feb 12 06:36:17 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:46 2009 Subject: Fw: Re: [Beowulf] cooling question: cfm per rack? In-Reply-To: References: Message-ID: On Fri, 11 Feb 2005, David Mathog wrote: > Sorry, to be vague, there are just so many unknowns. Always.:-) > > I also talked to Darryl Willick, who runs a bunch of machine rooms > on campus for Chemistry and some of Rees, Bjorkman and Mayo's > stuff. His main room is about at capacity now with > 6 full racks and a few odds and ends. He has 2 x 250A panels > in there and apparently only a 45kW A/C unit. That second > number is really odd because they aren't usually rated that > way, but that's the number he remembered. If he's right that's > 45000/3500=12 tons, roughly the same as the unit currently > in the Rees area. He said his had to be serviced > recently because they were having overheating problems, but only > a belt was changed. Unknown how many cfm it is. He has a small > workstation area that is somehow or other connected to his machine > room ventilation wise, and apparently when they prop the door open > in the workstation area it causes problems in the machine room. > So maybe it would make sense to put a small separate A/C unit > in the proposed classroom to avoid those sorts of complications > in the future. Or maybe it can tap off building air. > > Darryl did say something interesting though, he said that for > some units the A/C people can increase the capacity by changing > the pulleys around. Apparently this blows more air, and the > cold water isn't limiting, so it effectively upgrades the unit > without changing very much. Darryl said that this was done > at some point for Mayo's computer room in the subbasement > of the BI. I'm sure you probably remember this from my posts on this topic before, but there are lots of bad experiences we and others on the list have had with AC that you can profit from. Don't forget things like: * Kill switch for room for the day the AC fails altogether at 2:30 a.m. * Automated monitoring and (if you've got one) a call cycle so that maybe somebody can get there in time to shut things down before the kill switch kicks in EVEN at 2:30 a.m. * The fact that at many places, the physical plant people have this annoying tendency to try to save energy by throttling down the A/C to a standby mode (where the chilled water is allowed to warm up to maybe 18C) in the winter because hey, it's cold outside, right? Often this is done automatically, without human thought or control. Often this triggers events for which the first two interventions are required when it does. This may not apply to you in your generally warm clime (compared to here, anyway) but is worth checking, for sure. * When computing the cost/benefit of power vs AC, be aware (to put into words what you're working toward anyway) that the true optimum is going to be biased towards an excess of AC capacity. This is for several reasons, once you think about it. The most important one is that adding new/additional power is relatively cheap whenever you do it; adding new/additional AC capacity later can be VERY expensive -- as expensive as adding AC at all in the first place. * Surplus capacity can also keep room ambient colder (generally better) while operating in the normal load range and may be cheaper in terms of operating efficiency, as AC COP depends on temperature differentials between delivery and returned chiller water (although the blowers and pumps draw too -- don't know how this all works out in the wash). * Redundancy is good, if you've got the space. If one blower out of three goes, the remaining two may be able to keep the space operational while service is performed, or at least keep it cool enough to avoid an involuntary kill or midnight call. * As you note -- it really helps to get professional advice on this from an engineer or architect who specializes in server room infrastructure design and support. Not that you shouldn't educate yourself in it too -- it's just that they SHOULD have a broad base of personal professional experience to draw on as well as some classroom education on the issues to be faced. Worth paying for. As you note, it is very difficult to know exactly where future power requirements and node densities will go per rack. Maybe blades will take over the universe, and racks will suddenly become very hot indeed. Some non-blade racks can achieve close to double the standard node/CPU densities in terms of floorspace footprint (e.g. Rackable, IIRC). Multiple core CPUs are at the threshold of appearing, and although they also look like they might be power/clock limited BECAUSE of the heat problem, there is still going to be some sort of scaling of power per compute capacity per cubic foot of rack space as the latter goes up. Alternatively, some room designs might install the DUCTWORK now that can support a (say) doubling of future AC capacity in the future and reserve space for the local units to drive this capacity in the facility but leave that space empty. Then you can (eventually) add the units without having to necessarily rip everything apart. This probably works best with raised floor designs (where you just duct per rack location) but one would expect that they could manage it for other kinds of ducted delivery and return if they try. In any infrastructure project, it really pays to think about this stuff ahead of time, as you are. rgb > > Regards, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > > ------------- Forwarded message follows ------------- > > At 08:17 AM 2/11/2005, you wrote: > >In designing a computer room two key factors are: > > > >1. Power in (electricity) > >2. Power out (A/C) > > > >The second term really has two parts: > > > > A. the amount of air moved > > B. the reduction in temperature of that air across the A/C unit > > > >The latter part is specified in tons. The A/C guys I've spoken > >with recently utilize some more or less standard relationship > >between cubic feet per minute (cfm) and A/C tons for the units they > >maintain. These run off the campus cold water supply, so > >it makes sense that heat out is proportional to flow across, assuming > >that the cold water has a very large heat capacity. > > > >However, in terms of cooling the units themselves, the amount of > >air flow through the racks is also important. That flow is > >also in cfm. Ideally cfm through the racks would be equal to cfm > >through the A/C, ie, all air goes once through the racks and then > >directly through the A/C. Even more ideally cfm through _each_ rack > >could be modulated somehow, since some racks move much more > >air than others and putting a low flow rack next to a high flow rack > >might drive the air the wrong way through the low flow unit. > > > >How does one calculate an optimal cfm through a rack? > > Decide on a maximum outlet temperature (say, 30C) > Find your inlet air temperature (say, 15C) > You know your dissipation.. (say, 5kW) > > Calculate how much air you need to move using the specific heat of air. > (about 1 kJ/(kg K)) > > 5 kJ/sec means you'd need 5 kg/sec for a 1 degree rise, but here, with a 15 > degree rise, you can get by with .33 kg/sec. Turn the kg/sec into cfm... > .33 kg * 1.3 m3/kg = .43 > cubic meters/sec. There's about 35 cubic feet in a cubic meter, so we need > about 15 cubic feet per second. Multiply by 60 and you get a bit more than > 900 cfm. > > Now.. that's idealized, so double it. 1800 cfm or so. > > > Step 2: How big is the duct? Generally, you don't want to go any faster > than 1000 linear feet per minute, so your duct will need to be about 2 > square feet. (you begin to see why you don't want some little 6" diameter > blower...) > > > > >For a specific example with round numbers, let's say it's a > >25U rack, dissipates 10kW, and has a single 50 cfm per minute output > >fan per 1U node. (Ie, all air out must go through that path.) > > > >There seem to be a bunch of variables that are hard to deal with. > >For instance, adding the exhaust fans would be 50*25 = 1250 cfm. > >Is that all there is to it? But that type of fan only runs at > >the stated flow rate if the pressures are exactly as specified. > >Without incredibly careful balancing of the pressure across the > >rack it won't generally run at 50 cfm. > > > This is precisely the case. And, of course, the actual circumstances will > be nothing like what the design specs are. > > > >Is cfm the key unit here or should one think in terms of pressure > >at various points in the room? > > Trying to come up with an accurate aerodynamic model is a worthy challenge > for a very large cluster (computational challenge, not thermal). > > It's all done by rules of thumb and adding lots of margin. > > Use the rough sizing technique to get an approximate air flow. Use > reasonable sized ducts and air speeds. Measure the actual outlet > temperatures. > > Actually, what most people do is a rough sizing, then call in someone who > actually does this for a living (a HVAC contractor) and use their rough > sizing to validate what the contractor tells you you should have. > > > > >Thanks, > > > >David Mathog > >mathog@caltech.edu > >Manager, Sequence Analysis Facility, Biology Division, Caltech > >_______________________________________________ > >Beowulf mailing list, Beowulf@beowulf.org > >To change your subscription (digest mode or unsubscribe) visit > >http://www.beowulf.org/mailman/listinfo/beowulf > > James Lux, P.E. > Spacecraft Radio Frequency Subsystems Group > Flight Communications Systems Section > Jet Propulsion Laboratory, Mail Stop 161-213 > 4800 Oak Grove Drive > Pasadena CA 91109 > tel: (818)354-2075 > fax: (818)393-6875 > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Sat Feb 12 06:49:03 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] cooling question: cfm per rack? In-Reply-To: References: Message-ID: > if your pressure is reasonably even, the same tiles should flow the > same CFM. I'd LOVE to find some way to measure airflow, since I'd > actually consider doing things like adding patches of duct tape to > the underside of too-high-flow tiles. I suppose that the empiricist > approach is just to sample all your system temperatures, and if some > are too high, reduce the airflow to racks which are "too cool". Relative airflow can probably be measured with a kid's toy -- one of the little pinwheels -- and counting revolutions with a stopwatch. Normalizing that to absolute airflow in CFM is a bit tricky (since the result depends to some extent on the resistance imposed by the measuring apparatus) but somebody out there may have designed a version of this with a real fan and magnets set so that the counting is done electronically. In fact, I could build something to do this out of OTC parts if I had any way to normalize the count. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From james.p.lux at jpl.nasa.gov Sat Feb 12 07:47:17 2005 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] cooling question: cfm per rack? References: Message-ID: <001401c5111a$3c447b30$32a8a8c0@LAPTOP152422> Sure, one could build it.. but one can probably buy it cheaper/easier Omega: http://www.omega.com/ppt/pptsc.asp?ref=HHF82&Nav=grec06 $89 They have others. Similar devices abound: http://www.nkhome.com/ww/1000/1000.html http://www.windandweather.com/store/Weather_Instruments___Wind_Gauges?Args=& page_number=1 (check out the first one.. $49) Your local sporting goods place (REI, Sport Chalet, Big 5) might have something like this too. So might Sharper Image or Brookstone, or one of those gadget stores Heck, Harbor Freight Salvage, a big retailer of inexpensive moderate quality imported stuff might have them..next time you're down buying cheap imported Chinese machine tools...check that bargain bin next to the register. Other approaches..small propellor on a small DC motor run as a generator (only works for fairly fast flows >several m/sec) run to a DVM. Small propellor and magnet/reedswitch driving a counter (as in your inexpensive DMM). (this is what the commercial units are) The challenge in home fabrication of such devices is getting it to be reasonably orientation insensitive, which implies pretty good balance, and to work in very low flows (<1 m/sec), which implies fairly low friction. I imagine, if you had a LOT of time on your hands, you could probably modify the heated film/wire sensor from an automotive mass air flow sensor for this purpose. (I spent the better part of a year trying to come up with a low budget way to measure velocity profiles across large (decameter scale) artificial tornadoes.. We eventually settled on a pitot tube rake with water manometers using video to do data logging.) ----- Original Message ----- From: "Robert G. Brown" To: "Mark Hahn" Cc: "David Mathog" ; Sent: Saturday, February 12, 2005 6:49 AM Subject: Re: [Beowulf] cooling question: cfm per rack? > > if your pressure is reasonably even, the same tiles should flow the > > same CFM. I'd LOVE to find some way to measure airflow, since I'd > > actually consider doing things like adding patches of duct tape to > > the underside of too-high-flow tiles. I suppose that the empiricist > > approach is just to sample all your system temperatures, and if some > > are too high, reduce the airflow to racks which are "too cool". > > Relative airflow can probably be measured with a kid's toy -- one of the > little pinwheels -- and counting revolutions with a stopwatch. > Normalizing that to absolute airflow in CFM is a bit tricky (since the > result depends to some extent on the resistance imposed by the measuring > apparatus) but somebody out there may have designed a version of this > with a real fan and magnets set so that the counting is done > electronically. In fact, I could build something to do this out of OTC > parts if I had any way to normalize the count. > > rgb From Toufeeq_Hussain at infosys.com Thu Feb 10 20:01:55 2005 From: Toufeeq_Hussain at infosys.com (Toufeeq Hussain) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Porting lam-7.1 to Cygwin (Win 2K) Message-ID: <557E17BE74D22143B7BE70EB60E33E9915BD81F8@shlmsg01.ad.infosys.com> Hi, Trying to compile lam-7.1 on Cygwin. Make fails at this point: make[2]: Entering directory `/lam-7.1.1/otb/lamgrow' /bin/bash ../../libtool --mode=link gcc -O3 -o lamgrow.exe lamgrow.o ../../share/liblam/liblam.la ../../share/libltdl/libltdlc.la -lutil gcc -O3 -o lamgrow.exe lamgrow.o ../../share/liblam/.libs/liblam.a ../../share/libltdl/.libs/libltdlc.a -lutil ../../share/liblam/.libs/liblam.a(ssi_boot_slurm.o)(.text+0x3c8):ssi_boo t_slurm.c: undefined reference to `_inet_ntop' collect2: ld returned 1 exit status make[2]: *** [lamgrow.exe] Error 1 make[2]: Leaving directory `/lam-7.1.1/otb/lamgrow' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/lam-7.1.1/otb' make: *** [all-recursive] Error 1 Is there a cygwin port available ? Any suggestions to the above problem. Regards, Toufeeq Hussain From rwm at absoft.com Fri Feb 11 07:00:00 2005 From: rwm at absoft.com (Rodney Mach) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Re: thread safe PRNG In-Reply-To: <200502111409.j1BE8vAY013737@bluewest.scyld.com> References: <200502111409.j1BE8vAY013737@bluewest.scyld.com> Message-ID: <420CC870.6020307@absoft.com> > Hi folks: > > I need to get a thread-safe pseudo-random number generator. All I > have found online was SPRNG which is set up for MPI. Anyone have a > quick pointer to their favorite thread safe PRNG that works well in > OpenMP? > > Thanks. > > Joe > Hey Joe, Intel MKL has various thread-safe prng that will work with OpenMP. IMSL also has thread-safe prng, as does IBM ESSL, ditto for AMD ACML. -Rod From henry.gabb at intel.com Fri Feb 11 07:04:31 2005 From: henry.gabb at intel.com (Gabb, Henry) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] RE: A thread-safe PRNG for an OpenMP program Message-ID: Hi Joe, The Intel Math Kernel Library (specifically the Vector Statistical Library within MKL) contains threadsafe random number functions. The following web site has a full description: http://www.intel.com/software/products/mkl/features/vsl.htm. There's an article "Making the Monte Carlo Approach Even Easier and Faster" on Intel Developer Services that describes how to use VSL functions with OpenMP. It's available here: http://www.intel.com/cd/ids/developer/asmo-na/eng/95573.htm. Best regards, Henry Gabb Intel Parallel Applications Center > Hi folks: > > I need to get a thread-safe pseudo-random number generator. All I > have found online was SPRNG which is set up for MPI. Anyone have a > quick pointer to their favorite thread safe PRNG that works well in OpenMP? > > Thanks. > > Joe > > -- > Scalable Informatics LLC, > email: landman@scalableinformatics.com > web : http://www.scalableinformatics.com From diep at xs4all.nl Fri Feb 11 08:59:56 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] A thread-safe PRNG for an OpenMP progra Message-ID: <3.0.32.20050211175956.0102c6c0@pop.xs4all.nl> Perhaps use a local PRNG as that can serve roughly at 2 nanoseconds a number to each cpu. Here is what i modified to 64 bits it's real fast at processors that are 64 bits and have rotating instruction (itanium doesn't have it, but still is faster than k7 here as it's 64 bits). Even at itanium you can consider this a fast PRNG. /* define parameters (R1 and R2 must be smaller than the integer size): */ #define UNIX 1 // otherwise windows #if UNIX #include #define FORCEINLINE __inline /* UNIX and such this is 64 bits unsigned variable: */ #define BITBOARD unsigned long long #else #define FORCEINLINE __forceinline /* in WINDOWS we also want to be 64 bits: */ #define BITBOARD unsigned _int64 #endif #define KK 17 #define JJ 10 #define R1 5 #define R2 3 /* global variables Ranrot */ BITBOARD randbuffer[KK+3] = { /* history buffer filled with some random numbers */ 0x92930cb295f24dab,0x0d2f2c860b685215,0x4ef7b8f8e76ccae7,0x03519154af3ec239, 0x195e36fe715fad23, 0x86f2729c24a590ad,0x9ff2414a69e4b5ef,0x631205a6bf456141,0x6de386f196bc1b7b, 0x5db2d651a7bdf825, 0x0d2f2c86c1de75b7,0x5f72ed908858a9c9,0xfb2629812da87693,0xf3088fedb657f9dd, 0x00d47d10ffdc8a9f, 0xd9e323088121da71,0x801600328b823ecb,0x93c300e4885d05f5,0x096d1f3b4e20cd47, 0x43d64ed75a9ad5d9 }; int r_p1, r_p2; /* indexes into history buffer */ /******************************************************** AgF 1999-03-03 * * Random Number generator 'RANROT' type B * * by Agner Fog * * * * This is a lagged-Fibonacci type of random number generator with * * rotation of bits. The algorithm is: * * X[n] = ((X[n-j] rotl r1) + (X[n-k] rotl r2)) modulo 2^b * * * * The last k values of X are stored in a circular buffer named * * randbuffer. * * * * This version works with any integer size: 16, 32, 64 bits etc. * * The integers must be unsigned. The resolution depends on the integer * * size. * * * * Note that the function RanrotAInit must be called before the first * * call to RanrotA or iRanrotA * * * * The theory of the RANROT type of generators is described at * * www.agner.org/random/ranrot.htm * * * * Optimized for 64 bits usage by Vincent Diepeveen * * diep@xs4all.nl * *************************************************************************/ FORCEINLINE BITBOARD rotl(BITBOARD x,int r) {return(x<>(64-r));} /* returns a random number of 64 bits unsigned */ FORCEINLINE BITBOARD RanrotA(void) { /* generate next random number */ BITBOARD x = randbuffer[r_p1] = rotl(randbuffer[r_p2],R1) + rotl(randbuffer[r_p1], R2); /* rotate list pointers */ if( --r_p1 < 0) r_p1 = KK - 1; if( --r_p2 < 0 ) r_p2 = KK - 1; return x; } /* this function initializes the random number generator. */ void RanrotAInit(void) { int i; /* one can fill the randbuffer here with possible other values here */ randbuffer[0] = 0x92930cb295f24000 | (BITBOARD)ProcessNumber; randbuffer[1] = 0x0d2f2c860b000215 | ((BITBOARD)ProcessNumber<<12); /* initialize pointers to circular buffer */ r_p1 = 0; r_p2 = JJ; /* randomize */ for( i = 0; i < 3000; i++ ) (void)RanrotA(); } At 16:40 10-2-2005 -0500, Joe Landman wrote: >Hi folks: > > I need to get a thread-safe pseudo-random number generator. All I >have found online was SPRNG which is set up for MPI. Anyone have a >quick pointer to their favorite thread safe PRNG that works well in OpenMP? > > Thanks. > >Joe > >-- >Scalable Informatics LLC, >email: landman@scalableinformatics.com >web : http://www.scalableinformatics.com > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From wrankin at ee.duke.edu Fri Feb 11 09:51:42 2005 From: wrankin at ee.duke.edu (Bill Rankin) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] cooling question: cfm per rack? In-Reply-To: References: Message-ID: <1108144302.3042.27.camel@localhost.localdomain> > Is cfm the key unit here or should one think in terms of pressure > at various points in the room? The other factor in heat removal (both within the rack as well as within the air chiller) is the intake air temps. The larger the temperature difference the more efficient the heat transfer becomes. Essentially, 50cfm of 20C air cools a lot better than 50cfm of 30C air. Also (as we are currently experiencing) the air handlers are much more efficient at cooling really HOT air, versus warm air. -bill -- bill rankin, ph.d. ........ director, cluster and grid technology group wrankin@ee.duke.edu .......................... center for computational duke university ...................... science engineering and medicine http://www.ee.duke.edu/~wrankin .............. http://www.csem.duke.edu From maurice at harddata.com Fri Feb 11 11:15:05 2005 From: maurice at harddata.com (Maurice Hilarius) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Re: Re: Re: Re: Home beowulf - NIC latencies (Greg Lindahl) In-Reply-To: <200502111409.j1BE8vAY013737@bluewest.scyld.com> References: <200502111409.j1BE8vAY013737@bluewest.scyld.com> Message-ID: <420D0439.3000304@harddata.com> Greg Lindahl wrote: >Amen. So use the MM5 t3a benchmark, maybe even SPEC HPC, the canned >benchmarks for Amber, Charmm, DL_POLY, etc. The NAS Parallel >Benchmarks are also good, they are much closer to real apps than >microbenchmarks. > >-- greg > Double Amen. ( is that a long Amen??) ;-) Now if we could only get all those benchmarks to agree with each other a bit! It's classic. Pick your arch, chipset, amount of RAM, clockspeed, NIC, switch, and so on, and you can make a selective case for almost anything.. Although on SMP the Opterons are mainly kicking butt lately due to the fact that their SMP performance is so superior.. And that brings up another can 'o worms: SMP or uni ? One can make a great performance case for either/both depending on your goals. With our best regards, Maurice W. Hilarius Telephone: 01-780-456-9771 Hard Data Ltd. FAX: 01-780-456-9772 11060 - 166 Avenue email:maurice@harddata.com Edmonton, AB, Canada http://www.harddata.com/ T5X 1Y3 This email, message, and content, should be considered confidential, and is the copyrighted property of Hard Data Ltd., unless stated otherwise. From rbbrigh at valeria.mp.sandia.gov Fri Feb 11 12:14:11 2005 From: rbbrigh at valeria.mp.sandia.gov (Ron Brightwell) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <420CFE4C.6050003@ccrl-nece.de> References: <200502102000.j1AK0Eb7016772@bluewest.scyld.com> <420BC57A.5060007@harddata.com> <20050211023619.GB5174@greglaptop.internal.keyresearch.com> <420CFE4C.6050003@ccrl-nece.de> Message-ID: <20050211201411.GA10732@ratbert.mp.sandia.gov> On Fri Feb 11, 2005 11:49:48... Joachim Worringen wrote > Greg Lindahl wrote: > >On Thu, Feb 10, 2005 at 01:35:06PM -0700, Maurice Hilarius wrote: > >>If I have a fantastic device that uses infinitely small time (latency) > >>and moves huge amounts of data (bandwidth) but in doing so it takes 80% > >>of a CPU, we do not have a useful solution.. > > > >If large cpu usage is a problem, it will show up nicely in real > >application benchmarks. > > True. I always wonder what the low-CPU-usage-advocates want the MPI > process to do while i.e. an MPI_Send() is executed. For small messages > (which are critical for many applications), it's somewhat like > requesting that a local memory-write has to show low CPU usage. For blocking operations with short messages, low CPU usage shouldn't be the main concern. Measuring latency relative to CPU usage doesn't make much sense. > > Of course, I can think of scenarios in which data transfers w/o CPU > usage do promise advantages, and I have implemented and evaluated such > techniques myself. But in the end (for the application), it always > boiled down to latency and bandwidth as most applications don't honor > "true" asynchronous communication. Yep. We seem to have several micro-benchmarks that determine what the overlap potential of the network is, but I've never seen anything that determines what the overlap potential of an application is. It would be interesting to see what the overlap potential of real applications is. > > The latest unsuccessful case of uncoupling computation and MPI > communication I read about was BG/L when using the second CPU as a > message processor. Maybe Myrinet MX will behave differently by making > the MPI itself more concurrent on hardware level (is this a correct > description, Patrick?) - but it will need matching applications, too. > BG/L is unique is many ways. For example, using the second processor for communications doesn't actually help with progress -- the application still has to make MPI library calls to make progress on outstanding posted operations. So, even if the application was coded to take advantage of overlap, it probably wouldn't gain much by using the second processor. MX should be able to provide overlap and progress, like Quadrics and a few other technologies do. -Ron From bushnell at ultra.chem.ucsb.edu Fri Feb 11 15:57:27 2005 From: bushnell at ultra.chem.ucsb.edu (John Bushnell) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] cooling question: cfm per rack? In-Reply-To: Message-ID: A few comments below... On Fri, 11 Feb 2005, David Mathog wrote: > Mike, > > I've been trying to pick the brains of other folks on the > beowulf list who have computer rooms with modern equipment. > > One problem with the existing air, with regards to future > expansion, is apparently the total amount of air that the > current A/C can move. This is all horrendously complicated > and needs to be looked at carefully by a HVAC consultant. > Pretty sure we have enough tons and flow for now, meaning > my rack and Deshaies and everything else I know is going in there > in a couple of months. More and more convinced that we don't > have enough to handle multiple full racks of the next generation > of computers. We learned about this after putting in a big new A/C (adding to an old but still functioning one) in our server room. The problem was mitigated by having the vents on the old AC replaced with flanges attached to large flexible vents. They hang near the top/front of two racks, and this has helped quite a bit. Air flow is important! > Jim Lux from JPL answered my questions as attached after > my signature. Thanks go out to Jim for the useful numbers. > Darryl did say something interesting though, he said that for > some units the A/C people can increase the capacity by changing > the pulleys around. Apparently this blows more air, and the > cold water isn't limiting, so it effectively upgrades the unit > without changing very much. Darryl said that this was done > at some point for Mayo's computer room in the subbasement > of the BI. Sounds like a pretty cheap upgrade. It would certainly be nice if we could do that here, as we've been running on the edge in terms of cooling for some time now. Our industial chilled water loop runs at around 16C, so obviously the chilled water is simply acting as a resevoir for dumping heat from a compressor rather than being the direct source of cooling. So the limiting factor is likely the compressor/fluid/heat exchanger with the chilled water, rather than the chilled water itself. I wonder what "changing pulleys around" is really doing? Stay cool - John From idooley2 at uiuc.edu Fri Feb 11 16:59:06 2005 From: idooley2 at uiuc.edu (Isaac Dooley) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> Message-ID: <420D54DA.8000904@uiuc.edu> > > >>Using MPI_ISend() allows programs to not waste CPU cycles waiting on the >>completion of a message transaction. >> >> > >No, it allows the programmer to express that it wants to send a message >but not wait for it to complete right now. The API doesn't specify the >semantics of CPU utilization. It cannot, because the API doesn't have >knowledge of the hardware that will be used in the implementation. > > That is partially true. The context for my comment was under your assumption that everyone uses MPI_Send(). These people, as I stated before, do not care about what the CPU does during their blocking calls. I was trying to point out that programs utilizing non-blocking IO may have work that will be adversely impacted by CPU utilization for messaging. These are the people who care about CPU utilization for messaging. This I hopes answers your prior question, at least partially. Perhaps your applications demand low latency with no concern for the CPU during the time spent blocking. That is fine. But some applications benefit from overlapping computation and communication, and the cycles not wasted by the CPU on communication can be used productively. Isaac Dooley From rossen at VerariSoft.Com Fri Feb 11 22:52:03 2005 From: rossen at VerariSoft.Com (Rossen Dimitrov) Date: Wed Nov 25 01:03:46 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> Message-ID: <420DA793.4000909@verarisoft.com> I think that the mere definition of the term "MPI performance" and focusing too much on it can potentially have a negative impact on the overall discussion of parallel performance. Accepting the premise that all MPI can do is push individual messages between user processes as fast as possible, (as measured by ping pong) regardless of how this is achieved, unnecessarily and, I'd say, unjustifiably restricts the field of discussion. I agree that today MPI libraries are commonly measured by their ping-pong "performance" and not by their CPU utilization or other factors, but it does not necessarily make this form of performance evaluation right. I would support the idea of discussing isolated "MPI performance" but only in the context of a broader performance parameter space, at least including, communication overhead, communication bandwidth, processor overhead, and ability to perform asynchronous communication (i.e., compliance to the MPI Progress Rule). Only in such a broader evaluation space one can hope to fit the large number of combinations of processor/memory/peripheral_fabric architectures, network interconnects, system software/middleware, and application algorithms. Of course, there is always the case of running the actual application code and then evaluating the MPI performance by seeing which MPI library (or library mode) makes the application run faster. Unfortunately, this method for evaluating MPI often suffers from various efficiencies some of which originate from the parallel algorithm developers, who thoughout the years have sometimes adopted the most trivial ways of using MPI. Here a couple of arguments for why it is important to look at MPI (and the whole communication system) from different angles. If certain MPI optimizations are achieved at the cost of excessive use of resources that otherwise could be used for computation or enabling the overall "application_progress", the actual application performance may be below its potential or even degrade. Here are some "application progress" activities that can benefit of having these resources at their disposal: OS/kernel processing, other communication, I/O operations, memory operations (prefetching, etc.), peripheral bus/fabric operations. All of these in one way or another depend on CPU processing. Also, today's processor architectures have many independent processing units and complex memory hierarchies. When the MPI library polls for completion of a communication request, most of this specialized hardware is virtually unused (wasted). The processor architecture trends indicate that this kind of internal CPU concurrency will continue to increase, thus making the cost of MPI polling even higher. In this regard, a parallel application developer might actually very much care what is actually happening in the MPI library even when he makes a call to MPI_Send. If he doesn't, he probably should. Some related topics (not covered here because of bloviating) are: - How an MPI library that maximizes MPI's ping-pong performance alone can cause unexpected behavior and a fully functional parallel system to work far below its realistic efficiency. - What application algorithm developers experience when they attempt to use the ever so nebulous "overlapping" with a polling MPI library and how this experience has contributed to the overwhelming use of MPI_Send/MPI_Recv even for codes that can benefit from non-blocking or (even better) persistent MPI calls, thus killing any hope that these codes can run faster on systems that actually facilitate overlapping. Rossen Rob Ross wrote: > Hi Isaac, > > On Fri, 11 Feb 2005, Isaac Dooley wrote: > > >>>>Using MPI_ISend() allows programs to not waste CPU cycles waiting on the >>>>completion of a message transaction. >>> >>>No, it allows the programmer to express that it wants to send a message >>>but not wait for it to complete right now. The API doesn't specify the >>>semantics of CPU utilization. It cannot, because the API doesn't have >>>knowledge of the hardware that will be used in the implementation. >>> >> >>That is partially true. The context for my comment was under your >>assumption that everyone uses MPI_Send(). These people, as I stated >>before, do not care about what the CPU does during their blocking calls. > > > I think that it is completely true. I made no assumption about everyone > using MPI_Send(); I'm a late-comer to the conversation. > > I was not trying to say anything about what people making the calls care > about; I was trying to clarify what the standard does and does not say. > However, I agree with you that it is unlikely that someone calling > MPI_Send() is too worried about what the CPU utilization is during the > call. > > >>I was trying to point out that programs utilizing non-blocking IO may >>have work that will be adversely impacted by CPU utilization for >>messaging. These are the people who care about CPU utilization for >>messaging. This I hopes answers your prior question, at least partially. > > > I agree that people using MPI_Isend() and related non-blocking operations > are sometimes doing so because they would like to perform some > computation while the communication progresses. People also use these > calls to initiate a collection of point-to-point operations before > waiting, so that multiple communications may proceed in parallel. The > implementation has no way of really knowing which of these is the case. > > Greg just pointed out that for small messages most implementations will do > the exact same thing as in the MPI_Send() case anyway. For large messages > I suppose that something different could be done. In our implementation > (MPICH2), to my knowledge we do not differentiate. > > You should understand that the way MPI implementations are measured is by > their performance, not CPU utilization, so there is pressure to push the > former as much as possible at the expense of the latter. > > >>Perhaps your applications demand low latency with no concern for the CPU >>during the time spent blocking. That is fine. But some applications >>benefit from overlapping computation and communication, and the cycles >>not wasted by the CPU on communication can be used productively. > > > I wouldn't categorize the cycles spent on communication as "wasted"; it's > not like we code in extraneous math just to keep the CPU pegged :). > > Regards, > > Rob > --- > Rob Ross, Mathematics and Computer Science Division, Argonne National Lab > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From sadat_vit at yahoo.co.in Fri Feb 11 23:38:32 2005 From: sadat_vit at yahoo.co.in (sadat khan) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] BEOWULF vs NORMAL CLUSTER Message-ID: <20050212073832.28306.qmail@web8310.mail.in.yahoo.com> I am a new addition to this mailing list.Recently got interesterd in the field of high performance computing. We had Mr.Anand Babu in our college in recently(the creator of the 5th fastest supercomputer in the world THUNDER). And he gave a really good talk on clustering.... First up i would like to enquire as to whether there is any difference between a beowulf and normal cluster??? Or is it jus another name for a cluster... Another thing is what exactly do packages like MPI and PVM do ??? would be highly grateful for the help Yahoo! India Matrimony: Find your life partneronline. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20050212/3488d429/attachment.html From topa_007 at yahoo.com Sat Feb 12 05:40:49 2005 From: topa_007 at yahoo.com (Toufeeq Hussain) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Problem executing programs on lam-mpi Message-ID: <20050212134049.70098.qmail@web30209.mail.mud.yahoo.com> Hi, I get the following message while running a MPI program on a 2 node cluster* mpirun: cannot start ./a.out on n0: No such file or directory I'm running mpirun as such : $ mpirun C ./a.out compiled lam as such : ./configure --without-romio --with-rsh="ssh -x" *recon/lamboot execute successfully. topa@debian:~$ lamboot -v hosts LAM 7.1.1/MPI 2 C++ - Indiana University n-1<32615> ssi:boot:base:linear: booting n0 (devian) n-1<32615> ssi:boot:base:linear: booting n1 (debian) n-1<32615> ssi:boot:base:linear: finished *lamnodes gives the following output: topa@debian:~/mpi_progs$ lamnodes n0 devian:1: n1 debian:1:origin,this_node The MPI program is a simple one. #include #include int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello there"); printf("Hello world! I am %d of %d\n", rank, size); MPI_Finalize(); return 0; } Please help, Toufeeq ===== ############################################ # ring me @ 98401-96690 # # mail me @ toufeeq at computer dot org # # Debian Sarge \w 2.6.10-ck5 # ############################################ From landman at scalableinformatics.com Sat Feb 12 08:52:16 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] RE: A thread-safe PRNG for an OpenMP program In-Reply-To: References: Message-ID: <420E3440.3080108@scalableinformatics.com> Hi Henry: This is for two platforms that are not targets for Intel compilers. I have solved the problem by reworking tt800 a bit, and have that working nicely in OpenMP. Thanks though. Joe Gabb, Henry wrote: > Hi Joe, > The Intel Math Kernel Library (specifically the Vector Statistical > Library within MKL) contains threadsafe random number functions. The > following web site has a full description: > http://www.intel.com/software/products/mkl/features/vsl.htm. There's an > article "Making the Monte Carlo Approach Even Easier and Faster" on > Intel Developer Services that describes how to use VSL functions with > OpenMP. It's available here: > http://www.intel.com/cd/ids/developer/asmo-na/eng/95573.htm. > > Best regards, > > Henry Gabb > Intel Parallel Applications Center > > > >>Hi folks: >> >> I need to get a thread-safe pseudo-random number generator. All I > > >>have found online was SPRNG which is set up for MPI. Anyone have a >>quick pointer to their favorite thread safe PRNG that works well in > > OpenMP? > >> Thanks. >> >>Joe >> >>-- >>Scalable Informatics LLC, >>email: landman@scalableinformatics.com >>web : http://www.scalableinformatics.com > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From dtj at uberh4x0r.org Sat Feb 12 08:53:47 2005 From: dtj at uberh4x0r.org (Dean Johnson) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] cooling question: cfm per rack? In-Reply-To: References: Message-ID: <1108227228.3853.8.camel@terra> On Sat, 2005-02-12 at 09:49 -0500, Robert G. Brown wrote: > > Relative airflow can probably be measured with a kid's toy -- one of the > little pinwheels -- and counting revolutions with a stopwatch. > Normalizing that to absolute airflow in CFM is a bit tricky (since the > result depends to some extent on the resistance imposed by the measuring > apparatus) but somebody out there may have designed a version of this > with a real fan and magnets set so that the counting is done > electronically. In fact, I could build something to do this out of OTC > parts if I had any way to normalize the count. > Could you not use one of those cheapish wind speed devices that amateur weather folks use? That would give you a rating, presumably in miles per hour, and then figure backward based upon the area of the little fan thingy. That would likely be not too expensive and a great deal easier, and more accurate, to deal with than counting a pinwheel. ;-) -Dean From atp at piskorski.com Sat Feb 12 13:02:54 2005 From: atp at piskorski.com (Andrew Piskorski) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] cooling question: cfm per rack? In-Reply-To: <1108227228.3853.8.camel@terra> References: <1108227228.3853.8.camel@terra> Message-ID: <20050212210254.GA66503@piskorski.com> On Sat, Feb 12, 2005 at 10:53:47AM -0600, Dean Johnson wrote: > On Sat, 2005-02-12 at 09:49 -0500, Robert G. Brown wrote: > > > > Relative airflow can probably be measured with a kid's toy -- one of the > Could you not use one of those cheapish wind speed devices that amateur > Could you not use one of those cheapish wind speed devices that amateur > weather folks use? That would give you a rating, presumably in miles per When I asked Jack Wathey (architect of the Ammonite cluster) about the small hand-held anemometers intended for hikers and such, what he said was: On Wed, Nov 10, 2004 at 10:58:16AM -0800, Jack Wathey wrote: > What I found most useful was the Kestrel 2000, which measures wind speed > and temperature. The Kestrel 1000 is cheaper ($80 vs $100) and just > measures windspeed. The Kestrel was the only windmeter I could find that > was sensitive and accurate enough for measuring the flowrate at ammonite's > filters (typically in the 120 to 200 feet per minute range). They are > EXTREMELY DELICATE though! You can wreck the sapphire bearing just by > blowing on it hard (yes, I discovered this the hard way). But the bearing > and impeller are replaceable for about $15, so it's not a disaster. > > http://www.kestrelmeters.com A while back, I purchased one here: http://store.botachtactical.com/ke20pothwime.html -- Andrew Piskorski http://www.piskorski.com/ From emac at cybergps.net Sat Feb 12 11:30:17 2005 From: emac at cybergps.net (Eric Machala) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Home Beowulf Intial Startup Question References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <420DA793.4000909@verarisoft.com> Message-ID: <003501c51139$496fa9f0$6e45a8c0@masstivy> Hi all im semi new to beowulf's but very knowledge or computer and netowrk technologies i am building a 20 node dell optiplex 1.9 ghz 256 ram blah blah nodes wondering first off if master control is recommened to be same or better than nodes and what is recommened Linux O/s redhat or mandrake etc... or anyones recommendations Im also looking for some links or resources for tools aka software like parallel kernel upgrades moniter tools anything for setting up Linux beowulf to make this go smoothly Eric M Network Admin/CF Emac@cybergps.net From steve_heaton at ozemail.com.au Sat Feb 12 15:41:22 2005 From: steve_heaton at ozemail.com.au (Fringe Dweller) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] cooling question - dedicated infrastructure In-Reply-To: <200502122000.j1CK096k019160@bluewest.scyld.com> References: <200502122000.j1CK096k019160@bluewest.scyld.com> Message-ID: <420E9422.7080002@ozemail.com.au> An enlightening discussion re aircon peoples. Thanks. A couple of "war stories" :) I think RGB touched on problems with "defaults" on aircon behaviour in cooler climes. We have similar problems in warmer part of this blue marble. Your typical default behaviour is to put aircon into standby overnight. Even in the middle of summer. I mean, nobody's there and it's cooler overnight anyway right? Well yes but if your pushing your IT hard overnight... you can see the consequences. Make sure you "own" your aircon :) Another reason to ensure independence from anything related to the "building" is power. I had a customer in a very large building who's UPS would always trip every weekday morning at 6am and 6:30. Why? 6am => aircon up! 6:30 => lift motors up! The current draw for those two events is staggering. That's why you spend big bucks on the supporting infrastructure =) Stevo From hahn at physics.mcmaster.ca Sun Feb 13 12:41:51 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Home Beowulf Intial Startup Question In-Reply-To: <003501c51139$496fa9f0$6e45a8c0@masstivy> Message-ID: > netowrk technologies i am building a 20 node dell optiplex 1.9 ghz 256 ram kinda low on ram there, but for a learning cluster, that's plenty. (actually 20 is kinda big for such a cluster...) > blah blah nodes wondering first off if master control is recommened to be > same or better than nodes and what is recommened Linux O/s redhat or > mandrake etc... or anyones recommendations distros don't matter - none of them are significantly different, and they all work. people who care about distros are more interested in desktop decor than getting work done ;) admittedly, I am not a never-reinvent-the-wheel person. NRTW is worse than NIH, IMO. (some wheels desperately need reinvention, all progress comes from reinvention, etc). > Im also looking for some links or resources for tools aka software like > parallel kernel upgrades moniter tools anything for setting up Linux > beowulf to make this go smoothly to me, "smooth" means "no extra load per node". I strongly prefer net-booting, or at least net-root setups. people will tell you that using NFS for this is horribly inefficient, dangerous and causes warts. but it works extremely well, at least for clusters of <= 96 nodes, based on my experience so far. things might be different if you're doing retrocomputing based on a half-duplex 10mbps network or have large IO loads. the benefit is that your cluster acts like you have just one slave node. the cost is that you have to do a pretty minor amount of work to hack something like Fedora to boot diskless (small changes to the initrd.) and of course, it does mean that "incidental" file IO will cause network traffic. it's not clear to me that this is a problem, though, since: - nodes are normally configured to be fairly minimal - you don't have 30 user logins on each one, with people running ls/bash/netscape/gcc all the time. - NFS is not that bad at caching, and you can help this out by upping the per-mount cache parameters a bit.` - it's awefully nice to have a nearly fully functional node even after its disk dies. - my "diskless" nodes actually do have local swap and /tmp. disks are cheap and handy, just don't *depend* on them. - you can easily imagine a hybrid system that boots somehow (PXE or from disk), and does an rsync or rpm/yum/systemimager equivalent. I don't really see the point though. - having your root FS exported read-only is also kind of nice: good security is layered security... From mathog at mendel.bio.caltech.edu Sun Feb 13 13:50:34 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling? Message-ID: There are a series of white papers by APC here: http://www.apc.com/tools/mytools/index.cfm?action=wp where they discuss various power and cooling factors. They note a disconnect between the higher densities achieved by blades and similar high density racks and the practicality of actually cooling these beasts. Basically it comes down to you save space on the rack and then give it all back on the cooling system. Think of it minimally in these terms - to move enough cfm at less than 30 feet per minute starts to require a duct larger than the rack itself! In terms of TCO, at the moment, APC rejects the notion that these ultra high density machines are cost effective because they are so very difficult to cool. It seems to me that at a certain power point the racks are going to have to resort to water cooling. Long ago the ECL mainframes were cooled this way, but it's been a long time since most of us have seen water pipes running into the computers in a machine room. Cooling a 10 kW rack well looks to be extremely tough with air, and going much above that would seem to require something approaching a dedicated wind tunnel. Any opinions on how high the power dissipation in racks will go before the manufacturers throw in the air cooling towel and start shipping them with water connections? If you were designing a computer room today (which I am) what would you allow for the maximum power dissipation per rack _to_be_handled_ by_the_room_A/C. The assumption being that in 8 years if somebody buys a 40kW (heaven forbid) rack it will dump its heat through a separate water cooling system. Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rgb at phy.duke.edu Sun Feb 13 15:47:22 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling? In-Reply-To: References: Message-ID: On Sun, 13 Feb 2005, David Mathog wrote: > There are a series of white papers by APC here: > > http://www.apc.com/tools/mytools/index.cfm?action=wp That link doesn't work for me (apc's website barfs on it) but I googled and worked through their gatekeeper to get access. After "logging in" (yuck) I'm going to try to download: WP-5 Cooling Imperatives for Data Centers and Network Rooms Effective next generation data centers and network rooms must address the known needs and problems relating to current and past designs. This paper presents a categorized and prioritized collection of cooling needs and problems as obtained through systematic user interviews. which I'm hoping is the one you are referring to above. > where they discuss various power and cooling factors. They note > a disconnect between the higher densities achieved by blades and > similar high density racks and the practicality of actually > cooling these beasts. Basically it comes down to you save space > on the rack and then give it all back on the cooling system. Think > of it minimally in these terms - to move enough cfm at less than 30 > feet per minute starts to require a duct larger than the rack itself! > > In terms of TCO, at the moment, APC rejects the notion that > these ultra high density machines are cost effective because they > are so very difficult to cool. >From what I learned of bladed systems back when I reviewed them for my own purposes, this isn't terribly surprising, but it is really valuable to have a well-researched document that explains how and why. 10 KW (think 100 100W light bulbs) in what, 2 m^3 -- that's a lot of energy to get rid of, and almost by definition you're removing it from components that are packed as tightly as possible. > It seems to me that at a certain power point the racks are going to > have to resort to water cooling. Long ago the ECL mainframes were > cooled this way, but it's been a long time since most of us have > seen water pipes running into the computers in a machine room. > > Cooling a 10 kW rack well looks to be extremely tough with air, > and going much above that would seem to require something approaching > a dedicated wind tunnel. Any opinions on how high the power > dissipation in racks will go before the manufacturers throw > in the air cooling towel and start shipping them with water > connections? I think you're within a factor of 2 or so of the SANE threshold at 10KW. A rack full of 220 W Opterons is there already (~40 1U enclosures). I'd "believe" that you could double that with a clever rack design, e.g. Rackable's, but somewhere in this ballpark...it stops being sane. > If you were designing a computer room today (which I am) what would > you allow for the maximum power dissipation per rack _to_be_handled_ > by_the_room_A/C. The assumption being that in 8 years if somebody > buys a 40kW (heaven forbid) rack it will dump its heat through > a separate water cooling system. This is a tough one. For a standard rack, ballpark of 10 KW is accessible today. For a Rackable rack, I think that they can not quite double this (but this is strictly from memory -- something like 4 CPUs per U, but they use a custom power distribution which cuts power and a specially designed airflow which avoids recycling used cooling air). I don't know what bladed racks achieve in power density -- the earlier blades I looked at had throttled back CPUs but I imagine that they've cranked them up at this point (and cranked up the heat along with them). Ya pays your money and ya takes your choice. An absolute limit of 25 (or even 30) KW/rack seems more than reasonable to me, but then, I'd "just say no" to rack/serverroom designs that pack more power than I think can sanely be dissipated in any given volume. Note that I consider water cooled systems to be insane a priori for all but a small fraction of server room or cluster operations, "space" generally being cheaper than the expense associated with achieving the highest possible spatial density of heat dissipating CPUs. I mean, why stop at water? Liquid Nitrogen. Liquid Helium. If money is no option, why not? OTOH, when money matters, at some point it (usually) gets to be cheaper to just build another cluster/server room, right? rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From james.p.lux at jpl.nasa.gov Sun Feb 13 16:06:06 2005 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling? References: Message-ID: <001a01c51228$fb48ef20$1af69580@LAPTOP152422> ----- Original Message ----- From: "David Mathog" To: Sent: Sunday, February 13, 2005 1:50 PM Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling? > There are a series of white papers by APC here: > > http://www.apc.com/tools/mytools/index.cfm?action=wp > > where they discuss various power and cooling factors. They note > a disconnect between the higher densities achieved by blades and > similar high density racks and the practicality of actually > cooling these beasts. Basically it comes down to you save space > on the rack and then give it all back on the cooling system. Think > of it minimally in these terms - to move enough cfm at less than 30 > feet per minute starts to require a duct larger than the rack itself! I think that's 30 ft/second.. 1800 lfpm would be a reasonable duct speed... 30 lfpm is really really slow (that's 1/2 ft/sec, which is a pretty darn gentle breeze) > > In terms of TCO, at the moment, APC rejects the notion that > these ultra high density machines are cost effective because they > are so very difficult to cool. > > It seems to me that at a certain power point the racks are going to > have to resort to water cooling. Long ago the ECL mainframes were > cooled this way, but it's been a long time since most of us have > seen water pipes running into the computers in a machine room. High power density devices (like power electronics or high power vacuum tubes) have always resorted to liquid cooling. It's so much more efficient than trying to cool with air. For a variety of reasons, but primarily because it separates the problem of physical device and radiator surface. Consider liquid vs air cooled internal combustion engines. Really high power density often uses some sort of phase change (ebullient) cooling, although the design challenges are significant. Even some laptops have used liquid or phase change cooling (heat pipes) to move the heat from the CPU to the case. An interesting exception to liquid cooling for high power devices is big generators, which are cooled with hydrogen gas (low viscosity and density, so low aerodynamic drag) But liquid cooling, per se, isn't a crippling thing to work with. And, it actually allows certain design economies: no more do you have to constrain the design for air flow, or conduction through the boards, nor do you have to fool with an array of CPU fans, video card fans, etc. > > Cooling a 10 kW rack well looks to be extremely tough with air, > and going much above that would seem to require something approaching > a dedicated wind tunnel. Any opinions on how high the power > dissipation in racks will go before the manufacturers throw > in the air cooling towel and start shipping them with water > connections? Consider that 10kW is 5-10 times the power dissipation of a hair dryer. Other solutions that might turn up are an internal cooling loop to move heat from inside to a big heatsink on the surface. Modern rack mounted PCs aren't particularly designed for efficient thermal transfer with minimal air flow. (there's no economic incentive for it) There are economies of scale to a common chiller, though, because when you get to large HVAC, cold water is what you get, rather than cold air, because moving cold air is a LOT more expensive than moving cold water. > > If you were designing a computer room today (which I am) what would > you allow for the maximum power dissipation per rack _to_be_handled_ > by_the_room_A/C. The assumption being that in 8 years if somebody > buys a 40kW (heaven forbid) rack it will dump its heat through > a separate water cooling system. There are such things as individual rack chillers, which you would bolt to a rack and then hook up to a centralized cold water source. > > Thanks, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From james.p.lux at jpl.nasa.gov Sun Feb 13 19:33:50 2005 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling? References: Message-ID: <000401c51246$00751150$26f49580@LAPTOP152422> > > > > I think you're within a factor of 2 or so of the SANE threshold at 10KW. > A rack full of 220 W Opterons is there already (~40 1U enclosures). I'd > "believe" that you could double that with a clever rack design, e.g. > Rackable's, but somewhere in this ballpark...it stops being sane. > > > If you were designing a computer room today (which I am) what would > > you allow for the maximum power dissipation per rack _to_be_handled_ > > by_the_room_A/C. The assumption being that in 8 years if somebody > > buys a 40kW (heaven forbid) rack it will dump its heat through > > a separate water cooling system. > > This is a tough one. For a standard rack, ballpark of 10 KW is > accessible today. For a Rackable rack, I think that they can not quite > double this (but this is strictly from memory -- something like 4 CPUs > per U, but they use a custom power distribution which cuts power and a > specially designed airflow which avoids recycling used cooling air). I > don't know what bladed racks achieve in power density -- the earlier > blades I looked at had throttled back CPUs but I imagine that they've > cranked them up at this point (and cranked up the heat along with them). > > Ya pays your money and ya takes your choice. An absolute limit of 25 > (or even 30) KW/rack seems more than reasonable to me, but then, I'd > "just say no" to rack/serverroom designs that pack more power than I > think can sanely be dissipated in any given volume. Note that I consider > water cooled systems to be insane a priori for all but a small fraction > of server room or cluster operations, "space" generally being cheaper > than the expense associated with achieving the highest possible spatial > density of heat dissipating CPUs. I mean, why stop at water? Liquid > Nitrogen. Liquid Helium. If money is no option, why not? OTOH, when > money matters, at some point it (usually) gets to be cheaper to just > build another cluster/server room, right? The speed of light starts to set another limit for the physical size, if you want real speed. There's a reason why the old Crays are compact and liquid cooled. It's that several nanoseconds per foot propagation delay. Once you get past a certain threshold, you're actually better off going to very dense form factors and liquid cooling, in many areas. I think that most clusters haven't reached the performance point where it's worth liquid cooling the processors, but it's probably pretty close to the threshold. Adding machine room space is expensive for other reasons. You've already got to have the water chillers for any sort of major sized cluster (to cool the air), so the incremental cost to providing an appropriate interface to the racks and starting to build racks in liquid cooled configurations can't be far away. Liquid cooling is MUCH more efficient than air cooling: better heat transfer, better life (more even temperatures), less real estate required, etc. The hangup now is that nobody makes liquid cooled PCs as a commodity, mass production item. What you'll find is liquid cooling retrofits that don't take advantage of what liquid cooling can get you. If you look at high performance radar or sonar processors and such that use liquid cooling, the layout and physical configuration is MUCH different (partly driven by the fact that the viscosity of liquid is higher than air). Wouldn't YOU like to have, say, 1000 processors in one rack, with a 2-3" flexible pipe to somewhere else? Especially if it was perfectly quiet? And could sit next to your desk? (1000 processors*100W each is 100kW). From rene at renestorm.de Sat Feb 12 20:29:58 2005 From: rene at renestorm.de (rene) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Block send mpi Message-ID: <200502130529.58915.rene@renestorm.de> Hi folks, i know, this isn't a mpi forum, even so allow me a question about block sending. i got some(times) nice SIGSEGVs with that code (C++ implementation). Did I code something totally wrong? I really don't understand this function. // int MPI_Buffer_attach( void *buffer, int size ) int packsize; MPI_Pack_size (bit, MPI_INT, newcomm, &packsize); int bufsize = packsize + (MPI_BSEND_OVERHEAD); void *buf = new (void (*[packsize]) ()); MPI_Buffer_attach (buf, bufsize); ierr =MPI_Bsend (&testdata[0], bit, MPI_INT, node, 0, newcomm); MPI_Buffer_detach (&buf, &bufsize); Thanks, -- Rene Storm @Cluster From maurice at harddata.com Sun Feb 13 11:30:43 2005 From: maurice at harddata.com (Maurice Hilarius) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] cooling question: cfm per rack? In-Reply-To: <200502122000.j1CK096j019160@bluewest.scyld.com> References: <200502122000.j1CK096j019160@bluewest.scyld.com> Message-ID: <420FAAE3.9070108@harddata.com> Dean Johnson wrote: > On Sat, 2005-02-12 at 09:49 -0500, Robert G. Brown wrote: > >>> >>> Relative airflow can probably be measured with a kid's toy -- one of the >>> little pinwheels -- and counting revolutions with a stopwatch. >>> Normalizing that to absolute airflow in CFM is a bit tricky (since the >>> result depends to some extent on the resistance imposed by the measuring >>> apparatus) but somebody out there may have designed a version of this >>> with a real fan and magnets set so that the counting is done >>> electronically. In fact, I could build something to do this out of OTC >>> parts if I had any way to normalize the count. >>> > > >Could you not use one of those cheapish wind speed devices that amateur >weather folks use? That would give you a rating, presumably in miles per >hour, and then figure backward based upon the area of the little fan >thingy. That would likely be not too expensive and a great deal easier, >and more accurate, to deal with than counting a pinwheel. ;-) > > -Dean One can also go to an auto wreckers, and from ,any newer models of cars get a Mass Air Flow sensor (MAF) from teh throttle body. Modern cars use these, in conjunction with an O2 sensor on the exhasut, to manage fuel injection. The MAF returns a variable DC voltage, usually in the range of 0 to 5V (depending on air speed). Make a tube, mount the MAF with the probe end in the tube, attach to back of device being measured. Supply 12V DC,Connect to output for measurement. Obviosly this would have to be calibrated. It is cheap, and very accurate and very relaible.. If you want to make it more useful , a lot of modern cars also use a barometric pressure sensor, and the calucs can be done using bioth outputs. This helps a lot as things like current weather conditions and altitude have a large bearing on air pressure. Measuring flow by speed only, and ignoring pressure is a fairly inaccurate method. Lastly, one can measure the humidity, as this also has a pretty large influence on the cooling capacity of the air being moved. For around $25 one can cannibalize the parts and cabling from a modern car wreck. All that is left is to provide a DC 12V source, a computer with a 4 channel A/D chip on a proto board, and some calibration. The calibration will be the toughest challenge as you will need accurate precalibrated instruments for a test session, but at least this is one time, and may be borrowed.. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20050213/f138ce20/attachment.html From landman at scalableinformatics.com Sun Feb 13 21:42:21 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Block send mpi In-Reply-To: <200502130529.58915.rene@renestorm.de> References: <200502130529.58915.rene@renestorm.de> Message-ID: <42103A3D.8020605@scalableinformatics.com> Rene: More data. Where exactly does it SEGV? At the void *buf line? at the Pack? or the Bsend? Did you compile with -g? Do you have a core dump? Joe rene wrote: > Hi folks, > > i know, this isn't a mpi forum, even so allow me a question about block > sending. > > i got some(times) nice SIGSEGVs with that code (C++ implementation). > Did I code something totally wrong? > I really don't understand this function. > // int MPI_Buffer_attach( void *buffer, int size ) > > int packsize; > MPI_Pack_size (bit, MPI_INT, newcomm, &packsize); > int bufsize = packsize + (MPI_BSEND_OVERHEAD); > void *buf = new (void (*[packsize]) ()); > MPI_Buffer_attach (buf, bufsize); > ierr =MPI_Bsend (&testdata[0], bit, MPI_INT, node, 0, newcomm); > MPI_Buffer_detach (&buf, &bufsize); > > Thanks, -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From james.p.lux at jpl.nasa.gov Sun Feb 13 22:31:26 2005 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] cooling question: cfm per rack? References: <200502122000.j1CK096j019160@bluewest.scyld.com> <420FAAE3.9070108@harddata.com> Message-ID: <002301c5125e$ede1ef40$32a8a8c0@LAPTOP152422> ----- Original Message ----- From: Maurice Hilarius Dean Johnson mailto: > Relative airflow can probably be measured with a kid's toy -- one of the > little pinwheels -- and counting revolutions with a stopwatch. > Normalizing that to absolute airflow in CFM is a bit tricky (since the > result depends to some extent on the resistance imposed by the measuring > apparatus) but somebody out there may have designed a version of this > with a real fan and magnets set so that the counting is done > electronically. In fact, I could build something to do this out of OTC > parts if I had any way to normalize the count. > Could you not use one of those cheapish wind speed devices that amateur weather folks use? That would give you a rating, presumably in miles per hour, and then figure backward based upon the area of the little fan thingy. That would likely be not too expensive and a great deal easier, and more accurate, to deal with than counting a pinwheel. ;-) -DeanOne can also go to an auto wreckers, and from ,any newer models of cars get a Mass Air Flow sensor (MAF) from teh throttle body. Modern cars use these, in conjunction with an O2 sensor on the exhasut, to manage fuel injection. The MAF returns a variable DC voltage, usually in the range of 0 to 5V (depending on air speed). Make a tube, mount the MAF with the probe end in the tube, attach to back of device being measured. Supply 12V DC,Connect to output for measurement. Obviosly this would have to be calibrated. It is cheap, and very accurate and very relaible.. If you want to make it more useful , a lot of modern cars also use a barometric pressure sensor, and the calucs can be done using bioth outputs. This helps a lot as things like current weather conditions and altitude have a large bearing on air pressure. Measuring flow by speed only, and ignoring pressure is a fairly inaccurate method. Lastly, one can measure the humidity, as this also has a pretty large influence on the cooling capacity of the air being moved. For around $25 one can cannibalize the parts and cabling from a modern car wreck. All that is left is to provide a DC 12V source, a computer with a 4 channel A/D chip on a proto board, and some calibration. The calibration will be the toughest challenge as you will need accurate precalibrated instruments for a test session, but at least this is one time, and may be borrowed.. ---- The problem with automotive mass air flow sensors is sensitivity at low flows. Consider, for a moment, a 1.8 liter engine turning over at 1800 rpm (call it 30 rev/sec..) That's 1.8*15 liters/sec of air (27 liters/sec), being drawn through a tube some 5-10 cm in diameter (call it 60 cm2).. that's 450 cm/sec or 4.5 m/sec... 885 linear ft/minute a fairly fast airflow in HVAC terms.... And that's the bottom of the range for the automotive sensor. From rgb at phy.duke.edu Mon Feb 14 03:18:27 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling? In-Reply-To: <000401c51246$00751150$26f49580@LAPTOP152422> References: <000401c51246$00751150$26f49580@LAPTOP152422> Message-ID: On Sun, 13 Feb 2005, Jim Lux wrote: > > I think you're within a factor of 2 or so of the SANE threshold at 10KW. > > A rack full of 220 W Opterons is there already (~40 1U enclosures). I'd > > "believe" that you could double that with a clever rack design, e.g. > > Rackable's, but somewhere in this ballpark...it stops being sane. > > > > > If you were designing a computer room today (which I am) what would > > > you allow for the maximum power dissipation per rack _to_be_handled_ > > > by_the_room_A/C. The assumption being that in 8 years if somebody > > > buys a 40kW (heaven forbid) rack it will dump its heat through > > > a separate water cooling system. > > > > This is a tough one. For a standard rack, ballpark of 10 KW is > > accessible today. For a Rackable rack, I think that they can not quite > > double this (but this is strictly from memory -- something like 4 CPUs > > per U, but they use a custom power distribution which cuts power and a > > specially designed airflow which avoids recycling used cooling air). I > > don't know what bladed racks achieve in power density -- the earlier > > blades I looked at had throttled back CPUs but I imagine that they've > > cranked them up at this point (and cranked up the heat along with them). > > > > Ya pays your money and ya takes your choice. An absolute limit of 25 > > (or even 30) KW/rack seems more than reasonable to me, but then, I'd > > "just say no" to rack/serverroom designs that pack more power than I > > think can sanely be dissipated in any given volume. Note that I consider > > water cooled systems to be insane a priori for all but a small fraction > > of server room or cluster operations, "space" generally being cheaper > > than the expense associated with achieving the highest possible spatial > > density of heat dissipating CPUs. I mean, why stop at water? Liquid > > Nitrogen. Liquid Helium. If money is no option, why not? OTOH, when > > money matters, at some point it (usually) gets to be cheaper to just Keyword: ^^^^^^^ > > build another cluster/server room, right? Sure, I agree with everything below, for bleeding edge work. Or if you're building a cluster in your Manhattan office, where for whatever reason you have to work with a space the size of a broom closet (but where you miraculously have access to a stream of chilled water, or liquid nitrogen, or liquid helium). This just (IMO) pushes you over some sort of magic threshold that (while arbitrary and existing perhaps only in my fevered imagination) separates "COTS clusters" from a "big iron supercomputer". I have a hard time seeing liquid cooled clusters as being a beowulf in the sense I have grown to know and love. COTS clusters have always been about being ABLE to DIY, and while I can (if my life depends on it) do plumbing, it just seems like there would be some highly nonlinear cost and hassle thresholds in there. Also, I just cannot see COTS systems being built with copper pipes and coupling valves where you hook them into your household or office chilled water supply at your desk. I suspect that COTS desktops and even server mobos will continue to be engineered to be air cooled in the forseeable future. Now your observation that racks themselves may start coming with a pair of copper pipes and couplings for a built-in blower and heat exchanger -- so the rack itself is in some sense "liquid cooled", while the actual nodes within are still COTS mobos cooled by air -- I don't know what the cost and volume trade-offs are of this solution. Cooling the air in the rack bases (more likely at the top of each rack and ducting the cold air down to the base) vs cooling the air in a big liebert and piping the cool air around to the bases in a raised floor -- hmmm. One thing to remember (that I think was brought up one of the last times this issue was raised on list -- I know from bitter experience that water couplings are a PITA to reliably get, and keep, tight under pressure. When they leak ("when" because of Murphy), they're going to make God's Own mess and potentially ruin many tens of thousands of dollars worth of hardware. Heat exchangers at the tops of racks also increase the probability that humidity will be a problem -- I also know from bitter experience that overhead cold air ducting has a tendency to sweat unless carefully insulated, and the sweat in a humid climate like NC will inevitably drip into whatever is below. Heat exchangers at the bottom make it harder to move the warm air exhausted at the rack tops back to the bottom for recooling as you're working against an air pressure/density convective flow differential and not with it. Finally, there are likely to be Human Resources and state regulatory issues with liquid cooled electronics -- systems and network engineers somehow are viewed as being competent to manage end-stage electronics from the plug point on even by the unions in all but the most rabid of union shops (although I have heard of places where you have to call a union employee in to do any major plugging or unplugging of certain kinds o hardware). That simply won't be the case with liquid cooled hardware. I may be able to work on my household plumbing (and wiring), but if I set my hand to plumbing at Duke the HR Gods and the State would get Angry, and if anything wet wrong (like a leak causing a short and a fire) I would be Held Liable. This adds another project-staffing human notch to the TCO -- likely a fairly significant one as the heat exchanger/blowers in EACH rack might well need servicing and inspection 1-2x a year (as the room unit does now). None of these things are insurmountable difficulties, and as you note there are certain big, expensive pieces of hot hardware (big lasers, giant magnets, automobile engines) that one DOES plug right into a chilled water loop. With the exception of car engines they tend to be components with 6-8 figure price tags, though, where tacking on a full or part time FTE for managing the plumbing etc is a small fraction of the total marginal cost of operation. I'd expect this to make sense only for clusters in this same category -- really large, already expensive clusters shooting for bleeding edge performance (top 10 of top 500) at very high density someplace where a) physical space is very "expensive" (justifying the trade off economically); or b) speed of light and/or interconnect lengths are indeed an issue. Note that the fixing the latter will likely rely as much on moving out of the COTS arena for the cluster interconnect as it does on cooling alone. High end cluster interconnects are again almost by definition engineered on the assumption of air-cooled node densities and internode latencies that are specified by worst-case assumptions and protocol, not speed of light in the sense that interconnect length is an important parameter in the overall latency. As in 1 usec is pretty good latency for a modern interconnect IIRC, and a light-usecond is 3x10^8 x 10^-6 = 300 meters. I'd guess that very little of the internode latency over fiber is due to speed of light delays per se and nearly all of it is in the interconnects themselves, the switches, and the node bus interface. > The speed of light starts to set another limit for the physical size, if you > want real speed. There's a reason why the old Crays are compact and liquid > cooled. It's that several nanoseconds per foot propagation delay. Once you There's also a reason why old Crays are currently used primarily as lobby art, whereever they haven't been disassembled and bathed in mercury to recover all that gold. Several reasons, actually, but liquid cooling and the hassle and expense it entailed are a big one. Many a Cray was finally decommissioned when one could build and operate a true COTS cluster with as much or more raw horsepower for what it cost for just the infrastructure support for the Cray it supplanted. Like it or not, Moore's Law biases cost-benefit solutions heavily towards the COTS and disposable, and wet-cooling requires a significant and sustained investment in a particular technology that is likely to remain non-mainstream, human-resource intensive, and hence nonlinearly costly in a TCO CBA. One needs significant benefit in order to make it worthwhile. > get past a certain threshold, you're actually better off going to very dense > form factors and liquid cooling, in many areas. I think that most clusters > haven't reached the performance point where it's worth liquid cooling the > processors, but it's probably pretty close to the threshold. Adding machine > room space is expensive for other reasons. You've already got to have the > water chillers for any sort of major sized cluster (to cool the air), so the > incremental cost to providing an appropriate interface to the racks and > starting to build racks in liquid cooled configurations can't be far away. > > Liquid cooling is MUCH more efficient than air cooling: better heat > transfer, better life (more even temperatures), less real estate required, > etc. The hangup now is that nobody makes liquid cooled PCs as a commodity, > mass production item. What you'll find is liquid cooling retrofits that > don't take advantage of what liquid cooling can get you. If you look at high > performance radar or sonar processors and such that use liquid cooling, the > layout and physical configuration is MUCH different (partly driven by the > fact that the viscosity of liquid is higher than air). > > Wouldn't YOU like to have, say, 1000 processors in one rack, with a 2-3" > flexible pipe to somewhere else? Especially if it was perfectly quiet? And > could sit next to your desk? (1000 processors*100W each is 100kW). If somebody else paid for and fed the whole thing, you could multiply the capacity by an order of magnitude and use liquid nitrogen for cooling instead of water and I'd simply love it. And as Austin Powers might add, I'd like a gold-plated potty as well -- but I'm not going to get it...;-) Alas, in the real world it isn't about what I'd "like", it is about what I can afford, about what I can convince a grant agency to pay for. High infrastructure costs come out of node count, and node count matters -- in many projects, it is the PRIMARY thing that matters. High density increases infrastructure costs, often nonlinearly, and hence decreases node count at any fixed budget. In order to for liquid cooling to ever make sense for COTS clusters, it would have to BECOME COTS -- basically, to become cheap in both hardware and human terms. Might happen, might happen, but I'm not holding my breath... rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From john.hearns at streamline-computing.com Mon Feb 14 03:32:33 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Home Beowulf Intial Startup Question In-Reply-To: References: Message-ID: <1108380753.5708.0.camel@localhost.localdomain> On Sun, 2005-02-13 at 15:41 -0500, Mark Hahn wrote: > > netowrk technologies i am building a 20 node dell optiplex 1.9 ghz 256 ram Have a look at the new OReilly book 'High Performance Linux Cluster with Rocks, Oscar and Mosix'. Should be of help to you. I'm doing a review for the UKUUG newsletter. From ashley at quadrics.com Mon Feb 14 08:23:02 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> Message-ID: <1108398183.8243.54.camel@localhost.localdomain> On Fri, 2005-02-11 at 20:47 -0600, Rob Ross wrote: > I agree that people using MPI_Isend() and related non-blocking operations > are sometimes doing so because they would like to perform some > computation while the communication progresses. People also use these > calls to initiate a collection of point-to-point operations before > waiting, so that multiple communications may proceed in parallel. The > implementation has no way of really knowing which of these is the case. Either of these reasons for using non-blocking sends is valid and both will benefit from low CPU use in the Send call. Why would the implementation want to know the reason for using non-blocking sends? > You should understand that the way MPI implementations are measured is by > their performance, not CPU utilization, so there is pressure to push the > former as much as possible at the expense of the latter. It's relatively difficult to measure the CPU overhead of calls, some benchmarks work out the "issue rate" of sends (operations/second) and some measure how much compute (spinning) can be achieved before having a measurable effect on the latency. Both these are valid however the results are harder for the non-technical person to comprehend. Headline latency/bandwidth are just that, Headline figures that don't tell the whole story. Ashley, From rross at mcs.anl.gov Mon Feb 14 09:04:17 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: Message-ID: Hi Mikhail, I don't know all the implementations well enough to comment on them one-by-one. I'm sure that Rossen can talk about their implementation with regards to (a) below, and others will fill in other gaps. In general, to support (a) the implementation must either spawn a thread or have support from the NIC to make progress (this is related to the "Progress Rule" that people occasionally bring up). The standard *does not* specify that progress must be made when not in an MPI_ call. MPICH/MPICH2 do not use an extra thread (for portability one cannot assume that threads are available!). Thus the only overlap that occurs in MPICH2 over TCP is through the socket buffers. Making a sequence of MPI_Isends followed by a MPI_Wait go faster than a sequence of MPI_Sends isn't hard, particularly if the messages are to different ranks. I would guess that every implementation will provide better performance in the case where the user tells the implementation about all these concurrent operations and then MPI_Waits on the bunch. Hope this helps some, Rob --- Rob Ross, Mathematics and Computer Science Division, Argonne National Lab On Mon, 14 Feb 2005, Mikhail Kuzminsky wrote: > Let me ask some stupid's question: which MPI implementations allow > really > > a) to overlap MPI_Isend w/computations > and/or > b) to perform a set of subsequent MPI_Isend calls faster than "the > same" set of MPI_Send calls ? > > I say only about sending of large messages. > > I'm interesting (1st of all) in > - Gigabit Ethernet w/LAM MPI or MPICH > - Infiniband (Mellanox equipment) w/NCSA MPI or OSU MPI > > Yours > Mikhail Kuzminsky > Zelinsky Institute of Organic Chemistry > Moscow From rross at mcs.anl.gov Mon Feb 14 09:11:31 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <1108398183.8243.54.camel@localhost.localdomain> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> Message-ID: On Mon, 14 Feb 2005, Ashley Pittman wrote: > On Fri, 2005-02-11 at 20:47 -0600, Rob Ross wrote: > > I agree that people using MPI_Isend() and related non-blocking operations > > are sometimes doing so because they would like to perform some > > computation while the communication progresses. People also use these > > calls to initiate a collection of point-to-point operations before > > waiting, so that multiple communications may proceed in parallel. The > > implementation has no way of really knowing which of these is the case. > > Either of these reasons for using non-blocking sends is valid and both > will benefit from low CPU use in the Send call. Why would the > implementation want to know the reason for using non-blocking sends? If you used the non-blocking send to allow for overlapped communication, then you would like the implementation to play nicely. In this case the user will compute and eventually call MPI_Test or MPI_Wait (or a flavor thereof). If you used the non-blocking sends to post a bunch of communications that you are going to then wait to complete, you probably don't care about the CPU -- you just want the messaging done. In this case the user will call MPI_Wait after posting everything it wants done. One way the implementation *could* behave is to assume the user is trying to overlap comm. and comp. until it sees an MPI_Wait, at which point it could go into this theoretical "burn CPU to make things go faster" mode. That mode could, for example, tweak the interrupt coalescing on an ethernet NIC to process packets more quickly (I don't know off the top of my head if that would work or not; it's just an example). All of this is moot of course unless the implementation actually has more than one algorithm that it could employ... Rob From James.P.Lux at jpl.nasa.gov Mon Feb 14 09:17:00 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] some thoughts on thermal design, liquid cooling, etc. Message-ID: <6.1.1.1.2.20050214090712.0416fd68@mail.jpl.nasa.gov> It occurs to me that the real limiting factor in producing "cluster oriented thermal design" is the volume of sales. Say you want to design a custom motherboard/package for use in clusters. This is, at a guess, probably a 3-5 million dollar project (maybe down around a million if it's real close to an existing design). Say the cost of a node is around a kilobuck or 2 (in plain, non-custom, commodity trim). If you had a cluster with 1000 of those custom mobos, you're looking at adding $3K/node to the cluster. That's a bit punitive... You could buy a lot of machine room and cooling for that $3 mil. Now, on the other hand, if you had 100 people willing to each buy a cluster of this scale, then it's only adding $30-50/node, which is a lot more reasonable. Compare this to the consumer motherboard market (which, after all, is what we are really using here...) A production run of several million mobos isn't all that huge, so a Dell or HP can and do create customized motherboard designs to meet some peculiar requirement (on-board peripherals, etc.). Such customization only adds a buck to the mobo cost, and presumably, that buck is made up in cheaper packaging, shorter cables, one less manufacturing step, or somewhere. Somehow, I doubt that the total sales of ALL motherboards for clusters, of a given instance of motherboard design, exceeds a million units. Cluster buyers tend to want different processors, different peripherals, etc., and each configuration change would drive a whole new design cycle. There is hope on the horizon. The increasing drive to "media computers" is creating a demand for PCs that have high performance, but are quiet and have good cooling. I have a Motorola Moxi BMC9012 "set top box" at home from the cable company, and it is basically a Linux computer with an 80GB drive and a some custom video hardware. It's also hideously noisy (for something designed to sit in your living room) and dissipates >100W (all the time.. there's no on-off button). There WILL be consumer pressure to make it silent and to do better thermal management. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From steve_heaton at ozemail.com.au Sun Feb 13 21:52:27 2005 From: steve_heaton at ozemail.com.au (steve_heaton@ozemail.com.au) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] A home cluster of mobos Message-ID: <20050214055227.YEMC24369.swebmail02.mail.ozemail.net@localhost> Dear collective of great minds I'd like to humbly introduce my little Beowulf "BORG" (Boring and Old but Real Grunt). http://members.ozemail.com.au/~sheaton/lss/ -> Computing The next performance consideration will be to start and work over TCP. Maybe a jump into GAMMA for a quick squizz? We'll see how it goes. Cheers Stevo This message was sent through MyMail http://www.mymail.com.au From rene at renestorm.de Mon Feb 14 03:04:45 2005 From: rene at renestorm.de (rene) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Block send mpi In-Reply-To: <42103A3D.8020605@scalableinformatics.com> References: <200502130529.58915.rene@renestorm.de> <42103A3D.8020605@scalableinformatics.com> Message-ID: <200502141204.45263.rene@renestorm.de> Hi Joe, here is some output and changes which solves the problem. I don't know, why I created a void buffer and sended an int array. After creating an int buffer I was also able to delete it ;o) Tnx anyway Rene int packsize; MPI_Pack_size (bit, MPI_INT, newcomm, &packsize); int bufsize = packsize + (MPI_BSEND_OVERHEAD); // void *buf = new (void (*[packsize]) ()); int *buf = new (int ([packsize])); for (int az = 0; az < repeat + 1; az++) { MPI_Buffer_attach (buf, bufsize); for (int node = 1; node < rankcount; node++) { bsend->ierr = MPI_Bsend (&testdata[0], bit, MPI_INT, node, 0, newcomm); } MPI_Buffer_detach (&buf, &bufsize); } delete buf; output for the old code: Program received signal SIGSEGV, Segmentation fault. 0: 0x40ad3860 in malloc_consolidate () from /lib/libc.so.6 0: (gdb) kill rank 1 in job 4 xtrem_32898 caused collective abort of all ranks exit status of rank 1: killed by signal 9 rank 0 in job 4 xtrem_32898 caused collective abort of all ranks exit status of rank 0: killed by signal 9 1: aborting job: 1: Fatal error in MPI_Recv: Other MPI error, error stack: 1: MPI_Recv(207): MPI_Recv(buf=0x8186388, count=32, MPI_INT, src=0, tag=0, comm=0x84000002, status=0xbfffee30) failed 1: MPIDI_CH3_Progress_wait(207): an error occurred while handling an event returned by MPIDU_Sock_Wait() 1: MPIDI_CH3I_Progress_handle_sock_event(492): 1: connection_recv_fail(1728): 1: MPIDU_Socki_handle_read(590): connection closed by peer (set=0,sock=1) Am Montag 14 Februar 2005 06:42 schrieb Joe Landman: > Rene: > > More data. Where exactly does it SEGV? At the void *buf line? at > the Pack? or the Bsend? Did you compile with -g? Do you have a core > dump? > > Joe > > rene wrote: > > Hi folks, > > > > i know, this isn't a mpi forum, even so allow me a question about block > > sending. > > > > i got some(times) nice SIGSEGVs with that code (C++ implementation). > > Did I code something totally wrong? > > I really don't understand this function. > > // int MPI_Buffer_attach( void *buffer, int size ) > > > > int packsize; > > MPI_Pack_size (bit, MPI_INT, newcomm, &packsize); > > int bufsize = packsize + (MPI_BSEND_OVERHEAD); > > void *buf = new (void (*[packsize]) ()); > > MPI_Buffer_attach (buf, bufsize); > > ierr =MPI_Bsend (&testdata[0], bit, MPI_INT, node, 0, newcomm); > > MPI_Buffer_detach (&buf, &bufsize); > > > > Thanks, -- Rene Storm @Cluster Linux Cluster Consultant Hamburgerstr. 42e D-22952 Luetjensee mailto:Rene@ReneStorm.de Voice-IP: Skype.com, Rene_Storm From kus at free.net Mon Feb 14 07:47:15 2005 From: kus at free.net (Mikhail Kuzminsky) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: Message-ID: In message from Rob Ross (Fri, 11 Feb 2005 20:47:22 -0600 (CST)): >Hi Isaac, >On Fri, 11 Feb 2005, Isaac Dooley wrote: >> >>Using MPI_ISend() allows programs to not waste CPU cycles waiting >>on the >> >>completion of a message transaction. >> >No, it allows the programmer to express that it wants to send a >>message >> >but not wait for it to complete right now. The API doesn't specify >>the >> >semantics of CPU utilization. It cannot, because the API doesn't >>have >> >knowledge of the hardware that will be used in the implementation. >> That is partially true. The context for my comment was under your >> assumption that everyone uses MPI_Send(). These people, as I stated >> before, do not care about what the CPU does during their blocking >>calls. >I think that it is completely true. I made no assumption about >everyone >using MPI_Send(); I'm a late-comer to the conversation. >I was not trying to say anything about what people making the calls >care >about; I was trying to clarify what the standard does and does not >say. >However, I agree with you that it is unlikely that someone calling >MPI_Send() is too worried about what the CPU utilization is during >the >call. >> I was trying to point out that programs utilizing non-blocking IO >>may >> have work that will be adversely impacted by CPU utilization for >> messaging. These are the people who care about CPU utilization for >> messaging. This I hopes answers your prior question, at least >>partially. >I agree that people using MPI_Isend() and related non-blocking >operations >are sometimes doing so because they would like to perform some >computation while the communication progresses. People also use >these >calls to initiate a collection of point-to-point operations before >waiting, so that multiple communications may proceed in parallel. Let me ask some stupid's question: which MPI implementations allow really a) to overlap MPI_Isend w/computations and/or b) to perform a set of subsequent MPI_Isend calls faster than "the same" set of MPI_Send calls ? I say only about sending of large messages. I'm interesting (1st of all) in - Gigabit Ethernet w/LAM MPI or MPICH - Infiniband (Mellanox equipment) w/NCSA MPI or OSU MPI Yours Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow > The >implementation has no way of really knowing which of these is the >case. > >Greg just pointed out that for small messages most implementations >will do >the exact same thing as in the MPI_Send() case anyway. For large >messages >I suppose that something different could be done. In our >implementation >(MPICH2), to my knowledge we do not differentiate. > >You should understand that the way MPI implementations are measured >is by >their performance, not CPU utilization, so there is pressure to push >the >former as much as possible at the expense of the latter. > >> Perhaps your applications demand low latency with no concern for the >>CPU >> during the time spent blocking. That is fine. But some applications >> benefit from overlapping computation and communication, and the >>cycles >> not wasted by the CPU on communication can be used productively. > >I wouldn't categorize the cycles spent on communication as "wasted"; >it's >not like we code in extraneous math just to keep the CPU pegged :). > >Regards, > >Rob >--- >Rob Ross, Mathematics and Computer Science Division, Argonne National >Lab > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf From mathog at mendel.bio.caltech.edu Mon Feb 14 09:24:29 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling? Message-ID: Robert G. Brown wrote: > In order to for liquid cooling to ever > make sense for COTS clusters, it would have to BECOME COTS -- basically, > to become cheap in both hardware and human terms. Shuttle's itty bitty computers have a heat pipe that goes out to a radiator on the back of the case. It isn't much of a step from there to replacing the back radiator with a copper block. That block could in turn mate with another copper block which itself was on a cold water line. Ie, move the radiator even further from the CPU and other heat generating parts of the computer. So a company like shuttle could relatively easily start selling liquid cooled nodes using only minor modifications to its existing hardware. In this sort of a system you might have to pay to have the pros install (plumb) the rack itself, but you could still work on the nodes of the rack, as is true now. It would seem to be relatively straightforward to have the nodes mate up copper block to copper block when fully inserted, so that each node is not itself part of the rack circulation system. The tricky part is that something else would have to be attached to the copper block on the back when the node was serviced on the bench. On the plus side your racks could replace the building's current hot water supply! Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From ashley at quadrics.com Mon Feb 14 09:42:42 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> Message-ID: <1108402962.8265.25.camel@localhost.localdomain> On Mon, 2005-02-14 at 11:11 -0600, Rob Ross wrote: > If you used the non-blocking send to allow for overlapped communication, > then you would like the implementation to play nicely. In this case the > user will compute and eventually call MPI_Test or MPI_Wait (or a flavor > thereof). > > If you used the non-blocking sends to post a bunch of communications that > you are going to then wait to complete, you probably don't care about the > CPU -- you just want the messaging done. In this case the user will call > MPI_Wait after posting everything it wants done. > > One way the implementation *could* behave is to assume the user is trying > to overlap comm. and comp. until it sees an MPI_Wait, at which point it > could go into this theoretical "burn CPU to make things go faster" mode. > That mode could, for example, tweak the interrupt coalescing on an > ethernet NIC to process packets more quickly (I don't know off the top of > my head if that would work or not; it's just an example). Maybe if you were using a channel interface (sockets) and all messages were to the same remote process then it might make sense to coalesce all the sends into a single transaction and just send this in the MPI_Wait call. The latency for a bigger network transaction *might* be lower than the sum of the issue rates for smaller ones. I'd hope that a well written application would bunch all it's sends into a single larger block when possible though if this optimisation was possible though. Given any reasonably fast network not doing anything until the MPI_Wait call however would destroy your latency. It strikes me as this isn't overlapping comms and compute though rather artificially delaying comms to allow compute to finish, seems rather pointless? If you had a bunch of sends to do to N remote processes then I'd expect you to post them in order (non-blocking) and wait for them all at the end, the time taken to do this should be (base_latency + ( (N-1) * M )) where M is the recpipiocal of the "issue rate". You can clearly see here that even for small number of batched sends (even a 2d/3d nearest neighbour matrix) the issue rate (that is how little CPU the send call consumes) is at least as important that the raw latency. > All of this is moot of course unless the implementation actually has more > than one algorithm that it could employ... In my experience there are often dozens of different algorithms for every situation and each has their trade offs. Choosing the right one based on the parameters given is the tricky bit. Ashley, From rgb at phy.duke.edu Mon Feb 14 10:12:09 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling? In-Reply-To: References: Message-ID: On Mon, 14 Feb 2005, David Mathog wrote: > Robert G. Brown wrote: > > > In order to for liquid cooling to ever > > make sense for COTS clusters, it would have to BECOME COTS -- basically, > > to become cheap in both hardware and human terms. > > Shuttle's itty bitty computers have a heat pipe that goes out to > a radiator on the back of the case. It isn't much of a step from > there to replacing the back radiator with a copper block. That block > could in turn mate with another copper block which itself was > on a cold water line. Ie, move the radiator even further from > the CPU and other heat generating parts of the computer. So > a company like shuttle could relatively easily start selling > liquid cooled nodes using only minor modifications to its existing > hardware. > > In this sort of a system you might have to pay to have the pros > install (plumb) the rack itself, but you could still work on the > nodes of the rack, as is true now. It would seem to be relatively > straightforward to have the nodes mate up copper block to copper > block when fully inserted, so that each node is not itself part > of the rack circulation system. The tricky part is > that something else would have to be attached to the copper block > on the back when the node was serviced on the bench. I think there are lots of tricky parts, but I agree that it can be done. In face, Eugen found this from Rittal: http://www.enclosureinfo.com/tech/rittal/lit/pdf/LV_lcs_01_01.pdf where it IS being done, in the sense that one can get liquid cooling adjuncts for racks that accept standard ported lq heat sinks for CPUs and maybe a couple of other parts (disks, power supplies?). Their "mini-chiller" per rack is only around 1.3 tons (4500 "cooling watts") which seems small, and running all the supply hoses around in and out of the systems (especially MP motherboard or blade systems) inside enclosures not really designed for them seems like it would be "interesting". I just don't think of this is being mainstream. I didn't get a price from anybody on this, but I'll bet it is an option on your newborn child per rack. The external heat exchanger idea is also "interesting". I agree that better thermal management in motherboards themselves would be desirable, but it takes a biggish chunk of copper to make a heat pipe capable of moving 100 W 20-30 cm at \kappa_Cu = 385 W/(m-K) and keep the end temperature differentials in the 20-30 K range. Maybe what, 0.5 cm in radius? > > On the plus side your racks could replace the building's current > hot water supply! Not unless you permit the max T on the sink in contact with the water to get dangerously high... (taking this as a serious, rather than a wry, remark). Ditto for numerous discussions of using server room waste heat to help heat buildings -- good idea on paper, pretty difficult in practice, and then there is summer. rgb > > Regards, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Mon Feb 14 10:41:01 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling? In-Reply-To: References: Message-ID: On Mon, 14 Feb 2005, Robert G. Brown wrote: > moving 100 W 20-30 cm at \kappa_Cu = 385 W/(m-K) and keep the end > temperature differentials in the 20-30 K range. Maybe what, 0.5 cm in > radius? I meant diameter. Sorry. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From lindahl at pathscale.com Mon Feb 14 10:58:43 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <1108402962.8265.25.camel@localhost.localdomain> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> Message-ID: <20050214185843.GA1359@greglaptop.internal.keyresearch.com> On Mon, Feb 14, 2005 at 05:42:42PM +0000, Ashley Pittman wrote: > If you had a bunch of sends to do to N remote processes then I'd expect > you to post them in order (non-blocking) and wait for them all at the > end, the time taken to do this should be (base_latency + ( (N-1) * M )) > where M is the recpipiocal of the "issue rate". You can clearly see > here that even for small number of batched sends (even a 2d/3d nearest > neighbour matrix) the issue rate (that is how little CPU the send call > consumes) is at least as important that the raw latency. Unless I completely misunderstand your formula, M is not only the CPU the send call consumes. It's easy to find situations (fast cpu, slow network) where the cpu consumed isn't a part of M at all. Even for a modern 1 GByte/sec network, cpu consumed might not be a part of M. Reducing CPU consumed can't hurt. But reasoning about it seems to be less useful than testing actual applications. -- greg From lindahl at pathscale.com Mon Feb 14 11:07:37 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: Message-ID: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote: > Let me ask some stupid's question: which MPI implementations allow > really > > a) to overlap MPI_Isend w/computations > and/or > b) to perform a set of subsequent MPI_Isend calls faster than "the > same" set of MPI_Send calls ? > > I say only about sending of large messages. For large messages, everyone does (b) at least partly right. (a) is pretty rare. It's difficult to get (a) right without hurting short message performance. One of the commercial MPIs, at first release, had very slow short message performance because they thought getting (a) right was more important. They've improved their short message performance since, but I still haven't seen any real application benchmarks that show benefit from their approach. -- greg From joachim at ccrl-nece.de Mon Feb 14 11:18:51 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: Message-ID: <4210F99B.3040202@ccrl-nece.de> Rob Ross wrote: > Making a sequence of MPI_Isends followed by a MPI_Wait go faster than a > sequence of MPI_Sends isn't hard, particularly if the messages are to > different ranks. I would guess that every implementation will provide > better performance in the case where the user tells the implementation > about all these concurrent operations and then MPI_Waits on the bunch. In this case, the user should think about MPI_Alltoall(v) - there are MPI implementations which do this in a smarter way than Isend/Irecv/Waitall to achieve much better performance than using the naive approach. Especially if you go to large process numbers, some coordination can help a lot, even for a full bisection network like a single-stage full crossbar... Generally, collectives are there to let the library know what kind of communication is coming next. All speculations in the library based on monitoring and predicting non-collective communication will probably only do good in the matching micro-benchmark (my personal experience). Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From rross at mcs.anl.gov Mon Feb 14 11:49:49 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <1108402962.8265.25.camel@localhost.localdomain> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> Message-ID: On Mon, 14 Feb 2005, Ashley Pittman wrote: > On Mon, 2005-02-14 at 11:11 -0600, Rob Ross wrote: > > If you used the non-blocking send to allow for overlapped communication, > > then you would like the implementation to play nicely. In this case the > > user will compute and eventually call MPI_Test or MPI_Wait (or a flavor > > thereof). > > > > If you used the non-blocking sends to post a bunch of communications that > > you are going to then wait to complete, you probably don't care about the > > CPU -- you just want the messaging done. In this case the user will call > > MPI_Wait after posting everything it wants done. > > > > One way the implementation *could* behave is to assume the user is trying > > to overlap comm. and comp. until it sees an MPI_Wait, at which point it > > could go into this theoretical "burn CPU to make things go faster" mode. > > That mode could, for example, tweak the interrupt coalescing on an > > ethernet NIC to process packets more quickly (I don't know off the top of > > my head if that would work or not; it's just an example). > > Maybe if you were using a channel interface (sockets) and all messages > were to the same remote process then it might make sense to coalesce all > the sends into a single transaction and just send this in the MPI_Wait > call. The latency for a bigger network transaction *might* be lower > than the sum of the issue rates for smaller ones. This is exactly what MPICH2 does for the one-sided calls; see Thakur et. al in EuroPVM/MPI 2004. It can be a very big win in some situations. > I'd hope that a well written application would bunch all it's sends into > a single larger block when possible though if this optimisation was > possible though. We would hope that too, but applications do not always adhere to best practice. > Given any reasonably fast network not doing anything until the MPI_Wait > call however would destroy your latency. It strikes me as this isn't > overlapping comms and compute though rather artificially delaying comms > to allow compute to finish, seems rather pointless? I agree that postponing progress until MPI_Wait for the purposes of providing lower CPU utilization would be pointless. It can be useful for coalescing purposes, as mentioned above. But certainly there will be a latency cost. > If you had a bunch of sends to do to N remote processes then I'd expect > you to post them in order (non-blocking) and wait for them all at the > end, the time taken to do this should be (base_latency + ( (N-1) * M )) > where M is the recpipiocal of the "issue rate". You can clearly see > here that even for small number of batched sends (even a 2d/3d nearest > neighbour matrix) the issue rate (that is how little CPU the send call > consumes) is at least as important that the raw latency. Well I wasn't trying to start an argument about the importance of CPU utilization as it relates to issue rate :). The original question simply asked if there was generally an advantage to doing what you expect people to do anyway! And I think that we agree the answer is yes. > > All of this is moot of course unless the implementation actually has more > > than one algorithm that it could employ... > > In my experience there are often dozens of different algorithms for > every situation and each has their trade offs. Choosing the right one > based on the parameters given is the tricky bit. Absolutely! And which few of those dozens are applicable to a wide-enough range of situations that you want to actually implement/debug them? Rob From rross at mcs.anl.gov Mon Feb 14 13:09:30 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4210F99B.3040202@ccrl-nece.de> References: <4210F99B.3040202@ccrl-nece.de> Message-ID: On Mon, 14 Feb 2005, Joachim Worringen wrote: > Rob Ross wrote: > > Making a sequence of MPI_Isends followed by a MPI_Wait go faster than a > > sequence of MPI_Sends isn't hard, particularly if the messages are to > > different ranks. I would guess that every implementation will provide > > better performance in the case where the user tells the implementation > > about all these concurrent operations and then MPI_Waits on the bunch. > > In this case, the user should think about MPI_Alltoall(v) - there are > MPI implementations which do this in a smarter way than > Isend/Irecv/Waitall to achieve much better performance than using the > naive approach. Especially if you go to large process numbers, some > coordination can help a lot, even for a full bisection network like a > single-stage full crossbar... Yes! We don't see nearly enough of this I think. > Generally, collectives are there to let the library know what kind of > communication is coming next. All speculations in the library based on > monitoring and predicting non-collective communication will probably > only do good in the matching micro-benchmark (my personal experience). I agree. Which is why we don't tend to try to figure out what the user is trying to do, and instead just implement an algorithm to get things done as quickly as we can. Rob From ashley at quadrics.com Mon Feb 14 13:22:19 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> Message-ID: On 14 Feb 2005, at 19:49, Rob Ross wrote: > On Mon, 14 Feb 2005, Ashley Pittman wrote: >> On Mon, 2005-02-14 at 11:11 -0600, Rob Ross wrote: >> >> Maybe if you were using a channel interface (sockets) and all messages >> were to the same remote process then it might make sense to coalesce >> all >> the sends into a single transaction and just send this in the MPI_Wait >> call. The latency for a bigger network transaction *might* be lower >> than the sum of the issue rates for smaller ones. > > This is exactly what MPICH2 does for the one-sided calls; see Thakur > et. > al in EuroPVM/MPI 2004. It can be a very big win in some situations. I'll look it up. Presumably the win is because of higher bandwidth achieved by larger messages over a stream. I guess the MPI_Fence call copies data out of a receive buffer. >> I'd hope that a well written application would bunch all it's sends >> into >> a single larger block when possible though if this optimisation was >> possible though. > > We would hope that too, but applications do not always adhere to best > practice. As someone who maintains a MPI library I hope people do this, it's up to us to provide the functionality and application writers to actually make use of it. There are often times when it may well not be worth doing this, either because time to market demands or simply when experiments with differing algorithms. >> Given any reasonably fast network not doing anything until the >> MPI_Wait >> call however would destroy your latency. It strikes me as this isn't >> overlapping comms and compute though rather artificially delaying >> comms >> to allow compute to finish, seems rather pointless? > > I agree that postponing progress until MPI_Wait for the purposes of > providing lower CPU utilization would be pointless. It can be useful > for > coalescing purposes, as mentioned above. But certainly there will be a > latency cost. So potentially there is an optimization choice to me made, do you make the "noddy" application run faster at the cost of real performance for applications tuned to the particular library? That sounds like a whole can of worms. >>> All of this is moot of course unless the implementation actually has >>> more >>> than one algorithm that it could employ... >> >> In my experience there are often dozens of different algorithms for >> every situation and each has their trade offs. Choosing the right one >> based on the parameters given is the tricky bit. > > Absolutely! And which few of those dozens are applicable to a > wide-enough > range of situations that you want to actually implement/debug them? Implement? Most of them. Debug/support? no more than two or three seems optimal. There are some algorithms that just don't work on a given network and some that will only be best in corner cases. Then it's just a case of choosing the correct thresholds between the remaining few. For a given call *best* is absolute however for a given application tradeoffs have to be made. Ashley, From ashley at quadrics.com Mon Feb 14 13:29:09 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <20050214185843.GA1359@greglaptop.internal.keyresearch.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <20050214185843.GA1359@greglaptop.internal.keyresearch.com> Message-ID: <2799056564ab963d97483d0d1d926351@quadrics.com> On 14 Feb 2005, at 18:58, Greg Lindahl wrote: > On Mon, Feb 14, 2005 at 05:42:42PM +0000, Ashley Pittman wrote: >> If you had a bunch of sends to do to N remote processes then I'd >> expect >> you to post them in order (non-blocking) and wait for them all at the >> end, the time taken to do this should be (base_latency + ( (N-1) * M >> )) >> where M is the recpipiocal of the "issue rate". You can clearly see >> here that even for small number of batched sends (even a 2d/3d nearest >> neighbour matrix) the issue rate (that is how little CPU the send call >> consumes) is at least as important that the raw latency. > > Unless I completely misunderstand your formula, M is not only the CPU > the send call consumes. It's easy to find situations (fast cpu, slow > network) where the cpu consumed isn't a part of M at all. Even for a > modern 1 GByte/sec network, cpu consumed might not be a part of M. I'm talking about our (Quadrics) network here which has a CPU offload, a Wait call is simply a few function calls, a memory read (to test completion), a mutex lock/unlock cycle and a linked list insertion, nothing more. Some CPU is used in the send call as I said but outside the two calls there is zero CPU usage although potentially reduced memory to CPU bandwidth. > Reducing CPU consumed can't hurt. But reasoning about it seems to be > less useful than testing actual applications. I do that as well. Ashley, From eugen at leitl.org Mon Feb 14 13:31:18 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] new company, looking for people (fwd from treese@acm.org) Message-ID: <20050214213117.GQ1404@leitl.org> (I presume a single job announcement in all these years is tolerable). ----- Forwarded message from Win Treese ----- [snip] Last fall I joined a startup called SiCortex, where we're building a new Linux cluster computer. We're ramping up the software team, so if you know anyone who is really good and looking for something new, let me know. Here's the short blurb and some job descriptions; feel free to get in touch with me for more details. You can pass this along (minus headers, of course). - Win SiCortex is a new computer company developing a line of Linux cluster computers for demanding scientific and technical applications. The company is based in Maynard, Massachusetts. Senior Software Developers Software developers to work on designing, porting, and qualifying a Linux-based software stack for technical computing clusters. Responsibilities include: * Analyzing one or more sub-projects * Recommending overall approach (buy, port, build) for sub-project(s) * Preparing design and/or implementation plans and schedules for sub-project(s) * Executing implementation plan for sub-project(s) * Executing test and verification strategy for sub-project(s) Areas of expertise being sought include: * Linux porting, drivers, network stack * Parallel file systems * Cluster middleware (job scheduling, single system image) * Compilers and tools * Math and communications libraries * Firmware, diagnostics, and system bring-up * Technical application analysis and tuning Desired skills and experience: * 5+ years industry experience in software development * Deep knowledge of Linux (preferred) or general Unix * Exposure to technical computing * Expertise in multiple areas of software development * Track record of successful results in small teams Software Director Team leader for group of 8-10 software developers designing, porting, and qualifying a Linux-based software stack for technical computing clusters. Responsibilities include: * Analyzing required work and resources * Recruiting and managing team members * Qualifying, recommending, and managing potential third-party vendors (companies and/or consultants) * Preparing and monitoring team schedules * Reviewing and managing team results * Interfacing with potential and actual customers * Technical design and implementation in selected areas Software design task encompasses: * Linux operating system, including kernel, drivers, and networking * Parallel file systems * Cluster middleware * Compilers and tools * Libraries * Applications analysis * Firmware, diagnostics, and other hardware-related software Desired skills and experience: * 10+ years software development, with strong background in Linux (preferred) or Unix * Broad exposure rather than specialist experience in software development * Prior experience in software team leadership or management * Track record of successful results with constrained resources Ideal candidate will have prior exposure to startup environments, pragmatic approach to make vs buy decisions, good understanding of Open Source environments, and excellent people and leadership skills. ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050214/1b68ef97/attachment.bin From patrick at myri.com Mon Feb 14 13:57:03 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <420B264A.7050004@ccrl-nece.de> References: <3.0.32.20050204213943.010127d0@pop.xs4all.nl> <420580AD.5050003@myri.com> <420B264A.7050004@ccrl-nece.de> Message-ID: <42111EAF.5050709@myri.com> Hi Joachim, Joachim Worringen wrote: > Patrick Geoffray wrote: > >> Seriously, here are MPI latencies with MX on F cards on Opteron >> (PCI-X), that includes fibers and a switch in the middle: >> >> Length Latency(us) Bandwidth(MB/s) >> 0 2.684 0.000 > > [...] > > Nice work, Patrick - but such numbers are of little value if the > benchmark used to get them is not stated. I'd recommend mpptest (from > MPICH). Plus, the compiler etc. is also of interest when it comes to > latencies. Thanks. Such numbers have always little value coming from a vendor. My point was simply that >10us was not really today's ballpark. For your curiosity, it was using an in-house MPI Pingpong (one message at a time, not a bogus pipelined pingpong used to confuse people and make big pipes look good). For very small messages, most of Pingpong codes are similar, compiler has no impact (it was using the gcc that was installed on the machine at that time). For asymptotic bandwidth, the major difference is the way you compute 1 MB, either 1024*1024 Bytes, or 1000000 Bytes. In the networking world, it tends to be 1000000 Bytes. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Mon Feb 14 15:09:27 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> Message-ID: <42112FA7.7010900@myri.com> Hi Greg, Greg Lindahl wrote: > On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote: > > >>Let me ask some stupid's question: which MPI implementations allow >>really >> >>a) to overlap MPI_Isend w/computations >>and/or >>b) to perform a set of subsequent MPI_Isend calls faster than "the >>same" set of MPI_Send calls ? >> >>I say only about sending of large messages. > > > For large messages, everyone does (b) at least partly right. (a) is > pretty rare. It's difficult to get (a) right without hurting short > message performance. One of the commercial MPIs, at first release, had Many believe you just need RDMA support to overlap com and comp, but it's not enough. Zero-copy is needed because the copy is obviously a waste of host CPU (along with cache trashing), but the real problem is matching. Ron did a lot of work in Portals to offload the matching, because it is a big synchronization point: if you send a message and you need the CPU on the receive side to find the appropriate receive buffer, you cannot tell the user that it can have the CPU between the time he posts the MPI_Irecv() and the time he checks on it with MPI_Wait(). What will happen is the matching occurs in the MPI_Wait() and overlap goes to the toilettes. There are several ways to work around it: 1) You can have a thread on the receive side and wake it up with an interrupt. If you do that for all receives, then you add ~10 us in the critical path and the small message latency goes to the same place the overlap went before. This was what I believe the commercial MPI was doing at first. 2) If you can take decisions at the NIC level, you can receive small messages eagerly (with a copy) and fire an interrupt only for large messages (you want to steal some CPU cycles for matching). This is not bad, you steal (~5 us + cost of matching) worth of CPU cycles for large messages, that's not much for most people. 3) You can have the NIC doing the matching. Obviously the NIC is not as fast as the host CPU, so it's more expensive: you don't want to do that for small messages, it will hurt your latency. But you still has to do it for all messages to keep the matching order. One solution is to still receive small messages eagerly but match them in the shadow of the NIC->host DMA just to keep the list of posted receives consistent. For large messages, you match in the NIC in the critical path and you don't need the host CPU (assuming that the matched receive is in the small number that is kept on the NIC). It's still not obvious if 3) is worth it, it's much more complex to implement and 5us per large receive is not that big. And you can reduce that overhead with MSIs (on PCIe, only the Alpha Marvel provided MSI on PCI-X, AFAIK). There are more exotic work-arounds, like using 1) and polling at the same time, and hiding the interrupt overhead with some black magic on another processor. The one with the best potential would be to use HyperThreading on Intel chips to have a polling thread burning cycles continuously; it will run in-cache, won't use the FP unit or waste memory cycles. A perfect use for the otherwise useless HT feature. I wonder why nobody went that way... > right was more important. They've improved their short message > performance since, but I still haven't seen any real application > benchmarks that show benefit from their approach. That's the classical chicken-egg problem: Are people not trying to overlap in MPI because it is not implemented, or MPI implementations don't implement it because applications don't try to overlap ? I think it's the later, too complicated for most. Do you know the story/joke about the Physicist and unexpected messages ? Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From mprinkey at aeolusresearch.com Mon Feb 14 12:39:06 2005 From: mprinkey at aeolusresearch.com (Michael T. Prinkey) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> Message-ID: Greg, based on your evaluation of the available MPI libraries, does this imply that overlapping communication and computation can really only be done by explicitly building two separate threads? Mike On Mon, 14 Feb 2005, Greg Lindahl wrote: > On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote: > > > Let me ask some stupid's question: which MPI implementations allow > > really > > > > a) to overlap MPI_Isend w/computations > > and/or > > b) to perform a set of subsequent MPI_Isend calls faster than "the > > same" set of MPI_Send calls ? > > > > I say only about sending of large messages. > > For large messages, everyone does (b) at least partly right. (a) is > pretty rare. It's difficult to get (a) right without hurting short > message performance. One of the commercial MPIs, at first release, had > very slow short message performance because they thought getting (a) > right was more important. They've improved their short message > performance since, but I still haven't seen any real application > benchmarks that show benefit from their approach. > > -- greg > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From mhyoung at valdosta.edu Mon Feb 14 12:50:18 2005 From: mhyoung at valdosta.edu (michael young) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Poor man's SANS Message-ID: <42110F0A.10408@valdosta.edu> Hi, Can I use beowulf or some other Linux cluster or HA Linux solution to pool harddrive space together from differrent computers to make a kinda "poor man's SANS"? thank you Michael From rossen at VerariSoft.Com Mon Feb 14 14:32:57 2005 From: rossen at VerariSoft.Com (Rossen Dimitrov) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> Message-ID: <42112719.4060500@verarisoft.com> Greg Lindahl wrote: > On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote: > > >>Let me ask some stupid's question: which MPI implementations allow >>really >> >>a) to overlap MPI_Isend w/computations >>and/or >>b) to perform a set of subsequent MPI_Isend calls faster than "the >>same" set of MPI_Send calls ? >> >>I say only about sending of large messages. > > > For large messages, everyone does (b) at least partly right. (a) is > pretty rare. It's difficult to get (a) right without hurting short > message performance. One of the commercial MPIs, at first release, had > very slow short message performance because they thought getting (a) > right was more important. They've improved their short message > performance since, but I still haven't seen any real application > benchmarks that show benefit from their approach. There is quite a bit of published data that for a number of real application codes modest increase of MPI latency for very short messages has no impact on the application performance. This can also be seen by doing traffic characterization, weighing the relative impact of the increased latency, and taking into account the computation/communication ratio. On the other hand, what you give the application developers with an interrupt-driven MPI library is a higher potential for effective overlapping, which they could chose to utilize or not, but unless they send only very short messages, they will not see a negative performance impact from using this library. There is evidence that re-coding the MPI part of an application to take advantage of overlapping and asynchrony when the MPI library (and network) supports these well actually leads to real performance benefit. There is evidence that even without changing anything in the code, but by just running the same code with an MPI library that plays nicer to the system leads to better application performance by improving the overall "application progress" - a loose term I used to describe all of the complex system activities that need to occur during the life-cycle of a parallel application not only on a single node, but on all nodes collectively. The question of short message latency is connected to system scalability in at least one important scenario - running the same problem size as fast as possible by adding more processors. This will lead to smaller messages, much more sensitive to overhead, thus negatively impacting scalability. In other practical scenarios though, users increase the problem size as the cluster size grows, or they solve multiple instances of the same problem concurrently, thus keeping the message sizes away from the extremely small sizes resulting from maximum scale runs, thus limiting the impact of shortest message latency. I have seen many large clusters whose only job run across all nodes is HPL for the top500 number. After that, the system is either controlled by a job scheduler, which limits the size of jobs to about 30% of all processors (an empirically derived number that supposedly improves the overall job throughput), or it is physically or logically divided into smaller sub-clusters. All this being said, there is obviously a large group of codes that use small messages no matter what size problem they solve or what the cluster size is. For these, the lowest latency will be the most important (if not the only) optimization parameter. For these cases, users can just run the MPI library in polling mode. With regard to the assessment that every MPI library does (a) partly right I'd like to mention that I have seen behavior where attempting to overlap computation and communication can lead to no performance improvement at all, or even worse, to performance degradation. This is one example of how a particular implementation of a standard API can affect the way users code against it. I use a metric called "degree of overlapping" which for "good" systems approaches 1, for "bad" systems approaches 0, and for terrible systems becomes negative... Here goodness is measured as how well the system facilitates overlapping. Rossen > > -- greg > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From rross at mcs.anl.gov Mon Feb 14 20:52:51 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Poor man's SANS In-Reply-To: <42110F0A.10408@valdosta.edu> References: <42110F0A.10408@valdosta.edu> Message-ID: Yes! PVFS2 (http://www.pvfs.org/pvfs2) is my favorite option for this :). My group at ANL along with Clemson University and Ohio Supercomputer Center and others are developing this. It's entirely open source and open development, and is in production use at ANL, OSC, and the University of Utah CHPC, among other places. GFS (http://www.redhat.com/software/rha/gfs/) is another; I believe that RPMs are available for it now through one source or another. This used to be Sistina's product, who was subsequently bought by RedHat. I'm sure this is used in production in many business environments, and we use it at ANL also. Can someone provide a URL for this one? Lustre (www.lustre.org) is another option. This one is heavily funded by the DOE ASC laboratories and is in use on some very large parallel machines. But unless you have a relationship with CFS you can only get a crippled version of the source, so it's probably not a good option for average joe. If they change their policy on releasing source code, this would be worth reconsidering. Regards, Rob --- Rob Ross, Mathematics and Computer Science Division, Argonne National Lab On Mon, 14 Feb 2005, michael young wrote: > Hi, > Can I use beowulf or some other Linux cluster or HA Linux solution > to pool harddrive space together from differrent computers to make a > kinda "poor man's SANS"? > > thank you > Michael From rross at mcs.anl.gov Mon Feb 14 21:12:36 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <42112719.4060500@verarisoft.com> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> Message-ID: Rossen, It would be good to mention that you work for a company that sells an implementation specifically designed for facilitating overlapping, in case people don't know that. Clearly you guys have thought a lot about this. The last two Scalable OS workshops (the only two I've had a chance to attend), there was a contingent of people that are certain that MPI isn't going to last too much longer as a programming model for very large systems. The issue, as they see it, is that MPI simply imposes too much latency on communication, and because we (as MPI implementors) cannot decrease that latency fast enough to keep up with processor improvements, MPI will soon become too expensive to be of use on these systems. Now, I don't personally think that this is going to happen as quickly as some predict, but it is certainly an argument that we should be paying very careful attention to the latency issue, because as MPI implementors this is an argument that never seems to end. Also, there is additional overhead in the Isend()/Wait() pair over the simple Send() (two function calls rather than one, allocation of a Request structure at the least) that means that a naive attempt at overlapping communication and computation will result in a slower application. So that doesn't surprise me at all. I think that the theme from this thread should be that "it's a good thing that we have more than one MPI implementation, because they all do different things best." Rob --- Rob Ross, Mathematics and Computer Science Division, Argonne National Lab On Mon, 14 Feb 2005, Rossen Dimitrov wrote: > There is quite a bit of published data that for a number of real > application codes modest increase of MPI latency for very short messages > has no impact on the application performance. This can also be seen by > doing traffic characterization, weighing the relative impact of the > increased latency, and taking into account the computation/communication > ratio. On the other hand, what you give the application developers with > an interrupt-driven MPI library is a higher potential for effective > overlapping, which they could chose to utilize or not, but unless they > send only very short messages, they will not see a negative performance > impact from using this library. > > There is evidence that re-coding the MPI part of an application to take > advantage of overlapping and asynchrony when the MPI library (and > network) supports these well actually leads to real performance benefit. > > There is evidence that even without changing anything in the code, but > by just running the same code with an MPI library that plays nicer to > the system leads to better application performance by improving the > overall "application progress" - a loose term I used to describe all of > the complex system activities that need to occur during the life-cycle > of a parallel application not only on a single node, but on all nodes > collectively. > > The question of short message latency is connected to system scalability > in at least one important scenario - running the same problem size as > fast as possible by adding more processors. This will lead to smaller > messages, much more sensitive to overhead, thus negatively impacting > scalability. > > In other practical scenarios though, users increase the problem size as > the cluster size grows, or they solve multiple instances of the same > problem concurrently, thus keeping the message sizes away from the > extremely small sizes resulting from maximum scale runs, thus limiting > the impact of shortest message latency. I have seen many large clusters > whose only job run across all nodes is HPL for the top500 number. After > that, the system is either controlled by a job scheduler, which limits > the size of jobs to about 30% of all processors (an empirically derived > number that supposedly improves the overall job throughput), or it is > physically or logically divided into smaller sub-clusters. > > All this being said, there is obviously a large group of codes that use > small messages no matter what size problem they solve or what the > cluster size is. For these, the lowest latency will be the most > important (if not the only) optimization parameter. For these cases, > users can just run the MPI library in polling mode. > > With regard to the assessment that every MPI library does (a) partly > right I'd like to mention that I have seen behavior where attempting to > overlap computation and communication can lead to no performance > improvement at all, or even worse, to performance degradation. This is > one example of how a particular implementation of a standard API can > affect the way users code against it. I use a metric called "degree of > overlapping" which for "good" systems approaches 1, for "bad" systems > approaches 0, and for terrible systems becomes negative... Here goodness > is measured as how well the system facilitates overlapping. > > Rossen From patrick at myri.com Mon Feb 14 22:20:52 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <420DA793.4000909@verarisoft.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <420DA793.4000909@verarisoft.com> Message-ID: <421194C4.5050808@myri.com> Hi Rossen, Rossen Dimitrov wrote: > Of course, there is always the case of running the actual application > code and then evaluating the MPI performance by seeing which MPI library > (or library mode) makes the application run faster. Unfortunately, this > method for evaluating MPI often suffers from various efficiencies some > of which originate from the parallel algorithm developers, who thoughout > the years have sometimes adopted the most trivial ways of using MPI. So if you run an MPI application and it sucks, this is because the application is poorly written ? You don't want to benchmark an application to evaluate MPI, you want to benchmark an application to find the best set of resources to get the job done. If the code stinks, it's not an excuse. Good MPI implementations are good with poorly written applications, but still let smart people do smart things if they want. > these in one way or another depend on CPU processing. Also, today's > processor architectures have many independent processing units and > complex memory hierarchies. When the MPI library polls for completion of > a communication request, most of this specialized hardware is virtually > unused (wasted). The processor architecture trends indicate that this > kind of internal CPU concurrency will continue to increase, thus making > the cost of MPI polling even higher. When you poll, you have nothing else to do: you are stuck in a Wait or in a blocking call (collectives for example). Why do you care about the lost cycles ? The only way to rescue them would be to oversubscribe your processor, and hope than the cycles you recycle (no punt intended) are worth the context switches and the associated cache trashing. I would argue that polling should be the cheapest MPI operations ever (if nothing is found). This is the case of most half decent MPI implementation. > In this regard, a parallel application developer might actually very > much care what is actually happening in the MPI library even when he > makes a call to MPI_Send. If he doesn't, he probably should. He absolutely should not. It's one thing to work around clueless developers, but it's way more difficult to work around someone who assume wrong things about the MPI implementation. > - What application algorithm developers experience when they attempt to > use the ever so nebulous "overlapping" with a polling MPI library and Overlaping is completely orthogonal with polling. Overlaping means that you split the communication initiation from the communication completion. Polling means that you test for completion instead of wait for completion. You can perfectly overlap and check for completion of the asynchronous requests by polling, nothing wrong with that. > how this experience has contributed to the overwhelming use of > MPI_Send/MPI_Recv even for codes that can benefit from non-blocking or > (even better) persistent MPI calls, thus killing any hope that these > codes can run faster on systems that actually facilitate overlapping. There is 2 reasons why developers use blocking operations rather than non-blocking one: 1) they don't know about non-blocking operations. 2) MPI_Send is shorter than MPI_Isend(). Looking for overlaping is actually not that hard: a) look for medium/large messages, don't waste time on small ones. b) replace all MPI_Send() by a pair MPI_Isend() + MPI_Wait() c) move the MPI_Isend() as early as possible (as soon as data is ready). d) move the MPI_Wait() as late as possible (just before the buffer is needed). e) do same for receive. Most of the time, that would speed up things quite a bit, or not change anything. I am still looking for some tuning tool to do that automatically though. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From john.hearns at streamline-computing.com Mon Feb 14 23:20:21 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Poor man's SANS In-Reply-To: References: <42110F0A.10408@valdosta.edu> Message-ID: <33007.212.159.87.168.1108452021.squirrel@webmail.streamline-computing.com> > Yes! > > PVFS2 (http://www.pvfs.org/pvfs2) is my favorite option for this :). My > group at ANL along with Clemson University and Ohio Supercomputer Center > and others are developing this. It's entirely open source and open > development, and is in production use at ANL, OSC, and the University of > Utah CHPC, among other places. > > GFS (http://www.redhat.com/software/rha/gfs/) is another; I believe that > RPMs are available for it now through one source or another. This used to > be Sistina's product, who was subsequently bought by RedHat. I'm sure > this is used in production in many business environments, and we use it at > ANL also. Can someone provide a URL for this one? Source RPMs are of course available from RedHat, and ou can get support for their version. The Scientific Linux distribution has prebuilt RPMs ftp://ftp.scientificlinux.org/linux/scientific/304/i386/SL/RPMS/ From patrick at myri.com Mon Feb 14 23:48:47 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> Message-ID: <4211A95F.2010709@myri.com> Hi Rob, Rob Ross wrote: > The last two Scalable OS workshops (the only two I've had a chance to > attend), there was a contingent of people that are certain that MPI isn't > going to last too much longer as a programming model for very large Were they advocating shared memory paradigms, one sided operations, something more "natural" to program with ? I heard that before :-) > systems. The issue, as they see it, is that MPI simply imposes too much > latency on communication, and because we (as MPI implementors) cannot > decrease that latency fast enough to keep up with processor improvements, > MPI will soon become too expensive to be of use on these systems. This is just wrong. How much of the latency in high speed interconnect is due to MPI ? Very very little. The core of it is in the hardare (IO bus, NICs, crossbars and wires). Doing pure RDMA in hardware is easy for the chip designers, but it's hell for irregular applications when you actually don't know where to remotely read or write. > Also, there is additional overhead in the Isend()/Wait() pair over the > simple Send() (two function calls rather than one, allocation of a Request > structure at the least) that means that a naive attempt at overlapping > communication and computation will result in a slower application. So > that doesn't surprise me at all. What is the cost of one function call and an allocation in a slab ? At several GHz, 50 ns ? And most of the time, blocking calls are implemented on top of non-blocking routines, so the CPU overhead is the same. > I think that the theme from this thread should be that "it's a good thing > that we have more than one MPI implementation, because they all do > different things best." I would say having more than one MPI implementations is a bad thing as long as you cannot easily replace one by another. Let's define a standard MPI header and a standard API for spawning and such, and then having more than one implementation will actually be manageable. That would also remove the needs for swiss-army-knife MPI implementations that want to support all interconnect with the same binary. These implementations are, IMHO, a bad thing as they work at the lowest common denominator and are in essence inefficient for all devices. While we are at it, here is my wish list for the next MPI specs: a) only non-blocking calls. If there are no blocking calls, nobody will use them. b) non-blocking calls for collectives too, there is no excuse. Yes, even an asynchronous barrier. c) ban of the ANY_SENDER wildcard: a world of optimization goes away with this convenience. d) throw away the user defined datatypes, or at least restrict it to regular strides. e) get rid of one-sided communications: if someone is serious about it, it uses something like ARMCI or UPC or even low level vendor interfaces. Rob, you are politically connected, could you make it happen, please ? :-) Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From joachim at ccrl-nece.de Tue Feb 15 00:20:48 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4211A95F.2010709@myri.com> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> Message-ID: <4211B0E0.6030007@ccrl-nece.de> Patrick Geoffray wrote: > While we are at it, here is my wish list for the next MPI specs: > > a) only non-blocking calls. If there are no blocking calls, nobody will > use them. While this makes sense technically, nobody will probably offer an MPI implementation without MPI_Send for the next 20 years for compatibility reasons, so we can just forget about it. > b) non-blocking calls for collectives too, there is no excuse. Yes, even > an asynchronous barrier. No problem here - barrier_enter() and barrier_leaver() are not new. > c) ban of the ANY_SENDER wildcard: a world of optimization goes away > with this convenience. I think this could best be achieved with an assertion like those for one-sided and I/O. There are situations where ANY_SENDER is needed, or at least avoids large programming overheads. > d) throw away the user defined datatypes, or at least restrict it to > regular strides. This is nonsense: user-defined datatypes do not cause any overhead if you don't use them, there are ways to implemenent them very efficiently, and you can't do without in many situations (like MPI-IO). > e) get rid of one-sided communications: if someone is serious about it, > it uses something like ARMCI or UPC or even low level vendor interfaces. Instead, I propose to rework the MPI one-sided communications for a more simple and flexible semantic. The current definition does not match todays network capabilities, but was designed to allow a simple implemenentation for slow/non-RDMA networks. > Rob, you are politically connected, could you make it happen, please ? > :-) One person alone can't do this. The best place to discuss such things is the MPI users group meeting (EuroPVM/MPI, this year in Capri/Italy). Also, adding mpi.h to the standard to define an ABI is a good thing. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From joachim at ccrl-nece.de Tue Feb 15 00:53:37 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <1108402962.8265.25.camel@localhost.localdomain> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> Message-ID: <4211B891.6020406@ccrl-nece.de> Ashley Pittman wrote: > If you had a bunch of sends to do to N remote processes then I'd expect > you to post them in order (non-blocking) and wait for them all at the > end, the time taken to do this should be (base_latency + ( (N-1) * M )) > where M is the recpipiocal of the "issue rate". You can clearly see > here that even for small number of batched sends (even a 2d/3d nearest > neighbour matrix) the issue rate (that is how little CPU the send call > consumes) is at least as important that the raw latency. This is an interesting issue. If you look at what Greg mentioned about dump NICs (like InfiniPath, or SCI) and the latency numbers Ole posted for ScaMPI on different interconnects (all(?) accessed through uDAPL), you see that the dumb interface SCI has the lowest latency for both, pingpong and random, with random being about twice of pingpong. In contrast, the "smart" NIC Myrinet, which has much less CPU utilization, has twice the pingpong latency, and a slightly worse random-to-pingpong ratio. Why this? Maybe better pipelining in SCI, because it's write-and-forget for the CPU, with 16 outstanding transactions on the network level, while Myrinet obviously behaves differently here (although GM should also be PIO-write to the NIC memory for small messages). Then there is Infiniband, which has a much better random-to-pingpong ratio, which is striking. Would be nice to see Quadrics or InfiniPath in this context. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From gmpc at sanger.ac.uk Tue Feb 15 01:22:21 2005 From: gmpc at sanger.ac.uk (Guy Coates) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Poor man's SANS In-Reply-To: References: <42110F0A.10408@valdosta.edu> Message-ID: > PVFS2 (http://www.pvfs.org/pvfs2) is my favorite option for this :). My > GFS (http://www.redhat.com/software/rha/gfs/) is another; I believe that > Lustre (www.lustre.org) is another option. This one is heavily funded by You missed out GPFS from IBM. It is no-cost free for academic institutions. You can use it with or without SAN hardware. http://publib.boulder.ibm.com/clresctr/windows/public/gpfsbooks.html Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK Tel: +44 (0)1223 834244 ex 7199 From joachim at ccrl-nece.de Tue Feb 15 01:47:32 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <42111EAF.5050709@myri.com> References: <3.0.32.20050204213943.010127d0@pop.xs4all.nl> <420580AD.5050003@myri.com> <420B264A.7050004@ccrl-nece.de> <42111EAF.5050709@myri.com> Message-ID: <4211C534.7070608@ccrl-nece.de> Patrick Geoffray wrote: > For your curiosity, it was using an in-house MPI Pingpong (one message > at a time, not a bogus pipelined pingpong used to confuse people and > make big pipes look good). For very small messages, most of Pingpong > codes are similar, ...but not equal and give different results. Just compare PMB and mpptest. > compiler has no impact (it was using the gcc that was > installed on the machine at that time). I experienced differences of more than 2 us depending on whether using shared or static libraries, compiler version/options etc. on both scalar and vector machines. > For asymptotic bandwidth, the > major difference is the way you compute 1 MB, either 1024*1024 Bytes, or > 1000000 Bytes. In the networking world, it tends to be 1000000 Bytes. I tend to use MB for 10^6 and MiB for 2^10. This is a somewhat official no Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From patrick at myri.com Tue Feb 15 01:48:09 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4211B0E0.6030007@ccrl-nece.de> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> <4211B0E0.6030007@ccrl-nece.de> Message-ID: <4211C559.8070100@myri.com> Joachim, Joachim Worringen wrote: > Patrick Geoffray wrote: > >> While we are at it, here is my wish list for the next MPI specs: >> >> a) only non-blocking calls. If there are no blocking calls, nobody >> will use them. > > > While this makes sense technically, nobody will probably offer an MPI > implementation without MPI_Send for the next 20 years for compatibility > reasons, so we can just forget about it. Throw away compatibility. If you keep the legacy API, you have no incentive for change. I don't want MPI-3, I want MPI-light. We are against a wall because the MPI spec was too rich and developers took the lazy path. The weight of legacy will make shared memory paradigms the only proposal for the next step. If you believe we have to support the whole MPI semantics in the next message passing standards, then we are doomed. >> c) ban of the ANY_SENDER wildcard: a world of optimization goes away >> with this convenience. > > > I think this could best be achieved with an assertion like those for > one-sided and I/O. There are situations where ANY_SENDER is needed, or > at least avoids large programming overheads. It's used because it's there, there is no other reason. If you don't know who sends you what in a message passing application, then you cannot get either performance or robustness. If really you cannot do otherwise (and I don't believe that), you can always use unexpected messages (post the receive after Probe()ing), That's ugly, but you get what you deserved :-) >> d) throw away the user defined datatypes, or at least restrict it to >> regular strides. > > > This is nonsense: user-defined datatypes do not cause any overhead if > you don't use them, there are ways to implemenent them very efficiently, > and you can't do without in many situations (like MPI-IO). I know this item would itch, you spend a lot of time working on that. If you don't use user-defined datatypes, then you don't need it and it should not be there in the first place. It's a temptation, it's too easy. No, there is no ways to implement them efficiently unless they are regular, and this is what I am willing to keep: strided types with long segments. Everything else leads to memory copies. The developer should wipe his own bottom instead of asking the message passing interface to work around bad data layout. Sending a column of blocs, yes, that's regular stride and it makes a lot of sense. Sending non-contiguous irregular structure ? As we used to say in France, $100 and a chocolate bar with that ? Oh, BTW, I would gut MPI-IO and make a separate interface. Only a small subset of applications use it and the core semantics are quite different that pure message passing. Man, it's not MPI, it's emacs... >> e) get rid of one-sided communications: if someone is serious about >> it, it uses something like ARMCI or UPC or even low level vendor >> interfaces. > > > Instead, I propose to rework the MPI one-sided communications for a more > simple and flexible semantic. The current definition does not match > todays network capabilities, but was designed to allow a simple > implemenentation for slow/non-RDMA networks. I don't know about that. I just would took it out of the Message Passing Interface because it's not message passing. There would certainly be a need for a pure RMA interface, and there is already a lot of existing work and experience to build upon. >> Rob, you are politically connected, could you make it happen, please ? >> :-) > > > One person alone can't do this. The best place to discuss such things is > the MPI users group meeting (EuroPVM/MPI, this year in Capri/Italy). Nothing that radical would ever come out of EuroPVM/MPI (I heard that Capri is a really nice place, I will definitively beg my boss) or any other users group. > Also, adding mpi.h to the standard to define an ABI is a good thing. Just achieving that would be beyond my greatest expectations. It would certainly be fun to watch. We could organize fist fights on the beach in Capri... Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Tue Feb 15 02:12:35 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4211B891.6020406@ccrl-nece.de> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> Message-ID: <4211CB13.3050902@myri.com> Joachim Worringen wrote: > This is an interesting issue. If you look at what Greg mentioned about > dump NICs (like InfiniPath, or SCI) and the latency numbers Ole posted > for ScaMPI on different interconnects (all(?) accessed through uDAPL), > you see that the dumb interface SCI has the lowest latency for both, Which is the original hardware Scali built its MPI upon, btw. > pingpong and random, with random being about twice of pingpong. In > contrast, the "smart" NIC Myrinet, which has much less CPU utilization, > has twice the pingpong latency, and a slightly worse random-to-pingpong > ratio. No, it's not Myrinet, it's GM/Myrinet. There are many things that come from the GM side of the equation, believe me. > Why this? Maybe better pipelining in SCI, because it's write-and-forget > for the CPU, with 16 outstanding transactions on the network level, > while Myrinet obviously behaves differently here (although GM should > also be PIO-write to the NIC memory for small messages). Nope, no PIO for small messages with GM, DMA for everything. A last remark. I really think that the argument of using the same swiss-army-knive MPI implementation such as ScaMPI or Intel MPI or even MPI/Pro to infere interconnect characteristics is even worse that looking at latency and bandwidth alone. These implementations are never going to be designed to use all hardware efficiently, their design is either historic (Scali used to provided software for SCI alone) or politicaly motivated (Intel is using uDapl, hummm, wonder why), or both. They are by-products of the MPI forum failure to make the Standard practical (compatible ABI). Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From jcownie at etnus.com Tue Feb 15 05:06:56 2005 From: jcownie at etnus.com (James Cownie) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: Message from Patrick Geoffray of "Tue, 15 Feb 2005 05:12:35 EST." <4211CB13.3050902@myri.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> <4211CB13.3050902@myri.com> Message-ID: <20050215130656.8572F1C818@amd64.cownie.net> > A last remark. I really think that the argument of using the same > swiss-army-knive MPI implementation such as ScaMPI or Intel MPI or > even MPI/Pro to infere interconnect characteristics is even worse that > looking at latency and bandwidth alone. These implementations are > never going to be designed to use all hardware efficiently, their > design is either historic (Scali used to provided software for SCI > alone) or politicaly motivated (Intel is using uDapl, hummm, wonder > why), or both. They are by-products of the MPI forum's failure to make > the Standard practical (compatible ABI). As someone who was on the MPI Forum, and sat through an awful lot of meetings, I'd like to provide some justification for _why_ we didn't try to make a binary standard. 1) At the time (over ten years ago), we would have been happy to have _one_ MPI implementation on a given machine, and we weren't expecting to have multiple MPIs on the same hardware. (It was by no means a foregone conclusion that MPI would succeeed). 2) We didn't expect MPI to move into a commercial environment in which the people running the code wouldn't have the sources, and wouldn't be optimising for _their_ machine, which obviously requires recompilation, making an ABI irrelevant. 3) Not having a binary interface allows optimisations in the C MPI interface (such as using macros rather than functions in some places). 4) A binary interface based on no MPI implementation experience would likely be worse than no binary interface. 5) MPI is supposed to be machine and architecture independent, specifying a binary interface under those circumstances is hard. Maybe you can do it if you leverage the C ABI, however it's not clear that that is ideal, since that either changes with time, or suffers from poor vision of the future too (e.g. look at the required alignment of double in the x86 ABI). 6) It was a hard enough job to agree on the source level specification. If we'd tried to add an ABI we'd probably still be stuck in the Bristol Suites :-) You seem to think (maybe subconsciously) that the MPI forum added features the standard just to make life hard for implementors and to kill performance ;-) I can assure you that that was not the case, and that the standard was a compromise between features which users really wanted and what the implementors felt they could reasonably provide. If the standard had not provided things the users wanted (like wildcard receive), then it's quite possible that his whole discussion would be moot because MPI would by now be of only historical interest since the user community would have ignored it. If you _really_ believe that there is so much performance benefit for your customers in having an MPI-light with the restrictions you outlined which only runs on your hardware, then no-one's stopping you from providing it. The market will decide... -- -- Jim -- James Cownie Etnus, LLC. +44 117 9071438 http://www.etnus.com From rross at mcs.anl.gov Tue Feb 15 07:47:11 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Poor man's SANS In-Reply-To: <33007.212.159.87.168.1108452021.squirrel@webmail.streamline-computing.com> References: <42110F0A.10408@valdosta.edu> <33007.212.159.87.168.1108452021.squirrel@webmail.streamline-computing.com> Message-ID: Thanks John! Rob On Tue, 15 Feb 2005, John Hearns wrote: > > Yes! > > > > PVFS2 (http://www.pvfs.org/pvfs2) is my favorite option for this :). My > > group at ANL along with Clemson University and Ohio Supercomputer Center > > and others are developing this. It's entirely open source and open > > development, and is in production use at ANL, OSC, and the University of > > Utah CHPC, among other places. > > > > GFS (http://www.redhat.com/software/rha/gfs/) is another; I believe that > > RPMs are available for it now through one source or another. This used to > > be Sistina's product, who was subsequently bought by RedHat. I'm sure > > this is used in production in many business environments, and we use it at > > ANL also. Can someone provide a URL for this one? > Source RPMs are of course available from RedHat, > and ou can get support for their version. > > The Scientific Linux distribution has prebuilt RPMs > ftp://ftp.scientificlinux.org/linux/scientific/304/i386/SL/RPMS/ > > From rross at mcs.anl.gov Tue Feb 15 07:48:17 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:47 2009 Subject: [Beowulf] Poor man's SANS In-Reply-To: References: <42110F0A.10408@valdosta.edu> Message-ID: Hello Guy, I wasn't aware that IBM would give that out for use on existing systems. Does anyone know the constraints under which they will provide such a copy? Thanks, Rob On Tue, 15 Feb 2005, Guy Coates wrote: > You missed out GPFS from IBM. It is no-cost free for academic > institutions. You can use it with or without SAN hardware. > > http://publib.boulder.ibm.com/clresctr/windows/public/gpfsbooks.html > > Guy From rross at mcs.anl.gov Tue Feb 15 08:42:56 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4211A95F.2010709@myri.com> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> Message-ID: On Tue, 15 Feb 2005, Patrick Geoffray wrote: > Rob, you are politically connected, could you make it happen, please ? > :-) If I had that level of connections, I'd be a DC lobbyist :). Maybe sell off some national parks to the oil industry or something. Rob From gmpc at sanger.ac.uk Tue Feb 15 08:56:10 2005 From: gmpc at sanger.ac.uk (Guy Coates) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Poor man's SANS In-Reply-To: References: <42110F0A.10408@valdosta.edu> Message-ID: On Tue, 15 Feb 2005, Rob Ross wrote: > Hello Guy, > > I wasn't aware that IBM would give that out for use on existing systems. > Does anyone know the constraints under which they will provide such a > copy? As an academic, you sign up for it under the IBM "scholars program". It comes at no cost but unsupported (well, best-efforts support via the GPFS mailing list). http://www-306.ibm.com/software/info/university/members/faq.html If you want support or want a commercial license, then you have to pay money. The "official" GPFS hardware support matrix is pretty tight, but if you don't care about support, you should find that it will run on pretty much any sort of disk hardware. Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK Tel: +44 (0)1223 834244 ex 7199 From joachim at ccrl-nece.de Tue Feb 15 10:43:18 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4211CB13.3050902@myri.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> <4211CB13.3050902@myri.com> Message-ID: <421242C6.2050800@ccrl-nece.de> Patrick Geoffray wrote: > A last remark. I really think that the argument of using the same > swiss-army-knive MPI implementation such as ScaMPI or Intel MPI or even > MPI/Pro to infere interconnect characteristics is even worse that > looking at latency and bandwidth alone. These implementations are never > going to be designed to use all hardware efficiently, their design is > either historic (Scali used to provided software for SCI alone) or > politicaly motivated (Intel is using uDapl, hummm, wonder why), or both. The two most important things done to optimise performance of an MPI implementation for a hardware platform are: - low-level pt-2-pt communication - collective operations AFAIK, Myrinet's MPI (MPICH-GM), for example, does use the standard (partly naive) collective operations of MPICH. Considering this, plus the fact - that it's not all that hard to use GM for pt-2-pt efficiently. We have done this in our MPI, too, with the same level of performance. - that you probably do not know anything on ScaMPI's current internal design (Intel is MPICH2 plus some Intel-propietary device hacking) and little about it's performance (if this is wrong, let us know) - that all code apart from the device, and also the device architecture of MPICH-GM are more or less 10-year-old swiss-army-knive MPICH code (which is not a bad thing per se) you should maybe think again before judging on the efficiency of other MPI implementations. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From lindahl at pathscale.com Wed Feb 16 00:05:25 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4211B891.6020406@ccrl-nece.de> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> Message-ID: <20050216080525.GA3122@greglaptop.attbi.com> On Tue, Feb 15, 2005 at 09:53:37AM +0100, Joachim Worringen wrote: > This is an interesting issue. If you look at what Greg mentioned about > dump NICs (like InfiniPath, or SCI) and the latency numbers Ole posted > for ScaMPI on different interconnects (all(?) accessed through uDAPL), > you see that the dumb interface SCI has the lowest latency for both, > pingpong and random, with random being about twice of pingpong. In > contrast, the "smart" NIC Myrinet, which has much less CPU utilization, > has twice the pingpong latency, and a slightly worse random-to-pingpong > ratio. I would make 2 comments about this: First, you should be using the best MPI for each piece of hardware. Hardware architects pick their interface with a software implementation in mind. I don't expect any 3rd party MPI to get close to PathScale's MPI latency on PathScale's hardware, unless the 3rd party is flexible enough to change a lot of code. Second, you really can't generalize about dumb NICs by looking at SCI. SCI has a unique situation: its raw latency is much lower than the MPI latency of all MPI implementations for it. I suspect no hardware designer would be out to imitate that property! Both InfiniPath and the Quadrics STEN (forgive me for classing this as dumb, I happen to think dumb is a compliment...) get this right. Third (you knew I couldn't keep to my promise of 2), I wouldn't make any scaling generalizations based on a test with 16 nodes. Even at 128-256 nodes the picture is quite different, and that's the sweet spot that lots of today's clusters are at. So, if you want to make a scaling generalization, you should be quoting 256-512 node results. -- greg From patrick at myri.com Wed Feb 16 00:17:02 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <1108478089.4587.118.camel@s861954.sandia.gov> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> <1108478089.4587.118.camel@s861954.sandia.gov> Message-ID: <4213017E.7060302@myri.com> Keith D. Underwood wrote: >>c) ban of the ANY_SENDER wildcard: a world of optimization goes away >>with this convenience. > > > Um, our apps guys say this is more than a convenience. Apparently, > sometimes you don't exactly know who you are going to receive from. > Would you rather them post receives from 4000 nodes and cancel the ones > that don't send to that node after a while? No, I would not post any receives and let them come unexpected, sing MPI_Probe() to post a matching receive when something show up. It leaves the MPI implementation a way to move most of the matching to the send side for most of the messages and, if the receive is posted early enough, remove the need for host CPU on the receive side when the application is potentially computing. And you remind me, I would ban MPI_Cancel also. It should have been the item #1 :-) Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Wed Feb 16 00:39:17 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <1108479093.4587.132.camel@s861954.sandia.gov> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> <4211B0E0.6030007@ccrl-nece.de> <4211C559.8070100@myri.com> <1108479093.4587.132.camel@s861954.sandia.gov> Message-ID: <421306B5.3080200@myri.com> Hi Keith, Keith D. Underwood wrote: > Inertia is a powerful thing. Billions of dollars have been invested in > MPI codes. Changing that will not be easy (or cheap). This is not as > simple as moving from vectors to distributed memory - there wasn't > nearly as much accumulated code then (and, it hurt back then). I would not drop the whole MPI standard, I would define a subset that is the recommanded API for performance. If your code is too old, link with a legacy MPI lib. If it's coded with the subset, link either with a legacy MPI lib and it works, or link with the optimized MPI lib and see what the MPI implementation can deliver. >>It's used because it's there, there is no other reason. If you don't >>know who sends you what in a message passing application, then you >>cannot get either performance or robustness. If really you cannot do >>otherwise (and I don't believe that), you can always use unexpected >>messages (post the receive after Probe()ing), That's ugly, but you get >>what you deserved :-) > > > That just isn't true. If I don't know how many messages I will get, or > from whom, but I can bound it, then I should prepost those receives. > This is particularly true in your standard physics code that runs for > days and does thousands of time steps. (i.e. you can maintain a circular > queue of these things). A few years back, I looked at a lot of real world code to see if triggering the communication from the receive side could be worth it, ie if most of the messages did not use ANY_SENDER. I was amazed that the vast majority of the messages sent across many applications used the tag to discriminate on the sender among other things, not the source. For the couple of large code I dissected (sorry, don't remember the names right now), there was no rationale. I guess doing bookkeeping on the source and the tag was too much for the developer(s). You can still do the receive-pull optimization and fall back on sender-push when you see a receive with ANY_SENDER, but if ANY_SENDER is the common case, that's useless. The best way to force developer to write code that can leverage optimization in the MPI lib is to remove the source of the ambiguity. So ANY_SENDER in the legacy API, not in the subset. > The user should always expose as much opportunity for optimization as > possible to the MPI layer. e.g. a load-store architecture like the X1 > (not what I am advocating for MPI performance, mind you) could do > excellent datatype processing. You would rather the user do the > gather/scatter themselves to prohibit the MPI from being able to do it? In general yes, more opportunities for optimization is better. Now, assuming that irregular datatypes can be optimized as much as regular ones is wrong. The hardware can gather/scatter better than the application for nice long strides. However, MPI libs should print insults when tiny segments are used (when the scatter/gather efficiency collapse). The developer assumes that's it's fine because he does not know or he does not care. I advocate to hide the guns instead of letting the developer shoot himself in the foot. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Wed Feb 16 02:07:27 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <421242C6.2050800@ccrl-nece.de> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> <4211CB13.3050902@myri.com> <421242C6.2050800@ccrl-nece.de> Message-ID: <42131B5F.8040100@myri.com> Joachim Worringen wrote: > AFAIK, Myrinet's MPI (MPICH-GM), for example, does use the standard > (partly naive) collective operations of MPICH. Considering this, plus > the fact Replacing the collectives from MPICH-1 was not high on the todo list because there was more important things to optimize, with more effects on applications that the scheduling of some collectives. For scaling real codes on large machines, your priority is not there, not enough bang for your time. > - that it's not all that hard to use GM for pt-2-pt efficiently. We have > done this in our MPI, too, with the same level of performance. You have then no idea how hard if to use GM efficiently and *correctly*. Enough to run pingpong ? sure, that's piece of cake. But how to recover from fatal errors on the wire, from resources exhaustion, to avoid to spend most of your time pinning/unpinning pages, to not trash the translation cache on the NIC, etc ? Did you address all of these issues in your MPI ? Maybe, but it requires some design characteristics that would be higher than the device layer. At one time you have to make choices, and in a Swiss-Army-Knive (SAK) implementation, you choose the common ground, or the existing ground. > - that you probably do not know anything on ScaMPI's current internal True, I know zip about ScaMPI design. This is exactely why I don't know how they use GM. Without knowing that, how can you infer hardware characteristics from benchmark results ?!? > design (Intel is MPICH2 plus some Intel-propietary device hacking) and > little about it's performance (if this is wrong, let us know) Intel MPI is MPICH2 plus some multi-device glue. Intel got something right in their design: they ask the vendor to provide the native device layers instead of doing everything themselves. That's how a (SAK) implementation could actually be decent. However, the reference implementation is using uDapl. That means that there is stuff above the device layers that are needed to make the MPI-over-uDapl performance decent. Some of it can be used for other devices, the rest not. The question is that if I need something above the device layer to make my stuff decent, could I have it ? I would think so. Now, if it conflicts with something needed for another device, what happens ? Someone makes a choice. > - that all code apart from the device, and also the device architecture > of MPICH-GM are more or less 10-year-old swiss-army-knive MPICH code > (which is not a bad thing per se) MPICH-1 is not a SAK. You cannot take an MPICH binary and run it on all of the devices on which MPICH has been ported. You can *compile* it on multiple targets, but nothing more. Furthermore, many ch2 things where not used in ch_gm. If you look at it, most of the common code of MPICH is not performance related, at the exception of the collectives (and again they are not that bad). MPICH-2 has been moving more things to the device-specific part, that's the good direction. > you should maybe think again before judging on the efficiency of other > MPI implementations. I could not care less about the efficiency of other MPI implementations. None of my business. My point is that assuming that using a SAK MPI implementation factorize the software part and all remaining performance differences are thus hardware related is ridiculous. As Greg pointed out, an interconnect is a software/hardware stack, all the way to the MPI lib. Throw away the native MPI lib and you have a lame duck. Compare lame ducks and you go nowhere. You don't have much choice when you have a commercial MPI than to support many interconnects. You cannot ask the vendors to write their part unless you are Intel, so you write it yourself. You do your best, because you need to sell your stuff, and you call it good. Is there a value ? Today yes, because it makes life easier to have binary compatibility. However, my second point is that binary compatibility should be addressed by the MPI community, not by commercial MPI implementations. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Wed Feb 16 02:14:45 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4212087F.6070809@verarisoft.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> <4211CB13.3050902@myri.com> <4212087F.6070809@verarisoft.com> Message-ID: <42131D15.4020305@myri.com> Rossen Dimitrov wrote: > Patrick, this is quite a broad statement. 4 years ago we had a paper > arguing that MPI's written to support many different interconnects and > messaging technologies through internal portability layers were probably > sub-optimal for at least some of the interconnects. Most of the reasons Yes, it's very logical. See my reply to Joachim, I don't critic the existence of SAK implementations (actually, yes, a little), all commercial implementations are essentially swiss-army-knives, they have to. My problem is to use results from one unique MPI implementations to connect dots at the hardware level. You don't know if the dots are from the MPI or the hardware, or both. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From eugen at leitl.org Wed Feb 16 02:31:53 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Mare Nostrum (not quite COTS) Message-ID: <20050216103153.GH1404@leitl.org> http://www-106.ibm.com/developerworks/library/pa-nl3-marenostrum.html Power Architecture Community Newsletter, 15 Feb 2005: MareNostrum: A new concept in Linux supercomputing e-mail it! Contents: The name and the history Meet MareNostrum Distinguishing technologies View from the crow's nest Resources About the author Rate this article Related content: Project MareNostrum site IBM eServer Cluster Servers Subscriptions: dW newsletters Level: Introductory developerWorks Power Architecture editors IBM 15 Feb 2005 The MareNostrum supercomputer at the Barcelona Supercomputing Center, ranked number four in the world in speed in November 2004, is constructed of such totally off-the-shelf parts as IBM BladeCenter JS20 servers, 64-bit 970FX PowerPC processors, TotalStorage DS4100 storage servers, and Linux 2.6. This is its story. IBM? has long been a supercomputing leader -- its heritage of innovation currently and spectacularly manifested in its most powerful supercomputer, Blue Gene?/L. The MareNostrum project is the latest bold experiment in supercomputing by IBM -- a small but powerful, rapidly deployed and built system that comes entirely from commercially available components. The Latin term mare nostrum means "our sea" (which to the Romans meant the Mediterranean, as familiar and available to the Italici as the air they breathed, but also the critical key to their success). MareNostrum is one of the world's most powerful supercomputers, ranked among the top five in the prestigious TOP500 (see Resources), yet it is constructed from products available for sale to any business, lives within a relatively small footprint, and was built on a tight schedule using blade servers, a Linux. operating environment, and other cost-efficient technologies. MareNostrum represents a new way of thinking about high-performance computing. Blade servers, some of the most thin and dense machines that can be slid into chassis with the ability to share sources such as power and network switches, became the base components of this supercomputer design. Those familiar with the IBM BladeCenter. JS20 servers' shared-resources architecture will recognize how these servers cost-effectively minimize power consumption and heat output. Running the Linux operating system, the servers exploit the capabilities of the 2.6 kernel on 64-bit PowerPC? processors. MareNostrum also demonstrates something very unique in its project timeline: Part of its mission was to prove the speed at which IBM Linux clusters could be implemented and unleashed. According to the IBM MareNostrum e-Science Lead, Dr. Juan Jose Porta (Open Systems Design and Development, IBM Boeblingen Laboratory): This is all about timely and focused execution. The speed at which this project was realized is important. Consider: from the initial concept in late December of 2003 to assembling the computer in Madrid took less than a year. Normally, this kind of supercomputer projects take years. To make a remarkable saga short, MareNostrum is here and will soon be put into operation by the Barcelona Supercomputer Center (BSC), a public consortium created by the Spanish Government, the Catalonian Government, and the Technical University of Catalonia (UPC), the hosts of the MareNostrum supercomputer. The Barcelona Supercomputing Center is located on the Polytechnic University of Catalonia (UPC) campus in Barcelona. Dr. Porta added, "The supercomputer is based upon commodity technology already developed and available. We were also playing with another piece of magic -- an open environment. This has been a collaborative community effort, where we closely worked with our partners." The name and the history Why "MareNostrum?" In the words of Dr. Porta: MareNostrum means literally "our sea," which is also the Latin name for the Mediterranean Sea on which Barcelona is a port. It carries other apt connotations. "Our sea" refers to a sea of processors and professors who are flocking to the MareNostrum project with a deep commitment to breakthrough science. MareNostrum also refers to the fact that our supercomputer is on the shores of the Mediterranean which, in the days of old Rome, was the middle of the world. This was the center of the Roman Empire, now to become the center of European e-Science on the shores of the nice Mediterranean Sea! Thus, we are talking about an ocean of many professors and a major hub around which such facilitation will grow and thrive to empower a new generation of scientists. Another significant aspect of the name is that, being Latin, it is more culturally inclusive. Not everyone is aware that Spain has actually four official languages, and we did not want to slight anyone. Latin was a safe choice. Spain now understandably becomes the proud home to the most powerful supercomputer in Europe. We see references to its having been assembled in Madrid, but also references to its permanent home as being in Barcelona. MareNostrum is a result of the burgeoning partnership between IBM and the Spanish Government, which has also led to the creation of the Barcelona Supercomputing Center (BSC). BSC is a public consortium created by the Spanish Government, the Catalonian Government, and the Technical University of Catalonia (UPC), which will host the MareNostrum supercomputer. Housed in a majestic 1920s chapel on the university grounds, MareNostrum serves a dual purpose: To serve as a primary high-performance computing resource for the European e-science community and to demonstrate the many benefits of Linux on POWER. in scale. Meet MareNostrum With peak system performance of 40 teraflops for the final system configuration, and a number four spot on the TOP500 list, MareNostrum continues the IBM tradition of high-performance computing breakthroughs in the service of scientific advancement with a twist: MareNostrum is built entirely of commercially available components, including: * 2,282 IBM eServer BladeCenter JS20 blade servers housed in 163 * BladeCenter chassis * 4,564 64-bit IBM PowerPC 970FX processors * 140 TB of IBM TotalStorage? DS4100 storage servers The thinking behind MareNostrum's construction represents a new way of looking at these and other compute-intensive areas. Today's typical high-performance computing installation runs a large, parallel RISC-based UNIX? system with performance instead of reliability being of utmost importance. MareNostrum, however, is a small-footprint Linux cluster made up entirely of off-the-shelf components. With the extreme density of IBM eServer BladeCenter JS20 servers, diskless nodes, and an open system environment, MareNostrum offers superior price/performance; greater reliability, availability, and serviceability; and significant cost efficiencies -- factors that are endearing Linux-based cluster servers to more and more businesses all the time. Distinguishing technologies The next sections explain the hardware and software technologies that distinguish the high-performance computing strategy behind MareNostrum. Hardware: Servers There are 2,282 IBM eServer BladeCenter JS20 servers housed in 163 BladeCenters chassis. Each server Blade has two PowerPC 970 processors running at 2.20GHz, providing superior performance for several varieties of Linux. The BladeCenter technology offers the highest commercially available computer density in the industry, which results in high performance with a small footprint. The BladeCenter technology allows for 84 dual processor servers in a single 42 U rack, giving more than 1.4 teraflops of compute power in a single rack. Hot-swappable JS20 servers also allow administrators to change servers without disrupting applications, maximizing availability. Its shared-resources architecture helps to minimize power consumption and heat output, as well. Hardware: Storage MareNostrum's storage subsystem consists of 20 storage server nodes with 7 terabytes of capacity each or 140 terabytes of total capacity. Its backbone is the IBM TotalStorage DS4100 storage server which, like the BladeCenter JS20, uses redundant hot-swappable components for high availability. IBM TotalStorage DS4100 technology enables tremendous scalability and a wide range of RAID data protection options. Hardware: Switching Four switch frames with Myrinet, including 10 CLOS 256+256 switches and 2 Spine 1280s and densely bundled Myrinet cabling enables faster parallel processing with less switching hardware. The redundant hot-swappable power supply ensures greater availability. The complete switch with 12 chassis provides for 2,560 uniform ports. This uniformity simplifies the programming model so researches can focus on their programs and not the system interconnect architecture. Software: The power of Linux on POWER The Linux 2.6 kernel offers an array of enterprise and performance features that exploit the Power Architecture.. The virtualization capabilities of Linux on POWER allow for more flexible partitioning, better balancing of workloads, and superior scalability should workloads increase. Dr. Porta explained, "It is the Linux 2.6 kernel which offers an array of enterprise and performance features that exploit the Power Architecture." Software: Diskless Image Management (DIM) DIM is a prototype utility for managing the Linux distribution for the compute nodes on the storage servers so that the compute node does not have to manage the root file system. All the files for operation are obtained through the cluster network. Because of this, blades can operate immediately without Linux installation. This is on-demand operation. The blades do have a disk drive but that is reserved for future application use such as checkpointing. DIM also supports the network boot environment in a highly distributed fashion. Software: IBM Linux on POWER clustering technologies The goal is to endow MareNostrum with the same benefits businesses in many industries derive from IBM Linux clusters, albeit on a larger scale. Benefits such as: * Superior density and improved operating efficiency, including smaller * space, power, and cooling requirements and related costs -- thanks to * the BladeCenter JS20 architecture * Record price/performance and system throughput for high-performance * computing workloads thanks to innovative POWER semiconductor * technology, specifically the eight-way superscalar design of the * PowerPC 970FX processor which fully supports symmetric multi-processing * (SMP) * The leading IBM 64-bit POWER microprocessors are capable of addressing * four billion times the amount of physical memory as traditional 32-bit * processors without resorting to complex memory-extension techniques. * Better systems management control thanks to embedded service processors * and software image management * Increased reliability, availability, and serviceability, as well as * lower installation and maintenance costs -- provided by diskless * compute nodes * Improved functionality and performance thanks to the Linux 2.6 kernel * Reduced switching hardware requirements and faster parallel processing * provided by Myrinet switch cabling * Improved storage subsystem costs and reliability thanks to TotalStorage * DS4100 storage technology View from the crow's nest When the power of MareNostrum is unleashed later this year, it will be at the service of scientific, engineering, and medical researchers in the Spanish and international scientific communities. Its to-do list includes issues that are familiar in the supercomputing world, such as protein folding, in silico (computer generated) drug screening and enzymatic reactions. MareNostrum will be used to support basic and applied research in areas that include biology, chemistry, physics, and information-based medicine. As Dr. Porta summed up: ...[T]he very thinking that drove MareNostrum's construction is a new way of looking at compute-intensive areas, particularly in the life sciences, as we prepare new work to resolve challenging problems in information based medicine -- including improvements in diagnostic and therapeutic treatments in hospitals. In the EU context, many of the projects will be conducted in collaboration with other leading European research institutions. We are building collaborative efforts across geographic borders and disciplines. And remember -- the name of the supercomputer is MareNostrum. Traditionally, it was the Mediterranean Sea which allowed commerce and communication to flourish in Europe and beyond. Resources * Visit the Project MareNostrum site, demonstrating the value of Linux * clustering for science, for business, for life itself. * MareNostrum is now at home at the Barcelona Supercomputing Center (BSC) * on the Polytechnic University of Catalonia (UPC) campus in Barcelona, a * prestigious public institution focused on higher education, research, * and technology transfer. * The TOP500 Supercomputer Sites project was started in 1993 to provide a * reliable basis for tracking and detecting trends in high-performance * computing -- twice a year, the project releases a list of the 500 sites * operating the most powerful computer systems. * See this chart for the Linpack benchmark for MareNostrum and others. * This news article examines MareNostrum, IBM's top-ranked, * off-the-shelf, blade-based supercomputer. * Connecting two or more IBM eServer Cluster Servers can create a single, * unified computing resource that will dramatically improve availability, * flexibility, and adaptability for essential services. * The IBM BladeCenter JS20 is well- suited for commercial mainstream * applications and 64-bit high performance computing (HPC) environments. * The IBM Redbook, The IBM eServer BladeCenter JS20, takes an in-depth * look at the two-way Blade eServer for applications requiring 64-bit * computing. * The Linux on IBM eServer product line is Linux-enabled to deliver * maximum performance, reliability, manageability, and price/performance * benefits. * See this site for more on how IBM supercomputing solutions can help * remove the barriers to deployment of clustered server systems. * IBM TotalStorage DS400 series has been enhanced with the DS4000 Storage * Manager V9.10, enhanced remote mirror option, DS4100 option for larger * capacity configurations, and support for EXP100 serial ATA expansion * units . * Take a look at the Myrinet switches used in MareNostrum. About the author The developerWorks Power Architecture editors welcome your comments on this article. E-mail them at dwpower@us.ibm.com. -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050216/14ca2081/attachment.bin From patrick at myri.com Wed Feb 16 02:53:03 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <1108477871.4587.115.camel@s861954.sandia.gov> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <420DA793.4000909@verarisoft.com> <421194C4.5050808@myri.com> <1108477871.4587.115.camel@s861954.sandia.gov> Message-ID: <4213260F.5040303@myri.com> Keith, Keith D. Underwood wrote: >>Looking for overlaping is actually not that hard: >>a) look for medium/large messages, don't waste time on small ones. > > > I contend that this particular item is bad advice. If you send a lot of > small messages, you should use MPI_Isend there as well to give the MPI > implementation every opportunity to do the right thing. As we go > forward, end-to-end acknowledgments are going to become a reality. The I agree. We are strongly considering acking at the lib level instead of at the firmware level in MX. It has many good side effects, and a few evil ones. > last thing you want is to spend a round-trip delay on every message you > send if you send a lot of them. Yes, the implementation can copy on the > sending side to allow the send to complete, but that wastes memory and > time. If you are reliable, you need to be able to resend the data if you don't receive the ack in time. If you don't want to do a copy, you have to wait for the ack before releasing the send buffer. For small messages, the copy is cheaper than the rtt, IMHO. Do you say that if someone use Isend for sending small messages, it's an hint that avoiding the copy is worth it because he tries to overlap and he does not care about latency ? Yes, that would be logical. But then you need to have blocking Send to hint the reverse, and then you assume smart people will use blocking Send because they know latency matters at that place, whereas clueless people will use it because it's simpler than Isend. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Wed Feb 16 03:28:00 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4212182C.60607@verarisoft.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <420DA793.4000909@verarisoft.com> <421194C4.5050808@myri.com> <4212182C.60607@verarisoft.com> Message-ID: <42132E40.1060001@myri.com> Rossen, Rossen Dimitrov wrote: > >> >> So if you run an MPI application and it sucks, this is because the >> application is poorly written ? > > > Patrick, here the argument is about whether and how you "measure" the > "performance of MPI". I guess you may have missed some of the preceding > postings. No, I was pulling your leg :-) The bigger picture is that MPI has no performance in itself, it's a middleware. You can only measure the way an MPI implementation enable a specific application to perform. Only benchmarking of applications is meaningful, you can argue that everything else is futile and bogus. >> You don't want to benchmark an application to evaluate MPI, you want >> to benchmark an application to find the best set of resources to get >> the job done. If the code stinks, it's not an excuse. Good MPI >> implementations are good with poorly written applications, but still >> let smart people do smart things if they want. > > > This is exactly my point made in my previous posting - you cannot design > a system that is optimal in a single mode for all cases of its use when > there are multiple parameters defining the usage and performance I agree completely, being able to apply different assumptions for the whole code and see which one match the best the applications behavior is better than nothing. However, I believe that some tradeoffs are just too intrusive: you should not have to choose between low latency for small messages or progress by interrupt for large ones, especially when you can have both at the same time. > I think it is fairly easy to show that overlapping and polling (or any > kind of communication completion synchronization) are not orthogonal. If > this was the case, you would see codes that show perfect overlapping > running on any MPI implementation/network pair. I am sure there is > plenty of evidence this is not the case. I can show you codes where people sprinkled some MPI_Test()s in some loops. They don't poll to death, just a little from time to time to improve overlap by improving progression. They poll and they overlap. They could as well block and not overlap. polling/blocking and overlap/not are not linked. Interrupts are useful to get overlap without help from the application, but it's not required to overlap. > There is an important point here that needs to be clarified: when I say > "polling" library, I assume that this library does both: polling > completion synchronization and polling progress. There is not much room > to define here these but I am sure MPI developers know what they are. I think this is where we don't understand each other. For me, polling means no interrupts. Wherever you progress in the context of MPI calls or in the context of a progression thread, you pay for the same CPU cyles. If the application is providing CPU cycles to the MPI lib at the right time, you can overlap perfectly without wasting cycles. > Here is a third one. Writing your code for overlapping with non-blocking > MPI calls and segmentation/pipelining, testing the code, and not seeing > any benefit of it. Yes. This is very true. But if it's not worse than with blocking, they should stick with non-blocking, even if it's bigger and more confusing. > stage I with communication in stage I+1. Then, there is the question how > many segments you use to break up the message for maximum speedup. The > pipelining theory says the more you can get the better, when they are > with equal duration, there aren't inter-stage dependencies, and the > stage setup time is low in proportion to the stage execution time. Also, The more steps, the more overhead. Small pipeline stages decrease your startup overhead (when the second stage is empty) but increase the number of segments and the total cost of the pipeline. The best is to find a piece of computation long enough to hide the communication. Pipelining would be overkill in my opinion. > The metric I mentioned earlier "degree of overlapping" with some > additional analysis can help designers _predict_ whether the design is > good or not and whether it will work well or not on a particular system > of interest (including the MPI library). Temporal dependency between buffers and computation is the metric for overlaping. The longuer you don't need a buffers, the better you can overlap a communication to/from it. Compilers could know that. > This is however too much detail for this forum though, as most of the > postings here discuss much more practical issues :) I am bored with cooling questions. However, it's quite time consuming to argue by email. I don't know how RGB can keep the distance :-) Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From ashley at quadrics.com Wed Feb 16 03:26:55 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <20050216080525.GA3122@greglaptop.attbi.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> <20050216080525.GA3122@greglaptop.attbi.com> Message-ID: <1108553215.14604.9.camel@localhost.localdomain> On Wed, 2005-02-16 at 00:05 -0800, Greg Lindahl wrote: > Quadrics STEN (forgive me for classing this as > dumb, I happen to think dumb is a compliment...) get this right. In this context the STEN in used on the transmit side of the network as a way of doing effectively PIO writes directly into the network. On the receive side the NIC is anything but dumb and does the MPI tag matching. It's almost entirely bypasses the CPU leaving it free to do *whatever the application desires*. Interesting enough the STEN is a very good example of what is being discussed here, doing a remote write (Or MPI send) using the STEN is lower latency than using a DMA but uses more CPU cycles (as the STEN needs the data to be "pushed" from the main CPU whereas a (R)DMA only needs the DMA descriptor to be "pushed" and the NIC then "pulls" the actual data). Ashley, From patrick at myri.com Wed Feb 16 03:53:43 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <20050215130656.8572F1C818@amd64.cownie.net> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> <4211CB13.3050902@myri.com> <20050215130656.8572F1C818@amd64.cownie.net> Message-ID: <42133447.9050207@myri.com> Hi James, James Cownie wrote: > As someone who was on the MPI Forum, and sat through an awful lot of > meetings, I'd like to provide some justification for _why_ we didn't try > to make a binary standard. No, I imagine the context was very different 10 years ago. I just don't understand why dynamic spawning, one-sided communications and MPI-IO were added to the Standard, but nobody wanted to address the mpi.h header compatibility issue. By that time, people knew that it was a problem. no ? > You seem to think (maybe subconsciously) that the MPI forum added > features the standard just to make life hard for implementors and > to kill performance ;-) Well, it was the right thing to be as exhaustive as possible to insure the wide adoption of the standard. It was expert friendly, but easy for the application folks to miss the points or take shortcuts. That's the cose of success. Now, I would hate to see a shared memory paradigm emerge to progressively replace MPI because existing applications don't really try to leverage the message passing paradigm capabilities. Some believe it will never happen, I am not so sure. > If you _really_ believe that there is so much performance benefit for > your customers in having an MPI-light with the restrictions you outlined > which only runs on your hardware, then no-one's stopping you from > providing it. This discussion is a beginning. It will only happen if all/most MPI implementators reach a point where it's clear that to move forward, some semantic has to be avoided and some ambiguities cleared, and that can only be done at the API level. I would prefer that the MPI forum focus on improving the core message passing functionalities instead of adding yet another vertical dimension (what's left for MPI-3 ?). The urgent thing however is the ABI. Can we do that ? Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From ashley at quadrics.com Wed Feb 16 03:55:37 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <421306B5.3080200@myri.com> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> <4211B0E0.6030007@ccrl-nece.de> <4211C559.8070100@myri.com> <1108479093.4587.132.camel@s861954.sandia.gov> <421306B5.3080200@myri.com> Message-ID: <1108554937.14604.17.camel@localhost.localdomain> On Wed, 2005-02-16 at 03:39 -0500, Patrick Geoffray wrote: > In general yes, more opportunities for optimization is better. Now, > assuming that irregular datatypes can be optimized as much as regular > ones is wrong. The hardware can gather/scatter better than the > application for nice long strides. > However, MPI libs should print > insults when tiny segments are used (when the scatter/gather efficiency > collapse). The developer assumes that's it's fine because he does not > know or he does not care. I have seen code that used a multi megabyte array of 64bit float/short pairs, effectively having 10 bits of data and 6 bits of "space". Changing this to a 64bit float and two 32bit ints removed the void space and replaced it with deliberate zero data. The "data transferred" went up, application buffer sizes remained the same and performance was a whole lot better. The application writer had used a short to "save space" and was somewhat stunned at the performance improvement. This is a situation that would be best avoided, maybe user education is the key but it's a common problem and there are an awful lot of users. I'm not against complex datatypes on MPI but they are hard to deal with and do get mis-used. Ashley, From patrick at myri.com Wed Feb 16 04:04:31 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <1108553215.14604.9.camel@localhost.localdomain> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> <20050216080525.GA3122@greglaptop.attbi.com> <1108553215.14604.9.camel@localhost.localdomain> Message-ID: <421336CF.5020505@myri.com> Ashley Pittman wrote: > Interesting enough the STEN is a very good example of what is being > discussed here, doing a remote write (Or MPI send) using the STEN is > lower latency than using a DMA but uses more CPU cycles (as the STEN > needs the data to be "pushed" from the main CPU whereas a (R)DMA only > needs the DMA descriptor to be "pushed" and the NIC then "pulls" the > actual data). It seems to be common practice to use PIO for small messages on the send side. MX/Myrinet does that too (whereas GM/Myrinet does not), SCI does it, Greg's IB on HT does it. I don't know who is not burning some cycles to get lower latency for small messages these days. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From rgb at phy.duke.edu Wed Feb 16 04:17:10 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <42132E40.1060001@myri.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <420DA793.4000909@verarisoft.com> <421194C4.5050808@myri.com> <4212182C.60607@verarisoft.com> <42132E40.1060001@myri.com> Message-ID: On Wed, 16 Feb 2005, Patrick Geoffray wrote: > > This is however too much detail for this forum though, as most of the > > postings here discuss much more practical issues :) > > I am bored with cooling questions. However, it's quite time consuming to > argue by email. I don't know how RGB can keep the distance :-) > > Patrick > I stuck a hairpin into an electrical socket at age 2 (an "enlightening" experience I must say) and had a large rock fall on my head from a height of almost a meter at age 8. Since then, I hardly ever get bored with cooling questions, because I cannot remember that they've been asked. What were we talking about, again? Oh yeah, MPI and all that. I've actually been enjoying reading the discussion and not participating, since I'm a PVM kinda guy. But SINCE my name was invoked in vain, I'll make a single comment on the code quality issue, which is that underlying the discussion of communication pattern, blocking vs non-blocking, and directives is the fundamental scaling properties of the code and algorithm itself. So on the issue of whether MPI sucks because the application sucks -- well, possibly, but it seems more likely that the application sucks because its parallel scaling properties (with the algorithm chosen) suck. As to how "intelligent" the back end library should be at choosing algorithm -- I would say the BASIC library should be atomic, elementary, NOT algorithm level stuff. A thin skin on top of raw networking calls that provides the various things one always has to do oneself but not much more. Where one gets into trouble is where one uses a command that has a complex structure that doesn't fit your code without realizing it, and the reason you don't realize it is because all that detail is hidden, and isn't even uniform in RELATIVE performance across varying network hardware. In other words, to make MPI do more, either make it do less (in the form of commands that can be used to build "more" in a manner that is tuned to application and hardware) or be prepared to REALLY make it SMART behind the scenes. This isn't just MPI, BTW. PVM suffers from the same thing. I honestly think that both are limited tools in part BECAUSE they put too thick a skin between the programmer and the network. If you want real performance and complete control over communication algorithm, you probably have to use raw/low level networking commands, and write the appropriate "collective" operations for your particular application and hardware. Of course nobody does this -- not portable and a PITA to design/write/maintain. Or perhaps a few people DO do this, but they're programming gods. And this isn't crazy, really. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From mathog at mendel.bio.caltech.edu Wed Feb 16 08:16:25 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Academic sites: who pays for the electricity? Message-ID: In most universities services like electricity, water, and A/C are paid for by the school. To do so they take "overhead" out of every grant. Partially as a consequence of this they typically have a very poor ability to meter usage on a room by room basis. Now somewhere between the 10 node Pentium II beowulf sitting on a lab bench and the 1000 node dual P4 Xeon beowulf in a machine room that takes up half the basement the cost of the electricity (both for power and A/C) goes from a minor expense to a major one. Really major. For instance, in that hypothetical large machine, at 10 cents per kilowatt hour (a round number), assuming 100 watts per CPU (another round number) that's: 1000 (nodes) * 2 (cpus/node) * .1 (kilowatts/cpu) * .1 (dollars/kilowatt-hour) * 365 (days /year) * 24 (hours/day) = ----------------------- 175200 dollars/year The A/C expense is going to vary tremendously depending upon the outside temperature. It's going to be much higher for us in Southern California than for a site in Anchorage. "Typical" lab usage is widely variable but I'd be amazed if most biology or chemistry labs burn through even 1/10th this much for the equivalent lab area. Some physics lab running a tokamak might come close. Anyway, the question is, have any of the universities said "enough is enough" and started charging these electricity costs directly? If so, what did they use for a cutover level, where usage was "above and beyond" overhead? >From an economic perspective having electricity and A/C come out of overhead (without limit) grossly distorts the true cost of the project over time and can lead to choices which increase the total overall cost. For instance, the use of Xeons instead of Opterons has little effect on TCO if somebody else is picking up the electricity tab, but could change the power consumption significantly on a large project. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rgb at phy.duke.edu Wed Feb 16 09:22:35 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Academic sites: who pays for the electricity? In-Reply-To: References: Message-ID: On Wed, 16 Feb 2005, David Mathog wrote: > In most universities services like electricity, water, and > A/C are paid for by the school. To do so they take "overhead" > out of every grant. Partially as a consequence of this they > typically have a very poor ability to meter usage on a room > by room basis. > > Now somewhere between the 10 node Pentium II beowulf sitting on > a lab bench and the 1000 node dual P4 Xeon beowulf in a machine > room that takes up half the basement the cost of the electricity > (both for power and A/C) goes from a minor expense to a major > one. Really major. For instance, in that hypothetical large machine, > at 10 cents per kilowatt hour (a round number), assuming 100 watts > per CPU (another round number) that's: > > 1000 (nodes) * > 2 (cpus/node) * > .1 (kilowatts/cpu) * > .1 (dollars/kilowatt-hour) * > 365 (days /year) * > 24 (hours/day) = > ----------------------- > 175200 dollars/year I usually assume $1/watt/year (including AC) which is likely to be good within 20% or so depending on the actual cost of electricity in your area and amount of AC required on a seasonally averaged basis. That yields an estimate of $200K in your example -- not really different, just easier to do in your head as a round number. > > The A/C expense is going to vary tremendously depending upon > the outside temperature. It's going to be much higher for us > in Southern California than for a site in Anchorage. > > "Typical" lab usage is widely variable but I'd be amazed > if most biology or chemistry labs burn through even 1/10th this > much for the equivalent lab area. Some physics lab running > a tokamak might come close. > > > Anyway, the question is, have any of the universities said "enough > is enough" and started charging these electricity costs directly? > If so, what did they use for a cutover level, where usage was > "above and beyond" overhead? This issue has most definitely come up at Duke, although we're still seeking a formula that will permit us to deal with it equitably. This is only one of several pieces of overhead associated with clusters that go above and beyond the assumptions that went in to the original indirect costs formulas. For example, Duke now charges grants a "recycling fee" for certain pieces of environmentally toxic end-of-life hardware (e.g. monitors, with their lead-filled screens). Then there are the really HUGE costs for physical space renovations as valuable and scarce campus space is converted for use in the burgeoning clusters. As our Dean of A&S recently remarked, if there aren't any checks and balances or cost-equity in funding and installing clusters, they may well continue to grow nearly exponentially, without bound (Duke's cluster population is doubling almost according to Moore's Law -- every couple of years). Costs associated with those clusters from the space to hold them, the power to run them, and the people to operate them, all grow roughly linearly with the number of nodes. This much is known. What isn't known is the details of the income stream. Each cluster (or part of a cluster) is typically connected with a specific grant-funded project and its associated income stream. Indirect costs >>are<< assessed on those grants; it may be that on average, enough income comes from those indirect costs to easily support the clusters. This isn't crazy -- it is really a question of just what the ratio is of supported people and other IC-producing expenses are to the number of cluster nodes associated with the research. I wish I knew this number -- it would be very useful in a CWM column;-) -- but I don't, and last I heard Duke still didn't know either, although they are perhaps moving slowly towards expending the energy required to find out. Finding out isn't trivial -- it involves running down ALL the clusters on campus, figuring out whom ALL those nodes "belong" to, determining ALL the grant support associated with all those people and projects and clusters (since even research done without a cluster by a person who runs a cluster has to be considered as contributing, as the cluster may be "essential" to retaining that person), figuring out what the sum of the indirect costs are on all those grants, and finally connecting that total to the estimated cost of running all the nodes. By enabling more research projects, postdocs, laboratory operations, and other grant-funded activity to occur their presence on campus might MAKE the university money, who knows? Indirect cost formulas actually tend to EXCLUDE capital equipment such as clusters. If it didn't the University would have made something on the order of 50% indirect costs on the roughly $2M the hardware in your example above would cost, and out of the resulting $1M (noting that the total grant would have had to be $3M for the hardware alone) plus overhead on the salary of the 2-3 people likely to be hired to run the 1000 node cluster, they could have easily paid for power for 3-5 years. So one proposal is to no longer exclude clusters from indirect cost assessments. Of course this "solution" creates another problem just as big -- will granting agencies stand for this? There is a reason indirect costs aren't charged on capital equipment and it isn't because Universities don't WANT to charge them, it is because many granting agencies flatly refuse to pay them. Some do -- IIRC, NIH is pretty tolerant about indirect costs associated with hardware, probably because in medical research they "expect" to have to support entire labs as there is less likelihood of having a teaching stream of income to partially defray the costs. NSF does not, and I don't believe the DoD or DOE grants like to as well. Another is to just force clusters to budget and pay their own utility bills. I don't know how this would fly with grant agencies. They might be irritated if they had to pay for both the utilities and for indirect costs on the utility money (basically paying 1.5x or so of the cost of the power/AC used, so that the University would actually make another $100K in overhead in your example above, but they might hold still for the $200K/year for power alone. They almost certainly WOULD pay for utilities for clusters in places other than Universities, so this isn't so big a jump. > >From an economic perspective having electricity and A/C come out > of overhead (without limit) grossly distorts the true cost > of the project over time and can lead to choices which increase > the total overall cost. For instance, the use of Xeons instead of > Opterons has little effect on TCO if somebody else is picking > up the electricity tab, but could change the power consumption > significantly on a large project. Absolutely. Or, using shelved tower units vs 2U rackmounts vs 1U rackmount nodes, when space is "scarce" and hence expensive. Or requiring each node to have remote management hardware, PXE network cards, 3 year onsite service plans -- all of these choices will be very differently made depending on how the chooser is constrained and who is paying for what. I don't have a really perfect solution to this dilemna, and indeed I think it is a bit premature to expect one. When SOME institution does a real CBA on the total cash flow associated with grant-funded cluster-based research projects, including the more esoteric benefits such as "institutional prestige" (which is serious business, don't forget -- a weight factor that affects ALL grants submitted from an institution) perhaps we can start to think about which clever idea for recovering costs is realistic and fair. In the meantime, budgets of the groups that actuall pay these costs continue to get a wee bit strained as the number of nodes and associated costs continue to spiral upward. Maybe I'll do a column on this soon. I did a whole article on infrastructure for Linux Mag a year or two ago, but the particular aspect of infrastructure that you raise is still unresolved. I wonder if I could get Duke people to expedite collecting and assembling the data required to get the big picture on this...? rgb > > Regards, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From James.P.Lux at jpl.nasa.gov Wed Feb 16 10:56:03 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Academic sites: who pays for the electricity? In-Reply-To: References: Message-ID: <6.1.1.1.2.20050216104144.07664e40@mail.jpl.nasa.gov> At 09:22 AM 2/16/2005, Robert G. Brown wrote: >On Wed, 16 Feb 2005, David Mathog wrote: > > > In most universities services like electricity, water, and > > A/C are paid for by the school. To do so they take "overhead" > > out of every grant. Partially as a consequence of this they > > typically have a very poor ability to meter usage on a room > > by room basis. > > > > >I don't have a really perfect solution to this dilemna, and indeed I >think it is a bit premature to expect one. When SOME institution does a >real CBA on the total cash flow associated with grant-funded >cluster-based research projects, including the more esoteric benefits >such as "institutional prestige" (which is serious business, don't >forget -- a weight factor that affects ALL grants submitted from an >institution) perhaps we can start to think about which clever idea for >recovering costs is realistic and fair. In the meantime, budgets of the >groups that actuall pay these costs continue to get a wee bit strained >as the number of nodes and associated costs continue to spiral upward. > Such issues come up ALL the time in any government funded research. And, the more govermnent oversight, the more data you have to collect on such "burden" and "overhead". An extreme might be a Defense Department (or NASA) Cost Reimbursement type contract (Aka Cost Plus... note well.. There are NO government contracts that are cost plus percentage of cost.. they're illegal... The fee amount is fixed, or based on award criteria, but does not depend on on the amount spent, except perhaps in a negative fashion (bust a spending cap, and your award/incentive fee gets smaller)) In such cases, the funding source is VERY interested in just how you calcualated "cost", and therein lies much accounting. There's a sort of pendulumn type swing back and forth for certain types of costs (and management philosophies). Do you count telephone service as an overall burden (raising your "overhead" percentage, but reducing the project's "Other direct costs (ODC)") or, do you chargeback the project for the cost of the phoneline, plus usage, plus some management "tax"? The latter reduces your overhead percentage, but increases the "direct costs". Same dollars flow either way, but in the latter case you WILL spend more time accounting for the other direct costs. I suppose that in academia, the grantee might be sheltered a bit by the institutional processes, but in most other environments, it's been a reality for a long time. Different companies have different philosophies on the approach, and either works, and will generally pass muster with the auditors. It does make evaluating proposals a bit trickier. Taken to an extreme, we have the health care industry approach of "code and cost every item", so that the acetaminophen they give you after delivering a baby or having your gall bladder removed shows up on the bill as "Dispense acetaminophen, 2 tablets at 100mg" and "Administer acetaminophen, 100mg", each with separate charges near $10. Sadly, that $10 probably is a realistic cost, too, considering that some non-zero amount of time was spent to enter the transactions into a database, requiring the use of trained "medical coders" who know the procedure codes for everything, as well as the capital and operating costs of the terminal and computer they're using. I'm sure that clusters in industry face the question of Cost/Benefit analysis, including infrastructure impact. Certainly this is the case for desktop PCs and mainframes in at least one industry where my wife is employed. Questions such as David raised are only going to become more and more common as the drive for "accountability" increases. Even within government agencies, such as NASA, the drive for "Full Cost Accounting" (which essentially imposes the same controls that have always been imposed on vendors on cost reimbursement contracts) is causing great pain, not because the costs actually change, but because it is a huge cultural and mental shift in how one plans ones work. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From rgb at phy.duke.edu Wed Feb 16 12:03:16 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <3.0.32.20050216174455.0106ea70@pop.xs4all.nl> References: <3.0.32.20050216174455.0106ea70@pop.xs4all.nl> Message-ID: On Wed, 16 Feb 2005, Vincent Diepeveen wrote: > It is possible for algorithms to have sequential properties, in short > making it hard to scale well. Game tree search happens to have a few of > such algorithms, from which one is performing superior with a number of > enhancements having the same property that a faster blocked get latency > speeds it up exponential. > > For the basic idea why there is an exponential speedup see Knuth and search > for the algorithm 'alfabeta'. > > So the assumption that an algorithm sucks because it doesn't need bandwidth > but latency is like sticking a hairpin in an electrical socket. > > If users would JUST need a little bit of bandwidth they already can get > quite far with $40 cards. > > So optimizing MPI for low latency small messages IMHO is very relevant. I sort of agree, but I think you miss my point. Or maybe we totally agree but I misunderstand. Some algorithms will, as you note, have sequential communications times associated with them that scale with the number of nodes. BOTH bandwidth AND/OR latency can be important in minimizing those and other times in the algorithms, and which one is important can even change during the course of the computation. If I have an algorithm where lots of small messages have to go to lots of places in some order, latency becomes important. If I have an algorithm where lots of BIG messages have to go lots of places in some order, bandwidth becomes important. There is plenty of room in between the extremes, order might or might not matter, resource contention might or might not be an issue, and there is plenty of opportunity for both big messages and small messages to be sent within a single application. Optimizing any particular MPI (or PVM) command for either extreme is then like robbing Peter to pay Paul, when Peter and Paul are a single bicephalic individual that has to pay protection money to the mob for every theft transaction (oh how I just LOVE to fold, spindle and mutilate metaphors). To illustrate some of the complexity issues and how interesting Life can be, consider the notion of a "broadcast". Node A wants to send the same message to N other nodes. Fine fine fine, happens all the time in certain kinds of parallel code. Node A therefore uses one of the collective operations (such as pvm_bcast or pvm_mcast in PVM, which is where I have more experience). Now, just what happens when this code is executed? PVM in this case promises that it will be a non-blocking send -- execution in the current thread on A resumes as soon as the "message is safely on its way to the receiving processors". It of course cannot promise that those processors will receive them, so "safely" is a matter of opinion. It also completely hides the details of how this >>difficult<< task is implemented, and whether or not it varies depending on the hardware or code context. To find out what it really does you must Use the (pvm) Source, Luke! and that is Not Easy because the actual thing it does is hidden beneath all sorts of lower level primitives. For example, the pvmd might send the message to each node in the receiving list one at a time, essentially serializing the entire message transmission (without blocking the caller, sure, but serial time is serial time). Or is it? If RDMA is used to deliver the message at the hardware device level so not even the pvmd is blocked for the full time of delivery, maybe not. Or, if the network device supports it, it might use an actual network broadcast or multicast. Then how efficiently it proceeds depends on whether and how broadcasts are supported by e.g. the kernel and intermediate network hubs. It could be anything from truly parallelized, so it takes a single latency hit to put the message on N lines (as an old ethernet repeater broadcast would probably do) to a de facto serialized latency (possibly with a much lower latency) hit as a store and forward switch stores and then forwards to each line in turn. If that's what they do these days -- it might even vary ethernet switch to switch. Myrinet, FC, SCI all might (and probably do) do >>different<< things when one tries to optimally implement a "broadcast". Or PVM might refuse to second guess the hardware at all. Instead it might use some sort of tree, and send the message (serially) to only some of the hosts in the receive list who both accept the message and forward the message to others so (perhaps) you send to four hosts, the first of these sends to 3 more while you finish, the second of these sends to two more while you finish, the third to one while you finish, and where each of THESE recipients find still more to send to so that you cover a lot more than four hosts in four latency hits (at the expense of involving all the intermediary systems heavily in delivery, which may or may not delay the threads running on those nodes depending very much on the HARDWARE. Which of these it SHOULD do very likely depends strongly -- very strongly -- on some arcane mix of the task itself and its organization and communications requirements, the PARTICULAR size of the object being sent THIS time, the networking hardware in use, the number of CPUs on the node and number of CPUs on the node that are running a task thread, and (often forgotten) the global communication pattern -- whether this particular transmission is a one-way master-slave sort of thing or just one piece of some sort of global message exchange where all the nodes in the group are going to be adding their own messages to this stream while avoiding collisions. The programmer sees none of this, so they think of none of this. They think "Gee, a broadcast. That's an easy way to send a message to many hosts with one send command. Great!" They think "this will cost just one latency hit to deliver the message to all N hosts because that's what "broadcast" >>means<< as they well know from watching TV and listening to the radio at the same time as a zillion other humans, in parallel. They may not (with PVM or MPI) even know what a "collision" or tree >>is<<, as there is no prior assumption of knowledge about physical networking or network algorithms in the learning or application of the toolsets. This is the Dark Side of APIs that hide detail and wrap it up in powerful black-box commands. They make the programming relatively easy, but they also keep you from coming to grips with all those little details upon which performance and sometimes even robustness or functionality depend. This is the "problem" I alluded to in the previous note. To really do the right thing for the user (presuming there IS such a thing as a universally right thing to do or even hoping to do the right thing nearly all the time) one either needs to write a large set of highly differentiated and differently optimized commands -- not just pvm_bcast, but pvm_bcast (presuming low latency hardware broadcast exists and is efficient), pvm_bcast_tree (presuming that NO efficient hardware broadcast exists and is efficient and possibly including additional arguments to spell out some things about the tree), pvm_bcast_tree_join (presuming that you need a tree and that each branch will both take something off a message passing through as a leaf and adding back a message to join the transmission to the next leaf as a root), pvm_bcast_rr (round robin broadcasts that are optimally synchronized for no collisions between "simultaneously" broadcasting hosts), and pvm_bcast_for_other_special_cases for the ones I've forgotten or don't know about, and maybe double or treble the entire stack or add flags to be optimal for latency dominated patterns, bandwidth dominated patterns, or somewhere in between. Alternatively, one can make just one pvm_bcast command, but put God's Own AI into it. Make smart decisions inside the hardware-aware daemons that automatically switch between all of the above and more, possibly dynamically during the course of a computation, to minimize delivery time and the load of all systems participating in the delivery. Hope that you do a good enough job of it that the result is still robust and doesn't constantly hang and crash when assumptions you built into it turned out to be incorrect or your "AI" turns out to be rather stupid. All of this is just the opposite from the problems you encounter if you program at the raw socket (or other hardware interface) level. There you have to work very hard to achieve the simplest of tasks -- open up a socket to a remote host, establish bidirectional ports, work out some sort of reliable message transmission sequence (which is nontrivial to do if you work with the lowest level and hence fastest networking commands, because communications is fundamentally asynchronous and unreliable and thus simple read/write commands do not suffice). However, once you get to where you CAN talk beween nodes with sockets, you are forced to confront the communication and computational topology questions and the various "special" capabilities of the hardware head on. No black boxes. You want a tree, you get a tree, but YOU have to program the tree and decide what to do if a leaf dies in mid-delivery, etc. You want to synchronize communications, you go right ahead but be prepared to figure out how to communicate synchronization information out of band. You want non-blocking communications, set the appropriate ioctl or fcntl, if you know how for your particular "file" or hardware. Learn the select call. Now things are totally controlled, but you have to be an experienced network programmer (a.k.a. a network programming God) to do anything complex. Sleep with Stevens underneath your pillow, that sort of thing. The big set, not just the single book version. And if you're THAT good, what are you doing working on a beowulf? There are big money jobs out there a-waitin', as there are for the other seventeen humans on the planet with that kind of ability...;-) What I think a lot of people (even experienced people) end up doing is using PVM or MPI to mask out the annoying parts of raw networking -- maintaining a list of hosts in the cluster, dealing with the repetitive parts of the network code that ensure reliable delivery, adding some nice management tools to pass out-of-band information around for e.g. process control and synchronization. Then they use a relatively small set of the available message passing commands, because they do not trust the more advanced black box collectives. Usually this lack of trust has a basis -- they tried them in some application or other and found that instead of speeding up, things got slower or had unexpected side effects, and they had no way of knowing what they actually DID, let alone how to fix them. They have no access to the real low level mechanism whereby those commands move data. That's what I meant about making MPI or PVM more "atomic". PVM has all sorts of pack and unpack commands for a message that permit (I suppose) typechecking and conversion to be done, where all I really want for most communications is a single send command that takes a memory address and a buffer length and a single receive command that takes a memory address and a (maximum) length. If I want to "pack" the buffer in some particular way, that's what bcopy is for. I don't want to "have" to allocate and free it every time I use it, or use a command that very likely allocates and frees a buffer I cannot see when I call it. The buffer might hold an int or int array, a double matrix, a struct, a vector of structs -- who cares? Pointers at both ends can unambiguously manage the typing and alignment -- I'm the programmer, and this is what I do. With this much simpler structure one can at least think about optimization as the problem is now much simpler. A message is a block of anonymous memory, period, with the programmer fully responsible for aligning the send address or receive address on both ends with whatever structure(s) or variable(s) are to be used there. It is very definitely less portable -- it leaves the user (or more likely a higher level command set built on top of the primitives) with the hassle of having to manage big-endian and little-endian issues as well as data size issues if they use the message passer across/between incompatible systems. These issues, however, were a lot more important in the past than they are today, and they add a bunch of easily avoided overhead to the vast majority of clusters where they aren't needed. With a simple set of building block such as this, one could then >>implement<< PVM or MPI or any other higher order, more portable message passing API on top of it. Indeed, I'd guess that this is very much what PVM really does (don't know about MPI) -- the pvm_pack routines very likely wrap up a malloc (of a struct), set metadata in struct, bcopy into data region of struct, send struct, free struct, with the inverse process on the other side driven/modified by the metadata (such as endianness) as needed. All sorts of tradeoffs of speed and total memory footprint in that sequence, many of which are not always necessary. One could ALSO focus more energy on the higher order/collective send routines, as one could write them INSIDE the low level constructs provided so they become USER level software instead of library black boxes. With sources, modifying or tuning them would no longer involve working with either raw sockets or a hidden set of internals for the library itself. I'm not sure I'm making myself clear on all of this, and for lots of programs I'm sure it doesn't matter, but for really bleeding edge parallel performance and complex code I suspect that raw sockets (or the equivalent device interface for non-IP-socketed devices) still hold a substantial edge over the same algorithms managed through same harddware with the message passing libraries. This was where I jumped in -- when Patrick made much the same statement. This (if true) is a shame, and is likely due to the assumptions that have gone into the implementation of the commands, some of which date back to big iron supercomputer days where the hardware was VERY DIFFERENT from today but ABSOLUTELY UNIFORM within a given hardware platform, so that "universal" tuning was indeed possible. Maybe it's time to reassess these assumptions. I am therefore trying to suggest that instead of "fixing" the collectives to work better for optimal latency at the expense of bw or vice versa (without even MENTIONING the wide array of hardware the same command is supposed to "transparently" achieve this miracle on) it might be better to work the other way -- add some very low level primitives that do little more than encapsulate and manage the annoying aspects of raw interfaces while still permitting their "direct" use. THEN implement PVM and MPI both on top of those low level primitives -- why not? The differences are all higher order interface things -- ultimately what they do is move buffers across buses and wires, although the process would be made a lot easier if there were a shared data structure and primitives to describe and perform common tasks on a "cluster" between them. A coder could then choose to "use a compiler" (metaphorically the encapsulated primitives) for some or all of their code and accept the default optimizations, or "use an assembler" (the primitives themselves) to hand-tune critical parts of their code, without having to leave the nice safe portable bounds of their preferred parallel library. If done really well, it would accomplish the long discussed merger of PVM and MPI almost as an afterthought with teeny tweaks (perhaps) of the commands, since they would be based on the same primitives and underlying data structures, after all. Just dreaming, I guess. Possibly hallucinating. That bump on the head, y'know. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From James.P.Lux at jpl.nasa.gov Wed Feb 16 13:52:26 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: <3.0.32.20050216174455.0106ea70@pop.xs4all.nl> Message-ID: <6.1.1.1.2.20050216134838.041990c8@mail.jpl.nasa.gov> At 12:03 PM 2/16/2005, Robert G. Brown wrote: >On Wed, 16 Feb 2005, Vincent Diepeveen wrote: > > >This (if true) is a shame, and is likely due to the assumptions that >have gone into the implementation of the commands, some of which date >back to big iron supercomputer days where the hardware was VERY >DIFFERENT from today but ABSOLUTELY UNIFORM within a given hardware >platform, so that "universal" tuning was indeed possible. Maybe it's >time to reassess these assumptions. I am therefore trying to suggest >that instead of "fixing" the collectives to work better for optimal >latency at the expense of bw or vice versa (without even MENTIONING the >wide array of hardware the same command is supposed to "transparently" >achieve this miracle on) it might be better to work the other way -- add >some very low level primitives that do little more than encapsulate and >manage the annoying aspects of raw interfaces while still permitting >their "direct" use. >THEN implement PVM and MPI both on top of those low level primitives -- >why not? The differences are all higher order interface things -- >ultimately what they do is move buffers across buses and wires, although >the process would be made a lot easier if there were a shared data >structure and primitives to describe and perform common tasks on a >"cluster" between them. A coder could then choose to "use a compiler" >(metaphorically the encapsulated primitives) for some or all of their >code and accept the default optimizations, or "use an assembler" (the >primitives themselves) to hand-tune critical parts of their code, >without having to leave the nice safe portable bounds of their preferred >parallel library. If done really well, it would accomplish the long >discussed merger of PVM and MPI almost as an afterthought with teeny >tweaks (perhaps) of the commands, since they would be based on the same >primitives and underlying data structures, after all. Isn't this what "self tuning" kinds of packages (ATLAS?) do? Or, at another level, what those horrible MAKE scripts do that attempt to address every possible instruction set, hardware, glibc, etc. variation in existence (or that some bright soul took it into his mind to come up with one weekend after getting home from a Grateful Dead concert). >Just dreaming, I guess. Possibly hallucinating. That bump on the head, >y'know. > > rgb > >-- >Robert G. Brown http://www.phy.duke.edu/~rgb/ >Duke University Dept. of Physics, Box 90305 >Durham, N.C. 27708-0305 >Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu > > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From lindahl at pathscale.com Wed Feb 16 16:04:55 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: <3.0.32.20050216174455.0106ea70@pop.xs4all.nl> Message-ID: <20050217000455.GF2018@greglaptop.internal.keyresearch.com> On Wed, Feb 16, 2005 at 03:03:16PM -0500, Robert G. Brown wrote: > Optimizing any particular MPI (or PVM) command for either extreme is > then like robbing Peter to pay Paul, when Peter and Paul are a single > bicephalic individual that has to pay protection money to the mob for > every theft transaction (oh how I just LOVE to fold, spindle and > mutilate metaphors). Um, most MPI implementations have at least 3 algorithms, for short, long, and very long messages. So are they all breaking your rule? It's *unoptimizing* some of the cases that's at question. Most MPIs unoptimize compute/communication overlap with long messages, because it's hard work to get that right without hurting all short messages. -- greg From rgb at phy.duke.edu Thu Feb 17 07:14:48 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <20050217000455.GF2018@greglaptop.internal.keyresearch.com> References: <3.0.32.20050216174455.0106ea70@pop.xs4all.nl> <20050217000455.GF2018@greglaptop.internal.keyresearch.com> Message-ID: On Wed, 16 Feb 2005, Greg Lindahl wrote: > On Wed, Feb 16, 2005 at 03:03:16PM -0500, Robert G. Brown wrote: > > > Optimizing any particular MPI (or PVM) command for either extreme is > > then like robbing Peter to pay Paul, when Peter and Paul are a single > > bicephalic individual that has to pay protection money to the mob for > > every theft transaction (oh how I just LOVE to fold, spindle and > > mutilate metaphors). > > Um, most MPI implementations have at least 3 algorithms, for short, > long, and very long messages. So are they all breaking your rule? No, as I noted later in the (yes, long:-) message. That's what there should be. Although that they do isn't clear to the user, and the user has no control over it (that I can see in the standard). They have to trust the implementation to do the right thing. > It's *unoptimizing* some of the cases that's at question. Most MPIs > unoptimize compute/communication overlap with long messages, because > it's hard work to get that right without hurting all short messages. Again, I think that we in agreement. All I was ultimately suggesting is that message passing libraries that contain complex higher level commands that make optimization decisions (including the decision not to optimize) that result in a complex command that may not be optimal for a significant number of complex cases might benefit from having access to lower level primitives from which the actual complex commands are built so users can roll their own within the library without having to resort to raw networking. You might not agree with this suggestion, but it is as you say the point in question. As I also said, I'm not an MPI expert by any means and therefore have to go look up commands beyond the MPI 1 standard (which look not horribly unlike the PVM command set as far as communication is concerned) and am probably shaky there, but looking them up on the mpi-forum.org site, it looks like MPI 2 adds MPI_PUT, MPI_GET, MPI_ACCUMULATE which are just exactly what I was suggesting and what I would have hoped for, especially if they are indeed the primitives from which at least some of the higher order commands are built. If so, users can either choose to use the optimized/unoptimized higher level commands provided or (if they understand their problem and hardware) roll their own. This is the distinction I was talking about. MPI originally passed messages at a high level of abstraction to wrap a variety of mechanisms in use on big supercomputers (not forgetting that it was a consortium of the vendors of such supercomputers that wrote the standard in response to pressure from the government and other major consumers who were tired of rewriting code every time a new supercomputer was released with its own internals and API for moving data between processors/processes). It (I think deliberately) avoided providing any sort of interface that might be interpreted as a "thin" wrapper to those internals that were responsible for minimal latency, maximum bandwidth movement of data. Whether this was to make the government happy (hiding the detail) or to make themselves happy (leaving a purchaser of a supercomputer with an incentive to write optimizations in their native API and hence become "hooked" on the hardware) is a moot point. PVM has a different, but related, history. It was built on top of networking from the beginning, more or less, and was deliberately designed to hide the networking primitives (specifically) from the programmer where MPI might have been hiding shared memory primitives and create a "virtual machine" where MPI was running on REAL machines. It if anything went out of its way to avoid RMA-like message passing commands that "look" like a wrapper to shared memory following instead a fairly simple reliable message transmission model and in the end (3.x) had almost exactly the same range and general form of commands as MPI 1.x for the bulk of what a user was likely to do, with maybe a bit nicer control interface over the virtual machine and a bit less control over collective operations. Looking over (for the first time) the MPI 2 additions, I have to say that they look very nice, possibly nice enough to finally consider switching to MPI from PVM. Alternatively, it is something that should be cloned in PVM -- PVM would really benefit from PVM_GET, PVM_PUT, and some synchronization primitives. Provision of what amount to wrappers on raw RMA primitive commands (that can be/should be tuned for the hardware) and the separation of the RMA part and any synchronization components mean that a serious programmer has a lot of ability to control and optimize (assuming only that these commands truly are implemented as primitives as used to develop the higher level commands) without leaving the library, while people are able to use the higher level collectives when they are either a good match for their task or if they are a beginner and not ready to tackle lower level programming. The only thing I still don't find (on a fairly rapid lookover) is a discussion on just what e.g. broadcast does or how to make it vary what it does. Part of this of course doesn't belong in a standards document which isn't intended to describe algorithms or implementations at that level of detail. However, one part does. I think it matters a great deal to the programmer to know whether or not broadcast (and other commands) are indeed hardware primitive or if they are implemented on top of point-to-point communications primitives that may or may not involve diverting intermediary processors from their running tasks (and ditto for scatter/gather type operations). This seems like it might be a programming decision point for people who really want to hand-optimize their code. Again, this is based on my experiences in PVM, where I've tried using broadcast several times in master/slave contexts expecting to reduce latency and communications times only to find that the command was de facto serialized and in fact took as long or longer than just running a loop over point to point communications calls. Perhaps MPI does it better, or differently, but it doesn't LOOK like it is anything but a black box which can swing from being good on one network to terrible on another without warning. How to implement such a thing in a standard is an open question, but from a programmer interface point of view having a set of commands that can query and set variables to control the back end behavior of collectives or determine properties of the hardware in the cluster would be very useful. Just one creative idea might be for MPI to provide an optional initialization command to run on a cluster that builds a table of quiescent-state and cpu-loaded-state latencies for short, medium, and long messages both point to point and in collective mode. The same table might hold some describing the selected hardware device such as hw_bcast=TRUE along with the broadcast latency. >From this one might be able to build portable MPI programs that run optimally on Myrinet while they still run optimally on gig Ethernet, with or without e.g. a hardware RDMA command that significantly affects and redistributes the CPU loading per message. But maybe this is all too complicated, or doesn't belong in the standard per se. It is indeed like the ATLAS thing, but then, I think that ATLAS is sheer genius although it is also cumbersome and clunky to build...;-) I just dream of the day that ATLAS-like runtime optimization isn't so clunky and is based on tools that create tables of microbenchmark numbers that ARE sufficiently accurate and rich to achieve near-optimization without running a build loop that sweeps and searches a high-dimensional space...:-) rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rossen at VerariSoft.Com Mon Feb 14 22:27:41 2005 From: rossen at VerariSoft.Com (Rossen Dimitrov) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> Message-ID: <4211965D.5080704@verarisoft.com> Rob, I agree that by now it is well understood that by providing a very flexible API with a rich set of semantics, MPI may have missed some opportunities for accelerating message passing in some constrained cases. Many of us have seen codes that not only use just the famous 6 MPI functions, but also avoid wild cards and out-of-order messages. As a result, these codes pay for services they don't use. As far as the predicted end-of-life for MPI, I wouldn't necessarily bet on it. As often happens, the technical reasons may have little to do with the issue. By now MPI has had penetration in so many long-term programs that it will be around for quite a while. Of course, this does not mean that there would not be attempts to "fix" it or replace it with something else. This might in fact be a good thing - natural evolution of technology. Rossen Verari Systems Software Rob Ross wrote: > Rossen, > > It would be good to mention that you work for a company that sells an > implementation specifically designed for facilitating overlapping, in case > people don't know that. Clearly you guys have thought a lot about this. > > The last two Scalable OS workshops (the only two I've had a chance to > attend), there was a contingent of people that are certain that MPI isn't > going to last too much longer as a programming model for very large > systems. The issue, as they see it, is that MPI simply imposes too much > latency on communication, and because we (as MPI implementors) cannot > decrease that latency fast enough to keep up with processor improvements, > MPI will soon become too expensive to be of use on these systems. > > Now, I don't personally think that this is going to happen as quickly as > some predict, but it is certainly an argument that we should be paying > very careful attention to the latency issue, because as MPI implementors > this is an argument that never seems to end. > > Also, there is additional overhead in the Isend()/Wait() pair over the > simple Send() (two function calls rather than one, allocation of a Request > structure at the least) that means that a naive attempt at overlapping > communication and computation will result in a slower application. So > that doesn't surprise me at all. > > I think that the theme from this thread should be that "it's a good thing > that we have more than one MPI implementation, because they all do > different things best." > > Rob > --- > Rob Ross, Mathematics and Computer Science Division, Argonne National Lab > > > On Mon, 14 Feb 2005, Rossen Dimitrov wrote: > > >>There is quite a bit of published data that for a number of real >>application codes modest increase of MPI latency for very short messages >>has no impact on the application performance. This can also be seen by >>doing traffic characterization, weighing the relative impact of the >>increased latency, and taking into account the computation/communication >>ratio. On the other hand, what you give the application developers with >>an interrupt-driven MPI library is a higher potential for effective >>overlapping, which they could chose to utilize or not, but unless they >>send only very short messages, they will not see a negative performance >>impact from using this library. >> >>There is evidence that re-coding the MPI part of an application to take >>advantage of overlapping and asynchrony when the MPI library (and >>network) supports these well actually leads to real performance benefit. >> >>There is evidence that even without changing anything in the code, but >>by just running the same code with an MPI library that plays nicer to >>the system leads to better application performance by improving the >>overall "application progress" - a loose term I used to describe all of >>the complex system activities that need to occur during the life-cycle >>of a parallel application not only on a single node, but on all nodes >>collectively. >> >>The question of short message latency is connected to system scalability >>in at least one important scenario - running the same problem size as >>fast as possible by adding more processors. This will lead to smaller >>messages, much more sensitive to overhead, thus negatively impacting >>scalability. >> >>In other practical scenarios though, users increase the problem size as >>the cluster size grows, or they solve multiple instances of the same >>problem concurrently, thus keeping the message sizes away from the >>extremely small sizes resulting from maximum scale runs, thus limiting >>the impact of shortest message latency. I have seen many large clusters >>whose only job run across all nodes is HPL for the top500 number. After >>that, the system is either controlled by a job scheduler, which limits >>the size of jobs to about 30% of all processors (an empirically derived >>number that supposedly improves the overall job throughput), or it is >>physically or logically divided into smaller sub-clusters. >> >>All this being said, there is obviously a large group of codes that use >>small messages no matter what size problem they solve or what the >>cluster size is. For these, the lowest latency will be the most >>important (if not the only) optimization parameter. For these cases, >>users can just run the MPI library in polling mode. >> >>With regard to the assessment that every MPI library does (a) partly >>right I'd like to mention that I have seen behavior where attempting to >>overlap computation and communication can lead to no performance >>improvement at all, or even worse, to performance degradation. This is >>one example of how a particular implementation of a standard API can >>affect the way users code against it. I use a metric called "degree of >>overlapping" which for "good" systems approaches 1, for "bad" systems >>approaches 0, and for terrible systems becomes negative... Here goodness >>is measured as how well the system facilitates overlapping. >> >>Rossen > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From philippe.blaise at cea.fr Tue Feb 15 00:52:53 2005 From: philippe.blaise at cea.fr (Philippe Blaise) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: Message-ID: <4211B865.6050400@cea.fr> Mikhail Kuzminsky wrote: > > Let me ask some stupid's question: which MPI implementations allow > really > > a) to overlap MPI_Isend w/computations > and/or b) to perform a set of subsequent MPI_Isend calls faster than > "the same" set of MPI_Send calls ? > Dear Mikhail, sorry if it's not a direct answer to your question, but it could help. There is a potential difficulty when you try to overlap MPI_Isend with some computations : generally you do it on a cluster of SMP machines and the performance of the overlapping should depend a lot on the placement of the processes on the SMP nodes. On one hand if some of the pair processes that do the MPI_Isend / Irecv are on the same node, you won't be able to overlap communications with computations, but of course the communications should be faster for large messages using shared memory than using the NIC. On the other hand if the pair processesses are on different nodes, for large messages the communication time using the NIC is larger than the time for doing the same communication using shared memory, but of course if your NIC (like the quadrics one for example) is able to do some overlap you will save some time. Quadrics (again, but may be it's true for other network technologies) provide a way to use the NIC even for the intra-node communication ; but as a consequence you will share the NIC for intra and inter nodes communications together and the potential benefit is not so clear. So don't expect too much by overlapping communication with computation : it's very hard to tune, it depends a lot on the placement of your program on the SMP nodes, the NIC functionnalities, and the scheme you use for the communications ! If you have enough time, you could have a look to another approach by using a mixed OpenMP/MPI programming scheme. Regards, Phil. From ole at scali.com Tue Feb 15 01:28:10 2005 From: ole at scali.com (Ole W. Saastad) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Re: Home beowulf - NIC latencies (Patrick Geoffray) In-Reply-To: <200502150010.j1F09vpb024195@bluewest.scyld.com> References: <200502150010.j1F09vpb024195@bluewest.scyld.com> Message-ID: <1108459690.25145.8.camel@pc-2.office.scali.no> Patrick Geoffray wrote: > There are more exotic work-arounds, like using 1) and polling at the > same time, and hiding the interrupt overhead with some black magic on > another processor. The one with the best potential would be to use > HyperThreading on Intel chips to have a polling thread burning cycles > continuously; it will run in-cache, won't use the FP unit or waste > memory cycles. A perfect use for the otherwise useless HT feature. I > wonder why nobody went that way... > I have tried this in order to see if we could poll a memory location for free using Intel HT. I run a kinetics program that is a small Runge-Kutta stepping of equations simultaneously with a small loop checking the content of a memory location then issuing a PAUSE instruction and repeating the loop. The simple finding is that the kinetics program got somewhat more than 70% of the CPU cycles and that the polling waisted close to 30% of the CPU cycles, 30% is not for free. After this test I finally decided to forever leave HT off. Ole -- Ole W. Saastad, Dr.Scient. Manager Cluster Expert Center dir. +47 22 62 89 68 fax. +47 22 62 89 51 mob. +47 93 05 74 87 ole@scali.com Scali - www.scali.com High Performance Clustering From rene at renestorm.de Tue Feb 15 02:37:23 2005 From: rene at renestorm.de (rene) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Block send mpi In-Reply-To: References: Message-ID: <200502151137.23227.rene@renestorm.de> Hi Mark, please revise this one more time. Maybe I understood it now. int packsize; MPI_Pack_size (bit, MPI_INT, newcomm, &packsize); - Calculates the memory demand (packsize) in bytes needed for count (bit) of MPI_INTs. int bufsize = packsize + (MPI_BSEND_OVERHEAD); - Adds the overhead void *buf = new (void (*[bufsize]) ); - allocates the needed buffer in bytes. MPI_Buffer_attach (buf, bufsize); - attaches the buffer bsend->ierr = MPI_Bsend (&testdata[0], bit, MPI_INT, node, 0, newcomm); - sends the data MPI_Buffer_detach (&buf, &bufsize); - Detaches it Regards, Rene > > int packsize; > > MPI_Pack_size (bit, MPI_INT, newcomm, &packsize); > > I would expect packsize to be counting bytes here. > > > int bufsize = packsize + (MPI_BSEND_OVERHEAD); > > // void *buf = new (void (*[packsize]) ()); > > int *buf = new (int ([packsize])); > > but here you have allocated an array of ints where the number > of elements is packsize. that means you have 4x too many bytes. > > > bsend->ierr = MPI_Bsend (&testdata[0], bit, MPI_INT, node, 0, > > newcomm); > > bear in mind that &testdata[0] is legal but redundant - > it means the same thing as bare 'testdata'. -- Rene Storm @Cluster Linux Cluster Consultant Hamburgerstr. 42e D-22952 Luetjensee mailto:Rene@ReneStorm.de Voice-IP: Skype.com, Rene_Storm From Dries.Kimpe at cs.kuleuven.ac.be Tue Feb 15 05:06:40 2005 From: Dries.Kimpe at cs.kuleuven.ac.be (Dries Kimpe) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Block send mpi In-Reply-To: <200502141204.45263.rene@renestorm.de> References: <200502130529.58915.rene@renestorm.de> <42103A3D.8020605@scalableinformatics.com> <200502141204.45263.rene@renestorm.de> Message-ID: <4211F3E0.1060207@cs.kuleuven.ac.be> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 rene wrote: | Hi Joe, | | here is some output and changes which solves the problem. | I don't know, why I created a void buffer and sended an int array. | After creating an int buffer I was also able to delete it ;o) | There are still some mistakes left: | int *buf = new (int ([packsize])); | [...] | delete buf; | The last line should be: delete[] (buf); Also, MPI_Pack_size returns the needed buffer size in bytes. You are requesting (sizeof(int)*packsize) bytes where packsize bytes would suffice. ~ Greetings, ~ Dries -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.6 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCEfPfv/8puanD4GoRArI9AJ0cmdT4m+Q7e9jvYhTZbbHviUDmQACglnTS CqBqJ/GqpaHJjM7jI0MGkJc= =nqLt -----END PGP SIGNATURE----- From kdunder at sandia.gov Tue Feb 15 06:31:12 2005 From: kdunder at sandia.gov (Keith D. Underwood) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <421194C4.5050808@myri.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <420DA793.4000909@verarisoft.com> <421194C4.5050808@myri.com> Message-ID: <1108477871.4587.115.camel@s861954.sandia.gov> > Looking for overlaping is actually not that hard: > a) look for medium/large messages, don't waste time on small ones. I contend that this particular item is bad advice. If you send a lot of small messages, you should use MPI_Isend there as well to give the MPI implementation every opportunity to do the right thing. As we go forward, end-to-end acknowledgments are going to become a reality. The last thing you want is to spend a round-trip delay on every message you send if you send a lot of them. Yes, the implementation can copy on the sending side to allow the send to complete, but that wastes memory and time. Keith From kdunder at sandia.gov Tue Feb 15 06:34:50 2005 From: kdunder at sandia.gov (Keith D. Underwood) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4211A95F.2010709@myri.com> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> Message-ID: <1108478089.4587.118.camel@s861954.sandia.gov> > c) ban of the ANY_SENDER wildcard: a world of optimization goes away > with this convenience. Um, our apps guys say this is more than a convenience. Apparently, sometimes you don't exactly know who you are going to receive from. Would you rather them post receives from 4000 nodes and cancel the ones that don't send to that node after a while? Keith From kdunder at sandia.gov Tue Feb 15 06:51:34 2005 From: kdunder at sandia.gov (Keith D. Underwood) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4211C559.8070100@myri.com> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> <4211B0E0.6030007@ccrl-nece.de> <4211C559.8070100@myri.com> Message-ID: <1108479093.4587.132.camel@s861954.sandia.gov> > Throw away compatibility. If you keep the legacy API, you have no > incentive for change. I don't want MPI-3, I want MPI-light. We are > against a wall because the MPI spec was too rich and developers took the > lazy path. Inertia is a powerful thing. Billions of dollars have been invested in MPI codes. Changing that will not be easy (or cheap). This is not as simple as moving from vectors to distributed memory - there wasn't nearly as much accumulated code then (and, it hurt back then). > It's used because it's there, there is no other reason. If you don't > know who sends you what in a message passing application, then you > cannot get either performance or robustness. If really you cannot do > otherwise (and I don't believe that), you can always use unexpected > messages (post the receive after Probe()ing), That's ugly, but you get > what you deserved :-) That just isn't true. If I don't know how many messages I will get, or from whom, but I can bound it, then I should prepost those receives. This is particularly true in your standard physics code that runs for days and does thousands of time steps. (i.e. you can maintain a circular queue of these things). > If you don't use user-defined datatypes, then you don't need it and it > should not be there in the first place. It's a temptation, it's too > easy. No, there is no ways to implement them efficiently unless they are > regular, and this is what I am willing to keep: strided types with long > segments. Everything else leads to memory copies. The developer should > wipe his own bottom instead of asking the message passing interface to > work around bad data layout. Sending a column of blocs, yes, that's > regular stride and it makes a lot of sense. Sending non-contiguous > irregular structure ? As we used to say in France, $100 and a chocolate > bar with that ? The user should always expose as much opportunity for optimization as possible to the MPI layer. e.g. a load-store architecture like the X1 (not what I am advocating for MPI performance, mind you) could do excellent datatype processing. You would rather the user do the gather/scatter themselves to prohibit the MPI from being able to do it? Not that anyone uses irregular MPI datatypes because they were so bad for so long... but it would be nice if that were exposed to MPI. Keith From mhyoung at valdosta.edu Tue Feb 15 06:56:03 2005 From: mhyoung at valdosta.edu (michael young) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Poor man's SANS - THANK YOU!!!!!! In-Reply-To: References: <42110F0A.10408@valdosta.edu> Message-ID: <42120D83.6080804@valdosta.edu> Thank you so much to everyone who replyed. All the info ya'll previded should keep me busy for some time. :) again, thank you all very much. Michael Rob Ross wrote: >Yes! > >PVFS2 (http://www.pvfs.org/pvfs2) is my favorite option for this :). My >group at ANL along with Clemson University and Ohio Supercomputer Center >and others are developing this. It's entirely open source and open >development, and is in production use at ANL, OSC, and the University of >Utah CHPC, among other places. > >GFS (http://www.redhat.com/software/rha/gfs/) is another; I believe that >RPMs are available for it now through one source or another. This used to >be Sistina's product, who was subsequently bought by RedHat. I'm sure >this is used in production in many business environments, and we use it at >ANL also. Can someone provide a URL for this one? > >Lustre (www.lustre.org) is another option. This one is heavily funded by >the DOE ASC laboratories and is in use on some very large parallel >machines. But unless you have a relationship with CFS you can only get a >crippled version of the source, so it's probably not a good option for >average joe. If they change their policy on releasing source code, this >would be worth reconsidering. > >Regards, > >Rob >--- >Rob Ross, Mathematics and Computer Science Division, Argonne National Lab > > >On Mon, 14 Feb 2005, michael young wrote: > > > >>Hi, >>Can I use beowulf or some other Linux cluster or HA Linux solution >>to pool harddrive space together from differrent computers to make a >>kinda "poor man's SANS"? >> >>thank you >>Michael >> >> > > > From rossen at VerariSoft.Com Tue Feb 15 06:34:39 2005 From: rossen at VerariSoft.Com (Rossen Dimitrov) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4211CB13.3050902@myri.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> <4211CB13.3050902@myri.com> Message-ID: <4212087F.6070809@verarisoft.com> > A last remark. I really think that the argument of using the same > swiss-army-knive MPI implementation such as ScaMPI or Intel MPI or even > MPI/Pro to infere interconnect characteristics is even worse that > looking at latency and bandwidth alone. These implementations are never > going to be designed to use all hardware efficiently, their design is > either historic (Scali used to provided software for SCI alone) or > politicaly motivated (Intel is using uDapl, hummm, wonder why), or both. > They are by-products of the MPI forum failure to make the Standard > practical (compatible ABI). > > Patrick Patrick, this is quite a broad statement. 4 years ago we had a paper arguing that MPI's written to support many different interconnects and messaging technologies through internal portability layers were probably sub-optimal for at least some of the interconnects. Most of the reasons are obvious. At the time we were dealing with Portals, LAPI, and GM. You can easily see why having an internal portability layer for these interfaces does not seem to easily match the semantics of either one of them. We probably did something in our design to reflect this. From rossen at VerariSoft.Com Tue Feb 15 06:28:28 2005 From: rossen at VerariSoft.Com (Rossen Dimitrov) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4211B0E0.6030007@ccrl-nece.de> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> <4211B0E0.6030007@ccrl-nece.de> Message-ID: <4212070C.9050207@verarisoft.com> > > > One person alone can't do this. The best place to discuss such things is > the MPI users group meeting (EuroPVM/MPI, this year in Capri/Italy). > > Also, adding mpi.h to the standard to define an ABI is a good thing. > > Joachim > In a conversation with MPI and tool developers and I once mentioned that not defining a standard/mandatory mpi.h was probably a missed opportunity for improving interoperability of MPI. I was then told by a member of the MPI-1 Forum that this was done on purpose. This makes me think that we will not see an ABI definition for MPI any time soon. Rossen From rossen at VerariSoft.Com Tue Feb 15 07:41:32 2005 From: rossen at VerariSoft.Com (Rossen Dimitrov) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <421194C4.5050808@myri.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <420DA793.4000909@verarisoft.com> <421194C4.5050808@myri.com> Message-ID: <4212182C.60607@verarisoft.com> > > So if you run an MPI application and it sucks, this is because the > application is poorly written ? Patrick, here the argument is about whether and how you "measure" the "performance of MPI". I guess you may have missed some of the preceding postings. > > You don't want to benchmark an application to evaluate MPI, you want to > benchmark an application to find the best set of resources to get the > job done. If the code stinks, it's not an excuse. Good MPI > implementations are good with poorly written applications, but still let > smart people do smart things if they want. This is exactly my point made in my previous posting - you cannot design a system that is optimal in a single mode for all cases of its use when there are multiple parameters defining the usage and performance evaluation spaces. And this is the reason why we provide both {polling synchronization/polling progress} and {interrupt-driven synchronization/independent progress} MPI modes (we have published papers defining a space based on MPI design choices). With these modes we can at least increase the chance that the user can get a better match to his scenario. >> - What application algorithm developers experience when they attempt to >> use the ever so nebulous "overlapping" with a polling MPI library and > > Overlaping is completely orthogonal with polling. Overlaping means that > you split the communication initiation from the communication > completion. Polling means that you test for completion instead of wait > for completion. You can perfectly overlap and check for completion of > the asynchronous requests by polling, nothing wrong with that. Well, I would probably have to say that I don't agree with this. First, I think it is fairly easy to show that overlapping and polling (or any kind of communication completion synchronization) are not orthogonal. If this was the case, you would see codes that show perfect overlapping running on any MPI implementation/network pair. I am sure there is plenty of evidence this is not the case. There is an important point here that needs to be clarified: when I say "polling" library, I assume that this library does both: polling completion synchronization and polling progress. There is not much room to define here these but I am sure MPI developers know what they are. If polling and overlapping were orthogonal, the following would have had to be true: 1. You have a perfect network engine that takes no resources that might be used by computation when you either push bytes out or poll for completion 2. Once you start a request (e.g., MPI_Isend), the execution of this communication request takes no CPU. 3. You can have a very cheap, bound in duration polling operation from which you return immediately after it checks for your particular communication request 4. You have something else to do when the polling completion returns that your request is not done I would argue that none of these are true in practical scenarios, even including very smart polling schemes or networks with DMA engines, like Myrinet. Here I don't even bring the cases with multithreaded applications. These are still a fairly small minority. > >> how this experience has contributed to the overwhelming use of >> MPI_Send/MPI_Recv even for codes that can benefit from non-blocking or >> (even better) persistent MPI calls, thus killing any hope that these >> codes can run faster on systems that actually facilitate overlapping. > > There is 2 reasons why developers use blocking operations rather than > non-blocking one: > 1) they don't know about non-blocking operations. > 2) MPI_Send is shorter than MPI_Isend(). Here is a third one. Writing your code for overlapping with non-blocking MPI calls and segmentation/pipelining, testing the code, and not seeing any benefit of it. > > > Looking for overlaping is actually not that hard: > a) look for medium/large messages, don't waste time on small ones. > b) replace all MPI_Send() by a pair MPI_Isend() + MPI_Wait() > c) move the MPI_Isend() as early as possible (as soon as data is ready). > d) move the MPI_Wait() as late as possible (just before the buffer is > needed). > e) do same for receive. Not quite. Most of the time the message-passing segment of the code you optimize for overlapping is in the innermost loop of the algorithm - the one that is most overhead sensitive and usually most optimized. You will not see common cases where you can "pull" MPI_Send much earlier or push MPI_Wait much later than where MPI_Send is. So what you usually end up doing is introducing another loop inside the innermost one, breaking up the MPI_Send message in a number of segments and pipelining them with MPI_Isend (or even better MPI_Start) by initiating segment I+1 while computing with segment I, thus attempting to overlap computation in stage I with communication in stage I+1. Then, there is the question how many segments you use to break up the message for maximum speedup. The pipelining theory says the more you can get the better, when they are with equal duration, there aren't inter-stage dependencies, and the stage setup time is low in proportion to the stage execution time. Also, the size of the segments should be such that the transmission time (not the whole latency) of the segment is as close as possible to the computation performed on the segment. I can continue with other factors that one need to take into account in order to write a good algorithm with overlapping. The metric I mentioned earlier "degree of overlapping" with some additional analysis can help designers _predict_ whether the design is good or not and whether it will work well or not on a particular system of interest (including the MPI library). This is however too much detail for this forum though, as most of the postings here discuss much more practical issues :) Rossen From steve_heaton at ozemail.com.au Wed Feb 16 17:10:15 2005 From: steve_heaton at ozemail.com.au (steve_heaton@ozemail.com.au) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Academic sites: who pays for the electricity? Message-ID: <20050217011015.HITQ24369.swebmail02.mail.ozemail.net@localhost> G'day all Speaking as someone from "industry", and a Project/Programme Manager at that, I'd just like to add that I'm shocked and dismayed at the apparent lack of accountability that seems rampant in academic circles! If it was down to me I'd sack the lot of ya!! ;) I'd strongly recommend that all good cluster folk have a good idea about operation expenditure (opex). If you get a visit from the Meanie Beanies (auditors / cost accountants etc etc) then it'll help cover your A. It's a great way to have your gig cancelled because you didn't have a firm understanding of your $'s in and out. Happens all the time in Industry and in my job it's a sackable offence. No joke. Do some homework and you won't need to be afraid (OK, *as* afraid of the Purple Pen People). Some things to know ideally you should be able to quote these with as little as an hour's warning (shows you're on top of things): -) The amount of floor space you consume (sq ft or m) - don't worry about the cost of this one, those asking will know ;) Becomes a hot topic if you're paying rent in some form. -) Find out how much electricity you use per hour - chances are you're on one or more dedicated circuit(s) and probably separate metering - look at the bills. Don't worry about general lighting etc. It's often rolled into the floor space calcs. -) Ditto aircon (include your maintenance) -) Cluster hardware maintenance (out of warranty stuff, cost of spares) - quoting your amazing uptime can help explain this figure -) Service contracts (you've got a Service Level Agreement right? Uptime % etc helps explain) -) Staff / admin costs -) The good ol' "anything else you can think of" Now the fun part. Who used you cluster and for how long? Look at your job scheduling etc. Your department? Another department (do you cross charge somehow)? Which projects? What's their contribution to cluster opex? If you answer reasonably accurately then the Beanies will treat you with some respect :) >>Someone, somewhere is paying your bills already.<< Know where that money is going! Don't say I didn't warn you ;) Cheers Stevo This message was sent through MyMail http://www.mymail.com.au From rene at renestorm.de Tue Feb 15 09:55:49 2005 From: rene at renestorm.de (rene) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Block send mpi In-Reply-To: <200502151724.44168.rene@renestorm.de> References: <200502151724.44168.rene@renestorm.de> Message-ID: <200502151855.49438.rene@renestorm.de> > OK nice > > > void *buf = new char[bufsize]; > > would allocate a buffer of size bufsize (sizeof(char)=1) which is > calculated by mpi packsize + the overhead. > > Seems to be clear but wont' work and results in: > 0 - MPI_BSEND : Insufficent space available in user-defined buffer > [0] Aborting program ! > [0] Aborting program! > p0_9158: p4_error: : 321 > > > Cu >Rene From diep at xs4all.nl Wed Feb 16 02:56:18 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies Message-ID: <3.0.32.20050216115617.01058100@pop.xs4all.nl> At 11:07 14-2-2005 -0800, Greg Lindahl wrote: >On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote: > >> Let me ask some stupid's question: which MPI implementations allow >> really >> >> a) to overlap MPI_Isend w/computations >> and/or >> b) to perform a set of subsequent MPI_Isend calls faster than "the >> same" set of MPI_Send calls ? >> >> I say only about sending of large messages. > >For large messages, everyone does (b) at least partly right. (a) is >pretty rare. It's difficult to get (a) right without hurting short >message performance. One of the commercial MPIs, at first release, had >very slow short message performance because they thought getting (a) >right was more important. They've improved their short message >performance since, but I still haven't seen any real application >benchmarks that show benefit from their approach. Perhaps no one who needed fast latency bought those NICs in the first place. A huge number of jobs that the 1024 processor SGI of dutch government used to handle is 4-8 processors. Simply because latency matters. >-- greg >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From diep at xs4all.nl Wed Feb 16 04:13:34 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Mare Nostrum (not quite COTS) Message-ID: <3.0.32.20050216131329.0106f960@pop.xs4all.nl> That looks great, Congratulations on the supercomputer! Which myrinet cards are in Mare Nostrum? What one way pingpong latency can it get from 1 end of the machine to the other end of the machine? Vincent At 11:31 16-2-2005 +0100, Eugen Leitl wrote: > >http://www-106.ibm.com/developerworks/library/pa-nl3-marenostrum.html > >Power Architecture Community Newsletter, 15 Feb 2005: MareNostrum: A new >concept in Linux supercomputing > e-mail it! > > > >Contents: >The name and the history >Meet MareNostrum >Distinguishing technologies >View from the crow's nest >Resources >About the author >Rate this article >Related content: >Project MareNostrum site >IBM eServer Cluster Servers >Subscriptions: >dW newsletters > >Level: Introductory > >developerWorks Power Architecture editors >IBM >15 Feb 2005 > > The MareNostrum supercomputer at the Barcelona Supercomputing Center, >ranked number four in the world in speed in November 2004, is constructed of >such totally off-the-shelf parts as IBM BladeCenter JS20 servers, 64-bit >970FX PowerPC processors, TotalStorage DS4100 storage servers, and Linux 2.6. >This is its story. > >IBM? has long been a supercomputing leader -- its heritage of innovation >currently and spectacularly manifested in its most powerful supercomputer, >Blue Gene?/L. The MareNostrum project is the latest bold experiment in >supercomputing by IBM -- a small but powerful, rapidly deployed and built >system that comes entirely from commercially available components. The Latin >term mare nostrum means "our sea" (which to the Romans meant the >Mediterranean, as familiar and available to the Italici as the air they >breathed, but also the critical key to their success). > >MareNostrum is one of the world's most powerful supercomputers, ranked among >the top five in the prestigious TOP500 (see Resources), yet it is constructed >from products available for sale to any business, lives within a relatively >small footprint, and was built on a tight schedule using blade servers, a >Linux. operating environment, and other cost-efficient technologies. >MareNostrum represents a new way of thinking about high-performance >computing. > >Blade servers, some of the most thin and dense machines that can be slid into >chassis with the ability to share sources such as power and network switches, >became the base components of this supercomputer design. Those familiar with >the IBM BladeCenter. JS20 servers' shared-resources architecture will >recognize how these servers cost-effectively minimize power consumption and >heat output. Running the Linux operating system, the servers exploit the >capabilities of the 2.6 kernel on 64-bit PowerPC? processors. > >MareNostrum also demonstrates something very unique in its project timeline: >Part of its mission was to prove the speed at which IBM Linux clusters could >be implemented and unleashed. According to the IBM MareNostrum e-Science >Lead, Dr. Juan Jose Porta (Open Systems Design and Development, IBM >Boeblingen Laboratory): > > This is all about timely and focused execution. The speed at which this >project was realized is important. Consider: from the initial concept in late >December of 2003 to assembling the computer in Madrid took less than a year. >Normally, this kind of supercomputer projects take years. > >To make a remarkable saga short, MareNostrum is here and will soon be put >into operation by the Barcelona Supercomputer Center (BSC), a public >consortium created by the Spanish Government, the Catalonian Government, and >the Technical University of Catalonia (UPC), the hosts of the MareNostrum >supercomputer. The Barcelona Supercomputing Center is located on the >Polytechnic University of Catalonia (UPC) campus in Barcelona. > >Dr. Porta added, "The supercomputer is based upon commodity technology >already developed and available. We were also playing with another piece of >magic -- an open environment. This has been a collaborative community effort, >where we closely worked with our partners." > >The name and the history >Why "MareNostrum?" In the words of Dr. Porta: > > MareNostrum means literally "our sea," which is also the Latin name for >the Mediterranean Sea on which Barcelona is a port. It carries other apt >connotations. "Our sea" refers to a sea of processors and professors who are >flocking to the MareNostrum project with a deep commitment to breakthrough >science. MareNostrum also refers to the fact that our supercomputer is on the >shores of the Mediterranean which, in the days of old Rome, was the middle of >the world. This was the center of the Roman Empire, now to become the center >of European e-Science on the shores of the nice Mediterranean Sea! Thus, we >are talking about an ocean of many professors and a major hub around which >such facilitation will grow and thrive to empower a new generation of >scientists. > > Another significant aspect of the name is that, being Latin, it is more >culturally inclusive. Not everyone is aware that Spain has actually four >official languages, and we did not want to slight anyone. Latin was a safe >choice. Spain now understandably becomes the proud home to the most powerful >supercomputer in Europe. We see references to its having been assembled in >Madrid, but also references to its permanent home as being in Barcelona. > >MareNostrum is a result of the burgeoning partnership between IBM and the >Spanish Government, which has also led to the creation of the Barcelona >Supercomputing Center (BSC). BSC is a public consortium created by the >Spanish Government, the Catalonian Government, and the Technical University >of Catalonia (UPC), which will host the MareNostrum supercomputer. > >Housed in a majestic 1920s chapel on the university grounds, MareNostrum >serves a dual purpose: To serve as a primary high-performance computing >resource for the European e-science community and to demonstrate the many >benefits of Linux on POWER. in scale. > >Meet MareNostrum >With peak system performance of 40 teraflops for the final system >configuration, and a number four spot on the TOP500 list, MareNostrum >continues the IBM tradition of high-performance computing breakthroughs in >the service of scientific advancement with a twist: MareNostrum is built >entirely of commercially available components, including: > > * 2,282 IBM eServer BladeCenter JS20 blade servers housed in 163 > * BladeCenter chassis > * 4,564 64-bit IBM PowerPC 970FX processors > * 140 TB of IBM TotalStorage? DS4100 storage servers > >The thinking behind MareNostrum's construction represents a new way of >looking at these and other compute-intensive areas. Today's typical >high-performance computing installation runs a large, parallel RISC-based >UNIX? system with performance instead of reliability being of utmost >importance. MareNostrum, however, is a small-footprint Linux cluster made up >entirely of off-the-shelf components. With the extreme density of IBM eServer >BladeCenter JS20 servers, diskless nodes, and an open system environment, >MareNostrum offers superior price/performance; greater reliability, >availability, and serviceability; and significant cost efficiencies -- >factors that are endearing Linux-based cluster servers to more and more >businesses all the time. > >Distinguishing technologies >The next sections explain the hardware and software technologies that >distinguish the high-performance computing strategy behind MareNostrum. > >Hardware: Servers >There are 2,282 IBM eServer BladeCenter JS20 servers housed in 163 >BladeCenters chassis. Each server Blade has two PowerPC 970 processors >running at 2.20GHz, providing superior performance for several varieties of >Linux. The BladeCenter technology offers the highest commercially available >computer density in the industry, which results in high performance with a >small footprint. The BladeCenter technology allows for 84 dual processor >servers in a single 42 U rack, giving more than 1.4 teraflops of compute >power in a single rack. > >Hot-swappable JS20 servers also allow administrators to change servers >without disrupting applications, maximizing availability. Its >shared-resources architecture helps to minimize power consumption and heat >output, as well. > >Hardware: Storage >MareNostrum's storage subsystem consists of 20 storage server nodes with 7 >terabytes of capacity each or 140 terabytes of total capacity. Its backbone >is the IBM TotalStorage DS4100 storage server which, like the BladeCenter >JS20, uses redundant hot-swappable components for high availability. IBM >TotalStorage DS4100 technology enables tremendous scalability and a wide >range of RAID data protection options. > >Hardware: Switching >Four switch frames with Myrinet, including 10 CLOS 256+256 switches and 2 >Spine 1280s and densely bundled Myrinet cabling enables faster parallel >processing with less switching hardware. The redundant hot-swappable power >supply ensures greater availability. The complete switch with 12 chassis >provides for 2,560 uniform ports. This uniformity simplifies the programming >model so researches can focus on their programs and not the system >interconnect architecture. > >Software: The power of Linux on POWER >The Linux 2.6 kernel offers an array of enterprise and performance features >that exploit the Power Architecture.. The virtualization capabilities of >Linux on POWER allow for more flexible partitioning, better balancing of >workloads, and superior scalability should workloads increase. Dr. Porta >explained, "It is the Linux 2.6 kernel which offers an array of enterprise >and performance features that exploit the Power Architecture." > >Software: Diskless Image Management (DIM) >DIM is a prototype utility for managing the Linux distribution for the >compute nodes on the storage servers so that the compute node does not have >to manage the root file system. All the files for operation are obtained >through the cluster network. Because of this, blades can operate immediately >without Linux installation. This is on-demand operation. The blades do have a >disk drive but that is reserved for future application use such as >checkpointing. DIM also supports the network boot environment in a highly >distributed fashion. > >Software: IBM Linux on POWER clustering technologies >The goal is to endow MareNostrum with the same benefits businesses in many >industries derive from IBM Linux clusters, albeit on a larger scale. Benefits >such as: > > * Superior density and improved operating efficiency, including smaller > * space, power, and cooling requirements and related costs -- thanks to > * the BladeCenter JS20 architecture > * Record price/performance and system throughput for high-performance > * computing workloads thanks to innovative POWER semiconductor > * technology, specifically the eight-way superscalar design of the > * PowerPC 970FX processor which fully supports symmetric multi-processing > * (SMP) > * The leading IBM 64-bit POWER microprocessors are capable of addressing > * four billion times the amount of physical memory as traditional 32-bit > * processors without resorting to complex memory-extension techniques. > * Better systems management control thanks to embedded service processors > * and software image management > * Increased reliability, availability, and serviceability, as well as > * lower installation and maintenance costs -- provided by diskless > * compute nodes > * Improved functionality and performance thanks to the Linux 2.6 kernel > * Reduced switching hardware requirements and faster parallel processing > * provided by Myrinet switch cabling > * Improved storage subsystem costs and reliability thanks to TotalStorage > * DS4100 storage technology > >View from the crow's nest >When the power of MareNostrum is unleashed later this year, it will be at the >service of scientific, engineering, and medical researchers in the Spanish >and international scientific communities. Its to-do list includes issues that >are familiar in the supercomputing world, such as protein folding, in silico >(computer generated) drug screening and enzymatic reactions. MareNostrum will >be used to support basic and applied research in areas that include biology, >chemistry, physics, and information-based medicine. > >As Dr. Porta summed up: > > ...[T]he very thinking that drove MareNostrum's construction is a new way >of looking at compute-intensive areas, particularly in the life sciences, as >we prepare new work to resolve challenging problems in information based >medicine -- including improvements in diagnostic and therapeutic treatments >in hospitals. In the EU context, many of the projects will be conducted in >collaboration with other leading European research institutions. We are >building collaborative efforts across geographic borders and disciplines. And >remember -- the name of the supercomputer is MareNostrum. Traditionally, it >was the Mediterranean Sea which allowed commerce and communication to >flourish in Europe and beyond. > >Resources > > * Visit the Project MareNostrum site, demonstrating the value of Linux > * clustering for science, for business, for life itself. > > * MareNostrum is now at home at the Barcelona Supercomputing Center (BSC) > * on the Polytechnic University of Catalonia (UPC) campus in Barcelona, a > * prestigious public institution focused on higher education, research, > * and technology transfer. > > * The TOP500 Supercomputer Sites project was started in 1993 to provide a > * reliable basis for tracking and detecting trends in high-performance > * computing -- twice a year, the project releases a list of the 500 sites > * operating the most powerful computer systems. > > * See this chart for the Linpack benchmark for MareNostrum and others. > > * This news article examines MareNostrum, IBM's top-ranked, > * off-the-shelf, blade-based supercomputer. > > * Connecting two or more IBM eServer Cluster Servers can create a single, > * unified computing resource that will dramatically improve availability, > * flexibility, and adaptability for essential services. > > * The IBM BladeCenter JS20 is well- suited for commercial mainstream > * applications and 64-bit high performance computing (HPC) environments. > > * The IBM Redbook, The IBM eServer BladeCenter JS20, takes an in-depth > * look at the two-way Blade eServer for applications requiring 64-bit > * computing. > > * The Linux on IBM eServer product line is Linux-enabled to deliver > * maximum performance, reliability, manageability, and price/performance > * benefits. > > * See this site for more on how IBM supercomputing solutions can help > * remove the barriers to deployment of clustered server systems. > > * IBM TotalStorage DS400 series has been enhanced with the DS4000 Storage > * Manager V9.10, enhanced remote mirror option, DS4100 option for larger > * capacity configurations, and support for EXP100 serial ATA expansion > * units . > > * Take a look at the Myrinet switches used in MareNostrum. > >About the author >The developerWorks Power Architecture editors welcome your comments on this >article. E-mail them at dwpower@us.ibm.com. > > >-- >Eugen* Leitl leitl >______________________________________________________________ >ICBM: 48.07078, 11.61144 http://www.leitl.org >8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE >http://moleculardevices.org http://nanomachines.net > >Attachment Converted: "f:\internet\eudora\attach\[Beowulf] Mare Nostrum (not qui" >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From diep at xs4all.nl Wed Feb 16 05:12:23 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies Message-ID: <3.0.32.20050216141222.00924be0@pop.xs4all.nl> At 06:28 16-2-2005 -0500, Patrick Geoffray wrote: >Rossen, > >Rossen Dimitrov wrote: >> >>> >>> So if you run an MPI application and it sucks, this is because the >>> application is poorly written ? >> >> >> Patrick, here the argument is about whether and how you "measure" the >> "performance of MPI". I guess you may have missed some of the preceding >> postings. > >No, I was pulling your leg :-) The bigger picture is that MPI has no >performance in itself, it's a middleware. You can only measure the way >an MPI implementation enable a specific application to perform. Only >benchmarking of applications is meaningful, you can argue that >everything else is futile and bogus. A problem of MPI over DSM type forms of parallellism has been described very well by Chrilly Donninger with respect to his chessprogram Hydra which runs at a few nodes MPI : For every write : MPI_Isend(....) MPI_Test(&Reg,&flg,&Stat) while(!flg) { Hydra_MsgPending(); // Important, read in messages and process them while waiting on complete. Otherwise the own Input-Buffer can overflow // and we get a deadlock. MPI_Test(&Reg,&flg,&Stat); } The above is dead slow simply and delays the software. In a DSM model like Quadrics you don't have all these delays. Can Myri memory on the card (4MB and 8MB in the $1500 version) get used to directly write to the RAM on a remote network card? If so which library can i download for that for myri cards? Thanks in advance, Vincent >>> You don't want to benchmark an application to evaluate MPI, you want >>> to benchmark an application to find the best set of resources to get >>> the job done. If the code stinks, it's not an excuse. Good MPI >>> implementations are good with poorly written applications, but still >>> let smart people do smart things if they want. >> >> >> This is exactly my point made in my previous posting - you cannot design >> a system that is optimal in a single mode for all cases of its use when >> there are multiple parameters defining the usage and performance > >I agree completely, being able to apply different assumptions for the >whole code and see which one match the best the applications behavior is >better than nothing. However, I believe that some tradeoffs are just too >intrusive: you should not have to choose between low latency for small >messages or progress by interrupt for large ones, especially when you >can have both at the same time. > >> I think it is fairly easy to show that overlapping and polling (or any >> kind of communication completion synchronization) are not orthogonal. If >> this was the case, you would see codes that show perfect overlapping >> running on any MPI implementation/network pair. I am sure there is >> plenty of evidence this is not the case. > >I can show you codes where people sprinkled some MPI_Test()s in some >loops. They don't poll to death, just a little from time to time to >improve overlap by improving progression. They poll and they overlap. >They could as well block and not overlap. polling/blocking and >overlap/not are not linked. Interrupts are useful to get overlap without >help from the application, but it's not required to overlap. > >> There is an important point here that needs to be clarified: when I say >> "polling" library, I assume that this library does both: polling >> completion synchronization and polling progress. There is not much room >> to define here these but I am sure MPI developers know what they are. > >I think this is where we don't understand each other. For me, polling >means no interrupts. Wherever you progress in the context of MPI calls >or in the context of a progression thread, you pay for the same CPU >cyles. If the application is providing CPU cycles to the MPI lib at the >right time, you can overlap perfectly without wasting cycles. > >> Here is a third one. Writing your code for overlapping with non-blocking >> MPI calls and segmentation/pipelining, testing the code, and not seeing >> any benefit of it. > >Yes. This is very true. But if it's not worse than with blocking, they >should stick with non-blocking, even if it's bigger and more confusing. > >> stage I with communication in stage I+1. Then, there is the question how >> many segments you use to break up the message for maximum speedup. The >> pipelining theory says the more you can get the better, when they are >> with equal duration, there aren't inter-stage dependencies, and the >> stage setup time is low in proportion to the stage execution time. Also, > >The more steps, the more overhead. Small pipeline stages decrease your >startup overhead (when the second stage is empty) but increase the >number of segments and the total cost of the pipeline. The best is to >find a piece of computation long enough to hide the communication. >Pipelining would be overkill in my opinion. > >> The metric I mentioned earlier "degree of overlapping" with some >> additional analysis can help designers _predict_ whether the design is >> good or not and whether it will work well or not on a particular system >> of interest (including the MPI library). > >Temporal dependency between buffers and computation is the metric for >overlaping. The longuer you don't need a buffers, the better you can >overlap a communication to/from it. Compilers could know that. > >> This is however too much detail for this forum though, as most of the >> postings here discuss much more practical issues :) > >I am bored with cooling questions. However, it's quite time consuming to >argue by email. I don't know how RGB can keep the distance :-) > >Patrick >-- > >Patrick Geoffray >Myricom, Inc. >http://www.myri.com >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From rossen at VerariSoft.Com Wed Feb 16 06:34:56 2005 From: rossen at VerariSoft.Com (Rossen Dimitrov) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <42132E40.1060001@myri.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <420DA793.4000909@verarisoft.com> <421194C4.5050808@myri.com> <4212182C.60607@verarisoft.com> <42132E40.1060001@myri.com> Message-ID: <42135A10.3060804@verarisoft.com> >>> >>> So if you run an MPI application and it sucks, this is because the >>> application is poorly written ? >> >> Patrick, here the argument is about whether and how you "measure" the >> "performance of MPI". I guess you may have missed some of the >> preceding postings. > > No, I was pulling your leg :-) The bigger picture is that MPI has no > performance in itself, it's a middleware. You can only measure the way > an MPI implementation enable a specific application to perform. Only > benchmarking of applications is meaningful, you can argue that > everything else is futile and bogus. Actually, we have been arguing this exact argument for quite some time, which might sound odd as we are a commercial MPI vendor :) The whole idea is that focusing too much on microbenchmarks and then extending the results from these to characterize a whole parallel system does not seem to be right thing to do (or at least not the only thing to do) but on the other hand, I often see it being done. From diep at xs4all.nl Wed Feb 16 08:44:55 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies Message-ID: <3.0.32.20050216174455.0106ea70@pop.xs4all.nl> At 07:17 16-2-2005 -0500, Robert G. Brown wrote: >On Wed, 16 Feb 2005, Patrick Geoffray wrote: > >> > This is however too much detail for this forum though, as most of the >> > postings here discuss much more practical issues :) >> >> I am bored with cooling questions. However, it's quite time consuming to >> argue by email. I don't know how RGB can keep the distance :-) >> >> Patrick >> > >I stuck a hairpin into an electrical socket at age 2 (an "enlightening" >experience I must say) and had a large rock fall on my head from a >height of almost a meter at age 8. > >Since then, I hardly ever get bored with cooling questions, because I >cannot remember that they've been asked. What were we talking about, >again? > >Oh yeah, MPI and all that. > >I've actually been enjoying reading the discussion and not >participating, since I'm a PVM kinda guy. But SINCE my name was invoked >in vain, I'll make a single comment on the code quality issue, which is >that underlying the discussion of communication pattern, blocking vs >non-blocking, and directives is the fundamental scaling properties of >the code and algorithm itself. So on the issue of whether MPI sucks >because the application sucks -- well, possibly, but it seems more >likely that the application sucks because its parallel scaling >properties (with the algorithm chosen) suck. It is possible for algorithms to have sequential properties, in short making it hard to scale well. Game tree search happens to have a few of such algorithms, from which one is performing superior with a number of enhancements having the same property that a faster blocked get latency speeds it up exponential. For the basic idea why there is an exponential speedup see Knuth and search for the algorithm 'alfabeta'. So the assumption that an algorithm sucks because it doesn't need bandwidth but latency is like sticking a hairpin in an electrical socket. If users would JUST need a little bit of bandwidth they already can get quite far with $40 cards. So optimizing MPI for low latency small messages IMHO is very relevant. We get many improvements in hardware coming years. dual core, cell streaming type and obviously when software becomes available in larger quantities to run parallel, many will try the jump to running at clusters too. Obviously if you can make it easier from programmers viewpoint then to parallellize their software, like implementing short messages in a kind of single system image type of software, or even certain algorithms, makes sense to me. The step from shared memory programming to MPI is a rather huge step currently. Even if all you want is a byte which sometimes is at a remote machine and usually at your local cache, but you NEED that byte for your software, just to know whether it's a 1 or 0, then the last you want to be toying with is writing special code. You don't care how the result gets there, just as long as it gets there. >As to how "intelligent" the back end library should be at choosing >algorithm -- I would say the BASIC library should be atomic, elementary, >NOT algorithm level stuff. A thin skin on top of raw networking calls >that provides the various things one always has to do oneself but not >much more. Where one gets into trouble is where one uses a command that >has a complex structure that doesn't fit your code without realizing it, >and the reason you don't realize it is because all that detail is >hidden, and isn't even uniform in RELATIVE performance across varying >network hardware. > >In other words, to make MPI do more, either make it do less (in the form >of commands that can be used to build "more" in a manner that is tuned >to application and hardware) or be prepared to REALLY make it SMART >behind the scenes. > >This isn't just MPI, BTW. PVM suffers from the same thing. I honestly >think that both are limited tools in part BECAUSE they put too thick a >skin between the programmer and the network. If you want real >performance and complete control over communication algorithm, you >probably have to use raw/low level networking commands, and write the >appropriate "collective" operations for your particular application and >hardware. > >Of course nobody does this -- not portable and a PITA to >design/write/maintain. Or perhaps a few people DO do this, but they're >programming gods. And this isn't crazy, really. > > rgb > >-- >Robert G. Brown http://www.phy.duke.edu/~rgb/ >Duke University Dept. of Physics, Box 90305 >Durham, N.C. 27708-0305 >Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu > > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From bclem at rice.edu Wed Feb 16 08:48:34 2005 From: bclem at rice.edu (Brent M. Clements) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Academic sites: who pays for the electricity? In-Reply-To: References: Message-ID: I would suggest asking this on the Educause CIO mailing list. Anyone can join as long as your from a educational entity. -Brent > A/C are paid for by the school. To do so they take "overhead" > out of every grant. Partially as a consequence of this they > typically have a very poor ability to meter usage on a room > by room basis. > > Now somewhere between the 10 node Pentium II beowulf sitting on > a lab bench and the 1000 node dual P4 Xeon beowulf in a machine > room that takes up half the basement the cost of the electricity > (both for power and A/C) goes from a minor expense to a major > one. Really major. For instance, in that hypothetical large machine, > at 10 cents per kilowatt hour (a round number), assuming 100 watts > per CPU (another round number) that's: > > 1000 (nodes) * > 2 (cpus/node) * > .1 (kilowatts/cpu) * > .1 (dollars/kilowatt-hour) * > 365 (days /year) * > 24 (hours/day) = > ----------------------- > 175200 dollars/year > > The A/C expense is going to vary tremendously depending upon > the outside temperature. It's going to be much higher for us > in Southern California than for a site in Anchorage. > > "Typical" lab usage is widely variable but I'd be amazed > if most biology or chemistry labs burn through even 1/10th this > much for the equivalent lab area. Some physics lab running > a tokamak might come close. > > > Anyway, the question is, have any of the universities said "enough > is enough" and started charging these electricity costs directly? > If so, what did they use for a cutover level, where usage was > "above and beyond" overhead? > >> From an economic perspective having electricity and A/C come out > of overhead (without limit) grossly distorts the true cost > of the project over time and can lead to choices which increase > the total overall cost. For instance, the use of Xeons instead of > Opterons has little effect on TCO if somebody else is picking > up the electricity tab, but could change the power consumption > significantly on a large project. > > Regards, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From diep at xs4all.nl Wed Feb 16 10:08:05 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Academic sites: who pays for the electricity? Message-ID: <3.0.32.20050216190804.0106fcc0@pop.xs4all.nl> At 08:16 16-2-2005 -0800, David Mathog wrote: >In most universities services like electricity, water, and >A/C are paid for by the school. To do so they take "overhead" >out of every grant. Partially as a consequence of this they >typically have a very poor ability to meter usage on a room >by room basis. > >Now somewhere between the 10 node Pentium II beowulf sitting on >a lab bench and the 1000 node dual P4 Xeon beowulf in a machine >room that takes up half the basement the cost of the electricity >(both for power and A/C) goes from a minor expense to a major >one. Really major. For instance, in that hypothetical large machine, >at 10 cents per kilowatt hour (a round number), assuming 100 watts >per CPU (another round number) that's: > > 1000 (nodes) * > 2 (cpus/node) * > .1 (kilowatts/cpu) * > .1 (dollars/kilowatt-hour) * > 365 (days /year) * > 24 (hours/day) = >----------------------- > 175200 dollars/year Complete academic nonsense calculation. If you use quite some electricity the electricity gets up to factor 20-40 cheaper. Getting a factor 10 reduction in usage bill is pretty easy if you negotiate properly. However you must avoid starting machines at peaktimes. Big fines get given for that. So it's cheaper to let them run 24 hours a day than to start them in the morning after say 7 AM (depending upon local habits). Please note that nothing beats the price of nuclear power (as a member of the high voltage power forum i do not have an opinion on that). Electricity production costs of nuclear power are hundreds of times cheaper than producing it with oil, oil produces it roughly for 5 dollar cent a kilowatt (if memory serves me well). Coals have a CO2 problem for nations which are in Kyoto agreement (USA isn't), but also is nearly as cheap as nuclear power. So the actual price they deliver huge power for to big institutes is a very easy negotiation to get it factors down. Vincent Diepeveen ex-member of high voltage powerline forum. >The A/C expense is going to vary tremendously depending upon >the outside temperature. It's going to be much higher for us >in Southern California than for a site in Anchorage. > >"Typical" lab usage is widely variable but I'd be amazed >if most biology or chemistry labs burn through even 1/10th this >much for the equivalent lab area. Some physics lab running >a tokamak might come close. > > >Anyway, the question is, have any of the universities said "enough >is enough" and started charging these electricity costs directly? >If so, what did they use for a cutover level, where usage was >"above and beyond" overhead? > >>From an economic perspective having electricity and A/C come out >of overhead (without limit) grossly distorts the true cost >of the project over time and can lead to choices which increase >the total overall cost. For instance, the use of Xeons instead of >Opterons has little effect on TCO if somebody else is picking >up the electricity tab, but could change the power consumption >significantly on a large project. > >Regards, > >David Mathog >mathog@caltech.edu >Manager, Sequence Analysis Facility, Biology Division, Caltech >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From eno at dorsai.org Wed Feb 16 17:10:10 2005 From: eno at dorsai.org (Alpay Kasal) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] powering up 18 motherboards Message-ID: <0IC100EHK73A1V@mta7.srv.hcvlny.cv.net> Hello all. I have a question about powering on motherboards simulataneously. I have 18 identical mobo's right now with identical ram, cpu, and hard disk. I hooked one up to a kill-a-watt and found that it draws 140-150 watts when powering on, and stays level at about 90-100 watts afterwards. The problem is that I am setting this up at home, where I only have 10 amp circuits (and only a couple of them can be freed up). Correct me if I am wrong here please. 1 mobo = 100 watts / 115 volts = .87amps each mobo while steady on 1 mobo = 150 watts / 115 volts = 1.3amps each mobo while turning on I won't include the rest of the math, but needless to say, it'd be a pain in arse to turn on the room in piecemeal without tripping a circuit breaker. My questions is : Will a heavy duty UPS aid in getting me through powering up the room? I don't mind splitting up the 18 machines with 6 outlet surge strips. Any advice? Thanks. Alpay Kasal -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20050216/9beb162a/attachment.html From billk01 at metrumrg.com Wed Feb 16 04:45:48 2005 From: billk01 at metrumrg.com (billk01) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] running non-mpi programs on beowulf cluster Message-ID: <4213407C.7000605@metrumrg.com> I have a program that is called from a perl batch script. The program is non-MPI aware so I have been using mpprun to execute the perl program. The perl program can start from 1 - x processes depending upon the arguments to the batch file. I currently call the batch file as: mpprun -no-local perl batch.p 1 2 3 & 1 2 3 cause the perl program to start proceses 1 2 and 3 in three different directories. (The different directories are necessary because of the nature of the program being run.) The results is three processes all running on one node. (Each node has two processors and there are 3 nodes for now for a total of 6 processors.) I have tried supplying the -np x option but this simply starts starts the same three processes over an another node once the initial three processes are complete. The same thing occurs if I use the -map x:x:x option. I have also tried batching the commant via the "batch now" interactive command line interface and the results is the same. Is there anyway to indicate to the cluster to load balance these processes across the nodes? Or do I need to start each process with a seperate mpprun command? Also, it appears that the NO_LOCAL=1 option does not work with the "Batch" command. Does that seem correct? The cluster consists of a dual processor (2 Xeon's) master node with three compute nodes each with 2 Xeon processors. Eventually we will have a number of additional nodes up but I am testing for now. Any help would be greatly appreciated. Regards, Bill From billk at metrumrg.com Thu Feb 17 05:45:11 2005 From: billk at metrumrg.com (Bill Knebel) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] sun grid engine on Scyld beowulf cluster Message-ID: <42149FE7.5060000@metrumrg.com> I am in the process of installing SGE on a Scyld beowulf cluster. As most people are aware, the Scyld cluster runs a complete OS (linux) only on the master node and the compute nodes are simply for executing. During the SGE install, it requires adding the compute nodes as execute hosts. I do not understand how to do this given the current setup of a scyld cluster since you can't "login" to the nodes to execute the install script. The script does exist on an NFS shared directory (clusater wide). Has anybody else ran into this problem? Regards, Bill -- Bill Knebel, PharmD, Ph.D. Principal Scientist Metrum Research Group 15 Ensign Drive Avon, CT 06001 email: billk@metrumrg.com From billk01 at metrumrg.com Thu Feb 17 05:49:25 2005 From: billk01 at metrumrg.com (billk01) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] sun grid engine on Scyld beowulf cluster Message-ID: <4214A0E5.3010804@metrumrg.com> I am in the process of installing SGE on a Scyld beowulf cluster. As most people are aware, the Scyld cluster runs a complete OS (linux) only on the master node and the compute nodes are simply for executing. During the SGE install, it requires adding the compute nodes as execute hosts. I do not understand how to do this given the current setup of a scyld cluster since you can't "login" to the nodes to execute the install script. The script does exist on an NFS shared directory (cluster wide). Has anybody else ran into this problem? Regards, Bill ------------------------- Bill Knebel, PharmD, Ph.D. Principal Scientist Metrum Research Group 15 Ensign Drive Avon, CT 06001 email: billk@metrumrg.com From dag at sonsorol.org Thu Feb 17 14:12:07 2005 From: dag at sonsorol.org (Chris Dagdigian) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] sun grid engine on Scyld beowulf cluster In-Reply-To: <4214A0E5.3010804@metrumrg.com> References: <4214A0E5.3010804@metrumrg.com> Message-ID: <421516B7.3060906@sonsorol.org> I know Grid Engine well but not Scyld so forgive my ignorance if I say something stupid and given the level of expertise on this list I'm quite certain I'm about to make a fool myself :) If Scyld is presenting you with a single system image (ie a single linux server that can farm out tasks to all those nodes) then you would install SGE in the same way that you would install it on a big SMP box: 1. Install the SGE qmaster and scheduler on the master node 2. Install the execution host on the master node as well You will only have 1 execd per queue but each queue can be configured with N number of "job slots" which actually control how many jobs can run at the same time on the same machine. Try setting your # of job slots within your single SGE queue to the number of nodes in your cluster. This is simlar to what you would do on a big SMP machine -- small number of queues each supporting a decent jobslot count. Then submit a bunch of jobs and see if SGE causes the master node to fall over under load. If not then Scyld is doing its thing behind the scenes to migrate stuff around to the other nodes. -Chris billk01 wrote: > I am in the process of installing SGE on a Scyld beowulf cluster. As > most people are aware, the Scyld cluster runs a complete OS (linux) only > on the master node and the compute nodes are simply for executing. > During the SGE install, it requires adding the compute nodes as execute > hosts. I do not understand how to do this given the current setup of a > scyld cluster since you can't "login" to the nodes to execute the > install script. The script does exist on an NFS shared directory > (cluster wide). Has anybody else ran into this problem? > From dtj at uberh4x0r.org Thu Feb 17 14:18:40 2005 From: dtj at uberh4x0r.org (Dean Johnson) Date: Wed Nov 25 01:03:48 2009 Subject: [Beowulf] Re: Academic sites: who pays for the electricity? In-Reply-To: <20050217011015.HITQ24369.swebmail02.mail.ozemail.net@localhost> References: <20050217011015.HITQ24369.swebmail02.mail.ozemail.net@localhost> Message-ID: <1108678721.3683.16.camel@terra> I am going out on a limb here, but I suspect that a majority of the people on this list lean far more toward the casual administrator types and not professional cluster admins. Often, I suspect, the issues of power and cooling only arise in times of scarcity and not out of some anal-retentive need to quantify resources. They are very likely people doing science (or other worthy endeavour) and simply use clusters as tools. Should one quantify the ROI on each pencil and pen in ones office? I realize that the first pen/pencil in your office is an important writing device, but why do you have a second and third, given that the useful life of a pen/pencil, under normal load, is quite long? ;-) I'll be right back as soon as I justify the dozen or so pens I have under the seat of my car. Hold it! I don't use the company's crappy pens and provide my own. I guess beans don't interest me. -Dean On Thu, 2005-02-17 at 12:10 +1100, steve_heaton@ozemail.com.au wrote: > G'day all > > Speaking as someone from "industry", and a Project/Programme Manager at that, I'd just like to add that I'm shocked and dismayed at the apparent lack of accountability that seems rampant in academic circles! If it was down to me I'd sack the lot of ya!! ;) > > I'd strongly recommend that all good cluster folk have a good idea about operation expenditure (opex). If you get a visit from the Meanie Beanies (auditors / cost accountants etc etc) then it'll help cover your A. It's a great way to have your gig cancelled because you didn't have a firm understanding of your $'s in and out. Happens all the time in Industry and in my job it's a sackable offence. No joke. Do some homework and you won't need to be afraid (OK, *as* afraid of the Purple Pen People). > > Some things to know? ideally you should be able to quote these with as little as an hour's warning (shows you're on top of things): > > -) The amount of floor space you consume (sq ft or m) - don't worry about the cost of this one, those asking will know ;) Becomes a hot topic if you're paying rent in some form. > -) Find out how much electricity you use per hour - chances are you're on one or more dedicated circuit(s) and probably separate metering - look at the bills. Don't worry about general lighting etc. It's often rolled into the floor space calcs. > -) Ditto aircon (include your maintenance) > -) Cluster hardware maintenance (out of warranty stuff, cost of spares) - quoting your amazing uptime can help explain this figure > -) Service contracts (you've got a Service Level Agreement right? Uptime % etc helps explain) > -) Staff / admin costs > -) The good ol' "anything else you can think of" > > Now the fun part. Who used you cluster and for how long? Look at your job scheduling etc. Your department? Another department (do you cross charge somehow)? Which projects? What's their contribution to cluster opex? > > If you answer reasonably accurately then the Beanies will treat you with some respect :) >>Someone, somewhere is paying your bills already.<< Know where that money is going! > > Don't say I didn't warn you ;) > > Cheers > Stevo > > This message was sent through MyMail http://www.mymail.com.au > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From dtj at uberh4x0r.org Thu Feb 17 14:22:12 2005 From: dtj at uberh4x0r.org (Dean Johnson) Date: Wed Nov 25 01:03:48 2009 Subject: [SPAM] [Beowulf] powering up 18 motherboards In-Reply-To: <0IC100EHK73A1V@mta7.srv.hcvlny.cv.net> References: <0IC100EHK73A1V@mta7.srv.hcvlny.cv.net> Message-ID: <1108678933.3683.19.camel@terra> Howdy, Correct me if I am wrong. Presumably your cpu's have fan kits and if you may have case fans (or something). Do the motors on fan's not draw more initial amperage at startup? Thus you would have an initial spike. -Dean On Wed, 2005-02-16 at 20:10 -0500, Alpay Kasal wrote: > Hello all? I have a question about powering on motherboards > simulataneously? > > > > I have 18 identical mobo?s right now with identical ram, cpu, and hard > disk. I hooked one up to a kill-a-watt and found that it draws 140-150 > watts when powering on, and stays level at about 90-100 watts > afterwards. The problem is that I am setting this up at home, where I > only have 10 amp circuits (and only a couple of them can be freed up). > Correct me if I am wrong here please? > > > > 1 mobo = 100 watts / 115 volts = .87amps each mobo while steady on > > 1 mobo = 150 watts / 115 volts = 1.3amps each mobo while turning on > > > > I won?t include the rest of the math, but needless to say, it?d be a > pain in arse to turn on the room in piecemeal without tripping a > circuit breaker. My questions is : > > > > Will a heavy duty UPS aid in getting me through powering up the room? > I don?t mind splitting up the 18 machines with 6 outlet surge strips. > Any advice? > > > > Thanks. > > Alpay Kasal > > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mathog at mendel.bio.caltech.edu Thu Feb 17 14:54:15 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Academic sites: who pays for the electricity? Message-ID: At Wed, 16 Feb 2005 19:08:05 +0100 Vincent Diepeveen wrote: > Date: Wed, 16 Feb 2005 19:08:05 +0100 > From: Vincent Diepeveen > Subject: Re: [Beowulf] Academic sites: who pays for the electricity? > To: "David Mathog" , > beowulf@beowulf.org > Message-ID: <3.0.32.20050216190804.0106fcc0@pop.xs4all.nl> > Content-Type: text/plain; charset="us-ascii" > > At 08:16 16-2-2005 -0800, David Mathog wrote: > >In most universities services like electricity, water, and > >A/C are paid for by the school. To do so they take "overhead" > >out of every grant. Partially as a consequence of this they > >typically have a very poor ability to meter usage on a room > >by room basis. > > > >Now somewhere between the 10 node Pentium II beowulf sitting on > >a lab bench and the 1000 node dual P4 Xeon beowulf in a machine > >room that takes up half the basement the cost of the electricity > >(both for power and A/C) goes from a minor expense to a major > >one. Really major. For instance, in that hypothetical large machine, > >at 10 cents per kilowatt hour (a round number), assuming 100 watts > >per CPU (another round number) that's: > > > > 1000 (nodes) * > > 2 (cpus/node) * > > .1 (kilowatts/cpu) * > > .1 (dollars/kilowatt-hour) * > > 365 (days /year) * > > 24 (hours/day) = > >----------------------- > > 175200 dollars/year > > Complete academic nonsense calculation. If you use quite some electricity > the electricity gets up to factor 20-40 cheaper. Getting a factor 10 > reduction in usage bill is pretty easy if you negotiate properly. Well, it isn't complete nonsense, unless you care to dispute the number of days in a year, hours in a day, or cpus in a dual node computer! The only term you're complaining about is the price of electricity. I'm not privy to the electrical rates that our school pays, they may well be an order of magnitude lower. My home rates certainly aren't, but then, I don't buy as much power as the campus. It's also not at all clear that the campus would sell power to the end users at the same rate which it pays the utility. I don't really understand your point about keeping the units running versus restarting them. Sure, it would be really bad to try to boot all 1000 nodes simultaneously, in all likelihood it wouldn't work. That's why they are typically started at N second intervals, where N depends on your hardware. Surely there is some N large enough so that the peak current draw during the restart never exceeds the random fluctuations observed when all units are running normally. Or is your point that the electricity company doesn't want the facility to draw _less_ current than it uses normally at steady state? On a somewhat related note, it would be nice if rack nodes had some graceful way to conserve electricity. For instance, something along the lines of: if the CPU utilization goes below 5% for 10 seconds ratchet the clock down by a factor of 10. When CPU usage goes above 90% ratchet for 2 seconds move it back up again. Notebooks can do this sort of thing, but it seems not to be a "feature" of most full size motherboards. This should also lower the average temperature in the case, at the expense of increased thermal cycling. Hard to say off hand if that's a plus or a minus as far as hardware longevity goes. Certainly it would be a plus in terms of energy conservation. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From James.P.Lux at jpl.nasa.gov Thu Feb 17 16:17:57 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:49 2009 Subject: [SPAM] [Beowulf] powering up 18 motherboards In-Reply-To: <1108678933.3683.19.camel@terra> References: <0IC100EHK73A1V@mta7.srv.hcvlny.cv.net> <1108678933.3683.19.camel@terra> Message-ID: <6.1.1.1.2.20050217160903.07880988@mail.jpl.nasa.gov> At 02:22 PM 2/17/2005, Dean Johnson wrote: >Howdy, >Correct me if I am wrong. Presumably your cpu's have fan kits and if you >may have case fans (or something). Do the motors on fan's not draw more >initial amperage at startup? Thus you would have an initial spike. The fans probably don't draw a significant amount of startup current over their running current. Disk drives, on the other hand, do draw a lot more current when spinning up. > -Dean > >On Wed, 2005-02-16 at 20:10 -0500, Alpay Kasal wrote: > > Hello all I have a question about powering on motherboards > > simulataneously > > > > > > > > I have 18 identical mobo?s right now with identical ram, cpu, and hard > > disk. I hooked one up to a kill-a-watt and found that it draws 140-150 > > watts when powering on, and stays level at about 90-100 watts > > afterwards. The problem is that I am setting this up at home, where I > > only have 10 amp circuits (and only a couple of them can be freed up). > > Correct me if I am wrong here please > > > > > > > > 1 mobo = 100 watts / 115 volts = .87amps each mobo while steady on > > > > 1 mobo = 150 watts / 115 volts = 1.3amps each mobo while turning on You're basically right. There's also a very high current spike that your Kill-A-Watt won't see when you first turn on the supply, as the capacitors in the input section charge up. This might result in a nuisance trip of your breakers. This current spike will be somewhat dependent on the phase of the AC line voltage when you first close the switch. Some power supplies have inrush current limiting, others don't. A 10 amp circuit would be highly unusual in the U.S., but might be common practice elsewhere. In the U.S., a 15 amp circuit is standard. > > > > > > > > I won?t include the rest of the math, but needless to say, it?d be a > > pain in arse to turn on the room in piecemeal without tripping a > > circuit breaker. My questions is : > > > > > > > > Will a heavy duty UPS aid in getting me through powering up the room? > > I don?t mind splitting up the 18 machines with 6 outlet surge strips. > > Any advice? No, the UPS won't help. It might make things worse, because as you flip on all that load, the voltage will sag, causing the UPS to turn on, which then might trip from the overcurrent (assuming you're not out buying a 2kW UPS). What you want is some way to sequence the power, conveniently. The answer is to use some relays. You could spend some tens of dollars (brand new, much less surplus) and get some time delay relays. You could use the DC power out of the first power supply to turn on to charge a capacitor through a resistor that's hooked to a relay (12V coil, 110V contacts).. Most DC relays have a much higher pull-in than hold current/voltage. You could use the X-10 type (aka Plug n Power) remote controlled relays (don't use Lamp modules.. you need Appliance modules, which are relays inside). You could build a little power sequencing box that sends the appropriate signals to the power supplies to turn them on, one by one... I think you might be able to do this with the parallel printer port on the first mobo to fire up. I haven't looked at the control interface on an ATX power supply recently. > > > > > > > > Thanks. > > > > Alpay Kasal > > > > > > James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From James.P.Lux at jpl.nasa.gov Thu Feb 17 17:01:53 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Academic sites: who pays for the electricity? Message-ID: <6.1.1.1.2.20050217162000.07774090@mail.jpl.nasa.gov> At 04:19 PM 2/17/2005, Jim Lux wrote: >At 02:54 PM 2/17/2005, David Mathog wrote: >>At Wed, 16 Feb 2005 19:08:05 +0100 Vincent Diepeveen wrote: >> >> > Date: Wed, 16 Feb 2005 19:08:05 +0100 >> > From: Vincent Diepeveen >> > Subject: Re: [Beowulf] Academic sites: who pays for the electricity? >> > To: "David Mathog" , >> > beowulf@beowulf.org >> > Message-ID: <3.0.32.20050216190804.0106fcc0@pop.xs4all.nl> >> > Content-Type: text/plain; charset="us-ascii" >> > >> > At 08:16 16-2-2005 -0800, David Mathog wrote: >> > >In most universities services like electricity, water, and >> > >A/C are paid for by the school. To do so they take "overhead" >> > >out of every grant. Partially as a consequence of this they >> > >typically have a very poor ability to meter usage on a room >> > >by room basis. >> > > >> > >Now somewhere between the 10 node Pentium II beowulf sitting on >> > >a lab bench and the 1000 node dual P4 Xeon beowulf in a machine >> > >room that takes up half the basement the cost of the electricity >> > >(both for power and A/C) goes from a minor expense to a major >> > >one. Really major. For instance, in that hypothetical large machine, >> > >at 10 cents per kilowatt hour (a round number), assuming 100 watts >> > >per CPU (another round number) that's: >> > > >> > > 1000 (nodes) * >> > > 2 (cpus/node) * >> > > .1 (kilowatts/cpu) * >> > > .1 (dollars/kilowatt-hour) * >> > > 365 (days /year) * >> > > 24 (hours/day) = >> > >----------------------- >> > > 175200 dollars/year >> > >> > Complete academic nonsense calculation. If you use quite some electricity >> > the electricity gets up to factor 20-40 cheaper. Getting a factor 10 >> > reduction in usage bill is pretty easy if you negotiate properly. Just where do you live that such negotiations are possible. Here's some real numbers from Southern California Edison. http://www.sce.com/CustomerService/RateInformation/BusinessRates/LargeBusiness/ First off, you're looking at a 200kW load for 1000 nodes, which is a hefty load, just for the computers (not counting lights, HVAC, etc.) But, no matter, we'll assume your facility is sucking at least 500kW some of the time, so that would put you in the large business TOU-8 tariff. http://www.sce.com/NR/sc3/tm2/pdf/ce54-12.pdf All the large consumer tariffs are time-of-use sensitive. I assume you wouldn't want some sort of "Critical Peak Pricing Options" or "Demand Bidding Programs" Let's assume you're being served at 240V (as opposed to having your own distribution transformers, etc., although as a 200kW consumer, that's something you should consider). Looks like the rates break down as about 0.016/kWh for the delivery, and the actual power (generation) runs somewhere between 0.04/kWh (off peak summer) to 0.12/kWh on peak summer. There's also a raft of other charges (metering, demand (runs about $10/kW of instantaneous demand), power factor, etc.) Compare this to Domestic service.. where the rates run from 0.11 to 0.18/kWh, depending on where you sit relative to baseline, season, etc. (I'll also point out that I was paying SCE $0.26/kWh at home in the summer of 2001, but rates are lower now.) Now, you might consider Residential to be artificially constrained for political reasons, so we take a look at GS-1 (general service..) Here, we have 0.07 for delivery and 0.085 (totalling 0.155/kWh) during the summer. The point is, there isn't a 10:1 ratio... not even close. And, rgb's ballpark of $0.10/kWh is a perfectly reasonable estimate, if a bit low for Southern California. Even as far back as 1990,on peak, large customers were paying on the order of $0.11/kWh. It is possible that if you were buying power directly (which is possible, as a large end user), you can pay "market rate" for each kWh you consume. The price is quite volatile, though... At the peak of the market failure a few years back, a kWh on the open market was something like $20 during peak times. I doubt you want to schedule your cluster ops to take advantage of electricity rate fluctuations. >>Well, it isn't complete nonsense, unless you care to dispute the >>number of days in a year, hours in a day, or cpus in a dual node >>computer! >> >>The only term you're complaining about is the price of >>electricity. I'm not privy to the electrical rates that our >>school pays, they may well be an order of magnitude lower. My >>home rates certainly aren't, but then, I don't buy as much >>power as the campus. It's also not at all clear that the >>campus would sell power to the end users at the same rate >>which it pays the utility. CalTech probably buys their power from City of Pasadena, but it's probably similar in rate structure to SCE's TOU-8. Home rates are somewhat artificially low for political reasons. The folks really getting the short end of the stick are small businesses who don't have the negotiating power that a large business does, nor the political clout of elderly pensioners dying from heat. I'll also note that if you start paying for kWh, you're going to want to give serious consideration to buying more nodes than you strictly need, and shutting down the cluster during on-peak times. A typical pricing strategy might be 0.15/0.07/0.05 (on/mid/off peak): Peak lasts 6 hrs (1200-1800), mid is 0800-1200,1800-2300, and offpeak is the rest. There's 93 hours of offpeak time out of 168 total in the week For the pricing and schedules above, it turns out that the optimum is to shut down only during peak, but run during midpeak, for an average energy cost of 0.056/kWh, but using 22% more computers. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From James.P.Lux at jpl.nasa.gov Thu Feb 17 17:21:05 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Academic sites: who pays for the electricity? In-Reply-To: <3.0.32.20050216190804.0106fcc0@pop.xs4all.nl> References: <3.0.32.20050216190804.0106fcc0@pop.xs4all.nl> Message-ID: <6.1.1.1.2.20050217170816.077e17a8@mail.jpl.nasa.gov> A >Please note that nothing beats the price of nuclear power Nuclear power, if all the incidental costs (often absorbed into government budgets) for things like liability cover, waste disposal, etc. is not overwhelmingly competitive with other sources. One must also consider the capital cost of the production equipment and it's retirement (which is quite a bit higher than for, say, coal fired, gas fired, etc.) >Electricity production costs of nuclear power are hundreds of times cheaper >than producing it with oil, oil produces it roughly for 5 dollar cent a >kilowatt (if memory serves me well). Implying that nuclear energy generation costs are <0.0005/kWh? I find this quite hard to believe. Can you cite a reasonable source for the data? Just the capital cost of the generating plant is more than that. (2 GW plant, 20 yr life, 3.5E11 kWh. If the plant costs $1B, you're at about $0.003/kWh) I think that nuclear power, by the time you figure in all the stuff you need to, is a bit cheaper than fossil fuel, but not hugely cheaper, and certainly not an order of magnitude. In fact, the chart at the end of http://www.uic.com.au/nip08.htm shows that they're all within a factor of 2:1, except for high cost oil fired. (this chart has cost for OECD 1990) >Coals have a CO2 problem for nations >which are in Kyoto agreement (USA isn't), but also is nearly as cheap as >nuclear power. > >So the actual price they deliver huge power for to big institutes is a very >easy negotiation to get it factors down. Now there's an interesting prospect... buy your electricity on the open traded market, and schedule your cluster computation as the price goes up and down. Don't laugh.. it's been done in other industries. >Vincent Diepeveen >ex-member of high voltage powerline forum. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From rgb at phy.duke.edu Thu Feb 17 20:31:33 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Academic sites: who pays for the electricity? In-Reply-To: <3.0.32.20050216190804.0106fcc0@pop.xs4all.nl> References: <3.0.32.20050216190804.0106fcc0@pop.xs4all.nl> Message-ID: On Wed, 16 Feb 2005, Vincent Diepeveen wrote: > > 1000 (nodes) * > > 2 (cpus/node) * > > .1 (kilowatts/cpu) * > > .1 (dollars/kilowatt-hour) * > > 365 (days /year) * > > 24 (hours/day) = > >----------------------- > > 175200 dollars/year > > Complete academic nonsense calculation. If you use quite some electricity > the electricity gets up to factor 20-40 cheaper. Getting a factor 10 > reduction in usage bill is pretty easy if you negotiate properly. > > However you must avoid starting machines at peaktimes. Big fines get given > for that. So it's cheaper to let them run 24 hours a day than to start them > in the morning after say 7 AM (depending upon local habits). > > Please note that nothing beats the price of nuclear power > > (as a member of the high voltage power forum i do not have an opinion on > that). > > Electricity production costs of nuclear power are hundreds of times cheaper > than producing it with oil, oil produces it roughly for 5 dollar cent a > kilowatt (if memory serves me well). Coals have a CO2 problem for nations > which are in Kyoto agreement (USA isn't), but also is nearly as cheap as > nuclear power. > > So the actual price they deliver huge power for to big institutes is a very > easy negotiation to get it factors down. Actually, in spite of the fact that Duke (partly) owns its own power company, I don't think that they get all that much of a discount. I'm also quite certain the the nuclear plant about 20 miles from here doesn't sell its electricity to customers across the county line (it's not a Duke Power plant but rather CP&L IIRC) for any less than Duke and Durham get it. There is a nifty map here: http://www.coaleducation.org/Ky_Coal_Facts/electricity/average_cost.htm that shows the average electricity costs throughout the USA as of 2001 -- they are almost certainly a half-cent/KW-hr higher across the board if not a whole cent higher due to the war and oil price boosts. Note that they range from $0.04 and change in major coal states to $0.11 and change per KW-hr (where I'm sure a chunk of the difference is taxes in different states). My recollection from discussions at Duke is that Duke pays around $0.06/KW-hr, not a huge discount over what I pay at home. Maybe major industrial consumers of electricity do better in states like Michigan, but I don't think that there is enough margin even in the coal states to drop prices to $0.01/KW-hr after paying for the fuel itself for anything but nuclear. Where David lives, in CA, electricity is about as expensive as it is anywhere. At a guess currently over ten cents/KW-hr, probably even for Universities. David? rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From jimlux at earthlink.net Thu Feb 17 21:41:39 2005 From: jimlux at earthlink.net (Jim Lux) Date: Wed Nov 25 01:03:49 2009 Subject: [SPAM] [Beowulf] powering up 18 motherboards References: <0IC300G56DM77U@mta8.srv.hcvlny.cv.net> Message-ID: <000601c5157c$855b5350$32a8a8c0@LAPTOP152422> ----- Original Message ----- From: "Alpay Kasal" To: "'Bari Ari'" ; "'Jim Lux'" Cc: "'Dean Johnson'" ; Sent: Thursday, February 17, 2005 9:26 PM Subject: RE: [SPAM] [Beowulf] powering up 18 motherboards > I think you hit the nail on the head Bari, I'm in Brooklyn, New York. So I > suppose it should be 15amp circuits but every circuit breaker in the box is > clearly a 10. This is an old house, seems like any renovations over the > years have been only for aesthetics. The wiring in the walls is probably > disintegrating - that would explain why the new looking circuit breakers are > rated for 10 amps. Ahhh yes.. New York, where some of the (mostly undocumented) distribution wiring dates from Edison himself, and dogs are electrocuted when urinating on the street from stray currents in connection boxes. The wiring is probably knob and tube. > > I think I can get use of 3 circuits which gives me some room to play with > all the nodes and hopefully the assortment of switches and power supply. I > have to figure out what the draw will be on the rest of the equip. Now where > the hell am I going to plug in this air conditioner???? > > Any advice on how to gang up 3 10amp circuits into a single 30amp? Sounds > like a job for an electrician? Thanks for the help guys. Why gang em up? Just run three extension cords or plug strips, one into each circuit. 6 machines on each circuit. If the startup surge is too much, stagger the spin up of the disk drives (maybe some sort of BIOS power management option?). Seriously think about cobbling together some collection of relays. The W.W.Grainger catalog is your friend, even if you don't buy your stuff from them, you'll at least have a good description to go googling for surplus places. If that gets too dicey.. Honda and Yamaha make very nice, quiet inverter generators. 2kW for about $700. Good luck... Jim. From lindahl at pathscale.com Thu Feb 17 22:44:34 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <1108478089.4587.118.camel@s861954.sandia.gov> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> <1108478089.4587.118.camel@s861954.sandia.gov> Message-ID: <20050218064433.GA8147@greglaptop.attbi.com> On Tue, Feb 15, 2005 at 07:34:50AM -0700, Keith D. Underwood wrote: > > > c) ban of the ANY_SENDER wildcard: a world of optimization goes away > > with this convenience. > > Um, our apps guys say this is more than a convenience. Apparently, > sometimes you don't exactly know who you are going to receive from. > Would you rather them post receives from 4000 nodes and cancel the ones > that don't send to that node after a while? That's a good reason to use it. The fundamental problem is that supporting both ANY_TAG and ANY_SENDER efficiently is really annoying. So most implementers would prefer to support ANY_TAG efficiently and ANY_SENDER less efficiently. And then those pesky users (life is much simpler when you only run benchmarks that you write yourself, you know ;-) go use ANY_SENDER, usually because they think it's faster when say 4 nodes are sending you something at roughly the same time. Instead, we'd prefer that they use Irecv for that. -- greg From bari at onelabs.com Thu Feb 17 16:36:07 2005 From: bari at onelabs.com (Bari Ari) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] powering up 18 motherboards In-Reply-To: <0IC100EHK73A1V@mta7.srv.hcvlny.cv.net> References: <0IC100EHK73A1V@mta7.srv.hcvlny.cv.net> Message-ID: <42153877.3080101@onelabs.com> Alpay Kasal wrote: > Hello all? I have a question about powering on motherboards simulataneously? > > > > I have 18 identical mobo?s right now with identical ram, cpu, and hard > disk. I hooked one up to a kill-a-watt and found that it draws 140-150 > watts when powering on, and stays level at about 90-100 watts > afterwards. The problem is that I am setting this up at home, where I > only have 10 amp circuits (and only a couple of them can be freed up). > Correct me if I am wrong here please? 10 amp circuits? What country is this in? What line voltage and frequency? What's the wire size and insulator? > 1 mobo = 100 watts / 115 volts = .87amps each mobo while steady on > > 1 mobo = 150 watts / 115 volts = 1.3amps each mobo while turning on > > > > I won?t include the rest of the math, but needless to say, it?d be a > pain in arse to turn on the room in piecemeal without tripping a circuit > breaker. My questions is : > > > > Will a heavy duty UPS aid in getting me through powering up the room? I > don?t mind splitting up the 18 machines with 6 outlet surge strips. Any > advice? Circuit breakers are designed and initially calibrated to run with a continuous load of around 80% of its current rating. Most electrical codes limit continuos loads on branch circuits to 80% of the current rating of the conductors and current protection. (This is an over simplification. The NEC is far more complicated on this.) Using your rough numbers of 100W/mobo, you're just under 80% with 9 mobo's per 10A/115VAC circuit. The startup current is 17% over the rating of the circuit breaker. The breakers will hold for a short time at this load. How long is dependent on the brand and age of the circuit breaker. If you stagger the startup of the mobo's on each circuit (let's say to 5 and then a min. later the other 4) it will help to keep the breakers from tripping. -Bari From bari at onelabs.com Thu Feb 17 16:59:49 2005 From: bari at onelabs.com (Bari Ari) Date: Wed Nov 25 01:03:49 2009 Subject: [SPAM] [Beowulf] powering up 18 motherboards In-Reply-To: <6.1.1.1.2.20050217160903.07880988@mail.jpl.nasa.gov> References: <0IC100EHK73A1V@mta7.srv.hcvlny.cv.net> <1108678933.3683.19.camel@terra> <6.1.1.1.2.20050217160903.07880988@mail.jpl.nasa.gov> Message-ID: <42153E05.7090801@onelabs.com> Jim Lux wrote: > A 10 amp circuit would be highly unusual in the U.S., but might be > common practice elsewhere. In the U.S., a 15 amp circuit is standard. I thought this was odd when I first read this as well. This may be a case where to save dollars or in rehabbing old buildings where you may have more than 3 current carrying conductors in one raceway and you have to derate the current protection. In this case it may be that they ran more than three #14 current carrying conductors (as defined by the NEC)in the same raceway and had to derate the usual 15 amp circuit protection down to 10 amps. -Bari Ari From Glen.Gardner at verizon.net Thu Feb 17 17:12:27 2005 From: Glen.Gardner at verizon.net (Glen Gardner) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Academic sites: who pays for the electricity? References: Message-ID: <421540FB.2060103@verizon.net> David; It sounds to me as though you are seeking low power beowuf as a solution. There have been a few people build such machines, and it is possible to build a fast, useful beowulf cluster that uses very little electrical power and has sufficient muscle to do some serious work. I have a small 14 node cluster which I built a year ago. It uses very little power , and runs so cool that no room air conditioning is needed. In fact, my p4 machine makes more noise and heat than the beowulf cluster in my apartment. http://mini-itx.com/projects/cluster/ The above link shows the original 12 node configuration. At present, there are some motherboards available which give a very nice combination of cost, performance and low power use. The trick is to "right-size" everything for your needs and available resources. The down side is that the small low-power go-fast stuff is a little more pricey than the plain vanilla pc hardware a beowulf is usually buit from , but not insanely so. Transmetta has some nice boards, and the minit-itx boards are not bad at all for the cost. Also, there are some rather nice small form factor motherboards that use AMD's geode cpu. When I start comparing cost, power use, and performance, so far the most attractive motherboards seem to be the mini-itx boards with the nemiah core cpu. However with some low power geode boards now running at up to 1500 MHz, that may change. The Transmeta boards are probably the fastest of the low power boards, but the power use per MIPS is not as good as other boards if you believe the Transmeta printed specifications. Glen PS: I have also made a few comments below. David Mathog wrote: >At Wed, 16 Feb 2005 19:08:05 +0100 Vincent Diepeveen wrote: > > > >>Date: Wed, 16 Feb 2005 19:08:05 +0100 >>From: Vincent Diepeveen >>Subject: Re: [Beowulf] Academic sites: who pays for the electricity? >>To: "David Mathog" , >> beowulf@beowulf.org >>Message-ID: <3.0.32.20050216190804.0106fcc0@pop.xs4all.nl> >>Content-Type: text/plain; charset="us-ascii" >> >>At 08:16 16-2-2005 -0800, David Mathog wrote: >> >> >>>In most universities services like electricity, water, and >>>A/C are paid for by the school. To do so they take "overhead" >>>out of every grant. Partially as a consequence of this they >>>typically have a very poor ability to meter usage on a room >>>by room basis. >>> >>>Now somewhere between the 10 node Pentium II beowulf sitting on >>>a lab bench and the 1000 node dual P4 Xeon beowulf in a machine >>>room that takes up half the basement the cost of the electricity >>>(both for power and A/C) goes from a minor expense to a major >>>one. Really major. For instance, in that hypothetical large machine, >>>at 10 cents per kilowatt hour (a round number), assuming 100 watts >>>per CPU (another round number) that's: >>> >>> For a dual p4 xeon machine at full throttle, it comes out to about 250 watts per node (or a little less) including the network adapters and switching. >>> 1000 (nodes) * >>> 2 (cpus/node) * >>> .1 (kilowatts/cpu) * >>> .1 (dollars/kilowatt-hour) * >>> 365 (days /year) * >>> 24 (hours/day) = >>>----------------------- >>> 175200 dollars/year >>> >>> >>Complete academic nonsense calculation. If you use quite some electricity >>the electricity gets up to factor 20-40 cheaper. Getting a factor 10 >>reduction in usage bill is pretty easy if you negotiate properly. >> >> > >Well, it isn't complete nonsense, unless you care to dispute the >number of days in a year, hours in a day, or cpus in a dual node >computer! > >The only term you're complaining about is the price of >electricity. I'm not privy to the electrical rates that our >school pays, they may well be an order of magnitude lower. My >home rates certainly aren't, but then, I don't buy as much >power as the campus. It's also not at all clear that the >campus would sell power to the end users at the same rate >which it pays the utility. > > You are forgetting the cost of cooling the cluster. Big machines make a lot of heat, and need a lot of cooling. >I don't really understand your point about keeping the units >running versus restarting them. Sure, it would be really bad >to try to boot all 1000 nodes simultaneously, in all likelihood >it wouldn't work. That's why they are typically started at N >second intervals, where N depends on your hardware. >Surely there is some N large enough so that the peak current >draw during the restart never exceeds the random fluctuations >observed when all units are running normally. Or is your >point that the electricity company doesn't want the facility >to draw _less_ current than it uses normally at >steady state? > > It is important to keep the cluster up and running, and only cycle the power when you must. The inrush currents at turnon stress components and shorten the life of the nodes considerably. Also, thermal cycling puts mechanical stresses on boards and components that can cause components and connections to fail. In a large cluster that is middle aged (@ 2 years old), you can reasonably expect to lose a couple of nodes every time you power down and come back up. After a while, this can be expensive. Shutting down a big machine is not a trivial thing. >On a somewhat related note, it would be nice if rack nodes >had some graceful way to conserve electricity. For instance, >something along the lines of: if the CPU utilization goes >below 5% for 10 seconds ratchet the clock down by a factor of 10. >When CPU usage goes above 90% ratchet for 2 seconds move it back >up again. Notebooks can do this sort of thing, but it seems not >to be a "feature" of most full size motherboards. This should >also lower the average temperature in the case, at the expense >of increased thermal cycling. Hard to say off hand if that's >a plus or a minus as far as hardware longevity goes. Certainly >it would be a plus in terms of energy conservation. > > A lot of modern cpu's have the ability to actually shut off unused internal circuitry. VIA CPU's, AMD's geode, Transmeta, and some intel cpus have these features. >Regards, > >David Mathog >mathog@caltech.edu >Manager, Sequence Analysis Facility, Biology Division, Caltech >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > -- Glen E. Gardner, Jr. AA8C AMSAT MEMBER 10593 Glen.Gardner@verizon.net http://members.bellatlantic.net/~vze24qhw/index.html -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20050217/be311034/attachment.html From redboots at ufl.edu Thu Feb 17 18:46:58 2005 From: redboots at ufl.edu (Paul Johnson) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] scalability of cluster paper Message-ID: <42155722.5060105@ufl.edu> I am wondering if anyone could point me to a paper on the scalability issues of NOWs clusters or Beowulf clusters using MPI. Im curious what kind of scalability people see for clusters less than 10 nodes. Any reference to a paper would be greatly appreciated. I've been doing alot of scholar.googling but havent found what Im looking for yet. Thanks, Paul From eno at dorsai.org Thu Feb 17 21:43:00 2005 From: eno at dorsai.org (Alpay Kasal) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] powering up 18 motherboards In-Reply-To: <6.1.1.1.2.20050217160903.07880988@mail.jpl.nasa.gov> Message-ID: <0IC300MSUEDY78@mta10.srv.hcvlny.cv.net> James Lux, thanks for the extremely useful explanation. Btw, I'm in Brooklyn, NY. 120volts, 60cycles, regular AC power. I don't know the gauge of the wiring in the walls but (as mentioned in another response just now) I suspect it is old wiring and is the reason for the strange 10amp circuit breakers. I looked at the x10 modules. Seems like it could be very useful, just script all of them from my headend. For now I'm going to try to handle the power-on sequence myself. I figured I could steal 3 10amp circuits from the house. Follow me... Turn on 4 nodes (on 1 strip) which will peak at 5.2amps. let that settle down to a steady 3.48amps and hit another strip of 4 nodes. Total draw while the 2nd batch is starting is 8.68amps. It should steady at 6.96. I then have room to turn 1 more node on. Then one more after that. A 4 step process to get 10 nodes powered up without going over 10amps. Perform the same exact steps on a 2nd circuit. Annoying but possible without spending anymore money. I was really hoping a decent $200-300 UPS would come to the rescue here. Oh well. I just had a thought... I planned on making use of wake-on-lan. I can just start sending jobs to the whole network though if all of it is asleep, I'd have to still be careful of the powerup-sequence. Grrrr. Maybe a script to perform WOL before starting any number crunching. Boy did I take nice big fat electrical lines for granted in the past! Alpay -----Original Message----- From: Jim Lux [mailto:James.P.Lux@jpl.nasa.gov] Sent: Thursday, February 17, 2005 7:18 PM To: Dean Johnson; Alpay Kasal Cc: beowulf@beowulf.org Subject: Re: [SPAM] [Beowulf] powering up 18 motherboards No, the UPS won't help. It might make things worse, because as you flip on all that load, the voltage will sag, causing the UPS to turn on, which then might trip from the overcurrent (assuming you're not out buying a 2kW UPS). You could use the X-10 type (aka Plug n Power) remote controlled relays (don't use Lamp modules.. you need Appliance modules, which are relays inside). From rgb at phy.duke.edu Fri Feb 18 04:12:09 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:49 2009 Subject: [SPAM] [Beowulf] powering up 18 motherboards In-Reply-To: <000601c5157c$855b5350$32a8a8c0@LAPTOP152422> References: <0IC300G56DM77U@mta8.srv.hcvlny.cv.net> <000601c5157c$855b5350$32a8a8c0@LAPTOP152422> Message-ID: On Thu, 17 Feb 2005, Jim Lux wrote: > > ----- Original Message ----- > From: "Alpay Kasal" > To: "'Bari Ari'" ; "'Jim Lux'" > Cc: "'Dean Johnson'" ; > Sent: Thursday, February 17, 2005 9:26 PM > Subject: RE: [SPAM] [Beowulf] powering up 18 motherboards > > > > I think you hit the nail on the head Bari, I'm in Brooklyn, New York. So I > > suppose it should be 15amp circuits but every circuit breaker in the box > is > > clearly a 10. This is an old house, seems like any renovations over the > > years have been only for aesthetics. The wiring in the walls is probably Circuit breaker? What's wrong with good old fuses? Good enough for my granddaddy.... actually, CB's suggest that the wiring has been messed with sometime in the last twenty or thirty years. That's a good thing! In my last house, the fuses DID feed directly into 14 Ga wire pairs that went through the house six inches apart in porcelain insulators and through porcelain tubes -- until I got to it and replaced it all, a circuit at a time, with 12 Gauge three wire PVC. The fabric coating those old wires DOES tend to disintegrate after 70 years or so... if your house has it it is probably a midlevel electrical fire trap. > > disintegrating - that would explain why the new looking circuit breakers > are > > rated for 10 amps. > > Ahhh yes.. New York, where some of the (mostly undocumented) distribution > wiring dates from Edison himself, and dogs are electrocuted when urinating > on the street from stray currents in connection boxes. The wiring is > probably knob and tube. > > > > > I think I can get use of 3 circuits which gives me some room to play with > > all the nodes and hopefully the assortment of switches and power supply. I > > have to figure out what the draw will be on the rest of the equip. Now > where > > the hell am I going to plug in this air conditioner???? > > > > Any advice on how to gang up 3 10amp circuits into a single 30amp? Sounds > > like a job for an electrician? Thanks for the help guys. > > Why gang em up? Just run three extension cords or plug strips, one into In case Jim didn't make it clear, DO NOT GANG THEM UP. In principle one can do this, IF all three of the circuits have the same phase. If they don't have the same phase (and aren't correctly connected in phase) you will observe a brief, interesting flash while the circuit breaker does bad things when you power up after trying it and see "midlevel electrical fire trap" above. However, the "right" way to gang them up is to go to the box and run brand new, clean, NEC-compliant wire from the box to your cluster location. The only thing you'll have to worry about is removing as much total amperage from the box as you add (or at least, staying withing the distribution box's total capacity). Here knowing what your house's total service capacity is, and what the total capacity is of the main distribution panel is (hopefully they are the same, but one never knows) is useful. Maybe the box already has spare capacity and you just don't know it. The rule with electrical wiring is that if you don't know EXACTLY what you're doing, hire a professional electrician. That is, if you have to ask how to gang up circuits (thereby demonstrating an ignorance about 2 and 3 phase delivery, hot, neutral and cold/ground wires, ground loops, etc) you have an unpleasantly high probabiliby of either killing yourself or burning down your house (possibly months or years after your renovation), and are probably breaking the law besides when you do it. Jim's suggestions below are excellent for living with what you have. Also consider stripping down the machines. A cluster node these days can be built out of a motherboard loaded with CPU and memory and with an onboard, PXE-capable NIC (or at most a PCI PXE NIC). No floppy, CD, hard drive, video card, or other peripheral stuff. A few weeks ago, there was a lovely discussion on clusters made by mounting bare motherboards on shelving and powering them off of shared large power supplies strung together with simple extensions, three motherboards per 450 W supply. This saves on both heat and power, as EACH peripheral draws a base current when turned on, including each power supply. There are pictures out there if you look. If you do this, you'll probably drop your load by as much as 30 watts per node, and this should be enough to push you safely within the capacity of your circuits at six nodes per circuit with room to spare. rgb > each circuit. 6 machines on each circuit. If the startup surge is too much, > stagger the spin up of the disk drives (maybe some sort of BIOS power > management option?). Seriously think about cobbling together some > collection of relays. The W.W.Grainger catalog is your friend, even if you > don't buy your stuff from them, you'll at least have a good description to > go googling for surplus places. > > If that gets too dicey.. Honda and Yamaha make very nice, quiet inverter > generators. 2kW for about $700. > > Good luck... > Jim. > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From atp at piskorski.com Fri Feb 18 04:45:28 2005 From: atp at piskorski.com (Andrew Piskorski) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Re: Re: Re: Home beowulf - NIC latencies (Patrick Geoffray) In-Reply-To: <1108459690.25145.8.camel@pc-2.office.scali.no> References: <1108459690.25145.8.camel@pc-2.office.scali.no> Message-ID: <20050218124527.GA86169@piskorski.com> On Tue, Feb 15, 2005 at 10:28:10AM +0100, Ole W. Saastad wrote: > Patrick Geoffray wrote: > > The one with the best potential would be to use HyperThreading on > > Intel chips to have a polling thread burning cycles continuously; > The simple finding is that the kinetics program got somewhat more > than 70% of the CPU cycles and that the polling waisted close to > 30% of the CPU cycles, 30% is not for free. Using what Linux kernel? Using what feature to tell the kernel, "Please run this polling process only on the extra HyperThreaded virtual CPU, never on the real CPU." ? -- Andrew Piskorski http://www.piskorski.com/ From rgb at phy.duke.edu Fri Feb 18 05:11:17 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] powering up 18 motherboards In-Reply-To: <0IC300MSUEDY78@mta10.srv.hcvlny.cv.net> References: <0IC300MSUEDY78@mta10.srv.hcvlny.cv.net> Message-ID: On Fri, 18 Feb 2005, Alpay Kasal wrote: > James Lux, thanks for the extremely useful explanation. Btw, I'm in > Brooklyn, NY. 120volts, 60cycles, regular AC power. I don't know the gauge > of the wiring in the walls but (as mentioned in another response just now) I > suspect it is old wiring and is the reason for the strange 10amp circuit > breakers. > > I looked at the x10 modules. Seems like it could be very useful, just script > all of them from my headend. For now I'm going to try to handle the power-on > sequence myself. I figured I could steal 3 10amp circuits from the house. > > Follow me... Turn on 4 nodes (on 1 strip) which will peak at 5.2amps. let > that settle down to a steady 3.48amps and hit another strip of 4 nodes. > Total draw while the 2nd batch is starting is 8.68amps. It should steady at > 6.96. I then have room to turn 1 more node on. Then one more after that. A 4 > step process to get 10 nodes powered up without going over 10amps. Perform > the same exact steps on a 2nd circuit. Annoying but possible without > spending anymore money. You problem will occur when the power goes off and comes back on when you aren't there. We have rather frequent 5-10 second powerouts down here -- without UPS's I used to go nuts in my house. > I was really hoping a decent $200-300 UPS would come to the rescue here. Oh > well. I don't think that putting one of these on per circuit is a bad idea; the real problem is that a UPS might draw more than your lines' capacity when initially charging -- I don't know for sure how much of a load the divert to charging when passing a load through. If you have >>a<< bigger circuit, you might be able to charge one fully on it, move it, plug everything in and power everything up, and use it as a line buffer of sorts. Even a couple of very cheap $50 ones that only will give you a minute or two might keep you from blowing the CBs every time the power in your area bobbles. Assuming that it does bobble -- maybe NYC never has power issues, even when dogs piss on transformers...;-) > I just had a thought... I planned on making use of wake-on-lan. I can just > start sending jobs to the whole network though if all of it is asleep, I'd > have to still be careful of the powerup-sequence. Grrrr. Maybe a script to > perform WOL before starting any number crunching. Yeah, that's an alternative. Leave one box set to power up, set the rest to NOT power up after a power outage, if you can, and power them up with WOL from a script. But yes, a PITA. rgb > > Boy did I take nice big fat electrical lines for granted in the past! > > Alpay > > > -----Original Message----- > From: Jim Lux [mailto:James.P.Lux@jpl.nasa.gov] > Sent: Thursday, February 17, 2005 7:18 PM > To: Dean Johnson; Alpay Kasal > Cc: beowulf@beowulf.org > Subject: Re: [SPAM] [Beowulf] powering up 18 motherboards > > No, the UPS won't help. It might make things worse, because as you flip on > all that load, the voltage will sag, causing the UPS to turn on, which then > might trip from the overcurrent (assuming you're not out buying a 2kW UPS). > > You could use the X-10 type (aka Plug n Power) remote controlled relays > (don't use Lamp modules.. you need Appliance modules, which are relays > inside). > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Fri Feb 18 05:14:39 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] scalability of cluster paper In-Reply-To: <42155722.5060105@ufl.edu> References: <42155722.5060105@ufl.edu> Message-ID: On Thu, 17 Feb 2005, Paul Johnson wrote: > I am wondering if anyone could point me to a paper on the scalability > issues of NOWs clusters or > Beowulf clusters using MPI. Im curious what kind of scalability people > see for clusters less than 10 nodes. > Any reference to a paper would be greatly appreciated. I've been doing > alot of scholar.googling but > havent found what Im looking for yet. There is a whole chapter on Amdahl's Law and scaling in my online beowulf book (and several online talks and white papers ditto). Look on http://www.phy.duke.edu/brahma or http://www.phy.duke.edu/~rgb under the Beowulf link. Scalability isn't a question of number of nodes per se, it is a question of the parallel speedup you observe as a FUNCTION of the number of nodes participating in the problem, the problem "size" (really all other parameters, not just size), the network characteristics, and the system CPU/memory/bus characteristics. The presentation I give is a simplified one that lets you see the general idea, but the reality is much more complex for some classes of problem. HTH, rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From jcownie at etnus.com Fri Feb 18 06:38:23 2005 From: jcownie at etnus.com (James Cownie) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: Message from Rossen Dimitrov of "Tue, 15 Feb 2005 09:28:28 EST." <4212070C.9050207@verarisoft.com> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> <4211B0E0.6030007@ccrl-nece.de> <4212070C.9050207@verarisoft.com> Message-ID: <20050218143823.8694B1D9B3@amd64.cownie.net> > In a conversation with MPI and tool developers and I once mentioned > that not defining a standard/mandatory mpi.h was probably a missed > opportunity for improving interoperability of MPI. I was then told by > a member of the MPI-1 Forum that this was done on purpose. This makes > me think that we will not see an ABI definition for MPI any time soon. I think this is to misunderstand the process. The whole MPI process was informal. No-one gave the committee any power to create a standard, it happened because people wanted it to happen and were prepared to use the result. MPI followed the format and style of HPF, and can be thought of as an Open Source standard. It was created as a result of user demand by people who were prepared to put in the effort to do so, and was adopted because it met a need. If an ABI for MPI is so important to you and of such value to your (and Patrick's) clients, then there's nothing to stop you from formulating such a standard, or, at least starting a project to create one. If you're right about its importance, then all the MPI implementors will follow your lead. If you're wrong, well, you wasted your time. The point here is that doing this can be an informal process which doesn't require "The MPI Forum" (whatever that is now !?) to endorse it, any more than a project on SourceForge requires endorsement by Linus if it runs on Linux ;-) (Or, if you prefer, don't keep whingeing about what the MPI Forum chose to do, but get on and fix it for yourself). -- -- Jim -- James Cownie Etnus, LLC. +44 117 9071438 http://www.etnus.com From rross at mcs.anl.gov Fri Feb 18 07:56:26 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <20050218143823.8694B1D9B3@amd64.cownie.net> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> <4211B0E0.6030007@ccrl-nece.de> <4212070C.9050207@verarisoft.com> <20050218143823.8694B1D9B3@amd64.cownie.net> Message-ID: On Fri, 18 Feb 2005, James Cownie wrote: [snip] > If an ABI for MPI is so important to you and of such value to your (and > Patrick's) clients, then there's nothing to stop you from formulating > such a standard, or, at least starting a project to create one. > > If you're right about its importance, then all the MPI implementors will > follow your lead. If you're wrong, well, you wasted your time. > > The point here is that doing this can be an informal process which > doesn't require "The MPI Forum" (whatever that is now !?) to endorse > it, any more than a project on SourceForge requires endorsement by Linus > if it runs on Linux ;-) > > (Or, if you prefer, don't keep whingeing about what the MPI Forum chose > to do, but get on and fix it for yourself). But keep in mind that some implementations encode meaning into the values in mpi.h, and as a result the values aren't even the same between multiple instances of the same implementation on different platforms! For example, MPICH2 encodes the size of basic datatypes in the value of the type. Saves us looking around in some table for these cases (which are the only ones that Patrick wants us to support :)!). So this is probably a larger problem than it seems. Rob --- Rob Ross, Mathematics and Computer Science Division, Argonne National Lab From mathog at mendel.bio.caltech.edu Fri Feb 18 08:29:47 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] powering up 18 motherboards Message-ID: >I was really hoping a decent $200-300 UPS would come to > the rescue here. Oh well. APC makes Power Distribution Units that can be set to start loads at fixed intervals. If you had access to a 208V/3 phase line the AP7990 costs about $650, uses that as input, and outputs 24 120V sockets. They make quite a few of these PDUs in different configurations so maybe you can find one that does what you want? Another, much cheaper option, would be to set the slave node BIOS to use "Wake on LAN" (if it works on your systems) and NOT to start on power up. Then when power came up the headnode would boot and the others would just warm up enough to listen to their ethernet cards. I don't expect that the start up current going to that mostly off state would be very high, even for 18 computers, since neither the disks nor fans start spinning. Once the head node comes up you can boot the slaves using etherwake. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From diep at xs4all.nl Fri Feb 18 08:36:44 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Academic sites: who pays for the electricity? Message-ID: <3.0.32.20050218173640.0109a100@pop.xs4all.nl> At 17:01 17-2-2005 -0800, Jim Lux wrote: >At 04:19 PM 2/17/2005, Jim Lux wrote: >>At 02:54 PM 2/17/2005, David Mathog wrote: >>>At Wed, 16 Feb 2005 19:08:05 +0100 Vincent Diepeveen wrote: >>> >>> > Date: Wed, 16 Feb 2005 19:08:05 +0100 >>> > From: Vincent Diepeveen >>> > Subject: Re: [Beowulf] Academic sites: who pays for the electricity? >>> > To: "David Mathog" , >>> > beowulf@beowulf.org >>> > Message-ID: <3.0.32.20050216190804.0106fcc0@pop.xs4all.nl> >>> > Content-Type: text/plain; charset="us-ascii" >>> > >>> > At 08:16 16-2-2005 -0800, David Mathog wrote: >>> > >In most universities services like electricity, water, and >>> > >A/C are paid for by the school. To do so they take "overhead" >>> > >out of every grant. Partially as a consequence of this they >>> > >typically have a very poor ability to meter usage on a room >>> > >by room basis. >>> > > >>> > >Now somewhere between the 10 node Pentium II beowulf sitting on >>> > >a lab bench and the 1000 node dual P4 Xeon beowulf in a machine >>> > >room that takes up half the basement the cost of the electricity >>> > >(both for power and A/C) goes from a minor expense to a major >>> > >one. Really major. For instance, in that hypothetical large machine, >>> > >at 10 cents per kilowatt hour (a round number), assuming 100 watts >>> > >per CPU (another round number) that's: >>> > > >>> > > 1000 (nodes) * >>> > > 2 (cpus/node) * >>> > > .1 (kilowatts/cpu) * >>> > > .1 (dollars/kilowatt-hour) * >>> > > 365 (days /year) * >>> > > 24 (hours/day) = >>> > >----------------------- >>> > > 175200 dollars/year >>> > >>> > Complete academic nonsense calculation. If you use quite some electricity >>> > the electricity gets up to factor 20-40 cheaper. Getting a factor 10 >>> > reduction in usage bill is pretty easy if you negotiate properly. > >Just where do you live that such negotiations are possible. Here's some You aren't going to negotiate about a single small room with a few lightbulbs obviously. We're talking about huge usage, like if all supercomputers are located at 1 central spot and the entire institute with thousands of working places gets powered in a central way, and usually electricity offering companies aren't going to put online their rates for reduced usage, as that would give them a bad negotation starting point :) Nuclear power gets more and more exported from France to rest of Europe. For example Italy is importing 25% of its total power, majority is from France. Netherland and Germany import roughly 20% of their power. That will get more and more, simply because building electricity producing central plants can only get build for a specific amount of time (like 25 year contract) and then must get cleaned up. Obviously such rules make building your own producing central plants impossible for the electricity producing companies. I'm not taking a political viewpoint on what type of electricity is damaging more than the other and what should happen in the future in that respect. Yet closing eyes for reality is something else. More and more power gets used. Computers take a part of that power, industry majority. In USA last so many years no new nuclear plants have been built. Obviously that means that the market there is different from Europe, where the energy market seems to be more innovative than USA. Even though for certain plans i have little respect for. Like the 150 meter high windmills they want to build in Houten, just a few hundreds of meters away from newly build houses, where tens of thousands live, that is IMHO a wrong idea. Don't have the details here how big the diameter and speed of airtransport is of those mills, but obviously they can only get build because the government wastes money on them and at most kill huge number of birds who have near zero chance to survive if they are near those mills. Yet it's another innovation in Europe. The energy market is one of the most complex financial markets and not only because politicians prefer to close their eyes for its problems. Another major problem is who owns what in europe. Is the government again going to own the transport infrastructure or can independant transport companies keep doing the job? There is difference between energy producers and energy consumer delivering companies and so on. Yet a washing machine eats thousands of watts and nearly everybody has one at home in Europe. Certain products we daily use and just throw away get produced by throwing tens of thousands of watts into battle in heavy industry. So complaining about energy usage of computers is a good thing, but shouldn't get overreacted. >real numbers from Southern California Edison. >http://www.sce.com/CustomerService/RateInformation/BusinessRates/LargeBusin ess/ > >First off, you're looking at a 200kW load for 1000 nodes, which is a hefty >load, just for the computers (not counting lights, HVAC, etc.) But, no >matter, we'll assume your facility is sucking at least 500kW some of the >time, so that would put you in the large business TOU-8 tariff. >http://www.sce.com/NR/sc3/tm2/pdf/ce54-12.pdf > All the large consumer tariffs are time-of-use sensitive. I assume you >wouldn't want some sort of "Critical Peak Pricing Options" or "Demand >Bidding Programs" > >Let's assume you're being served at 240V (as opposed to having your own >distribution transformers, etc., although as a 200kW consumer, that's >something you should consider). > >Looks like the rates break down as about 0.016/kWh for the delivery, and >the actual power (generation) runs somewhere between 0.04/kWh (off peak >summer) to 0.12/kWh on peak summer. > >There's also a raft of other charges (metering, demand (runs about $10/kW >of instantaneous demand), power factor, etc.) > >Compare this to Domestic service.. where the rates run from 0.11 to >0.18/kWh, depending on where you sit relative to baseline, season, etc. >(I'll also point out that I was paying SCE $0.26/kWh at home in the summer >of 2001, but rates are lower now.) Now, you might consider Residential to >be artificially constrained for political reasons, so we take a look at >GS-1 (general service..) >Here, we have 0.07 for delivery and 0.085 (totalling 0.155/kWh) during the >summer. > >The point is, there isn't a 10:1 ratio... not even close. And, rgb's >ballpark of $0.10/kWh is a perfectly reasonable estimate, if a bit low for >Southern California. Even as far back as 1990,on peak, large customers were >paying on the order of $0.11/kWh. > >It is possible that if you were buying power directly (which is possible, >as a large end user), you can pay "market rate" for each kWh you >consume. The price is quite volatile, though... At the peak of the market >failure a few years back, a kWh on the open market was something like $20 >during peak times. I doubt you want to schedule your cluster ops to take >advantage of electricity rate fluctuations. > > >>>Well, it isn't complete nonsense, unless you care to dispute the >>>number of days in a year, hours in a day, or cpus in a dual node >>>computer! >>> >>>The only term you're complaining about is the price of >>>electricity. I'm not privy to the electrical rates that our >>>school pays, they may well be an order of magnitude lower. My >>>home rates certainly aren't, but then, I don't buy as much >>>power as the campus. It's also not at all clear that the >>>campus would sell power to the end users at the same rate >>>which it pays the utility. > >CalTech probably buys their power from City of Pasadena, but it's probably >similar in rate structure to SCE's TOU-8. Home rates are somewhat >artificially low for political reasons. The folks really getting the short >end of the stick are small businesses who don't have the negotiating power >that a large business does, nor the political clout of elderly pensioners >dying from heat. > > >I'll also note that if you start paying for kWh, you're going to want to >give serious consideration to buying more nodes than you strictly need, and >shutting down the cluster during on-peak times. A typical pricing strategy >might be 0.15/0.07/0.05 (on/mid/off peak): Peak lasts 6 hrs (1200-1800), >mid is 0800-1200,1800-2300, and offpeak is the rest. There's 93 hours of >offpeak time out of 168 total in the week > >For the pricing and schedules above, it turns out that the optimum is to >shut down only during peak, but run during midpeak, for an average energy >cost of 0.056/kWh, but using 22% more computers. > > > >James Lux, P.E. >Spacecraft Radio Frequency Subsystems Group >Flight Communications Systems Section >Jet Propulsion Laboratory, Mail Stop 161-213 >4800 Oak Grove Drive >Pasadena CA 91109 >tel: (818)354-2075 >fax: (818)393-6875 > > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From rgb at phy.duke.edu Fri Feb 18 09:56:22 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] powering up 18 motherboards In-Reply-To: References: Message-ID: On Fri, 18 Feb 2005, David Mathog wrote: > >I was really hoping a decent $200-300 UPS would come to > > the rescue here. Oh well. > > APC makes Power Distribution Units that can be set to > start loads at fixed intervals. If you had access to a > 208V/3 phase line the AP7990 costs about $650, uses that > as input, and outputs 24 120V sockets. They make quite > a few of these PDUs in different configurations > so maybe you can find one that does what you want? Which brings to mind a safety as well as a practical point -- if you indeed DO have three phases of power on the three separate circuits of the room (not as unlikely as it sounds as in that case they may well share a single common ground wire) then your power options are a little different, because then you de facto run at a higher voltage and lower line current, and can in principle get a small boost in what the lines can safely tolerate by way of delivered power vs power dissipated in the supply lines as heat if you use such a unit. However, this then brings up a nasty, thorny issue concerning the power factor and current draw pattern of "most" (cheap) PC power supplies. Switching power supplies that are not power factor corrected tend to draw most of their current only in the middle third of each voltage half-sinusoid wave (true fact -- read about it on How Things Work website or any of several websites that discuss computer room wiring, some of which -- mirus international? -- are linked to some of my beowulf pages on brahma). This means several things: a) the peak current draw (for a given average power consumed) is much higher than you'd expect based on simple RMS considerations -- the power factor of the load is less than 1, if you know what that means. b) this causes higher order harmonics to appear in the voltage/current curves -- in particular 3*60 Hz or 180 Hz (the "edge" frequency of where the power draws switch on and off). These voltage/current ripples may make it past 60 Hz filters to reach internal components. c) three phases can share a neutral because three equal loads with unit power factor (resistive loads like a light bulb) cause the neutral current to >>cancel perfectly<<, and it can be shown (consider sin(wt) + sin(wt + 2\pi/3)) that the neutral current has a strict upper bound in this case that is within the safe limits of the neutral wire if the loads on each delivery line are themselves safe. This is NOT TRUE for switched power supplies sharing a neutral. In that case the currents delivered to the neutral wire by each phase >>add<< instead of cancelling, in three separate chunks per half cycle. The neutral current can actually approach 3I where I is the (average) current being drawn by any single line (already high relative to RMS expectations based on an assumption of unit power factor), see above). This final point is both dangerous and annoying. It is dangerous because the neutral line can be carrying enough current to make it much hotter than permitted by spec assumptions, and if the wiring job is in any other way marginal, the margins can add and produce a fire (perhaps during one of those current-draw spikes). It is annoying because your CB's will tend to overheat and pop prematurely, your PS's will tend to overheat and break, your computers will tend to brown out and fail in mid run due to a voltage ground loop (high backvoltage on your neutral line relative to prevailing/local/plumbing ground), a condition that is also actively dangerous on non-ground-fault-protected circuits. If (as one might reasonably expect) they are underfusing your line because they have exceeded the spec length for the gauge of wire that they are using, then the current carrying neutral ALREADY has a higher resistance than is technically safe, and all of these conditions are likely to be exacerbated, possibly into the Danger Zone especially the ground loop thing. Shorting that neutral to local building ground might be very dangerous indeed. Note that this state COULD ALREADY EXIST with your wiring, and that the only way to test it is to measure the hot-to-hot voltage between circuits, that is: v_a - v_b = (should be zero, could be 240 or 209 VAC) v_a - v_c = ( ditto ) v_b - v_c = ( ditto ) where v_a is multimeter probe in VAC mode inserted into hot slot of circuit a, v_b is other probe inserted into hot slot of circuit b. If these measurements are all zero, all three of your circuits have the same phase (and actually could be combined "safely" into a 30 amp circuit although you should NEVER DO THIS as there is nothing to prevent somebody at the distribution panel from moving one of the lines onto a different phase while rearranging things for some other reason so it isn't at all safe, actually). If they are all 209, you have three phase (wye) power and the power is likely being delivered from the distribution panel as a single cable with five internal wires, three of them insulated carrying one phase each, a shared insulated neutral, and a bare ground. If one pair is zero and the other two are 240, you have two phase power at the distribution box, and they are either running three separate lines (likely) or (possibly) lines with four wires, two "hot" and carrying the opposed phases, one neutral, and one ground and some other circuit in your apartment is the partner of the odd line out. They could be running a lower current limit on the lines because they exceeded the run length for the gauge of wire they used in these cables. It is certainly cheaper to re-fuse than it is to put additional primary or secondary panels with thicker wire and/or additional transformers in locations from which standard wiring can reach and be within code. A lot of state codes prohibit this sort of thing and require a building's wiring to be brought up to code any time anything is renovated, but it wouldn't surprise me to learn that NYC either makes a general exception or that individual landlords grease their way to a local exception. It is perhaps worth your while to figure this out -- I'd certainly want to know if it were my cluster. I'd also be CERTAIN to test the phase per circuit (is the hot wire really hot and not the wire that is SUPPOSED to be the neutral wire?). There are some really "interesting" things that can happen if you cross connect devices in certain ways between two miswired circuits, where running on each circuit alone is safe enough in the sense that nothing breaks. Interesting like spattering partially vaporized liquid metal is interesting. You also very definitely want to learn about the shared neutral thing, if you're using stock power supplies. If it is three phase wye, then you could think about the APC option above and other thing. Cluster room wiring is nontrivial (even when the "room" is in your house), and where you are trying to push it to a limit, you're likely going to have to educate yourself about it. I think that I put some of this in my online book, in case this is confusing. There are also really good discussions of it in the list archives (where Jim Lux and others made some great contributions). rgb > > Another, much cheaper option, would be to set the slave > node BIOS to use "Wake on LAN" (if it works on your systems) > and NOT to start on power up. Then when power came up the > headnode would boot and the others would just warm up > enough to listen to their ethernet cards. I don't > expect that the start up current going to that mostly off > state would be very high, even for 18 computers, since neither > the disks nor fans start spinning. Once the head node > comes up you can boot the slaves using etherwake. > > Regards, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From gropp at mcs.anl.gov Fri Feb 18 10:29:10 2005 From: gropp at mcs.anl.gov (William Gropp) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] MPI ABI (Was Re: Re: Home beowulf - NIC latencies ) In-Reply-To: <20050218143823.8694B1D9B3@amd64.cownie.net> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> <4211B0E0.6030007@ccrl-nece.de> <4212070C.9050207@verarisoft.com> <20050218143823.8694B1D9B3@amd64.cownie.net> Message-ID: <6.2.1.2.2.20050218121850.04c47cd0@pop.mcs.anl.gov> At 08:38 AM 2/18/2005, James Cownie wrote: >... > >If an ABI for MPI is so important to you and of such value to your (and >Patrick's) clients, then there's nothing to stop you from formulating >such a standard, or, at least starting a project to create one. > I wrote a paper that appeared in the EuroPVMMPI'02 meeting that discusses the issues of a common ABI. The paper is "Building Library Components That Can Use Any MPI Implementation" and is available as http://www-unix.mcs.anl.gov/~gropp/bib/papers/2002/gmpishort.pdf . This paper was relatively pragmatic and discussed an approach that allowed the user to link the same object files against two MPI libraries (MPICH and LAM/MPI were used in the example). There are a few remaining tricky issues for handling the MPI opaque objects (specifically, how big are they) and the size and layout of MPI_Status (different implementations use different sizes, and the user is allowed to use an array of MPI_Status). There are also some very minor tradeoffs in performance in the solution presented in the paper, but these probably aren't important in the context of clusters, and are likely to be smaller than requiring implementations to translate between internal and external representations. The web site mentioned in the paper is out-of-date, mostly because there wasn't much interest in a (nearly) common ABI at the time. I can make it available if there is interest now. Bill William Gropp http://www.mcs.anl.gov/~gropp From James.P.Lux at jpl.nasa.gov Fri Feb 18 10:30:39 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Academic sites: who pays for the electricity? In-Reply-To: <3.0.32.20050218173640.0109a100@pop.xs4all.nl> References: <3.0.32.20050218173640.0109a100@pop.xs4all.nl> Message-ID: <6.1.1.1.2.20050218102305.041ca4a0@mail.jpl.nasa.gov> At 08:36 AM 2/18/2005, Vincent Diepeveen wrote: > >>> > Complete academic nonsense calculation. If you use quite some >electricity > >>> > the electricity gets up to factor 20-40 cheaper. Getting a factor 10 > >>> > reduction in usage bill is pretty easy if you negotiate properly. > > > >Just where do you live that such negotiations are possible. Here's some > >You aren't going to negotiate about a single small room with a few >lightbulbs obviously. > >We're talking about huge usage, like if all supercomputers are located at 1 >central spot and the entire institute with thousands of working places gets >powered in a central way, and usually electricity offering companies aren't >going to put online their rates for reduced usage, as that would give them >a bad negotation starting point :) In the U.S., at least, electricity at the retail level is fairly regulated, and anyone selling electricity must post official "tariffs" that give the rates and so forth. A significant part (perhaps 20-30%) of the "as delivered" cost of electricity is the amount you pay for the transmission and distribution system (all those HV power lines, etc.), which is, again, somewhat regulated (or at least, the past pricing data is readily available, by law and regulation). The significant exception to this might be if you co-locate with an independent generator of power, in which case there's no interconnection to the grid. But, even in this case, if your generator interties with the rest of the system, so you're not the only customer, then your supply will be affected by the fluctations in the supply and demand on the overall grid. In general, the price that a generator is paid is almost totally unregulated (unlike retail rates, which is what caused the problems in California a few years back), and so, while you may be able to negotiate a very low price for power for some times of day, etc., you'll probably also get a "interruptible load" clause in the contract, or you'll have to pay high rates at other times. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From hahn at physics.mcmaster.ca Fri Feb 18 20:25:28 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Mare Nostrum (not quite COTS) In-Reply-To: <20050216103153.GH1404@leitl.org> Message-ID: > ranked number four in the world in speed in November 2004, is constructed of > such totally off-the-shelf parts as IBM BladeCenter JS20 servers, 64-bit it's funny how "off the shelf" means different things to different people. I consider blades to be a qualitatively different category of hardware than, say, a tyan motherboards in an AICPC chassis. AFAIKT, blades still run at a premium vs "normal" servers from the same vendor (which is also at a premium vs whitebox.) > The thinking behind MareNostrum's construction represents a new way of > looking at these and other compute-intensive areas. Today's typical > high-performance computing installation runs a large, parallel RISC-based > UNIX® system with performance instead of reliability being of utmost > importance. egads, did the author of this really believe it?!? > computer density in the industry, which results in high performance with a > small footprint. The BladeCenter technology allows for 84 dual processor > servers in a single 42 U rack, uh, no, sorry, maybe someone should have fact-checked this. HP has both xeon and opteron-based blades which put 96 duals in a rack. it would be rather shocking if HP didn't jump on dual-core opterons, too... > giving more than 1.4 teraflops of compute > power in a single rack. fused-mul-add is really a wonderful marketing tool, isn't it? > When the power of MareNostrum is unleashed later this year, it will be at the hmm, to me, the bigger the computer, the more money is evaporating for every day between delivery and full user utilization. From hahn at physics.mcmaster.ca Fri Feb 18 21:18:31 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Academic sites: who pays for the electricity? In-Reply-To: Message-ID: > > room that takes up half the basement the cost of the electricity > > (both for power and A/C) goes from a minor expense to a major > > one. but it's not as if the university will fail to notice a new 1K-node cluster. so if they choose not to install metering equipment, it's on them. we recently built a machineroom for 1.5Kcpus, and sure enough, the U made us buy metering crap (IMO, the U should pay for it.) the U also made us buy harmonic mitigating PDUs because they didn't understand what PFC means in a power supply. > > "Typical" lab usage is widely variable but I'd be amazed > > if most biology or chemistry labs burn through even 1/10th this > > much for the equivalent lab area. Some physics lab running > > a tokamak might come close. I guess, I figure around 300W/sq ft is pretty typical for a new HPC machineroom. that's obviously more than a bio lab, but how about comparing to that glassblower over in arts, or someone pouring a new alloy in materials engineering? > > Anyway, the question is, have any of the universities said "enough > > is enough" and started charging these electricity costs directly? > > If so, what did they use for a cutover level, where usage was > > "above and beyond" overhead? > > This issue has most definitely come up at Duke, although we're still > seeking a formula that will permit us to deal with it equitably. This I guess I'm a little surprised - one main Canadian funding agency (CFI) has a program that provides infrastructure operating funds (you apply for this after being awarded a capital grant.) it pays for electricity as well as sysadmins. > As our Dean of A&S recently remarked, if there aren't any checks and > balances or cost-equity in funding and installing clusters, they may > well continue to grow nearly exponentially, without bound (Duke's I find that most faculty who have compute needs (and funding) will seriously consider buying into a shared facility instead. that's our (SHARCnet's) usual pitch: let us help you spend your grant, and we'll give you first cut at that resource, but otherwise take all the pain off your hands. most people realize that running a cluster is a pain: heat/noise, but more importantly the fact that it soaks up tons of the most expensive resource, human attention. do you want your grad students spending even 20% of their time screwing around with cluster maintenance? not to mention the fact that most computer use is bursty, and therefore very profitably pooled. a shared resource means that a researcher can burst to 200 cpus, rather than just the 20 that his grant might have bought. and after the burst, someone else can use them... > to hold them, the power to run them, and the people to operate them, all > grow roughly linearly with the number of nodes. This much is known. well, operator cost scales linearly, but that line certainly does not pass through zero, and is nearly flat (a 100p cluster takes almost the same effort as a 200p one.) > Finding out isn't trivial -- it involves running down ALL the clusters > on campus, figuring out whom ALL those nodes "belong" to, determining > ALL the grant support associated with all those people and projects and here at least, the office of research services sees all funding traffic, and so is sensitized to the value of cluster pooling. the major funding agencies have also expressed some desire to see more facility centralization, since the bad economics of a bunch of little clusters is so clear... regards, mark hahn. From srgadmin at cs.hku.hk Thu Feb 17 21:41:23 2005 From: srgadmin at cs.hku.hk (srg-admin) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] CCGrid 2005: CALL FOR PARTICIPATION Message-ID: <42158003.9040708@cs.hku.hk> Apologies if you received multiple copies of this message. ------------------------------------------ CLUSTER COMPUTING AND GRID (CCGrid 2005) http://www.cs.cf.ac.uk/ccgrid2005/ 9-12 May 2005 Cardiff, UK -------------------------------------------- ******************************************************* CALL FOR PARTICIPATION ******************************************************* SCOPE ===== Commodity-based clusters and Grid computing technologies are rapidly developing, and are key components in the emergence of a novel service-based fabric for high capability computing. Cluster-powered Grids not only provide access to cost-effective problem-solving power, but also promise to enable a more collaborative approach to the use of distributed resources, and new economic products and services. CCGrid2005, sponsored by the IEEE Computer Society is designed to bring together international leaders who are pioneering researchers, developers, and users of clusters, networks, and Grid architectures and applications. The symposium will also serve as a forum to present the latest work, and highlight related activities from around the world. CCGrid2005 is interested in topics including, but not limited to: o Hardware and Software (based on PCs, Workstations, SMPs or Supercomputers) o Middleware for Clusters and Grids o Dynamic Optical Network Architectures for Grid Computing o Parallel File Systems, including wide area file systems, and Parallel I/O o Scheduling and Load Balancing o Programming Models, Tools, and Environments o Performance Evaluation and Modeling o Resource Management and Scheduling o Computational, Data, and Information Grid Architectures and Systems o Grid Economies, Service Architectures, and Resource Exchange Architectures o Grid-based Problem Solving Environments o Scientific, Engineering, and Commercial Grid Applications o Portal Computing / Science Portals PROGRAMME ========= The conference will contain: o Over 75 papers in the main track (33% Acceptance Rate) o 9 workshops: Chair: Craig Lee (Aerospace Corporation, US) Workshop 1: Collaborative and Learning Applications of Grid Technology Organisers: Oscar Ardaiz-Villanueva, Technical University Catalunya, Spain Miguel L. Bote-Lorenzo, University of Valladolid, Spain Amy Apon, University of Arkansas, US Barry Wilkinson, Western Carolina University, US Workshop 2: Cluster-Sec 2005: Cluster Security -- The Paradigm Shift Organiser: William Yurcik, NCSA, US Workshop 3: Semantic Infrastructure for Grid Computing Applications Organisers: Line C. Pouchard, Oak Ridge National Lab, US Luc Moreau, University of Southampton, UK Valentina Tamma, University of Liverpool, UK Workshop 4: Fifth International Workshop on Global and Peer-2-Peer Computing: "Theory and Experience of Desktop Grids and P2P systems" Organisers: Franck Cappello, INRIA, France Adriana Iamnitchi, Duke University, US Mitsuhisa Sato, Tsukuba University, Japan Workshop 5: DSM 2005: Fifth International Workshop on Distributed Shared Memory Organisers: Laurent Lefevre, INRIA RESO/LIP, France Michael Schoettner, University of Ulm, Germany Workshop 6: GAN'05: Third Workshop on Grids and Advanced Networks Organisers: Laurent Lefevre, INRIA RESO/LIP, France Pascale Primet, INRIA RESO/LIP, France Workshop 7: Bio-Medical Computations on the Grid (BioGrid) Organisers: Chun-Hsi Huang, University of Connecticut, US Sanguthevar Rajasekaran, University of Connecticut, US Workshop 8: 1st International Workshop on Grid Performability Organisers: Nigel Thomas, University of Newcastle upon Tyne, UK Stephen Jarvis, University of Warwick, UK Workshop 9: Workshop on Agent-based Grid Economics (AGE-2005) Organisers: Daniel Veit, University of Karlsruhe, Germany Bjoern Schnizler, University of Karlsruhe, Germany o 4 Tutorials (31% acceptance rate -- 13 tutorial proposals were submitted) Chair: Michael Gerndt (TUM, Germany) Tutorial 1: High Performance I/O for Scientific Applications Robert Latham Mathematics and Computer Science Division, Argonne National Lab, IL, USA Tutorial 2: Practical Performance Measurement and Analysis of Parallel Programs on Clusters (and Grids) Bernd Mohr Forschungszentrum Juelich, Zentralinstitut fuer Angewandte Mathematik, 52428 Juelich, Germany Tutorial 3: Grid Computing Security - Issues, Concerns and Counter-measures Anirban Chakrabarti Grid Computing Focus Group, Software Engineering Technology Lab, Infosys Technologies, Electronics City, Bangalore, Karnataka 560100, India Tutorial 4: The Gridbus Toolkit: Creating and Managing Utility Grids for eScience and eBusiness Applications Rajkumar Buyya Senior Lecturer and StorageTek (USA) Fellow of Grid Computing Grid Computing and Distributed Systems (GRIDS) Lab Dept. of Computer Science and Software Engineering, The University of Melbourne, ICT Building, 111, Barry Street, Carlton, Melbourne, VIC 3053, Australia o A poster session with over 16 posters Posters Chair: Yan Huang (Cardiff University, UK) o Two General Keynote Talks: Talk 1: "e-Science, Cyberinfrastructure and Web Service Grids" Professor Tony Hey, University of Southampton/ERSRC, UK Talk 2: "Experiences with System X" Professor Srinidhi Varadarajan, Virginia Tech, US o An Industry Track Chair: Alistair Dunlop (OMII, UK) Industry Keynote: "WS-Agreement" Heiko Ludwig, IBM T.J.Watson Research Centre, US o A "Work in Progress" Session REGISTRATION DATES ================== http://www.cs.cf.ac.uk/ccgrid2005/registration.html Early bird registration : March 21, 2005 Accommodation (cut-off) date: March 21, 2005 SPECIAL EVENT ============= Conference Banquet will be at the National Museum and Galleries of Wales (within walking distance of the conference venue and hotels). More details at: http://www.nmgw.ac.uk/nmgc/ ========== Honorary Chair -------------- Tony Hey, EPSRC, UK Conference Chairs ----------------- David W. Walker, Cardiff University, UK Carl Kesselman, USC/ISI, US Programme Committee Chair ------------------------- Omer F. Rana, Cardiff University, UK Programme Committee Vice-Chairs ------------------------------- Jack Dongarra, University of Tenneesee, US Luc Moreau, University of Southampton, UK Sven Graupner, HP Labs, US Peter Sloot, University of Amsterdam, The Netherlands Craig Lee, The Aerospace Corporation, US Publications Chair: Rajkumar Buyya, University of Melbourne, Australia Workshops Chair: Craig Lee, Aerospace Corporation, US Tutorials Chair : Michael Gerndt, TU Munich, Germany Industry Track Chair: Alistair Dunlop, OMII, UK Exhibits Chair: Steven Newhouse, OMII, UK Posters Chair : Yan Huang, Cardiff University, UK Finance Chair: John Oliver, Welsh eScience Centre, UK Registration Chair : Tracey Lavis, Cardiff University, UK Local Arrangements Chair: Linda Wilson, Welsh eScience Centre, UK Publicity Chairs ---------------- Vladimir Getov, University of Westminster, UK (Europe) Marcin Paprzycki, Oklahoma State University, US (Europe) Cho-Li Wang, University of Hong Kong (Asia Pacific) Ken Hawick, Massey University, New Zealand (Asia Pacific) Manish Parashar, Rutgers University, US (America) PROGRAMME COMMITTEE ------------------- Thierry Priol, IRISA, France Seif Haridi, KTH Stockholm, Sweden Bruno Schulze, Laboratario Nacional de Computacio Cientifica, Brazil David Abramson, Monash University, Australia Steven Willmott, Universitat Polithcnica de Catalunya, Spain Xian-He Sun, Illinois Institute of Technology, US Yun-Heh (Jessica) Chen-Burger, University of Edinburgh, UK Thilo Kielmann, Vrije Universiteit, The Netherlands Brian Matthews, RAL/CCLRC and Oxford Brookes University, UK Maozhen Li, Brunel University, UK Greg Astfalk, HP Labs, US Marty Humphrey, University of Virginia, US Geoffrey Fox, University of Indiana, US Martin Berzins, University of Leeds, UK Hai Jin, Huazhong University of Science and Technology, China Giovanni Chiola, Universita' di Genova, Italy Domenico Talia, Universita' della Calabria/ICAR-CNR, Italy Josi Cunha, Universidade Nova de Lisboa, Portugal Ron Perrott, Queens University Belfast, UK Ewa Deelman, ISI/USC, US Stephen Jarvis, Warwick University, UK Niclas Andersson, Linkvping University, Sweden Putchong Uthayopas, Kasetsart University, Thailand John Morrison, University College Cork, Ireland Stephen Scott, Oak Ridge National Lab, US Luciano Serafini, ITC-IRST, Italy David A. Bader, University of New Mexico, US Mark Baker, University of Portsmouth, UK Emilio Luque, Universitat Autrnoma de Barcelona, Spain Akhil Sahai, HP Labs, US Gregor von Laszewski, Argonne National Lab, US Fethi Rabhi, University of New South Wales, Sydney, Australia Fabrizio Petrini, Los Alamos National Lab, US Kate Keahey, Argonne National Lab, US Sergei Gorlatch, Universitdt M|nster, Germany Brian Tierney, Lawrence Berkeley National Lab, US Rauf Izmailov, NEC Labs, US Stephen J. Turner, Nanyang Technological University, Singapore Savas Parastatidis, University of Newcastle, UK Elias Houstis, University of Thessaly, Greece -- and Purdue University, US Karl Aberer, EPFL, Switzerland Rolf Hempel, DLR, Germany Anne Elster, NTNU, Norway Artur Andrzejak, Zuse Institute Berlin, Germany Jennifer Schopf, Argonne National Laboratory, US John Gurd, University of Manchester, UK Domenico Laforenza, ISTI/CNR, Italy Wolfgang Rehm, TU Chemnitz, Germany Gabriel Antoniu, IRISA, France Beniamino di Martino, Seconda Universita di Napoli, Italy Frank Z. Wang, Cranfield University, UK Daniel S. Katz, JPL/Caltech, US Nigel Thomas, University of Newcastle, UK Moustafa Ghanem, Imperial College, London, UK Kenneth Hurst, JPL/Caltech, US From eno at dorsai.org Thu Feb 17 23:25:22 2005 From: eno at dorsai.org (Alpay Kasal) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] powering up 18 motherboards Message-ID: <0IC30057AJ4KZ3@mta7.srv.hcvlny.cv.net> I think you hit the nail on the head Bari, I'm in Brooklyn, New York. So I suppose it should be 15amp circuits but every circuit breaker in the box is clearly a 10. This is an old house, seems like any renovations over the years have been only for aesthetics. The wiring in the walls is probably disintegrating - that would explain why the new looking circuit breakers are rated for 10 amps. I think I can get use of 3 circuits which gives me some room to play with all the nodes and hopefully the assortment of switches and power supply. I have to figure out what the draw will be on the rest of the equip. Now where the hell am I going to plug in this air conditioner???? Any advice on how to gang up 3 10amp circuits into a single 30amp? Sounds like a job for an electrician? Thanks for the help guys. Alpay -----Original Message----- From: Bari Ari [mailto:bari@onelabs.com] Sent: Thursday, February 17, 2005 8:00 PM To: Jim Lux Cc: Dean Johnson; Alpay Kasal; beowulf@beowulf.org Subject: Re: [SPAM] [Beowulf] powering up 18 motherboards Jim Lux wrote: > A 10 amp circuit would be highly unusual in the U.S., but might be > common practice elsewhere. In the U.S., a 15 amp circuit is standard. I thought this was odd when I first read this as well. This may be a case where to save dollars or in rehabbing old buildings where you may have more than 3 current carrying conductors in one raceway and you have to derate the current protection. In this case it may be that they ran more than three #14 current carrying conductors (as defined by the NEC)in the same raceway and had to derate the usual 15 amp circuit protection down to 10 amps. -Bari Ari From emac at cybergps.net Thu Feb 17 23:36:34 2005 From: emac at cybergps.net (Eric Machala) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] powering up 18 motherboards References: <0IC300MSUEDY78@mta10.srv.hcvlny.cv.net> Message-ID: <00ed01c5158c$92df7550$6e45a8c0@masstivy> This is not true the Ups will not draw ever any more from the wall that its is set to, ups has a set trickle rate that is able to be set to a cost effective trickle rate unless it is load overloaded it, if this is the case it is double the trickle rate becuase systems wants to restore full battery before power drain err loss of power.... i could get the actual specs on this but for your actual needed load if u were in a 50-65% load im sure there would be no spikes in power draw over normal trickle ----- Original Message ----- From: "Alpay Kasal" To: "'Jim Lux'" ; "'Dean Johnson'" Cc: Sent: Friday, February 18, 2005 12:43 AM Subject: RE: [Beowulf] powering up 18 motherboards > James Lux, thanks for the extremely useful explanation. Btw, I'm in > Brooklyn, NY. 120volts, 60cycles, regular AC power. I don't know the gauge > of the wiring in the walls but (as mentioned in another response just now) > I > suspect it is old wiring and is the reason for the strange 10amp circuit > breakers. > > I looked at the x10 modules. Seems like it could be very useful, just > script > all of them from my headend. For now I'm going to try to handle the > power-on > sequence myself. I figured I could steal 3 10amp circuits from the house. > > Follow me... Turn on 4 nodes (on 1 strip) which will peak at 5.2amps. let > that settle down to a steady 3.48amps and hit another strip of 4 nodes. > Total draw while the 2nd batch is starting is 8.68amps. It should steady > at > 6.96. I then have room to turn 1 more node on. Then one more after that. A > 4 > step process to get 10 nodes powered up without going over 10amps. Perform > the same exact steps on a 2nd circuit. Annoying but possible without > spending anymore money. > > I was really hoping a decent $200-300 UPS would come to the rescue here. > Oh > well. > > I just had a thought... I planned on making use of wake-on-lan. I can just > start sending jobs to the whole network though if all of it is asleep, I'd > have to still be careful of the powerup-sequence. Grrrr. Maybe a script to > perform WOL before starting any number crunching. > > Boy did I take nice big fat electrical lines for granted in the past! > > Alpay > > > -----Original Message----- > From: Jim Lux [mailto:James.P.Lux@jpl.nasa.gov] > Sent: Thursday, February 17, 2005 7:18 PM > To: Dean Johnson; Alpay Kasal > Cc: beowulf@beowulf.org > Subject: Re: [SPAM] [Beowulf] powering up 18 motherboards > > No, the UPS won't help. It might make things worse, because as you flip > on > all that load, the voltage will sag, causing the UPS to turn on, which > then > might trip from the overcurrent (assuming you're not out buying a 2kW > UPS). > > You could use the X-10 type (aka Plug n Power) remote controlled relays > (don't use Lamp modules.. you need Appliance modules, which are relays > inside). > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From ole at scali.com Fri Feb 18 02:20:04 2005 From: ole at scali.com (Ole W. Saastad) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Re: Home beowulf - NIC latencies In-Reply-To: <200502162001.j1GK0vgt019156@bluewest.scyld.com> References: <200502162001.j1GK0vgt019156@bluewest.scyld.com> Message-ID: <1108722004.16480.36.camel@pc-2.office.scali.no> Hi, with all the argument about performance of so called Swiss Army Knife (SAK) MPIs I have uploaded four runs of the HPCC benchmark to the HPCC benchmark web site to show the performance of a single cluster with a single SAK MPI running with four different Interconnects, GigaBit Ethernet (tcp), SCI, Myrinet and InfiniBand. The results can be found at : http://icl.cs.utk.edu/hpcc/ Look for the Dell PowerEdge 2650 cluster with 32 CPUs. The results show that it is possible to have a SAK MPI that show acceptable performance for a multiple of interconnects. The SAK MPIs are of great value for the application vendors as they are free from the extra work involved with a new MPI implementation for every interconnect. In addition an application can be moved without changes from one cluster with one interconnect to another cluster with yet another interconnect. -- Ole W. Saastad, Dr.Scient. Manager Cluster Expert Center dir. +47 22 62 89 68 fax. +47 22 62 89 51 mob. +47 93 05 74 87 ole@scali.com Scali - www.scali.com High Performance Clustering From vaidya.anand at gmail.com Fri Feb 18 19:25:19 2005 From: vaidya.anand at gmail.com (vaidya.anand@gmail.com) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Entertaining article on cooling hot processors on IBM developerWorks Message-ID: <200502181925.20179.ar3107@gmail.com> http://www-106.ibm.com/developerworks/library/pa-chipschall5/?ca=dnl From bari at onelabs.com Fri Feb 18 06:28:37 2005 From: bari at onelabs.com (Bari Ari) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] powering up 18 motherboards In-Reply-To: <0IC30057AJ4KZ3@mta7.srv.hcvlny.cv.net> References: <0IC30057AJ4KZ3@mta7.srv.hcvlny.cv.net> Message-ID: <4215FB95.5030908@onelabs.com> Alpay Kasal wrote: > Any advice on how to gang up 3 10amp circuits into a single 30amp? Sounds > like a job for an electrician? Thanks for the help guys. Electrical codes don't allow paralleling conductors to increase the current handling capacity for small circuits like yours. It looks like you'll have to try an idea like Jim's and use a staggered startup or install a new 30A circuit. If you look around Brooklyn you'll see lots of this done on the outside of homes in order to run large window A/C's. -Bari From camm at enhanced.com Fri Feb 18 07:26:46 2005 From: camm at enhanced.com (Camm Maguire) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: <3.0.32.20050216174455.0106ea70@pop.xs4all.nl> <20050217000455.GF2018@greglaptop.internal.keyresearch.com> Message-ID: <54k6p5nb2h.fsf@intech19.enhanced.com> Greetings! "Robert G. Brown" writes: > But maybe this is all too complicated, or doesn't belong in the standard > per se. It is indeed like the ATLAS thing, but then, I think that ATLAS > is sheer genius although it is also cumbersome and clunky to build...;-) > I just dream of the day that ATLAS-like runtime optimization isn't so > clunky and is based on tools that create tables of microbenchmark > numbers that ARE sufficiently accurate and rich to achieve > near-optimization without running a build loop that sweeps and searches > a high-dimensional space...:-) > I'm of the opinion that the bulk of the benefit can be had by providing alternate binaries tuned for the coarse grained cpu differentials - isa extension, cache, etc. -- coupled with some smarts in ld.so to select the proper version at runtime depending on the running cpu. We do something like this, (alternate isa extension only), in the Debian atlas packages, which are now at the base of quite a large application tree -- no recompilation required. Am always appreciative of other thoughts on these matters. Take care, > rgb > > -- > Robert G. Brown http://www.phy.duke.edu/~rgb/ > Duke University Dept. of Physics, Box 90305 > Durham, N.C. 27708-0305 > Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > -- Camm Maguire camm@enhanced.com ========================================================================== "The earth is but one country, and mankind its citizens." -- Baha'u'llah From streich at uwm.edu Fri Feb 18 08:44:16 2005 From: streich at uwm.edu (streich@uwm.edu) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] A hello, and an introduction Message-ID: <1108745056.42161b60d311c@panthermail.uwm.edu> Hello all, I'm new to the list and just thought I'd introduce myself, as will probably be posting to the list a bit. I'm a system administrator for a Beowulf cluster at UW-Milwaukee. It's a 22 node 2.4GHz Intel cluster running Linux that is dedicated to studying clouds (using wrf and COAMPS (MPI based software)). It's a student job, and a lot of fun. I'm a Computer Science major, and have all the Computer Science course done (just have a few math classes left). I'm starting to think about Grad school and Master Thesis stuff, though that is a little way off. Along this vein, if anyone has any suggestions as to hot research topics a CS major with access to a few spare clock cycles Beowulf cluster might be interested in, please feel free to send them to me. ;) I've only been admin-ing the cluster for about a year, so I don't know how much I'll be able to help people with questions... But I may throw an idea out once in a while. I suppose here I may be asking more than answering the questions, as it seems a lot of you have quite a bit of experience with larger clusters. - Jeremy From mit2005 at vreme.yubc.net Fri Feb 18 08:57:09 2005 From: mit2005 at vreme.yubc.net (IPSI-2005 Italy and USA) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Invitation to Italy and USA 2005; c/bb Message-ID: <200502181657.j1IGv9uW024833@vreme.yubc.net> Dear potential Speaker: On behalf of the organizing committee, I would like to extend a cordial invitation for you to attend one or both of the upcoming IPSI BgD multidisciplinary, interdisciplinary, and transdisciplinary conferences. The first one will take place in Cambridge, Massachusetts, USA: IPSI-2005 USA Hotel@MIT, Cambridge (arrival: 7 July 05 / departure: 10 July 05) New deadlines: 20 February 05 (abstract) / 15 April 05 (full paper) The second one will take place in Loreto Aprutino, Italy: IPSI-2005 ITALY Hotel Castello Chiola (arrival: 27 July 05 / departure: 1 August 05) New deadlines: 20 February 05 (abstract) / 15 April 05 (full paper) All IPSI BgD conferences are non-profit. They bring together the elite of the world science; so far, we have had seven Nobel Laureates speaking at the opening ceremonies. The conferences always take place in some of the most attractive places of the world. All those who come to IPSI conferences once, always love to come back (because of the unique professional quality and the extremely creative atmosphere); lists of past participants are on the web, as well as details of future conferences. These conferences are in line with the newest recommendations of the US National Science Foundation and of the EU research sponsoring agencies, to stress multidisciplinary, interdisciplinary, and transdisciplinary research (M+I+T++ research). The speakers and activities at the conferences truly support this type of scientific interaction. One of the main topics of this conference is "E-education and E-business with Special Emphasis on Semantic Web and Web Datamining" Other topics of interest include, but are not limited to: * Internet * Computer Science and Engineering * Mobile Communications/Computing for Science and Business * Management and Business Administration * Education * e-Medicine * e-Oriented Bio Engineering/Science and Molecular Engineering/Science * Environmental Protection * e-Economy * e-Law * Technology Based Art and Art to Inspire Technology Developments * Internet Psychology If you would like more information on either conference, please reply to this e-mail message. If you plan to submit an abstract and paper, please let us know immediately for planning purposes. Note that you can submit your paper also to the IPSI Transactions journal. Sincerely Yours, Prof. V. Milutinovic, Chairman, IPSI BgD Conferences * * * CONTROLLING OUR E-MAILS TO YOU * * * If you would like to continue to be informed about future IPSI BgD conferences, please reply to this e-mail message with a subject line of SUBSCRIBE. If you would like to be removed from our mailing list, please reply to this e-mail message with a subject line of REMOVE. From laytonjb at charter.net Sat Feb 19 04:53:55 2005 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Mare Nostrum (not quite COTS) In-Reply-To: References: Message-ID: <421736E3.3050001@charter.net> Mark Hahn wrote: >>When the power of MareNostrum is unleashed later this year, it will be at the >> >> > >hmm, to me, the bigger the computer, the more money is evaporating >for every day between delivery and full user utilization. > > This is a very interesting comment and one I agree with. Does anyone care to post numbers of some sort regarding the size of the cluster (and the size of the storage system) and the time to get the system up and stabilized? Thanks! Jeff From hahn at physics.mcmaster.ca Sat Feb 19 06:23:20 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] powering up 18 motherboards In-Reply-To: Message-ID: > Another, much cheaper option, would be to set the slave > node BIOS to use "Wake on LAN" (if it works on your systems) I really, really like lan-IPMI. not only do you get a nice way to turn on/off/reset machines remotely, but you also can query their internal sensors. not to mention that it's an open standard. my main experience with it is a cluster of HP DL145's, which are rebadged Celestica white-ish-box dual-opterons. I've heard that lan-IPMI is also available for real whitebox (tyan, supermicro), but have never managed to get hands on. IPMI-like functionality is one of those odd places where customers are often hurt by vendors who produce proprietary, less-interoperable mechanisms. sort of embrace-extend-bastardize-rebrand. product teams should not listen to marketing/business people... From rgb at phy.duke.edu Sat Feb 19 11:49:28 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Academic sites: who pays for the electricity? In-Reply-To: References: Message-ID: On Sat, 19 Feb 2005, Mark Hahn wrote: > > As our Dean of A&S recently remarked, if there aren't any checks and > > balances or cost-equity in funding and installing clusters, they may > > well continue to grow nearly exponentially, without bound (Duke's > > I find that most faculty who have compute needs (and funding) will > seriously consider buying into a shared facility instead. that's our > (SHARCnet's) usual pitch: let us help you spend your grant, and we'll > give you first cut at that resource, but otherwise take all the pain > off your hands. most people realize that running a cluster is a pain: > heat/noise, but more importantly the fact that it soaks up tons of > the most expensive resource, human attention. do you want your grad > students spending even 20% of their time screwing around with cluster > maintenance? > > not to mention the fact that most computer use is bursty, and therefore > very profitably pooled. a shared resource means that a researcher > can burst to 200 cpus, rather than just the 20 that his grant might > have bought. and after the burst, someone else can use them... A model that we have emulated at Duke, actually, and I mean emulated literally. You'll recall that we had that offline discussion about your share cluster operation a few years ago -- that was a fairly important component that I took into various discussions with provosty-level folks that led ultimately to the creation of Duke's CSEM. So we owe you a debt of gratitude, and possibly even a beer (standard compensation for cluster support services on list, is it not:-). However we, like many institutions, are not pure anything -- the high-cycle-consuming departments (including physics, but also CPS, chemistry, statistics, biology, econ) still run their own clusters, although individuals with BIG needs are encouraged to join the campus grid project and share their unused cycles. I feel like we're likely in a gradual and purely voluntary transition to a point where this model will ultimately dominate and perhaps even become universal because of the various economies of scale. Frankly, even where people DON'T participate in the sharing of resources there will have to be colocation, as physical space with adequate infrastructure is both scarce and expensive. So physics may become part of the campus grid, etc, although the way things look now there will continue to be a half dozen separate server-class spaces all over campus for the hardware. I actually feel that this "decentralized centralization" is a good thing -- we have historical reasons to strongly mistrust overcentralization of resources at Duke. One gets economies of scale, perhaps, but at the expense of permitting empire builders in the bureaucracy entrench and start to dictate policy, often to groups who are a hell of a lot smarter and cost effective when left to their own devices. I mean, we'd still be using mainframes if it were up to the folks that run centralized mainframe compute centers. Hell, at Duke I believe we still ARE running some mainframes. Beowulfs themselves are another example of an innovation that could only have come about from the bottom up. So we're trying to operate according to a model where individual enterprise and innovation aren't squelched and "power" (planning and control) aren't totally centralized, but we still can centralize primary infrastructure (the network, the main server rooms) and help coordinate the planning and operation and sharing of cluster resources without trying to dictate them. This isn't as difficult as it might sound, at a University. All the faculty are innately mistrustful, cynical, and jealous of control, and at the same time tend to be really bright and recognize the benefits of cooperation on some issues. We've also been blessed for the last 5-8 years with some really good people in Arts and Science Network Administrators group (from Melissa Mills as the assistant dean at the top, through the actual sysadmins down to, well, me:-), at the Provost level (leading to the birth of CSEM), and in the Office of Information Technology (OIT, which runs the campus network and student/academic computing). When you have enlightened leadership that ISN'T empire building, things are good and even if you try something and it turns out to be wrong it doesn't matter -- you just fix it or try something else and move on. > > to hold them, the power to run them, and the people to operate them, all > > grow roughly linearly with the number of nodes. This much is known. > > well, operator cost scales linearly, but that line certainly does not > pass through zero, and is nearly flat (a 100p cluster takes almost the > same effort as a 200p one.) Yeah, yeah, yeah -- it's an irregular scale with flat patches and a minimum buy-in. However, the discussion concerns the planning of a 1000 node, 2000 CPU cluster, and at that point I think you can start to talk meaningfully about the number of nodes per support person in planning discussions where you recognize that you CAN'T run 1000 nodes with the same effort as 100 nodes. The hardware support component most definitely scales per node, and that large a cluster could eat a person alive with hardware maintenance as the cluster ages or if you hit a bad patch. Hardware support is in some sense the scale limiting resource for a well-designed cluster (whether you provide it locally or contract it out), especially now that PXE, yum, etc have made it possible for a single person to install and maintain 1000 nodes on the SOFTWARE side of things. But to avoid going into all this is why I used the word "roughly". Perhaps I should have said "in the limit of a large number of machines". > > Finding out isn't trivial -- it involves running down ALL the clusters > > on campus, figuring out whom ALL those nodes "belong" to, determining > > ALL the grant support associated with all those people and projects and > > here at least, the office of research services sees all funding traffic, > and so is sensitized to the value of cluster pooling. the major funding > agencies have also expressed some desire to see more facility centralization, > since the bad economics of a bunch of little clusters is so clear... Yes, ditto here as well, with exceptions as noted above. Although granting agencies can swing both ways -- they like to see resource and cost sharing, but they are also jealous of resource "ownership" and don't want to fund project A (including cluster) and then find that that cluster has been hijacked by project B, possibly run by somebody else entirely. There's also a WIDE range of what the different agencies view as reasonable "cost sharing". What Duke has tried to do is use a thoughtful model and not a one-size-fits-all plan with mandatory participation. The one thing that I think Duke still really needs is the detailed CBA of existing cluster operations. I suspect that they'd find that they are remarkably efficient as is, but I'm not certain. As always, a pleasure to read what you write. rgb > > regards, mark hahn. > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From john.hearns at streamline-computing.com Sun Feb 20 00:53:21 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] powering up 18 motherboards In-Reply-To: References: Message-ID: <1108889601.5617.3.camel@Vigor11> On Sat, 2005-02-19 at 09:23 -0500, Mark Hahn wrote > I really, really like lan-IPMI. not only do you get a nice way > to turn on/off/reset machines remotely, but you also can query > their internal sensors. not to mention that it's an open standard. > my main experience with it is a cluster of HP DL145's, which are > rebadged Celestica white-ish-box dual-opterons. I've heard that > lan-IPMI is also available for real whitebox (tyan, supermicro), > but have never managed to get hands on. Tyan need an extra card, which isn't included by default, if I'm not wrong. We have MSI systems which have IPMI. Also don't forget locatelight capability. It is SO useful if you have 500 nodes, and have to ask someone to swap a component on a problem node, for instance. From patrick at myri.com Sun Feb 20 23:46:59 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Mare Nostrum (not quite COTS) In-Reply-To: <3.0.32.20050216131329.0106f960@pop.xs4all.nl> References: <3.0.32.20050216131329.0106f960@pop.xs4all.nl> Message-ID: <421991F3.3060803@myri.com> Hi Vincent, Vincent Diepeveen wrote: > Which myrinet cards are in Mare Nostrum? Single link PCI-X D cards, with 4 MB of SRAM to be comfortable with multiples routes between a lot of nodes. The NICs for the BladeCenter have a specific format: smaller than regular half-size PCI boards, NIC is // to the mainboard and has no fiber transceiver (goes to the OPM via the Bladecenter backplane). Available only via IBM. > What one way pingpong latency can it get from 1 end of the machine to the > other end of the machine? I don't know, I didn't work on this machine. It would depend on the number of crossbars and lengths of fiber. I uses the new switches with 32-port crossbars, so the max path should not be longer than 7 hops if my memory is right. At one time, the PCI-X on the JS-20 blades was clocked at 100 MHz, I don't know if it has been bumped to 133 MHz since. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Mon Feb 21 00:02:44 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <3.0.32.20050216141222.00924be0@pop.xs4all.nl> References: <3.0.32.20050216141222.00924be0@pop.xs4all.nl> Message-ID: <421995A4.6030801@myri.com> Vincent Diepeveen wrote: > A problem of MPI over DSM type forms of parallellism has been described > very well by Chrilly Donninger with respect to his chessprogram Hydra which > runs at a few nodes MPI : > > For every write : > > MPI_Isend(....) > MPI_Test(&Reg,&flg,&Stat) > while(!flg) { > Hydra_MsgPending(); // Important, read in messages and process them > while waiting on complete. Otherwise the own Input-Buffer can overflow > // and we get a deadlock. > MPI_Test(&Reg,&flg,&Stat); > } > > The above is dead slow simply and delays the software. You are effectively waiting for the send completion, and that can require synchronization with the receive side if the message size is large enough. > In a DSM model like Quadrics you don't have all these delays. You don't have these delays with message passing if you do it differently. You can post multiple sends and wait on all of them at the same time, or post a send and the compute the next step before waiting. RMA would remove the synchronization with the remote side, but you need to know where to Put the data over there. > Can Myri memory on the card (4MB and 8MB in the $1500 version) get used to > directly write to the RAM on a remote network card? The memory on the NIC is not related to Remote Memory Access. The SRAM is used to host the firmware code and some data such as the routes, physical addresses, the name of the captain, whatever. More memory means that you can fit more routes (on a 7 hops topologies, you need to store 8 bytes, 7 routing bytes and a length, for every routes, and you need 8 different routes for each destination, per link, to have an effecive route dispersion scheme) or do something special (you can write your own firmware if you are crazy or you know what you are doing). 2MB is fine for most cases. > If so which library can i download for that for myri cards? GM supports RMA (PUT and GET) but do not expect the same latency as Quadrics. MX does not have been available with one-sided operations yet, but the latency is much better. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From ashley at quadrics.com Mon Feb 21 03:22:18 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Wed Nov 25 01:03:49 2009 Subject: [Beowulf] Mare Nostrum (not quite COTS) In-Reply-To: <421736E3.3050001@charter.net> References: <421736E3.3050001@charter.net> Message-ID: <1108984938.6139.4.camel@localhost.localdomain> On Sat, 2005-02-19 at 07:53 -0500, Jeffrey B. Layton wrote: > Mark Hahn wrote: > > >>When the power of MareNostrum is unleashed later this year, it will be at the > >> > >> > > > >hmm, to me, the bigger the computer, the more money is evaporating > >for every day between delivery and full user utilization. > > > > > > This is a very interesting comment and one I agree with. > Does anyone care to post numbers of some sort regarding > the size of the cluster (and the size of the storage > system) and the time to get the system up and stabilized? In my experience it's not a lot to do with size, the first one of a given size is always a new experience but the second and third ones onwards are fairly safe. Far more relevant is the exact hardware used and experience, expect as many problems making a 64 way machine with a new motherboard/file-system/integrator as you would ramping this same system up to 1024 way. Ashley, From lindahl at pathscale.com Mon Feb 21 21:08:24 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] The Case for an MPI ABI Message-ID: <20050222050824.GA2195@greglaptop.attbi.com> Those of you who were at the Open IB conference last week saw me give a talk entitled "The Case for an MPI ABI". It seems that Patrick and I have been channeling each other AGAIN; see what happens when I move to California? The first question is: Does an ABI provide enough benefit for people to care? To care enough to sit on a committee? If the answer is "yes", then I think we'll have one. The minimum technical issues revolve around the contents of and the names of shared libraries. The amount of work for MPICH or OpenMPI to support that part of an ABI is modest. If we wanted to go farther, I have a strawman proposal which addresses a generic startup procedure which would allow user applications, MPI implementations, and queue systems to all live in peace and harmony. This talk: http://www.openib.org/docs/oib_wkshp_022005/mpi-abi-pathscale-lindahl.pdf mostly talks about why we need an ABI, who wins and loses as a result of having one, and the pieces that could be in it. Please give it a look. -- greg From eugen at leitl.org Tue Feb 22 08:32:37 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] [Clusters_sig] The kernel and cluster issues (fwd from cherry@osdl.org) Message-ID: <20050222163237.GT1404@leitl.org> ----- Forwarded message from John Cherry ----- From: John Cherry Date: Tue, 22 Feb 2005 08:29:47 -0800 To: clusters_sig@osdl.org Cc: Subject: [Clusters_sig] The kernel and cluster issues X-Mailer: Evolution 2.0.1 LinuxWorld Conference and Expo (August 8-11) has a content track for "Kernel and Cluster Issues". http://www.linuxworldexpo.com/live/12/speakers//callforpapers Is anyone in this forum planning to present a paper for this conference? This may be a good venue to let the "cluster community" have a unified voice for proposing a minimal set of common cluster services/hooks for the kernel. OLS may be another good forum for this. John _______________________________________________ Clusters_sig mailing list Clusters_sig@lists.osdl.org http://lists.osdl.org/mailman/listinfo/clusters_sig ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050222/8b066a4a/attachment.bin From lusk at mcs.anl.gov Tue Feb 22 09:06:05 2005 From: lusk at mcs.anl.gov (Rusty Lusk) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] The Case for an MPI ABI In-Reply-To: <20050222050824.GA2195@greglaptop.attbi.com> References: <20050222050824.GA2195@greglaptop.attbi.com> Message-ID: <20050222.110605.03066059.lusk@localhost> From: Greg Lindahl Subject: [Beowulf] The Case for an MPI ABI Date: Mon, 21 Feb 2005 21:08:24 -0800 > This talk: > > http://www.openib.org/docs/oib_wkshp_022005/mpi-abi-pathscale-lindahl.pdf > > mostly talks about why we need an ABI, who wins and loses as a result > of having one, and the pieces that could be in it. Please give it a > look. One piece you include is the replacement of the non-portable mpirun with an mpistart with standard arguments. You might note that the MPI-2 forum addressed this issue with the specification of an mpiexec with standard arguments. MPICH2 implements it. Rusty From rcmanglekar at rediffmail.com Fri Feb 18 22:38:23 2005 From: rcmanglekar at rediffmail.com (Rahul Manglekar) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] How to run services/daemon on Cluster. Message-ID: <20050219063823.10268.qmail@webmail17.rediffmail.com> Hi all, I need more processing resources for services/daemons on my server machine. I have service that consumes much cpu resources, like mysqld and httpd etc., it consumes around 90-94% processor(CPU) usage. Can we build cluster, that will share all client/nodes processor usage to Server machine. So that services/daemons (eg., mysqld,apache etc.) running on server, will get more processing power. Can any/one guide me..! Thanks in advance. Regards.., -- Rahul. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20050219/864ef698/attachment.html From ipsitrans at vreme.yubc.net Sat Feb 19 00:46:10 2005 From: ipsitrans at vreme.yubc.net (IPSI Transactions Special Issues) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] Call for IPSI Transactions Special Issues in 2005/6; c/ba Message-ID: <200502190846.j1J8kArD008261@vreme.yubc.net> Dear potential Speaker: We are pleased to inform you that both IPSI Transactions journals are planing some special issues in late 2005 and early 2006, and you are welcome to submit your paper(s), until the deadlines listed below! IPSI Transactions on Internet Research: March 31, 2005 - Special Issue on E-Education: Concepts and Infrastructure June 30, 2005 - Special Issue on E-Business: Concepts and Infrastructure IPSI Transactions on Advanced Research: March 31, 2005 - Special Issue on the Research with Multidisciplinary Elements June 30, 2005 - Special Issue on the Research with Interdisciplinary Elements Each submitted paper first undergoes the editor review, and those who pass this first stage are sent to 12 external experts for a rigorous review; decisions are made after at least 6 external reviewers respond! The review is free of charge, but the authors of the accepted papers ar expected to pay the publication fee of E400 per paper (if 4 or 5 or 6 pages of the TIR/TAR format), and the additional fee of E100 per page, for each extra page, till the maximum of 10 pages. Rigorous reviewing is the major strength of IPSI journals, which is the major contributor to their high quality! Soft copies of the existing issues of TIR and TAR can be seen at the web, and hard copies can be obtained on a special request by email, as indicated on the web, where can find all information! Sincerely yours, Prof. Dr. Veljko Milutinovic, Editor-in-Chief P.S. If you need aditional information, please reply to this e-mail. From eno at dorsai.org Sat Feb 19 02:58:59 2005 From: eno at dorsai.org (Alpay Kasal) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] powering up 18 motherboards In-Reply-To: <00ed01c5158c$92df7550$6e45a8c0@masstivy> Message-ID: <0IC500I3ZNOKNN@mta6.srv.hcvlny.cv.net> Thanks to all for the info over the past few days. I decided to write a WOL app that turns the machines on from a windows box on the network with a user defined delay. I'll stick it on the web if anyone is interested in taking a look at it. I plan to add the ability to selectively shut machines down too (since the whole thing doesn't do much good from a cold boot). @Eric Machala and David Mathog Thanks for the info on the UPS's. I will be looking into some of the bigger refurbed APC's on eBay (I don't have 240v here though). I'll leave some safety room on the circuits for peaks, now I figure I'd want the UPS's to get me through the problems RGB was describing - I don't get many brownouts in NY with our underground cabling but that occasional power hiccup would drive me nuts. @Patrick Michael Kane I must say that I have NOT been cranking away with the 3D apps I'll be using regularly. Just trying to put load on the cpu's for my tests. I'll be properly set up to do everything right over the weekend. I needed to first finish the build-out for cooling and these power issues. I'll be sure to report back with my details. @Jim and RGB and Bari Understood. Loud and clear. No ganging of circuits. I didn't know if it was easy run in parallel to support bigger peaks. I won't be touching whatever is behind the circuit breakers by myself. @RGB again... VERY informative stuff about power and PS's. Thank you. -Alpay -----Original Message----- From: Eric Machala [mailto:emac@cybergps.net] Sent: Friday, February 18, 2005 2:37 AM To: Alpay Kasal; 'Jim Lux'; 'Dean Johnson' Cc: beowulf@beowulf.org Subject: Re: [Beowulf] powering up 18 motherboards This is not true the Ups will not draw ever any more from the wall that its is set to, ups has a set trickle rate that is able to be set to a cost effective trickle rate unless it is load overloaded it, if this is the case it is double the trickle rate becuase systems wants to restore full battery before power drain err loss of power.... i could get the actual specs on this but for your actual needed load if u were in a 50-65% load im sure there would be no spikes in power draw over normal trickle From maurice at harddata.com Sat Feb 19 10:00:05 2005 From: maurice at harddata.com (Maurice Hilarius) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] Re: "Off teh shelf" (was:Mare Nostrum) In-Reply-To: <200502190603.j1J62SnO003210@bluewest.scyld.com> References: <200502190603.j1J62SnO003210@bluewest.scyld.com> Message-ID: <42177EA5.4070606@harddata.com> Mark Hahn wrote: >it's funny how "off the shelf" means different things to different people. >I consider blades to be a qualitatively different category of hardware >than, say, a tyan motherboards in an AICPC chassis. AFAIKT, blades still >run at a premium vs "normal" servers from the same vendor (which is also >at a premium vs whitebox.) > > No Kidding!! If the "blades" use proprietary components, then the useful lifespan of the investment is what? 2 years perhaps? If, OTOH, they truly do use "off the shelf" components, they can be readily upgraded. As you mention, commodity motherboards share standard form factors, and usually power requirements. If one builds a "blade box" using these it is fairly trivial to change out motherboards and CPUs on a scheduled basis, easily doubling, and often tripling the lifespan of the investment. Further these upgrades may be done incrementally, reducing downtime to nothing, in practical terms. Buying a "solution" using proprietary components is the ages old "suckers game" that has been played by the larger vendors for decades to allow forced obsolescence. Maurice W. Hilarius Hard Data Ltd. maurice@harddata.com From billk01 at metrumrg.com Sun Feb 20 15:41:19 2005 From: billk01 at metrumrg.com (BillKnebel) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] sun grid engine on Scyld beowulf cluster In-Reply-To: <421516B7.3060906@sonsorol.org> References: <4214A0E5.3010804@metrumrg.com> <421516B7.3060906@sonsorol.org> Message-ID: <4219201F.70105@metrumrg.com> Chris, I was able to get grid engine to run on the Scyld cluster using the approach of setting the master (head) node as the submit, admin, and execute host. Unfortunately, starting a set of jobs on the cluster results in all jobs being run on the head node only (if grid engine only commands are used) or I can integrate grid engine "qsub" command with some of the Scyld tools to get jobs started then migrated ( to a point) over the cluster. However, I am still running into problems becuase all of the queueing variables for grid engine read the headnode info and since all jobs run on the compute nodes, the headnode appears to be always free which results in all jobs being started at once. This is not ideal. I am waiting on some feedback from Scyld/Penguin computing on some related issues that will hopefully solve some of these problems. Bill Chris Dagdigian wrote: > > I know Grid Engine well but not Scyld so forgive my ignorance if I say > something stupid and given the level of expertise on this list I'm > quite certain I'm about to make a fool myself :) > > If Scyld is presenting you with a single system image (ie a single > linux server that can farm out tasks to all those nodes) then you > would install SGE in the same way that you would install it on a big > SMP box: > > 1. Install the SGE qmaster and scheduler on the master node > 2. Install the execution host on the master node as well > > You will only have 1 execd per queue but each queue can be configured > with N number of "job slots" which actually control how many jobs can > run at the same time on the same machine. > > Try setting your # of job slots within your single SGE queue to the > number of nodes in your cluster. This is simlar to what you would do > on a big SMP machine -- small number of queues each supporting a > decent jobslot count. > > Then submit a bunch of jobs and see if SGE causes the master node to > fall over under load. If not then Scyld is doing its thing behind the > scenes to migrate stuff around to the other nodes. > > -Chris > > > > billk01 wrote: > >> I am in the process of installing SGE on a Scyld beowulf cluster. As >> most people are aware, the Scyld cluster runs a complete OS (linux) only >> on the master node and the compute nodes are simply for executing. >> During the SGE install, it requires adding the compute nodes as execute >> hosts. I do not understand how to do this given the current setup of a >> scyld cluster since you can't "login" to the nodes to execute the >> install script. The script does exist on an NFS shared directory >> (cluster wide). Has anybody else ran into this problem? >> > > > From cflau at clustertech.com Sun Feb 20 19:01:16 2005 From: cflau at clustertech.com (John Lau) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] MPICH question Message-ID: <1108954876.15965.16.camel@cattail.clustertech.com> Hi, I have a question on the number of processes spawned by MPICH. I am using MPICH 1.2.5.2 --with-device=ch_p4. When I use mpirun -np 4 to start a mpi process, total 8 processes will be spawned. But only 4 processes have loadings on CPUs. I would like to know if it is the correct behavior of MPICH? And what's the use of the 4 no-loading process? Thank you. Best regards, John Lau -- John Lau Chi Fai Center For Large-Scale Computation Tel: (852) 2994-3727 Fax: (852) 2994-2101 From brian at cypher.acomp.usf.edu Mon Feb 21 06:22:39 2005 From: brian at cypher.acomp.usf.edu (Brian R Smith) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] A hello, and an introduction In-Reply-To: <1108745056.42161b60d311c@panthermail.uwm.edu> References: <1108745056.42161b60d311c@panthermail.uwm.edu> Message-ID: <1108995759.24297.41.camel@daemon> Hey Jeremy, Its good to see another student admin at a university on here. Welcome to the list. There are a lot of top-notch people on here that you can learn a lot from. I've been admining at my univ. for about 3 years now and plan on doing so even after I graduate. With a C.S. background, you'll probably find lots of interesting things involved with Cryptography or Image/Video processing. I'm working on a video compression format right now and will likely write up a parallel encoder for AVI's into my format. Maybe my boss will post and give you some ideas on what to research as he's working on his PhD and probably has a better idea than I do on what you can accomplish on a cluster as a C.S. major. And I'm sure RGB could come up with some mind-blowers if you are really up to the task. Good luck and welcome to the list. Brian Smith On Fri, 2005-02-18 at 10:44 -0600, streich@uwm.edu wrote: > Hello all, > > I'm new to the list and just thought I'd introduce myself, as will probably be > posting to the list a bit. I'm a system administrator for a Beowulf cluster at > UW-Milwaukee. It's a 22 node 2.4GHz Intel cluster running Linux that is > dedicated to studying clouds (using wrf and COAMPS (MPI based software)). It's > a student job, and a lot of fun. I'm a Computer Science major, and have all > the Computer Science course done (just have a few math classes left). > > I'm starting to think about Grad school and Master Thesis stuff, though that is > a little way off. Along this vein, if anyone has any suggestions as to hot > research topics a CS major with access to a few spare clock cycles Beowulf > cluster might be interested in, please feel free to send them to me. ;) > > I've only been admin-ing the cluster for about a year, so I don't know how much > I'll be able to help people with questions... But I may throw an idea out once > in a while. I suppose here I may be asking more than answering the questions, > as it seems a lot of you have quite a bit of experience with larger clusters. > > - Jeremy > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- _______________________________________ | Brian R Smith | | Systems Administrator | | Research Computing Core Facility | | University of South Florida | | Phone: 1(813)974-1467 | | 4202 E Fowler Ave, LIB 613 | _______________________________________ From cflau at clustertech.com Mon Feb 21 17:30:29 2005 From: cflau at clustertech.com (John Lau) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] MPICH question Message-ID: <1109035829.18387.5.camel@cattail.clustertech.com> Hi, I have a question on the number of processes spawned by MPICH. I am using MPICH 1.2.5.2 --with-device=ch_p4. When I use mpirun -np 4 to start a mpi process, total 8 processes will be spawned. But only 4 processes have loadings on CPUs. I would like to know if it is the correct behavior of MPICH? And what's the use of the 4 no-loading process? Thank you. Best regards, John Lau -- John Lau Chi Fai Cluster Technology Ltd. cflau@clustertech.com Tel: (852) 2994-3727 Fax: (852) 2994-2101 From mark.westwood at ohmsurveys.com Tue Feb 22 00:23:42 2005 From: mark.westwood at ohmsurveys.com (Mark Westwood) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] The Case for an MPI ABI In-Reply-To: <20050222050824.GA2195@greglaptop.attbi.com> References: <20050222050824.GA2195@greglaptop.attbi.com> Message-ID: <421AEC0E.402@ohmsurveys.com> Greg You make a very persuasive case for an ABI. As an end-user of MPI, and with no ambitions to be anything else, many of the benefits of an ABI you suggest would be very useful. My recent experience of porting our MPI / Fortran codes to other platforms has been that getting the code to compile has been almost trivial (replace 'call flush(6)' by 'call flush_(6)' a few times, that sort of thing) but that getting to grips with the foreign environment (memory management, job submission, job start-up) is a real pain. So, to all you salesman in the group, come back and try to sell me the ABI when it's ready. Regards Mark Greg Lindahl wrote: > Those of you who were at the Open IB conference last week saw me give > a talk entitled "The Case for an MPI ABI". It seems that Patrick and I > have been channeling each other AGAIN; see what happens when I move to > California? > > The first question is: Does an ABI provide enough benefit for people > to care? To care enough to sit on a committee? > > If the answer is "yes", then I think we'll have one. The minimum > technical issues revolve around the contents of and the names > of shared libraries. The amount of work for MPICH or OpenMPI to > support that part of an ABI is modest. > > If we wanted to go farther, I have a strawman proposal which addresses > a generic startup procedure which would allow user applications, MPI > implementations, and queue systems to all live in peace and harmony. > > This talk: > > http://www.openib.org/docs/oib_wkshp_022005/mpi-abi-pathscale-lindahl.pdf > > mostly talks about why we need an ABI, who wins and loses as a result > of having one, and the pieces that could be in it. Please give it a > look. > > -- greg > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > -- Mark Westwood Parallel Programmer OHM Ltd The Technology Centre Offshore Technology Park Claymore Drive Aberdeen AB23 8GD United Kingdom +44 (0)870 429 6586 www.ohmsurveys.com From cousins at limpet.umeoce.maine.edu Tue Feb 22 11:18:19 2005 From: cousins at limpet.umeoce.maine.edu (Steve Cousins) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] RAID storage: Vendor vs. parts In-Reply-To: <200502212000.j1LK0B42019265@bluewest.scyld.com> Message-ID: I'm in the process of getting a couple 16 Bay 6.4 TB RAID units. The vendors who have given me quotes have a SATA-SCSI version for around $14,000 each. I can get a similarly equipped StorCase unit for around $11,000. With the StorCase I envision that I'd be more on my own if anything went wrong. However, with the cheaper price, I'd be able to buy a spare RAID controller to have on hand in case one of them failed and still save a couple grand. All three of these (the other two are Infortrend and I believe a re-badged Jet) use the same Intel i80321 CPU on their controllers. All are configured the same (with spare PS module and Fan module) except the StorCase doesn't include a 3 year Express Swap warranty. This is why I'd also want to get the spare RAID controller to share between the two units in case one went bad. So, for price comparison, it is probably closer to $28,000 for two "Name-brand" units or $25,500 for two StorCase units with spares of most everything. Does anyone have experience with any or all of these? Is it worth the extra money to have a "burned-in" device supported by some company? I know this isn't a Beowulf specific question but it seems that storage is a big part of Beowulfery and I'd bet that a lot of people are looking into similar devices. I hope it is relevant. I'm not after any more quotes from vendors (please don't email or call). I'm happy with the alternatives that I have right now. Thanks, Steve ______________________________________________________________________ Steve Cousins, Ocean Modeling Group Email: cousins@umit.maine.edu Marine Sciences, 208 Libby Hall http://rocky.umeoce.maine.edu Univ. of Maine, Orono, ME 04469 Phone: (207) 581-4302 From rgb at phy.duke.edu Tue Feb 22 11:56:45 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] A hello, and an introduction In-Reply-To: <1108995759.24297.41.camel@daemon> References: <1108745056.42161b60d311c@panthermail.uwm.edu> <1108995759.24297.41.camel@daemon> Message-ID: On Mon, 21 Feb 2005, Brian R Smith wrote: > Hey Jeremy, > > Its good to see another student admin at a university on here. Welcome > to the list. There are a lot of top-notch people on here that you can > learn a lot from. I've been admining at my univ. for about 3 years now > and plan on doing so even after I graduate. > > With a C.S. background, you'll probably find lots of interesting things > involved with Cryptography or Image/Video processing. I'm working on a > video compression format right now and will likely write up a parallel > encoder for AVI's into my format. > > Maybe my boss will post and give you some ideas on what to research as > he's working on his PhD and probably has a better idea than I do on what > you can accomplish on a cluster as a C.S. major. And I'm sure RGB could > come up with some mind-blowers if you are really up to the task. I don't know about mind blowers, but my column for I think April's CWM is on "things you can do with your starter cluster". It's far from exhaustive, but it provides a bit of direction for this perennial question. As far as RESEARCH topics are concerned, you should probably contact me off the list if you really do want any suggestions. There is a bit of difference between "generally interesting stuff you can do with a beowulf" and "computer science research you can do" with a beowulf. The former is concerned with applications and simple demonstrations -- the latter with tools, algorithms, timings, latency and so forth. I do have one idea for god's own project that I've offered up to the list a few times before (one inspired by work of Jack Dongarra and others, in case you were wondering which god:-). In a nutshell, it is to build a microbenchmarking daemon that could be included in a standard linux distribution and run as an initd-controlled tasks during startup. During normal operation, it would borrow idle cycles and accumulate benchmark/performance statistics and make them available via a socket interface (probably UDP) Any application, local or remote, could then query any host/node and get a performance profile -- a matrix of key microbenchmark numbers. These in turn could be used to "autotune" both serial and parallel/distributed applications. Once the daemon existed and a fairly standard set of numbers developed for a first cut at its output, one could then start thinking about e.g. rewriting ATLAS so that it autotunes from the daemon results instead of during build, so that a parallel application that partitions does so automatically to take advantage of superlinear speedups that might occur for certain partitionings, and so forth. I'd think that there were all sorts of papers in there, no? And a damn nice GPL toolset in the end that could be tremendously useful to lots of people. And as a final benefit for those seeking fame, it would obviously become THE microbenchmark tool for linux and likely other distros, as it would be the one that is built right in. In fact, the very first application that uses it would be a simple command line or GUI interface to read and plot its cumulated results... I'd be happy to direct this and maybe even contribute, if any CPS student-geeks out there find this interesting... rgb > > Good luck and welcome to the list. > > Brian Smith > > On Fri, 2005-02-18 at 10:44 -0600, streich@uwm.edu wrote: > > Hello all, > > > > I'm new to the list and just thought I'd introduce myself, as will probably be > > posting to the list a bit. I'm a system administrator for a Beowulf cluster at > > UW-Milwaukee. It's a 22 node 2.4GHz Intel cluster running Linux that is > > dedicated to studying clouds (using wrf and COAMPS (MPI based software)). It's > > a student job, and a lot of fun. I'm a Computer Science major, and have all > > the Computer Science course done (just have a few math classes left). > > > > I'm starting to think about Grad school and Master Thesis stuff, though that is > > a little way off. Along this vein, if anyone has any suggestions as to hot > > research topics a CS major with access to a few spare clock cycles Beowulf > > cluster might be interested in, please feel free to send them to me. ;) > > > > I've only been admin-ing the cluster for about a year, so I don't know how much > > I'll be able to help people with questions... But I may throw an idea out once > > in a while. I suppose here I may be asking more than answering the questions, > > as it seems a lot of you have quite a bit of experience with larger clusters. > > > > - Jeremy > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From mathog at mendel.bio.caltech.edu Tue Feb 22 12:23:45 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] S2466 Wake on Lan working, anyone? Message-ID: Does _anybody_ have wake on lan working with Tyan's S2466 motherboards? I know this has been asked before but hopefully since the last time around somebody has made it work. With: Tyan S2466M 2.6.8-1 kernel 3c59x driver v4.06 BIOS I tried putting: options 3c59x enable_wol=1 in /etc/modprobe.conf then did poweroff. This modified from the instructions here (URL may wrap): http://homepage.mac.com/felipe_alfaro/iblog/B1004527421/C1515218762/E66260423/ Unfortunately a subsequent ether-wake -D -i eth1 00:e0:81:22:ba:84 (also with -b) did nothing. The ethernet activity light is blinking on the powered off node, so there is at least power to the (on board) NIC. /var/log/messages shows these possibly relevant lines: Feb 22 11:55:04 monkey04 kernel: PCI: PCI BIOS revision 2.10 entry at 0xfd7d0, last bus=2 Feb 22 11:55:04 monkey04 kernel: 0000:02:08.0: 3Com PCI 3c905C Tornado at 0x2000. Vers LK1.1.19 Now as I understand it PCI 2.1 requires a header cable for WOL. The little blue Tyan user's manual does not indicate the location of such a header. It could be on J12 I suppose, since half the pins there are not documented. The book does document a LAN disable header which describes the on board ethernet as "3COM 3C905C." The little blue book also says that the board is a PCI 2.2 spec, and 2.2 doesn't require such a header. In any case as far as the PCI version goes there's some discrepancy between what is showing up in messages and what Tyan has documented. The BIOS appears not to have a WOL entry that can be turned on/off or a password set. Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From alvin at Mail.Linux-Consulting.com Tue Feb 22 13:34:36 2005 From: alvin at Mail.Linux-Consulting.com (Alvin Oga) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] RAID storage: Vendor vs. parts In-Reply-To: Message-ID: hi ya steve On Tue, 22 Feb 2005, Steve Cousins wrote: > I'm in the process of getting a couple 16 Bay 6.4 TB RAID units. The > vendors who have given me quotes have a SATA-SCSI version for around > $14,000 each. I can get a similarly equipped StorCase unit for around > $11,000. With the StorCase I envision that I'd be more on my own if > anything went wrong. However, with the cheaper price, I'd be able to buy > a spare RAID controller to have on hand in case one of them failed and > still save a couple grand. let's say $300 for 300GB (sata) disks ==> $4,800 for 16 disks ... - pata disk is $150-$200 for 300GB disks - i'd put 2 or 3 raid controllers in it instead of 1 .. unless there is an absolute requirement that there is only 1 disk subsystem that has the capacity of 6TB on "one disk" i'd prefer to have 2 6TB systems for the same $$$ than to have one name brand or commercial system backup of the 6TB or 100TB/rack of data is more important in my book than "brand name" > Does anyone have experience with any or all of these? Is it worth the > extra money to have a "burned-in" device supported by some company? a good burn in tests will lasts about 30 days .. of 24x7x30 continuous disk exercise ... it's not worth the "burn in time" and it'd be more important to find out how and what their rma process and timing is for replacing bad parts/subsystems ( 4hr turn around vs 4 day turn around etc ) after it passes, and it ships to you, all burn in tests is sorta void, since the disks and cages could shift a few tenths of mm, and the cards and disks wont be seated properly again ... things move(shift) in "cheap cases" c ya alvin From eugen at leitl.org Tue Feb 22 14:19:30 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] New MPI tutorials (fwd from d@daugerresearch.com) Message-ID: <20050222221930.GD1404@leitl.org> ----- Forwarded message from "Dr. Dean Dauger" ----- From: "Dr. Dean Dauger" Date: Tue, 22 Feb 2005 11:47:10 -0800 To: scitech@lists.apple.com Cc: "Dr. Dean Dauger" Subject: New MPI tutorials X-Mailer: Apple Mail (2.619) Hello All, I wanted to let you know about the debut of three tutorials about programming using MPI we just posted: http://daugerresearch.com/pooch/parallellife.html http://daugerresearch.com/pooch/parallelcirclepi.html http://daugerresearch.com/pooch/macmpitutorial.html featuring working source code examples using MPI and descriptions of the parallel code. We've also updated three earlier source-code tutorials on writing parallel code: http://daugerresearch.com/pooch/parallelknock.html http://daugerresearch.com/pooch/paralleladder.html http://daugerresearch.com/pooch/parallelpascalstriangle.html and updated an introduction to parallelization and an exhibition of parallel computing types: http://daugerresearch.com/pooch/parallelization.html http://daugerresearch.com/pooch/parallelzoology.html Also, Dauger Research authored a new Apple Developer Connection article about MPI: http://developer.apple.com/hardware/hpc/mpionmacosx.html All of these are linked from the Tutorials page: http://daugerresearch.com/pooch/tutorials.html Thank you very much, Dean _______________________________________________ Do not post admin requests to the list. They will be ignored. Scitech mailing list (Scitech@lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/scitech/eugen%40leitl.org This email sent to eugen@leitl.org ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050222/86c7e860/attachment.bin From Ron.Jerome at nrc-cnrc.gc.ca Tue Feb 22 12:11:08 2005 From: Ron.Jerome at nrc-cnrc.gc.ca (Jerome, Ron) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] RAID storage: Vendor vs. parts Message-ID: I've been running a 16 bay Maxtronix ATA raid unit on my 80 node cluster, 24x7 for the last couple of years without a single issue. In fact I just ordered another similar SATA unit here... http://www.raidweb.com/sata.html. _________________________________________ Ron Jerome Programmer/Analyst National Research Council Canada M-2, 1200 Montreal Road, Ottawa, Ontario K1A 0R6 Government of Canada Phone: 613-993-5346 FAX: 613-941-1571 _________________________________________ > -----Original Message----- > From: beowulf-bounces@beowulf.org [mailto:beowulf-bounces@beowulf.org] On > Behalf Of Steve Cousins > Sent: Tuesday, February 22, 2005 2:18 PM > To: beowulf@beowulf.org > Subject: [Beowulf] RAID storage: Vendor vs. parts > > > I'm in the process of getting a couple 16 Bay 6.4 TB RAID units. The > vendors who have given me quotes have a SATA-SCSI version for around > $14,000 each. I can get a similarly equipped StorCase unit for around > $11,000. With the StorCase I envision that I'd be more on my own if > anything went wrong. However, with the cheaper price, I'd be able to buy > a spare RAID controller to have on hand in case one of them failed and > still save a couple grand. > > All three of these (the other two are Infortrend and I believe a re-badged > Jet) use the same Intel i80321 CPU on their controllers. > > All are configured the same (with spare PS module and Fan module) except > the StorCase doesn't include a 3 year Express Swap warranty. This is why > I'd also want to get the spare RAID controller to share between the two > units in case one went bad. So, for price comparison, it is probably > closer to $28,000 for two "Name-brand" units or $25,500 for two StorCase > units with spares of most everything. > > Does anyone have experience with any or all of these? Is it worth the > extra money to have a "burned-in" device supported by some company? > > I know this isn't a Beowulf specific question but it seems that storage is > a big part of Beowulfery and I'd bet that a lot of people are looking into > similar devices. I hope it is relevant. > > I'm not after any more quotes from vendors (please don't email or call). > I'm happy with the alternatives that I have right now. > > Thanks, > > Steve > ______________________________________________________________________ > Steve Cousins, Ocean Modeling Group Email: cousins@umit.maine.edu > Marine Sciences, 208 Libby Hall http://rocky.umeoce.maine.edu > Univ. of Maine, Orono, ME 04469 Phone: (207) 581-4302 > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From emac at cybergps.net Tue Feb 22 12:43:55 2005 From: emac at cybergps.net (Eric Machala) Date: Wed Nov 25 01:03:50 2009 Subject: [BEOwulf] WW Fedora Student questions help Message-ID: <001d01c5191f$3ac29b40$6e45a8c0@masstivy> Hi for my networking and linux class i setup this beowulf cluster useing fedora 2 and warewulf its up and running, but part of this class is also setting up monitoring and benchmarking tools and tests to get overall proformance and what not of my cluster... And setting up and running some parrallel Applications ... Im very interested in getting more into clusters so was wondering if anyone has any tools or scripts or anything i can setup and test on my fedora warewulf setup just to get experience Also i really need some computational or Sim type app's or any type of cluster parrallel application i can play with to get the hang of them so i can move on to makeing my own applications this would be a HUGE help -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20050222/4bfa6f59/attachment.html From canon at nersc.gov Tue Feb 22 12:54:41 2005 From: canon at nersc.gov (Shane Canon) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] RAID storage: Vendor vs. parts In-Reply-To: References: Message-ID: <421B9C11.1070204@nersc.gov> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Steve, We currently have an Infotrend based box (older FC/IDE) and I'm currently testing a JetStor (AC&NC) box (FC/SATA). The Infotrend box has been in production for over a year now. It has been reasonably stable. However, in my opinion, the management interface stinks. They clearly were thinking more about the Winders crowd than the Linux crowd. The JetStor box is pretty nice so far. You can have redundant controllers, PS, etc. The management interface is very nice and very flexible. For example, you can take one pool of disks and create multiple LUNs on the same pool with different RAID levels. This means you can create several LUNs under the 2TB limit and not waste disks on extra parity drives for example. I don't have any strong opinion on the support question though. - --Shane Steve Cousins wrote: | I'm in the process of getting a couple 16 Bay 6.4 TB RAID units. The | vendors who have given me quotes have a SATA-SCSI version for around | $14,000 each. I can get a similarly equipped StorCase unit for around | $11,000. With the StorCase I envision that I'd be more on my own if | anything went wrong. However, with the cheaper price, I'd be able to buy | a spare RAID controller to have on hand in case one of them failed and | still save a couple grand. | | All three of these (the other two are Infortrend and I believe a re-badged | Jet) use the same Intel i80321 CPU on their controllers. | | All are configured the same (with spare PS module and Fan module) except | the StorCase doesn't include a 3 year Express Swap warranty. This is why | I'd also want to get the spare RAID controller to share between the two | units in case one went bad. So, for price comparison, it is probably | closer to $28,000 for two "Name-brand" units or $25,500 for two StorCase | units with spares of most everything. | | Does anyone have experience with any or all of these? Is it worth the | extra money to have a "burned-in" device supported by some company? | | I know this isn't a Beowulf specific question but it seems that storage is | a big part of Beowulfery and I'd bet that a lot of people are looking into | similar devices. I hope it is relevant. | | I'm not after any more quotes from vendors (please don't email or call). | I'm happy with the alternatives that I have right now. | | Thanks, | | Steve | ______________________________________________________________________ | Steve Cousins, Ocean Modeling Group Email: cousins@umit.maine.edu | Marine Sciences, 208 Libby Hall http://rocky.umeoce.maine.edu | Univ. of Maine, Orono, ME 04469 Phone: (207) 581-4302 | | | | _______________________________________________ | Beowulf mailing list, Beowulf@beowulf.org | To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCG5wRZd/2zrI5CioRAvDSAKCLkHuIIKe1pv3n0HkM4fYRRAbPsQCgq/6/ 1t1D1DvbJea8PUp6K49iajE= =cZ8o -----END PGP SIGNATURE----- From cousins at limpet.umeoce.maine.edu Tue Feb 22 16:05:39 2005 From: cousins at limpet.umeoce.maine.edu (Steve Cousins) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] RAID storage: Vendor vs. parts In-Reply-To: Message-ID: On Tue, 22 Feb 2005, Alvin Oga wrote: > hi ya steve > > On Tue, 22 Feb 2005, Steve Cousins wrote: > > >> I'm in the process of getting a couple 16 Bay 6.4 TB RAID units. The > >> vendors who have given me quotes have a SATA-SCSI version for around > >> $14,000 each. I can get a similarly equipped StorCase unit for around > >> $11,000. With the StorCase I envision that I'd be more on my own if > >> anything went wrong. However, with the cheaper price, I'd be able to buy > >> a spare RAID controller to have on hand in case one of them failed and > >> still save a couple grand. > > > let's say $300 for 300GB (sata) disks ==> $4,800 for 16 disks ... > - pata disk is $150-$200 for 300GB disks > > - i'd put 2 or 3 raid controllers in it instead of 1 .. > unless there is an absolute requirement that there is only > 1 disk subsystem that has the capacity of 6TB on "one disk" That's what I'm shooting for. Anybody have good luck with volumes greater than 2 TB with Linux? I think LSI SCSI cards are needed (?) and the 2.6 Kernel is needed with CONFIG_LBD=y. Any hints or notes about doing this would be greatly appreciated. Google has not been much of a friend on this unfortunatlely. I'm guessing I'd run into NFS limits too. Also, am I being overly cautious about having a spare RAID controller on hand? How frequent do RAID controllers go bad compared to disks, power supplies and fan modules? I'd guess that it would be very infrequent. Looking back at my own experience I think I've had to return one out of 15 in the last eight years, and that was bad as soon as I bought it. If this is too off-topic let me know and I'll move it elsewhere. Thanks, Steve From lindahl at pathscale.com Tue Feb 22 16:31:03 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] RAID storage: Vendor vs. parts In-Reply-To: References: Message-ID: <20050223003103.GB3291@greglaptop.internal.keyresearch.com> On Tue, Feb 22, 2005 at 07:05:39PM -0500, Steve Cousins wrote: > Also, am I being overly cautious about having a spare RAID controller on > hand? No. It depends on what kind of uptime you want to support. > How frequent do RAID controllers go bad compared to disks, power > supplies and fan modules? Which of these can you buy at Fry's? -- greg From alvin at Mail.Linux-Consulting.com Tue Feb 22 16:43:34 2005 From: alvin at Mail.Linux-Consulting.com (Alvin Oga) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] RAID storage: Vendor vs. parts In-Reply-To: Message-ID: hi ya steve On Tue, 22 Feb 2005, Steve Cousins wrote: > That's what I'm shooting for. Anybody have good luck with volumes greater > than 2 TB with Linux? I think LSI SCSI cards are needed (?) and the 2.6 > Kernel is needed with CONFIG_LBD=y. Any hints or notes about doing this > would be greatly appreciated. Google has not been much of a friend on > this unfortunatlely. I'm guessing I'd run into NFS limits too. for files/volumes over 2TB ... it's a question of libs, apps and kernel everything has to work ... which is not always the case i don't play much with 2.6 kernels other than on suse-9.x boxes > Also, am I being overly cautious about having a spare RAID controller on > hand? How frequent do RAID controllers go bad compared to disks, power > supplies and fan modules? I'd guess that it would be very infrequent. it's always better to have spare parts ... ( part of my requirement ) if they expect the systems to be available 24x7 ... - more importantly, how long can they wait, when silly inexpensive things die, before it gets replaced - dead fans is $2.oo - $15 each to keep the disks cool - power supply is $50 range ... but if one bought n+1 powersupply than its supposed to not be an issue anymore, but you will need to have its replacement handy - raid controllers should NOT die, nor cpu, mem, mb, nic, etc and it's not cheap to have these items floating around as spare parts - ethernet cables will go funky if random people have access to the patch panels ... ( keep the fingers away ) - ups will go bonkers too - what failure mode can one protect against and what will happen if "it" dies - best protection against downtime for users is to have an warm-swap server which is updated a hourly or daily ... ( my preference - 2nd identical or bigger-disk capacity system ) > Looking back at my own experience I think I've had to return one out of 15 > in the last eight years, and that was bad as soon as I bought it. seems too high of a return rate ?? 1 out of 15 ?? > If this is too off-topic let me know and I'll move it elsewhere. ditto here 24x7x365 uptime compute environment is fun/frustrating stuff on tight budgets c ya alvin From alvin at Mail.Linux-Consulting.com Tue Feb 22 16:50:08 2005 From: alvin at Mail.Linux-Consulting.com (Alvin Oga) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] RAID storage: Vendor vs. parts In-Reply-To: <20050223003103.GB3291@greglaptop.internal.keyresearch.com> Message-ID: hi ya greg On Tue, 22 Feb 2005, Greg Lindahl wrote: > > How frequent do RAID controllers go bad compared to disks, power > > supplies and fan modules? > > Which of these can you buy at Fry's? if you're buying parts/systems from "fries" or compusa/dell/etc.. than one is in deep kaka ... - consumer grade parts is not as good as "industrial strength" that does not necessarily mean higher prices - fries et.al carry the lower grade junk of the same parts marginal mtbf parts of the same identical items, sold by higher end distributors vs retail store my conspiracy theory about why the same Model xx from Manufacturer are good for some and bad for others ( it'd depend on where you bought it ) - most consumer stores carry 3ware cards ... but not necessarily adaptec/lsi raid cards but some do carry all 3 of um but none carry the $1K - $20k raid controllers in stock c ya alvin From lusk at mcs.anl.gov Tue Feb 22 21:12:17 2005 From: lusk at mcs.anl.gov (Rusty Lusk) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] MPICH question In-Reply-To: <1109035829.18387.5.camel@cattail.clustertech.com> References: <1109035829.18387.5.camel@cattail.clustertech.com> Message-ID: <20050222.231217.74740023.lusk@localhost> From: John Lau Subject: [Beowulf] MPICH question Date: Tue, 22 Feb 2005 09:30:29 +0800 > Hi, > > I have a question on the number of processes spawned by MPICH. I am > using MPICH 1.2.5.2 --with-device=ch_p4. When I use mpirun -np 4 to > start a mpi process, total 8 processes will be spawned. But only 4 > processes have loadings on CPUs. I would like to know if it is the > correct behavior of MPICH? And what's the use of the 4 no-loading > process? Thank you. The other 4 processes are "listener" processes that permit dynamic creation of connections as they are needed. What you see is the correct behavior. We recommend that MPICH1 users switch to MPICH2. Regards, Rusty Lusk From joachim at ccrl-nece.de Wed Feb 23 03:19:55 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] The Case for an MPI ABI In-Reply-To: <20050222050824.GA2195@greglaptop.attbi.com> References: <20050222050824.GA2195@greglaptop.attbi.com> Message-ID: <421C66DB.5080309@ccrl-nece.de> Greg Lindahl wrote: > The first question is: Does an ABI provide enough benefit for people > to care? To care enough to sit on a committee? Unfortunately, the value of an ABI is much reduced by the fact that the most important target platform Linux itself has no stable ABI (think of libc and other version nightmares). On a OS like Solaris or Windows, this is much more of a benefit. Another problem are i.e. vendor-specific assertions that could conflict. A solution for this could be "numerical namespaces" for such extensions, but how should they be managed? And what about the different calling-conventions in Fortran? Different library names for each variant? The different symbol names are also a problem, but a solvable one if a limited, but sufficient set of uppercase-lowercase-underscore permutations is defined. -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From ashley at quadrics.com Wed Feb 23 03:31:35 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] The Case for an MPI ABI In-Reply-To: <20050222.110605.03066059.lusk@localhost> References: <20050222050824.GA2195@greglaptop.attbi.com> <20050222.110605.03066059.lusk@localhost> Message-ID: <1109158295.9025.88.camel@localhost.localdomain> On Tue, 2005-02-22 at 11:06 -0600, Rusty Lusk wrote: > From: Greg Lindahl > Subject: [Beowulf] The Case for an MPI ABI > Date: Mon, 21 Feb 2005 21:08:24 -0800 > > This talk: > > > > http://www.openib.org/docs/oib_wkshp_022005/mpi-abi-pathscale-lindahl.pdf > > > > mostly talks about why we need an ABI, who wins and loses as a result > > of having one, and the pieces that could be in it. Please give it a > > look. > > One piece you include is the replacement of the non-portable mpirun with > an mpistart with standard arguments. You might note that the MPI-2 > forum addressed this issue with the specification of an mpiexec with > standard arguments. MPICH2 implements it. Can you provide a link to the the part of the MPI-2 spec which says what these arguments are, I can't seem to find it on-line. Ashley, From gropp at mcs.anl.gov Tue Feb 22 20:21:33 2005 From: gropp at mcs.anl.gov (William Gropp) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] MPICH question In-Reply-To: <1109035829.18387.5.camel@cattail.clustertech.com> References: <1109035829.18387.5.camel@cattail.clustertech.com> Message-ID: <6.2.1.2.2.20050222221818.04faf120@pop.mcs.anl.gov> At 07:30 PM 2/21/2005, John Lau wrote: >Hi, > >I have a question on the number of processes spawned by MPICH. I am >using MPICH 1.2.5.2 --with-device=ch_p4. When I use mpirun -np 4 to >start a mpi process, total 8 processes will be spawned. But only 4 >processes have loadings on CPUs. I would like to know if it is the >correct behavior of MPICH? And what's the use of the 4 no-loading >process? Thank you. Those four processes are used to listen for connection requests. They are part of the ch_p4 device, which is built on top of the p4 communication layer. The p4 layer is quite venerable (it may be older than you are), and predates the wide use of threads. There is an option to use a thread instead of a process to listen for connection requests, but you best bet is to switch to MPICH2, which uses a very different (and more scalable and modern) architecture. Specific questions should be directed to mpi-maint@mcs.anl.gov (for MPICH) or mpich2-maint@mcs.anl.gov (for MPICH2). Bill >Best regards, >John Lau >-- >John Lau Chi Fai >Cluster Technology Ltd. >cflau@clustertech.com >Tel: (852) 2994-3727 >Fax: (852) 2994-2101 > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf William Gropp http://www.mcs.anl.gov/~gropp From gropp at mcs.anl.gov Wed Feb 23 05:38:03 2005 From: gropp at mcs.anl.gov (William Gropp) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] The Case for an MPI ABI In-Reply-To: <1109158295.9025.88.camel@localhost.localdomain> References: <20050222050824.GA2195@greglaptop.attbi.com> <20050222.110605.03066059.lusk@localhost> <1109158295.9025.88.camel@localhost.localdomain> Message-ID: <6.2.1.2.2.20050223073608.04f50400@pop.mcs.anl.gov> At 05:31 AM 2/23/2005, Ashley Pittman wrote: >On Tue, 2005-02-22 at 11:06 -0600, Rusty Lusk wrote: > > From: Greg Lindahl > > Subject: [Beowulf] The Case for an MPI ABI > > Date: Mon, 21 Feb 2005 21:08:24 -0800 > > > This talk: > > > > > > http://www.openib.org/docs/oib_wkshp_022005/mpi-abi-pathscale-lindahl.pdf > > > > > > mostly talks about why we need an ABI, who wins and loses as a result > > > of having one, and the pieces that could be in it. Please give it a > > > look. > > > > One piece you include is the replacement of the non-portable mpirun with > > an mpistart with standard arguments. You might note that the MPI-2 > > forum addressed this issue with the specification of an mpiexec with > > standard arguments. MPICH2 implements it. > >Can you provide a link to the the part of the MPI-2 spec which says what >these arguments are, I can't seem to find it on-line. It is under "Portable MPI Process Startup"; see http://www.mpi-forum.org/docs/mpi-20-html/node42.htm#Node42 . Bill >Ashley, >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf William Gropp http://www.mcs.anl.gov/~gropp From cousins at limpet.umeoce.maine.edu Wed Feb 23 07:36:19 2005 From: cousins at limpet.umeoce.maine.edu (Steve Cousins) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] RAID storage: Vendor vs. parts In-Reply-To: <421C0D6D.5020008@harddata.com> Message-ID: On Tue, 22 Feb 2005, Maurice Hilarius wrote: > I am not sure what you are asking here? > If you had the experience to build this yourself with confidence, then it is not a question. > And if you do not, then why the uncertainty? You will NEED the support. My question was simply: Does anyone have experience with any or all of these? Is it worth the extra money to have a "burned-in" device supported by some company? where "these" refered to Storcase, Infortrend, and Jetstor 16 Bay SATA-SCSI RAID units. I've heard good things about Infortrend and Jetstor but nothing about the Storcase unit so I primarily was interested in hearing if anyone had used these and what their impression was vs. what I'd heard about the others. Based partly on this, I will make a decision on what to buy. > Or am I missing a vital part of your question? > > BTW, those prices are WAY too high. Further, there is a very strong > case now for a purely software RAID. Still use the RAID controllers as > disk interface controllers, such as LSI or AMCC (Was 3Ware). But use a > commodity dual CPU motherboard as the "RAID controller" under mdadm. > LOTS of people are doing that very successfully, and it is both > economical and generally higher performance than pure RAID controllers > under RAID5 or 6. I agree that you can do it cheaper this way however it does have its drawbacks. I set up a 2 TB system a year ago using a 3Ware 8506-12 card and 10 250 GB drives. It took months to get it to be rock-solid due to a combination of problems between the 3Ware driver, the 2.6 kernel using Opterons, and failing Maxtor hard drives as they worked their way to the bottom of the Bathtub curve. When a drive would fail or show signs of failing the driver would OOPS the system. We're now trying to get away from this mode of storage and transition to more of a SAN system (we may get FC versions of these boxes instead of SCSI) which will give better performance to all nodes as well as (hopefully) have less management overhead (my time). In any case, back to the main question: I haven't heard of anyone with experience with the Storcase systems. Has anyone considered them and decided to pay more for a name brand? If so, can you tell me why? Thanks, Steve From josip at lanl.gov Wed Feb 23 08:26:21 2005 From: josip at lanl.gov (Josip Loncaric) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] RAID storage: Vendor vs. parts In-Reply-To: References: Message-ID: <421CAEAD.8070207@lanl.gov> Alvin Oga wrote: > if you're buying parts/systems from "fries" or compusa/dell/etc.. > than one is in deep kaka ... > > - consumer grade parts is not as good as "industrial strength" > that does not necessarily mean higher prices > > - fries et.al carry the lower grade junk of the same parts > > marginal mtbf parts of the same identical items, sold by > higher end distributors vs retail store > > my conspiracy theory about why the same Model xx from Manufacturer > are good for some and bad for others > ( it'd depend on where you bought it ) My experience with boxed drives bought from retailers was better than OEM bare drives from reputable sources. Retail boxed drives often carry 3yr warranty, so there is more at stake for the manufacturer if they go bad. This is just one observation, possibly not statistically significant, but my retail store purchases worked out just fine. Sincerely, Josip P.S. You would not want to build a cluster that way (limited selection & higher prices) but for spare parts, retail stores are quick and convenient. ---------------------------------------------------------------------- "Technical data or Software Publicly Available" or "Correspondence". ---------------------------------------------------------------------- From reuti at staff.uni-marburg.de Tue Feb 22 15:23:41 2005 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] MPICH question In-Reply-To: <1109035829.18387.5.camel@cattail.clustertech.com> References: <1109035829.18387.5.camel@cattail.clustertech.com> Message-ID: <1109114621.421bbefda1ca0@home.staff.uni-marburg.de> Hi, you are not using shared memory, and so you will also see the rsh tasks. This is the usual behavior of mpich (and the other forks are most likely responsible for the async network communication between the tasks, although they are just on the same machine). But you try this, and use in addition the following switch to ./configure mpich: -comm=shared , recompile (or at least relink) your program, and use a machinefile with only one line: myhost:4 and you we now have only 4 processes and use shared memory. Cheers - Reuti Quoting John Lau : > Hi, > > I have a question on the number of processes spawned by MPICH. I am > using MPICH 1.2.5.2 --with-device=ch_p4. When I use mpirun -np 4 to > start a mpi process, total 8 processes will be spawned. But only 4 > processes have loadings on CPUs. I would like to know if it is the > correct behavior of MPICH? And what's the use of the 4 no-loading > process? Thank you. > > Best regards, > John Lau > -- > John Lau Chi Fai > Cluster Technology Ltd. > cflau@clustertech.com > Tel: (852) 2994-3727 > Fax: (852) 2994-2101 > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From michael at halligan.org Tue Feb 22 16:24:48 2005 From: michael at halligan.org (Michael T. Halligan) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] RAID storage: Vendor vs. parts In-Reply-To: References: Message-ID: <56919.66.150.251.142.1109118288.squirrel@mail3.bitpusher.com> >> >> I'm in the process of getting a couple 16 Bay 6.4 TB RAID units. The >> >> vendors who have given me quotes have a SATA-SCSI version for around >> >> $14,000 each. I can get a similarly equipped StorCase unit for >> around >> >> $11,000. With the StorCase I envision that I'd be more on my own if >> >> anything went wrong. However, with the cheaper price, I'd be able to >> buy >> >> a spare RAID controller to have on hand in case one of them failed >> and >> >> still save a couple grand. >> > >> let's say $300 for 300GB (sata) disks ==> $4,800 for 16 disks ... >> - pata disk is $150-$200 for 300GB disks >> >> - i'd put 2 or 3 raid controllers in it instead of 1 .. >> unless there is an absolute requirement that there is only >> 1 disk subsystem that has the capacity of 6TB on "one disk" > > That's what I'm shooting for. Anybody have good luck with volumes greater > than 2 TB with Linux? I think LSI SCSI cards are needed (?) and the 2.6 > Kernel is needed with CONFIG_LBD=y. Any hints or notes about doing this > would be greatly appreciated. Google has not been much of a friend on > this unfortunatlely. I'm guessing I'd run into NFS limits too. > > Also, am I being overly cautious about having a spare RAID controller on > hand? How frequent do RAID controllers go bad compared to disks, power > supplies and fan modules? I'd guess that it would be very infrequent. > Looking back at my own experience I think I've had to return one out of 15 > in the last eight years, and that was bad as soon as I bought it. > > If this is too off-topic let me know and I'll move it elsewhere. > > Thanks, > I've had no problems with raid voumes 4-8 terabytes in size. recently, but I use the 2.6.8 kernel and only use ICP Vortex cards, I've found the LSI cards to be sketchy at best. ------------------- BitPusher, LLC http://www.bitpusher.com/ 1.888.9PUSHER (415) 724.7998 - Mobile From maurice at harddata.com Tue Feb 22 20:58:21 2005 From: maurice at harddata.com (Maurice Hilarius) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] RAID storage: Vendor vs. parts In-Reply-To: <200502222000.j1MK0AEf018502@bluewest.scyld.com> References: <200502222000.j1MK0AEf018502@bluewest.scyld.com> Message-ID: <421C0D6D.5020008@harddata.com> Steve Cousins wrote: Subject: [Beowulf] RAID storage: Vendor vs. parts >I'm in the process of getting a couple 16 Bay 6.4 TB RAID units. The >vendors who have given me quotes have a SATA-SCSI version for around >$14,000 each. I can get a similarly equipped StorCase unit for around >$11,000. With the StorCase I envision that I'd be more on my own if >anything went wrong. However, with the cheaper price, I'd be able to buy >a spare RAID controller to have on hand in case one of them failed and >still save a couple grand. > >All three of these (the other two are Infortrend and I believe a re-badged >Jet) use the same Intel i80321 CPU on their controllers. > >All are configured the same (with spare PS module and Fan module) except >the StorCase doesn't include a 3 year Express Swap warranty. This is why >I'd also want to get the spare RAID controller to share between the two >units in case one went bad. So, for price comparison, it is probably >closer to $28,000 for two "Name-brand" units or $25,500 for two StorCase >units with spares of most everything. > >Does anyone have experience with any or all of these? Is it worth the >extra money to have a "burned-in" device supported by some company? > >I know this isn't a Beowulf specific question but it seems that storage is >a big part of Beowulfery and I'd bet that a lot of people are looking into >similar devices. I hope it is relevant. > >I'm not after any more quotes from vendors (please don't email or call). >I'm happy with the alternatives that I have right now. > >Thanks, > >Steve I am not sure what you are asking here? If you had the experience to build this yourself with confidence, then it is not a question. And if you do not, then why the uncertainty? You will NEED the support. Or am I missing a vital part of your question? BTW, those prices are WAY too high. Further, there is a very strong case now for a purely software RAID. Still use the RAID controllers as disk interface controllers, such as LSI or AMCC (Was 3Ware). But use a commodity dual CPU motherboard as the "RAID controller" under mdadm. LOTS of people are doing that very successfully, and it is both economical and generally higher performance than pure RAID controllers under RAID5 or 6. With our best regards, Maurice W. Hilarius Telephone: 01-780-456-9771 Hard Data Ltd. FAX: 01-780-456-9772 11060 - 166 Avenue email:maurice@harddata.com Edmonton, AB, Canada http://www.harddata.com/ T5X 1Y3 From list-beowulf at onerussian.com Tue Feb 22 21:44:08 2005 From: list-beowulf at onerussian.com (Yaroslav Halchenko) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] managing debian packages Message-ID: <20050223054408.GS3124@washoe.rutgers.edu> Hello to all Beowulfers, A simple question: so we have cfengine2 to manage configs through the hosts. But its "packages" section is quite handicaped so there is a question: how do you manage installing packages on nodes which some times might differ a bit but most often have the same set of packages. I have in mind Debian packaging system In my case what I do is 1. Install required package on a main node, so if it has any dialog which tweaks configuration - I adjust it so it fits my needs. 2. I ran cfegines through the cluster, so they pick up updated /var/cache/debconf/config.dat 3. Using favorite dsh I install the same package in parallel on all the nodes in non-interactive regime, so it grabs answers for possible questions from debconf. This way everything is kinda right. Other ways would be: install on all the nodes from the beginning in non-interactive. I don't like such option because it is quite often that default config has to be tweaked in slight way suggested by debconf, but if you don't get dialog - you will not tweak it... or indeed my way is overkill? -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555] -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20050223/d6055ea9/attachment.bin From streich at uwm.edu Tue Feb 22 23:12:33 2005 From: streich at uwm.edu (streich@uwm.edu) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] Re: WW Fedora Student questions help Message-ID: <1109142753.421c2ce1638f8@panthermail.uwm.edu> *SNIP* > but part of this class is also setting up monitoring and benchmarking tools > and tests to get overall proformance and what not of my cluster... Check out MRTG and Ganglia on the net for monitoring. I know that MRTG is avalible as a RedHat package that can be installed from distro's install CDs, but it is probably best to get it in source code and configure it to meet your needs. Ganglia is powerful, it is used by a lot of clusters running ROCKs. > And setting up and running some parrallel Applications ... Im very interested in > getting more into clusters so was wondering if anyone has any tools or scripts > or anything i can setup and test on my fedora warewulf setup just to get > experience Also i really need some computational or Sim type app's or any type > of cluster parrallel application i can play with to get the hang of them so i > can move on to makeing my own applications this would be a HUGE help Hot topics that are using clusters include astronomical models, biology (human genome stuff), and all sorts of things. Our cluster is dedicated to atmospheric research (in particular studying fluffy white clouds most atmospheric researchers aren't as intrested in as storms and such) running COAMPs and wrf (MPI based programs). I suppose the real question is: what type of problem are you intrested in? From pauln at psc.edu Tue Feb 22 23:15:48 2005 From: pauln at psc.edu (Paul Nowoczynski) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] RAID storage: Vendor vs. parts In-Reply-To: References: Message-ID: <421C2DA4.8090608@psc.edu> Alvin Oga wrote: >hi ya steve > >On Tue, 22 Feb 2005, Steve Cousins wrote: > > > >>That's what I'm shooting for. Anybody have good luck with volumes greater >>than 2 TB with Linux? I think LSI SCSI cards are needed (?) and the 2.6 >>Kernel is needed with CONFIG_LBD=y. Any hints or notes about doing this >>would be greatly appreciated. Google has not been much of a friend on >>this unfortunatlely. I'm guessing I'd run into NFS limits too. >> >> > >for files/volumes over 2TB ... it's a question of libs, apps and kernel > everything has to work ... which is not always the case > > > We've got this working at PSC without too much pain.. even with scsi block devices >2TB. The LBD is needed but it doesn't solve all the problems with large disks, especially if you have a single volume which is larger than 2TB. The issue we ran into was that many disk related apps like mdadm and [s]fdisk don't support the BLKGETSIZE64 ioctl. So even though your kernel is using 64 bits, some needed apps are not. There are also issues with disklabels for devices >2TB. The normal dos-style disklabel used by linux doesn't support them so you'll need a kernel patch for the "plaintext" partition table made by Andries Brouwer. If you're interested in running this on 2.6 I can give you the patch. As far as cards go I think the adaptec u320 cards are better. I've seen less scsi timeout weirdness with them (this could be related to our disks). Performance wise the lsi and adaptec are about the same.. we see ~400MB/sec when using both channels - even with a sub pci-x bus. For a couple hundred bucks a card this is really good news. --paul > i don't play much with 2.6 kernels other than on suse-9.x boxes > > > >>Also, am I being overly cautious about having a spare RAID controller on >>hand? How frequent do RAID controllers go bad compared to disks, power >>supplies and fan modules? I'd guess that it would be very infrequent. >> >> > >it's always better to have spare parts ... ( part of my requirement ) if >they expect the systems to be available 24x7 ... > > - more importantly, how long can they wait, when silly inexpensive > things die, before it gets replaced > > - dead fans is $2.oo - $15 each to keep the disks cool > > - power supply is $50 range ... but if one bought n+1 powersupply > than its supposed to not be an issue anymore, but you will need to > have its replacement handy > > - raid controllers should NOT die, nor cpu, mem, mb, nic, etc > and it's not cheap to have these items floating around as spare > parts > > - ethernet cables will go funky if random people have access > to the patch panels ... ( keep the fingers away ) > > - ups will go bonkers too > > - what failure mode can one protect against and what will happen > if "it" dies > > - best protection against downtime for users is to have an > warm-swap server which is updated a hourly or daily ... > ( my preference - 2nd identical or bigger-disk capacity system ) > > > >>Looking back at my own experience I think I've had to return one out of 15 >>in the last eight years, and that was bad as soon as I bought it. >> >> > >seems too high of a return rate ?? 1 out of 15 ?? > > > >>If this is too off-topic let me know and I'll move it elsewhere. >> >> > >ditto here > >24x7x365 uptime compute environment is fun/frustrating stuff on tight >budgets > >c ya >alvin > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From gotero at linuxprophet.com Wed Feb 23 00:48:53 2005 From: gotero at linuxprophet.com (Glen Otero) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] New BioBrew Release Message-ID: <0b130210c33a2235b11fc4ff101381f2@linuxprophet.com> BioBrew-3.1 for x86 is here. BioBrew is an open source Linux cluster distribution based on the popular Rocks (www.rocksclusters.org) cluster software and enhanced for bioinformatics. BioBrew includes popular cluster software e.g. MPICH, PVM, Modules, PVFS, Myrinet GM, Sun Grid Engine, gcc, Ganglia, and Globus, *and* popular bioinformatics software e.g. the NCBI toolkit, BLAST, mpiBLAST, HMMER, ClustalW, GROMACS, PHYLIP, WISE, FASTA, MrBayes, and EMBOSS. A BioBrew DVD iso for x86 is freely available for download at BioBrew.org, a Bioinformatics.org sponsored and hosted website. README and INSTALL docs are also available on the website. Features you'll find in this release that differ slightly from Rocks-3.1 include: -Infiniband support from Mellanox (/usr/mellanox) -Myrinet support with gm-2.0.11 -Virtual Machine Interface (VMI) 2.0. (/opt/vmi-2.0-gcc) -mpich built for VMI (/opt/mpich-vmi-2.0-gcc) Application upgrades for 3.1: added modules back to distro with modulesenv 3.1.6-2 upgraded modulefiles to 1.0.2, including corrections to profile.d scripts submitted by Humberto Zuazaga upgraded hmmer to 2.3.2-1 upgraded gromacs to 3.2.1-1 upgraded EMBOSS to 2.9.0-6; EMBOSS now installs to /usr/share and not /opt/BioBrew upgraded NCBI BLAST with ncbitoools 6.1.0-2 and ncbi.tar.gz from 10/20/04) upgraded mpiBLAST to 1.3; mpiBLAST lives under /opt/NCBI/6.1.0/bin with the rest of the binaries upgraded Phylip to 3.61-5 and separated it from EMBOSS New applications: MrBayes 3.0-1 has been added to BioBrew BioBrew rolls are also here: BioBrew rolls for Rocks-3.1 and Rocks-3.3 on x86 are also available on the website, as is a BioBrew roll for Rocks-3.3 on x86_64. The rolls contain the same bio apps as the full BioBrew release. But know that if you build a cluster with Rocks + BioBrew roll, you will not have access to the vmi, mpich-vmi, Infiniband, or Myrinet packages in the full BioBrew release. You will be relying on Rocks' Infiniband, mpich, and Myrinet support, which is slightly different than BioBrew's, when using these rolls. The BioBrew rolls include the SRPMS as well as the RPMS. This is the first BioBrew release for x86_64: The BioBrew roll for Rocks-3.3 on x86_64 includes the same apps as the x86 releases, except for Java and EMBOSS. EMBOSS is not included because Java for Rocks-3.3-x86_64 was not available at the time this release was built. Future BioBrew development is likely to consist of BioBrew rolls only. I've got more apps planned. I'm also soliciting requests. Hint, hint... Thanks to Joe Landman of Scalable Informatics and Luc Ducazu of BioLinux for the SRPMS they make available to the community, and from which I borrowed liberally. They helped make this release a reality. Thanks to the beta testers that downloaded the releases over the past few weeks and pointed out a few problems. Thanks to Aaron Darling and Humberto Zuazaga for their help with mpiBLAST and Modules, respectively. Special thanks to the folks running BioBrew mirrors! Glen Otero Ph.D. Linux Prophet -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 3161 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050223/2fbbf51d/attachment.bin From reuti at staff.uni-marburg.de Wed Feb 23 00:23:07 2005 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] sun grid engine on Scyld beowulf cluster In-Reply-To: <4219201F.70105@metrumrg.com> References: <4214A0E5.3010804@metrumrg.com> <421516B7.3060906@sonsorol.org> <4219201F.70105@metrumrg.com> Message-ID: <1109146987.421c3d6b7c926@home.staff.uni-marburg.de> Hi, maybe this is of help: http://noel.feld.cvut.cz/magi/sge+bproc.html Cheers - Reuti Quoting BillKnebel : > > Chris, > > I was able to get grid engine to run on the Scyld cluster using the > approach of setting the master (head) node as the submit, admin, and > execute host. Unfortunately, starting a set of jobs on the cluster > results in all jobs being run on the head node only (if grid engine only > commands are used) or I can integrate grid engine "qsub" command with > some of the Scyld tools to get jobs started then migrated ( to a point) > over the cluster. However, I am still running into problems becuase all > of the queueing variables for grid engine read the headnode info and > since all jobs run on the compute nodes, the headnode appears to be > always free which results in all jobs being started at once. This is not > ideal. > > I am waiting on some feedback from Scyld/Penguin computing on some > related issues that will hopefully solve some of these problems. > > Bill > Chris Dagdigian wrote: > > > > > I know Grid Engine well but not Scyld so forgive my ignorance if I say > > something stupid and given the level of expertise on this list I'm > > quite certain I'm about to make a fool myself :) > > > > If Scyld is presenting you with a single system image (ie a single > > linux server that can farm out tasks to all those nodes) then you > > would install SGE in the same way that you would install it on a big > > SMP box: > > > > 1. Install the SGE qmaster and scheduler on the master node > > 2. Install the execution host on the master node as well > > > > You will only have 1 execd per queue but each queue can be configured > > with N number of "job slots" which actually control how many jobs can > > run at the same time on the same machine. > > > > Try setting your # of job slots within your single SGE queue to the > > number of nodes in your cluster. This is simlar to what you would do > > on a big SMP machine -- small number of queues each supporting a > > decent jobslot count. > > > > Then submit a bunch of jobs and see if SGE causes the master node to > > fall over under load. If not then Scyld is doing its thing behind the > > scenes to migrate stuff around to the other nodes. > > > > -Chris > > > > > > > > billk01 wrote: > > > >> I am in the process of installing SGE on a Scyld beowulf cluster. As > >> most people are aware, the Scyld cluster runs a complete OS (linux) only > >> on the master node and the compute nodes are simply for executing. > >> During the SGE install, it requires adding the compute nodes as execute > >> hosts. I do not understand how to do this given the current setup of a > >> scyld cluster since you can't "login" to the nodes to execute the > >> install script. The script does exist on an NFS shared directory > >> (cluster wide). Has anybody else ran into this problem? > >> > > > > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From eugen at leitl.org Wed Feb 23 12:02:46 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] [BioBrew-discuss] Re: Biobrew 3.1 (fwd from jeff@bioinformatics.org) Message-ID: <20050223200245.GJ1404@leitl.org> Site's hammered; no mirrors nor torrents, so won't be of much use yet. ----- Forwarded message from "J.W. Bizzaro" ----- From: "J.W. Bizzaro" Date: Wed, 23 Feb 2005 14:42:11 -0500 To: The Virtual BioBrew Think Tank Subject: [BioBrew-discuss] Re: Biobrew 3.1 User-Agent: Mozilla Thunderbird 1.0 (X11/20041206) Reply-To: The Virtual BioBrew Think Tank Hi guys. Since the DVD ISOs are > 2 GB, you need to use FTP: ftp://ftp.bioinformatics.org/pub/biobrew/BioBrew-v3.1/x86/ Apache is probably responsible for all of the screwy errors vis-a-vis these files. Cheers. Jeff >On Feb 23, 2005, at 10:59 AM, Eugen Leitl wrote: > The access rights on > http://ftp.bioinformatics.org/pub/biobrew/BioBrew-v3.1/x86/BioBrew-Pro-3.1.0.i386.iso > > seem to be screwed. -- J.W. Bizzaro Bioinformatics Organization, Inc. (Bioinformatics.Org) E-mail: jeff@bioinformatics.org Phone: +1 508 890 8600 -- _______________________________________________ BioBrew-discuss mailing list BioBrew-discuss@bioinformatics.org https://bioinformatics.org/mailman/listinfo/biobrew-discuss ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050223/ca5536ec/attachment.bin From lindahl at pathscale.com Wed Feb 23 12:04:41 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] The Case for an MPI ABI In-Reply-To: <421C66DB.5080309@ccrl-nece.de> References: <20050222050824.GA2195@greglaptop.attbi.com> <421C66DB.5080309@ccrl-nece.de> Message-ID: <20050223200440.GD2227@greglaptop.internal.keyresearch.com> On Wed, Feb 23, 2005 at 12:19:55PM +0100, Joachim Worringen wrote: > Unfortunately, the value of an ABI is much reduced by the fact that the > most important target platform Linux itself has no stable ABI (think of > libc and other version nightmares). On a OS like Solaris or Windows, > this is much more of a benefit. I don't think it's "much reduced" by this, but I think it's clear this would be a matter of opinion. What you'll definitely be able to do is run an application built on a particular Linux version with different MPI libraries compiled for that same Linux version. You are correct that if the MPI library was built for a wildly different Linux distro than the app, you can't necessarily put them together. > Another problem are i.e. vendor-specific assertions that could conflict. > A solution for this could be "numerical namespaces" for such extensions, > but how should they be managed? This is certainly something that a committe would discuss. There are plenty of examples of this problem being solved successfully by handing out numeric ranges. > And what about the different calling-conventions in Fortran? The calling conventions differences (in Linux) revolve around the f2c-abi issue, and it so happens that no MPI routines trip on this issue, as it only affects functions that return REAL*4 or COMPLEX types. Did I miss a function that has those return types? -- greg From rgb at phy.duke.edu Wed Feb 23 12:09:34 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] managing debian packages In-Reply-To: <20050223054408.GS3124@washoe.rutgers.edu> References: <20050223054408.GS3124@washoe.rutgers.edu> Message-ID: On Wed, 23 Feb 2005, Yaroslav Halchenko wrote: > Hello to all Beowulfers, > > A simple question: so we have cfengine2 to manage configs through the > hosts. But its "packages" section is quite handicaped so there is a > question: how do you manage installing packages on nodes which > some times might differ a bit but most often have the same set of > packages. I have in mind Debian packaging system Does this mean you only want answers that apply to debian-based clusters? I mean, another answer for rpm-based systems might involve kickstart and yum, with or without cfe or dsh. Another KIND of possibility altogether is warewulf (beowulf in a non-commercial box), which runs a single template on all nodes (so updating the template updates everything). Still another is e.g. scyld (beowulf in a commercial box). And this still isn't exhaustive, I'm sure. So there are many ways to do it, but which sort of solution you look for is likely predicated as much on the particular linux distro you choose for a base as anything, and beyond that on whether you choose to use a cluster-specific packaging that manages all this with provided tools. rgb > > In my case what I do is > > 1. Install required package on a main node, so if it has any dialog > which tweaks configuration - I adjust it so it fits my needs. > > 2. I ran cfegines through the cluster, so they pick up updated > /var/cache/debconf/config.dat > > 3. Using favorite dsh I install the same package in parallel on all the > nodes in non-interactive regime, so it grabs answers for possible > questions from debconf. > > This way everything is kinda right. > > Other ways would be: install on all the nodes from the beginning in > non-interactive. I don't like such option because it is quite often that > default config has to be tweaked in slight way suggested by debconf, > but if you don't get dialog - you will not tweak it... > > or indeed my way is overkill? > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Wed Feb 23 12:15:13 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] Re: WW Fedora Student questions help In-Reply-To: <1109142753.421c2ce1638f8@panthermail.uwm.edu> References: <1109142753.421c2ce1638f8@panthermail.uwm.edu> Message-ID: On Wed, 23 Feb 2005 streich@uwm.edu wrote: > *SNIP* > > but part of this class is also setting up monitoring and benchmarking tools > > and tests to get overall proformance and what not of my cluster... > > Check out MRTG and Ganglia on the net for monitoring. I know that MRTG is > avalible as a RedHat package that can be installed from distro's install CDs, > but it is probably best to get it in source code and configure it to meet your > needs. Ganglia is powerful, it is used by a lot of clusters running ROCKs. Or you can try wulfware (xmlsysd and e.g. wulfstat or wulflogger). Depends on what you want to do with the data. wulflogger makes it very easy to log a variety of cluster load metrics into a file at a selected interval, in case you want to actually run programs to analyze it or plot it with standalone tools. It should be prebuilt for FC2, RH9 and Centos 3.3 here: http://www.phy.duke.edu/~rgb/Beowulf/wulfware.php (or available in source rpm or tarball if you prefer). I've tried to make this a yummified repository, so you can if you wish autoupdate from it via yum, or of course you can just grab source or binary rpms and put them into a local repository. rgb > > > And setting up and running some parrallel Applications ... Im very interested > in > > getting more into clusters so was wondering if anyone has any tools or scripts > > or anything i can setup and test on my fedora warewulf setup just to get > > experience Also i really need some computational or Sim type app's or any type > > of cluster parrallel application i can play with to get the hang of them so i > > can move on to makeing my own applications this would be a HUGE help > > Hot topics that are using clusters include astronomical models, biology (human > genome stuff), and all sorts of things. Our cluster is dedicated to > atmospheric research (in particular studying fluffy white clouds most > atmospheric researchers aren't as intrested in as storms and such) running > COAMPs and wrf (MPI based programs). > > I suppose the real question is: what type of problem are you intrested in? > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From landman at scalableinformatics.com Wed Feb 23 12:32:42 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] final CFP: HiPCoMB-2005 Message-ID: <421CE86A.7000707@scalableinformatics.com> I didn't see it posted here, so I figured I would send it out. Apologies for those with low meeting spam tolerance ******************************************************************************** We apologize if you received multiple copies of this Call for Papers. Please feel free to distribute it to those who might be interested. ******************************************************************************** ___________________________________________________________________ CALL FOR PAPERS 1st IEEE Workshop on High Performance Computing in Medicine and Biology (HiPCoMB-2005) held in conjunction with The 11th International Conference on Parallel and Distributed Systems (ICPADS 2005) Fukuoka Institute of Technology (FIT), Fukuoka, Japan July 20-22, 2005 HiPCoMB-2005 Home Page: http://www.pdcl.wayne.edu/HiPCoMB-2005 (Apologies if you receive multiple copies of this Call for Papers) _______________________________________________________________________ IMPORTANT DEADLINES: Paper submission: February 28, 2005 Author Notification: April 07, 2005 Camera-Ready Papers: April 21, 2005 WORKSHOP INFORMATION: The First Workshop on High Performance Computing in Medicine and Biology (HiPCoMB-05), held in conjunction with ICPADS 2005 in Fukuoka City, Japan brings together researchers in computer science and engineering, medicine, and biology that use high performance computing to solve computationally expensive problems in medicine and biology. The workshop will provide a forum for presenting and exchanging new ideas and experiences in this area. Topics of interest include high performance algorithms, systems, architecture, and tools for the following: (but are not limited to the following list) * Microarray Analysis * RNAi Analysis * Systems Biology * Computational Genomics * Comparative Genomics * DNA Assembly, Clustering, and Mapping * Gene identification and annotation * Computational Proteomics * Evolution and Phylogenetics * Protein Structure Predication and Modeling * Medical Image Processing * Computer Assisted Surgery * Computational Medicine Modeling * Computational Biology Modeling * Augmented Reality * Medical Informatics SUBMISSION INFORMATION: Talks will be accepted on the basis of a paper of approximately 15 single-column pages that describes the work, its significance, and the current status of the research. Submit one electronic copy of the paper in PostScript or PDF format by February 15, 2005. Please visit the workshop home page for submission instructions. Notification of acceptance will be given by March 22, 2005, and camera-ready papers will be due April 21, 2005. Accepted papers will be given guidelines in preparing and submitting the final manuscript(s) together with the notification of acceptance. All accepted papers will be presented at the workshop and included in proceedings that will be distributed at the workshop. In addition, authors of selected papers from the workshop will be invited to submit extended versions of their papers for publication in a special issue of the International Journal of Bioinformatics Research and Applications. GENERAL INFORMATION: GENERAL CO-CHAIRS: Laurence T. Yang St. Francis Xavier University, Canada email: lyang@stfx.ca Albert Zomaya University of Sydney, Australia email: zomaya@it.usyd.edu.au PROGRAM CO-CHAIRS: Vipin Chaudhary Wayne State University, USA email: vipin@wayne.edu Andrei Doncescu LAAS, National Center for Scientific Research, France email: adoncesc@laas.fr Yi Pan Georgia State University, USA email: pan@cs.gsu.edu PROGRAM COMMITTEE MEMBERS: David Abramson, Monash University, Australia davida@csse.monash.edu.au Enrique Alba, University of Malaga, Spain eat@lcc.uma.es Srinivas Aluru, Iowa State University, USA aluru@iastate.edu http://www.ece.iastate.edu/~aluru Shahid H. Bokhari, University of Engineering & Technology, Pakistan shb@acm.org Vincent Breton, CNRS/IN2P3, LPC Clermont-Ferrand, France breton@clermont.in2p3.fr Kevin Burrage, University of Queensland, Australia kb@maths.uq.edu.au Amitava Data, University of Western Australia, Australia datta@csse.uwa.edu.au Hans de Sterck, University of Waterloo, Canada hdesterck@math.uwaterloo.ca Mario Rosario Guarracino, ICAR-CNR, Italy mario.guarracino@cps.na.cnr.it http://pixel.dma.unina.it/~mariog/ Ryoko Hayashi, Advanced Institute of Science and Technology (JAIST), Japan ryoko@jaist.ac.jp Matthew He, Nova Southeastern University, USA hem@nsu.nova.edu Alfons Hoekstra, University of Amsterdam, The Netherlands alfons@science.uva.nl Xiaohua (Tony) Hu, Drexel University, USA thu@cis.drexel.edu http://www.cis.drexel.edu/faculty/thu Chun-Hsi Huang University of Connecticut, USA huang@engr.uconn.edu Arun Krishnan, Bioinformatics Institute, Singapore arun@bii.a-star.edu.sg Joseph Landman, Scalable Informatics, LLC landman@scalableinformatics.com Wenjun Li, UT Southwestern Medical Center, USA liwenjun2k@yahoo.com Yiming Li, National Chiao Tung University, Taiwan ymli@mail.nctu.edu.tw Robert L. Martino, National Institutes of Health, USA Robert.Martino@nih.gov Maria Mirto University of Lecce, Italy maria.mirto@unile.it Michael Mascagni, Florida State University, USA mascagni@fsu.edu Martin Middendorf, University of Leipzig, Germany middendorf@informatik.uni-leipzig.de Giri Narasimhan, Florida International University, USA giri@cs.fiu.edu Jun Ni, University of Iowa, USA jun-ni@uiowa.edu Sergei Petoukhov, Russian Academy of Sciences, Russia petoukhov@hotmail.com Pascal Poulet, French West Indies UNiversity, France Pascal.Poullet@univ-ag.fr Youxing Qu, Univresity of Georgia, USA youxing@csbl.bmb.uga.edu Nagiza Samatova, Oak Ridge National Lab, USA samatovan@ornl.gov Bertil Schmidt, Nanyang Technological University, Singapore ASBSchmidt@ntu.edu.sg Tony Solomonides, University of the West of England, UK Tony.Solomonides@uwe.ac.uk El-Ghazali Talbi, LIFL, France talbi@lifl.fr Daming Wei, University of Aizu, Japan dm-wei@u-aizu.ac.jp Tiffani Williams, University of New Mexico, USA tlw@cs.unm.edu C. M. Yang, Nankai University, China yangchm@nankai.edu.cn Yanqing Zhang, Georgia State University, USA yzhang@cs.gsu.edu Bingbing Zhou, University of Sydney, Australia bbz@it.usyd.edu.au Information related to ICPADS 2005 is available at the official ICPADS 2005 Web site: http://www.takilab.k.dendai.ac.jp/conf/icpads/2005/ ICPADS 2005 is Co-sponsored by IEEE Computer Society TCDP and TCPP, and Fukuoka Institute of Technology (FIT), in cooperation with Fukuoka City, IPSJ (Information Processing Society of Japan) SIGDPS, IEEE Taipei Section, IEEE HonKong Section, SCAT, and AOARD/AOR. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From alvin at Mail.Linux-Consulting.com Wed Feb 23 12:43:20 2005 From: alvin at Mail.Linux-Consulting.com (Alvin Oga) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] RAID storage: Vendor vs. parts In-Reply-To: <421CAEAD.8070207@lanl.gov> Message-ID: hi ya josip On Wed, 23 Feb 2005, Josip Loncaric wrote: > My experience with boxed drives bought from retailers was better than > OEM bare drives from reputable sources. Retail boxed drives often carry > 3yr warranty, so there is more at stake for the manufacturer if they go bad. warranty is the same from retailers or wholesalers "reputable sournces" that i'm talking about are the ones where you are required to have reseller permit to buy from them and some do not have a website for consumer to buy the big question is does one call ibm or the place you bought it from to return the item to me, the parts should NEVER fail withing its warranty period - exceptions are doa when its first powered one which has < 1% failure - exception was the ibm deathstar disks ( that resulted in a class action suit against ibm ) we don't have any problems with our hardware ... other than deathstars and we buy 1,000's of um > P.S. You would not want to build a cluster that way (limited selection > & higher prices) but for spare parts, retail stores are quick and > convenient. yup .. which is why want to buy cots parts .... and nothing specific to the system integrator that sells their modified versions which has been the case where all services/support calls came from to go fix these "modified/customized" systems when a part dies, you want to go to the local pc store, and buy the $10 item or $100 replacement disk and have people working in a matter of hour or three hrs .. etc .. c ya alvin From gmpc at sanger.ac.uk Wed Feb 23 12:56:53 2005 From: gmpc at sanger.ac.uk (Guy Coates) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] managing debian packages In-Reply-To: <20050223054408.GS3124@washoe.rutgers.edu> References: <20050223054408.GS3124@washoe.rutgers.edu> Message-ID: On Wed, 23 Feb 2005, Yaroslav Halchenko wrote: > Hello to all Beowulfers, > > A simple question: so we have cfengine2 to manage configs through the > hosts. But its "packages" section is quite handicaped so there is a > question: how do you manage installing packages on nodes which > some times might differ a bit but most often have the same set of > packages. I have in mind Debian packaging system For the installs, take a look at FAI; it is the automated debian installer. It is extremely flexible, so you can hack it about to do pretty much anything you want. Once the machines are up, dsh is a fine way of keeping things up to date or installing new stuff. If you don't want to be pestered by configuration questions during package installs, you can change that behaviour in /etc/debconf.conf to make everything non-interactive. There is also a rather nifty debian package called jablicator. You run it on a master system and it creates an empty deb file which depends on every deb installed. If you then install that deb on another machine, it will auto-magically install whatever it needs to make it a clone of the master. It probably won't handle config files, but dsh/cfengine can handle that part for you. Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK Tel: +44 (0)1223 834244 ex 7199 From mathog at mendel.bio.caltech.edu Wed Feb 23 15:28:07 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] Re: S2466 Wake on Lan working, anyone? Message-ID: One other tidbit, both the node with enable_wol=1 and the other nodes show exactly the same thing for "lspci -vv", which is: 02:08.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 78) Subsystem: Tyan Computer: Unknown device 2466 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- SERR- [disabled] [size=128K] Capabilities: [dc] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=2 PME- As I understand this the status should have been D3 (from some old posts by Donald Becker). Or maybe it should be in D3 at shutdown, which of course I can't see because, um, the system is shutdown! The node was started with this lilo entry: image=/boot/vmlinuz label="linuxserial" root=/dev/hda3 initrd=/boot/initrd.img append="devfs=mount acpi=on resume=/dev/hda2 splash=silent console=tty0 console=ttyS0,38400" vga=788 read-only How would does one test that enable_wol is actually being set during a shutdown initiated by poweroff??? Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From hvidal at tesseract-tech.com Thu Feb 24 05:28:14 2005 From: hvidal at tesseract-tech.com (H.Vidal, Jr.) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] test Message-ID: <421DD66E.6000302@tesseract-tech.com> sorry, had a lapsed domain and wanted to see if I am still here.. From joachim at ccrl-nece.de Thu Feb 24 05:30:09 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] The Case for an MPI ABI In-Reply-To: <20050223200440.GD2227@greglaptop.internal.keyresearch.com> References: <20050222050824.GA2195@greglaptop.attbi.com> <421C66DB.5080309@ccrl-nece.de> <20050223200440.GD2227@greglaptop.internal.keyresearch.com> Message-ID: <421DD6E1.5000200@ccrl-nece.de> Greg Lindahl wrote: > I don't think it's "much reduced" by this, but I think it's clear this > would be a matter of opinion. What you'll definitely be able to do is > run an application built on a particular Linux version with different > MPI libraries compiled for that same Linux version. You are correct > that if the MPI library was built for a wildly different Linux distro > than the app, you can't necessarily put them together. This problem left apart, do you know of ISV's that would at least be willing to think about giving support to an MPI ABI no matter which implementation and interconnect, and not a specific MPI library? Because this is what matters. For open source software packages alone, an ABI is not of critical importance as people with a tcp/ip cluster can use pre-linkked packages, and people with a high-perfomance interconnect cluster typically have enough competence to compile the software themselves. >>Another problem are i.e. vendor-specific assertions that could conflict. >>A solution for this could be "numerical namespaces" for such extensions, >>but how should they be managed? > > This is certainly something that a committe would discuss. There are > plenty of examples of this problem being solved successfully by > handing out numeric ranges. Well, for MAC addresses, PCI device ids etc, there are professional organisations that care for this. For MPI; there is no such instituion. ANL? Maybe. But maybe there's another technical solution, if the linked library could somehow know which variant of mpi.h the code was compiled against, which then would determine the meaning of all assertion beyond 1024 (or some other limit). Something coded into MPI_Init() or it's arguments might be a way.. hacky, hacky. >>And what about the different calling-conventions in Fortran? > > The calling conventions differences (in Linux) revolve around the > f2c-abi issue, and it so happens that no MPI routines trip on this > issue, as it only affects functions that return REAL*4 or COMPLEX > types. Did I miss a function that has those return types? I did not think of this, but more of issues like "string as an argument" as the way how the string length is passed is not standardized. Then there are issues with getting access to global variables from COMMON blocks etc. which are hard (if at all) to be solved with one shared object file for multiple compilers. We currently need to link a small extra object file depending on the compiler used. This does not mean that we should not continue thinking about an ABI, but there's more than unifying mpi.h to be able to use a single shared library. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From ashley at quadrics.com Thu Feb 24 06:20:12 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] The Case for an MPI ABI In-Reply-To: <421DD6E1.5000200@ccrl-nece.de> References: <20050222050824.GA2195@greglaptop.attbi.com> <421C66DB.5080309@ccrl-nece.de> <20050223200440.GD2227@greglaptop.internal.keyresearch.com> <421DD6E1.5000200@ccrl-nece.de> Message-ID: <1109254812.12270.46.camel@localhost.localdomain> On Thu, 2005-02-24 at 14:30 +0100, Joachim Worringen wrote: > For open source software packages alone, an ABI is not of critical > importance as people with a tcp/ip cluster can use pre-linkked packages, > and people with a high-perfomance interconnect cluster typically have > enough competence to compile the software themselves. It's not about competence, it's about time and effort spent. I wouldn't need to compile every application myself if there was a ABI. It's not a particularly difficult thing to do, it just going through the hoops of doing it every time you need an application. The ability to install a cluster and type '[apt-get|yum] install pmb' would be a truly wonderful thing indeed. You also make the assumption that it's the high-performance vendors who do things differently, I don't believe this is the case. Quadrics for example (my employer) happen to use whatever ABI MPICH (1.2.x) provides as we have never had a reason to modify it. I believe the same holds for Myrinet, I've certinally run binaries compiled against Myrinet MPI on our MPI stack without obvious problems. Having said that though I have never attempted to verify binary compatibility and we don't support such programs but insist they are correctly compiled before support requests get more than a cursory glance. Ashley, From joachim at ccrl-nece.de Thu Feb 24 07:45:27 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] The Case for an MPI ABI In-Reply-To: <1109254812.12270.46.camel@localhost.localdomain> References: <20050222050824.GA2195@greglaptop.attbi.com> <421C66DB.5080309@ccrl-nece.de> <20050223200440.GD2227@greglaptop.internal.keyresearch.com> <421DD6E1.5000200@ccrl-nece.de> <1109254812.12270.46.camel@localhost.localdomain> Message-ID: <421DF697.3070005@ccrl-nece.de> Ashley Pittman wrote: > It's not about competence, it's about time and effort spent. I wouldn't > need to compile every application myself if there was a ABI. It's not a > particularly difficult thing to do, it just going through the hoops of > doing it every time you need an application. The ability to install a > cluster and type '[apt-get|yum] install pmb' would be a truly wonderful > thing indeed. My experience is that such sites want to use their (expensive, commercial, Fortran) compiler to optimize the binaries to their platform. In some cases, this requires source-code changes (parameters) anyway. It's not about PMB. > You also make the assumption that it's the high-performance vendors who > do things differently, I don't believe this is the case. Quadrics for > example (my employer) happen to use whatever ABI MPICH (1.2.x) provides [...] I can't see where I made this assumption. Indeed, most interconnect vendors (Quadrics, various Infiniband, Myrinet, ...) happily plug their low-level stuff into the latest MPICH and are done. So, it's most often the cross-interconnect MPI vendors which create their own ABI for some reason. Other cases are vendors (like us) who started to provide MPI-2 when there was no open-source MPI-2 around. We had to do our own definitions then. But this doesn't really matter after all; what matters is if there are enough parties to take part in this effort, and to understand the related issues as much as possible and as early as possible. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From ashley at quadrics.com Thu Feb 24 08:23:47 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] The Case for an MPI ABI In-Reply-To: <421DF697.3070005@ccrl-nece.de> References: <20050222050824.GA2195@greglaptop.attbi.com> <421C66DB.5080309@ccrl-nece.de> <20050223200440.GD2227@greglaptop.internal.keyresearch.com> <421DD6E1.5000200@ccrl-nece.de> <1109254812.12270.46.camel@localhost.localdomain> <421DF697.3070005@ccrl-nece.de> Message-ID: <1109262227.12272.64.camel@localhost.localdomain> On Thu, 2005-02-24 at 16:45 +0100, Joachim Worringen wrote: > My experience is that such sites want to use their (expensive, > commercial, Fortran) compiler to optimize the binaries to their > platform. In some cases, this requires source-code changes (parameters) > anyway. It's not about PMB. The difference here is between "want" and "need". If they want to do it then well done, congratulations, it is often the right thing to do for performance reasons. In terms of setup time to get a working cluster though there is a difference, having things work out the box is a good thing (tm). Of course there is a potential downside that if if works out of the box then they won't play with compilers (why bother - it works?) and not even try to get the extra performance. This isn't a valid reason not to have a ABI though. > > You also make the assumption that it's the high-performance vendors who > > do things differently, I don't believe this is the case. Quadrics for > > example (my employer) happen to use whatever ABI MPICH (1.2.x) provides > [...] > > I can't see where I made this assumption. I was referring to this quote from earlier. I thought you were implying that code should come pre-linked with TCPIP MPICH (in effect a de-facto standard) and us "exotic" people would be on our own. In this context there is nothing exotic or unusual about our MPI. There is no relationship between the performance of a MPI stack and how much it's interface varies. Sounds like we are on the same wavelength here though. >> For open source software packages alone, an ABI is not of critical >> importance as people with a tcp/ip cluster can use pre-linkked >> packages, and people with a high-perfomance interconnect cluster >> typically have enough competence to compile the software themselves. > Indeed, most interconnect > vendors (Quadrics, various Infiniband, Myrinet, ...) happily plug their > low-level stuff into the latest MPICH and are done. So, it's most often > the cross-interconnect MPI vendors which create their own ABI for some > reason. Other cases are vendors (like us) who started to provide MPI-2 > when there was no open-source MPI-2 around. We had to do our own > definitions then. Hhmm. Does this mean that the only reasons for not having a ABI are historical, purely because we have never had one and there isn't the inertia to change this? Are there valid technical/performance reasons for the need to change mpi.h? > But this doesn't really matter after all; what matters is if there are > enough parties to take part in this effort, and to understand the > related issues as much as possible and as early as possible. Plus the fact that someone (everyone?) has to take the hit of breaking binary compatibility with all their previous MPI releases when they make the jump to being compliant. Any volunteers? This might not actually be so bad with the shared library number versioning scheme, in fact it might be possible to avoid it completely at the cost of a bit more effort in the packaging. Ashley, From lindahl at pathscale.com Thu Feb 24 11:13:26 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] The Case for an MPI ABI In-Reply-To: <421DD6E1.5000200@ccrl-nece.de> References: <20050222050824.GA2195@greglaptop.attbi.com> <421C66DB.5080309@ccrl-nece.de> <20050223200440.GD2227@greglaptop.internal.keyresearch.com> <421DD6E1.5000200@ccrl-nece.de> Message-ID: <20050224191325.GA1903@greglaptop.internal.keyresearch.com> On Thu, Feb 24, 2005 at 02:30:09PM +0100, Joachim Worringen wrote: > This problem left apart, do you know of ISV's that would at least be > willing to think about giving support to an MPI ABI no matter which > implementation and interconnect, and not a specific MPI library? Because > this is what matters. As I wrote in my talk, no. Some ISVs will balk because of the testing issue. However, it is much easier to test against new libraries if there is an ABI, and no recompilation means no worrying about recompilation bugs. > For open source software packages alone, an ABI is not of critical > importance as people with a tcp/ip cluster can use pre-linkked packages, > and people with a high-perfomance interconnect cluster typically have > enough competence to compile the software themselves. An ABI is still useful for these people. Even if you are using ethernet and TCP/IP, if you have a queue system and have integrated with LAM, a pre-linked package using MPICH is not so useful. Critically important? Probably not. Important? Yes. > I did not think of this, but more of issues like "string as an argument" > as the way how the string length is passed is not standardized. Yes, but on x86 and x86-64, Intel and PGI and g77 and PathScale all use the same convention. > Then there are issues with getting access to global variables from > COMMON blocks This is not an issue, because the MPI interface does not use COMMON blocks. The most annoying issue is command-line args, but I am about to write (and will give away) an Amazing Universal Fortran Command Line Arg Fetcher (AUFCLAF). > This does not mean that we should not continue thinking about an ABI, > but there's more than unifying mpi.h to be able to use a single shared > library. I agree. I wasn't claiming to have a final solution for the issues. The important thing at this stage is not to solve all the issues, but to see if the benefits of an ABI are compelling enough to form a committee and work on it. -- greg From list-beowulf at onerussian.com Wed Feb 23 13:11:58 2005 From: list-beowulf at onerussian.com (Yaroslav Halchenko) Date: Wed Nov 25 01:03:50 2009 Subject: [Beowulf] managing debian packages In-Reply-To: References: <20050223054408.GS3124@washoe.rutgers.edu> Message-ID: <20050223211158.GQ3124@washoe.rutgers.edu> On Wed, Feb 23, 2005 at 08:56:53PM +0000, Guy Coates wrote: > For the installs, take a look at FAI; it is the automated debian > installer. It is extremely flexible, so you can hack it about to do pretty > much anything you want. yeap - that is what I used to install all the nodes... And actually FAI has somewhat neat idea on classes of machines and installed packages depending on the class. So it reminds cfengine approach, that is why I actually tried first to use FAI config file as the source of packages to be installed and then wrote a cf.fai config which was installing packages using FAI's functionality. The problems with that were: I needed manually type-in the packages I want on per class basis, and there were issues with dpkg installation process not closing all FIDs probably so remote shell never returned which annoyed... I've mentioned a simple trick to get around that on cfengine's Wiki but I haven't touched this way to install packages for a long time.... > Once the machines are up, dsh is a fine way of keeping things up to date > or installing new stuff. If you don't want to be pestered by > configuration questions during package installs, you can change that > behaviour in /etc/debconf.conf to make everything non-interactive. That is exactly what I'm doing pretty much > There is also a rather nifty debian package called jablicator. You run it > on a master system and it creates an empty deb file which depends on every > deb installed. Cool - didn't know about this one... I see limited applicability though, it is pretty much the same as dpkg --get-selections | rsh remote host dpkg --set-selections and probably it doesn't give conflicts, ie which packages to remove if they are not installed on the system where you run jablicator My problem is that I'm not sure really on what I'm looking for... FAI way to configure installed packages is very appealing to me but its implementation is what keeps me away. Let me characterize: I want to have some classes as cfengine does and depending on to which class the machine belongs, it gets necessary packages installed in non-interactive fasion, using debconf values from the machine on which it was installed in non-interactive fasion. for instance probably similar command should look like apt-getclass lam-nodes install libblablah-dev then it runs interactive installation on a first machine from lam-nodes netgroup for instance, clones debconf to related machines, runs noninteractive install there Running simply dsh -g @lam-nodes apt-get install libblablah-dev will not suffice because of possible necessity to configure Hm... probably I just to write a tiny wrapper around apt-get and cfrun :-) -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555] -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20050223/acf6db44/attachment.bin From rmiguel at usmp.edu.pe Thu Feb 24 06:50:57 2005 From: rmiguel at usmp.edu.pe (rmiguel@usmp.edu.pe) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] about concept of beowulf clusters In-Reply-To: References: Message-ID: <1109256657.421de9d113bb3@mail.usmp.edu.pe> Hi, i have a doubt about the strict concept of Beowulf cluster. Is a cluster build with comodity hardware only?.. what's up when i build a cluster using some tools as OSCAR, or ROCKS, etc on servers or using some kind of high speed networks?. If I have two Alpha servers with Linux and open source software conected by a high speed network.. is this a beowulf cluster?. Thanks for your answers .. -- -------------------------------------------------------- ~ | . . | "In a world without fences ... who needs (0 0 ) | GATES?" / \/ \ | // \\ | /(( _ ))\ | oo0 0oo | ---------------------------------------------------------- ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. From apseyed at bu.edu Thu Feb 24 12:44:30 2005 From: apseyed at bu.edu (Patrice Seyed) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] about concept of beowulf clusters In-Reply-To: <1109256657.421de9d113bb3@mail.usmp.edu.pe> References: <1109256657.421de9d113bb3@mail.usmp.edu.pe> Message-ID: <1109277874.348029A3@bd8.dngr.org> A beowulf cluster is in itself a loose definition, but it basically is linux machines that perform some kind of HPC (high performance computing) tasks. It could be commoity or not. Also there is no set of software that makes it a beowulf, but there are software that make them useful, like mpi and lam libraries for parallel computing, or pbs/maui and sge for scheduler/resource manager for batch/interactive/parallel computing in the cluster. Rocks and Oscar are simply tookits that make installing and managing the systems easier, like for installing nodes, software management, and parallel commands. Whether you're using low latency propriety switches or gigE, as long as the tasks are related to hpc I think it falls in the "beowulf" definition. -Patrice On Thu, 24 Feb 2005 3:27 pm, rmiguel@usmp.edu.pe wrote: > Hi, i have a doubt about the strict concept of Beowulf cluster. Is a > cluster > build with comodity hardware only?.. what's up when i build a cluster > using > some tools as OSCAR, or ROCKS, etc on servers or using some kind of > high speed > networks?. > If I have two Alpha servers with Linux and open source software > conected by a > high speed network.. is this a beowulf cluster?. > > Thanks for your answers .. > > > > -- > -------------------------------------------------------- > ~ | > . . | "In a world without fences ... who needs > (0 0 ) | GATES?" > / \/ \ | > // \\ | > /(( _ ))\ | > oo0 0oo | > ---------------------------------------------------------- > > ---------------------------------------------------------------- > This message was sent using IMP, the Internet Messaging Program. > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -Patrice From alex at DSRLab.com Thu Feb 24 12:48:13 2005 From: alex at DSRLab.com (Alex Vrenios) Date: Wed Nov 25 01:03:51 2009 Subject: FW: [Beowulf] about concept of beowulf clusters Message-ID: <200502242040.j1OKeQ1R023076@bluewest.scyld.com> > -----Original Message----- > From: beowulf-bounces@beowulf.org > [mailto:beowulf-bounces@beowulf.org] On Behalf Of rmiguel@usmp.edu.pe > Sent: Thursday, February 24, 2005 7:51 AM > To: beowulf@beowulf.org > Subject: [Beowulf] about concept of beowulf clusters > > Hi, i have a doubt about the strict concept of Beowulf > cluster. Is a cluster build with comodity hardware only?.. > what's up when i build a cluster using some tools as OSCAR, > or ROCKS, etc on servers or using some kind of high speed networks?. > If I have two Alpha servers with Linux and open source > software conected by a high speed network.. is this a beowulf > cluster?. > > Thanks for your answers .. > A mid-1990s paper (I'll dig it out if necessary) described differences between Beowulf and Beowulf II systems. The initial thrust had high ideals and essentially put Linux and Beowulf clusters in the limelight. I didn't follow the argument very well, but my "feeling" from that distinction was the reality of a comparitively slow external data network caused them to relax the restrictions when it came to networking hardware and software. Keeping these things in balance is hard to do when the high-end store parts are at Gigabit Ethernet and 3+ GHz processors. Early Beowulf clusters, in my opinion, were out to see what could be done with off-the-shelf components, and it was an inspiration to many of us without the US Treasury behind our projects. Alex P.S. Have a look through Issue #100 of Linux Journal for further info. It had follow-ups to earlier introductory articles on Beowulfs, and that earlier issue number is referenced there. From jrollins at ligo.mit.edu Thu Feb 24 15:20:21 2005 From: jrollins at ligo.mit.edu (Jamie Rollins) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes Message-ID: Hello. I am new to this list, and to beowulfery in general. I am working at a physics lab and we have decided to put together a relatively small beowulf cluster for doing data analysis. I was wondering if people on this list could answer a couple of my newbie questions. The basic idea of the system is that it would be a collection of 16 to 32 off-the-shelf motherboards, all booting off the network and operating completely disklessly. We're looking at amd64 architecture running Debian, although we're flexible (at least with the architecture ;). Most of my questions have to do with diskless operation. How netboot-capable are modern motherboards with on-board nics? I have experience with a couple that support PXE. However, I have been having a hard time finding information on-line stating expicitly that a given motherboard and/or bios supports netbooting. The only thing I've been able to find so far is the Tyan K8SR that uses the AMI BIOS 8.0. I get the impression that most MB's that have gigabit probably support PXE booting, but I was curious what other's impressions are. Something else that we're looking for that I believe is far more esoteric and has been equally hard to find information about is BIOS serial console redirect, ie. being able to control the bios from the serial port. I've been getting more and more into accessing machines through the serial port. The only thing holding me back from throwing out the video and keyboard entirely is being able to access and control the BIOS through the serial port as well. This would also elliminate the need for on-board video, which can only be good. This question is obviously related to the one in the previous paragraph about netboot, since they are both about features of the MB and the BIOS. The Tyan K8SR and the like with the AMI BIOS 8.0 also claim to support this feature. If anyone has any suggestions for specific MBs that would fit these three requirements (netboot, serial BIOS redirect, Debian), or at least some ideas about where to look, that would be a huge help. A related question is whether anyone has any experience with LinuxBios and if so, on what hardware. This might be too big of an issue to bring up, but I would love to hear people experiences using LinuxBios to boot diskless cluster nodes over the network. Thanks so much for the help, and I really look forward to becoming part of this community. Jamie Rollins. From James.P.Lux at jpl.nasa.gov Thu Feb 24 16:19:30 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: References: Message-ID: <6.1.1.1.2.20050224161442.07749138@mail.jpl.nasa.gov> At 03:20 PM 2/24/2005, Jamie Rollins wrote: >Hello. I am new to this list, and to beowulfery in general. I am working >at a physics lab and we have decided to put together a relatively small >beowulf cluster for doing data analysis. I was wondering if people on >this list could answer a couple of my newbie questions. > >The basic idea of the system is that it would be a collection of 16 to 32 >off-the-shelf motherboards, all booting off the network and operating >completely disklessly. We're looking at amd64 architecture running >Debian, although we're flexible (at least with the architecture ;). Most >of my questions have to do with diskless operation. > >How netboot-capable are modern motherboards with on-board nics? Very capable. Easier to do netboot than almost anything else. >I >have experience with a couple that support PXE. However, I have >been having a hard time finding information on-line stating expicitly that >a given motherboard and/or bios supports netbooting. The only thing I've >been able to find so far is the Tyan K8SR that uses the AMI BIOS 8.0. I >get the impression that most MB's that have gigabit probably support PXE >booting, but I was curious what other's impressions are. One way to check is to look at the BIOS manual for your mobo (they're typically online) and see if they mention a "boot from network" option. As a practical matter, I think almost ALL mobos these days can do network boot. Now, if someone could answer whether they can netboot over a 802.11 card, I'd be real interested. (the question is whether the bios has enough smarts to bring up the wireless interface) >Something else that we're looking for that I believe is far more esoteric >and has been equally hard to find information about is BIOS serial console >redirect, ie. being able to control the bios from the serial port. I've >been getting more and more into accessing machines through the serial >port. That's probably a bit dicey.. While netboot is a "essential" feature for modern business environments (which drives mobo development, to a large degree), serial access is not. In fact, a lot of "legacy free" mobos have NO serial port. However, for mobos aimed at the "server" application, remote management is a big deal, so serial port management might be more likely. >A related question is whether anyone has any experience with LinuxBios and >if so, on what hardware. This might be too big of an issue to bring up, >but I would love to hear people experiences using LinuxBios to boot >diskless cluster nodes over the network. Google for it, and you'll find several examples of LinuxBIOS and clusters. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From becker at scyld.com Thu Feb 24 16:41:22 2005 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] about concept of beowulf clusters In-Reply-To: <1109256657.421de9d113bb3@mail.usmp.edu.pe> References: <1109256657.421de9d113bb3@mail.usmp.edu.pe> Message-ID: On Thu, 24 Feb 2005 rmiguel@usmp.edu.pe wrote: > Hi, i have a doubt about the strict concept of Beowulf cluster. Is a cluster > build with comodity hardware only?.. what's up when i build a cluster using > some tools as OSCAR, or ROCKS, etc on servers or using some kind of high speed > networks?. > If I have two Alpha servers with Linux and open source software conected by a > high speed network.. is this a beowulf cluster?. My definition of a cluster independent machines combined into a unified system through software and networking The Beowulf definition is commodity machines connected by a private cluster network running an open source software infrastructure for scalable performance computing Traditionally the term "Beowulf Cluster" has included non-PC architectures such as the Alpha and somewhat specialized networks such as Myrinet, but excluded the purpose-built tightly coupled machines such as the Cray T3E and Digital SC. We can back to the "cluster" definition. We are starting with general purpose machines capable of independent operation, generally those with a broad market appeal. The goal is to make them appear to be a single machine. We start by networking them together, then we add a software layer to smooth over the ugliness caused because we couldn't custom design the hardware. To distinguish independent machines from the aggregate machine we call the former "nodes" and the latter the "cluster". The Beowulf definition sets a category by excluding other important classes: commodity machines We are excluding custom built hardware e.g. a single Altix is not a Beowulf cluster (or even a cluster by the strict definition) connected by a cluster network These machines are dedicated to being a cluster, at least temporarily. This excludes cycle scavenging from NOWs and wide area grids. running an open source infrastructure The core elements of the system are open source and verifiable for scalable performance computing The goal is to scale up performance over many dimensions, rather than simulate a single more reliable machine e.g. fail-over. Ideally a cluster incrementally scales both up and down, rather than being a fixed size. The original challenges for building clusters were very basic: can we build them at all? how can we get the nodes to communicate? do they do anything useful? In the early days the answers were you have to build them yourself writing and improving the basic networking for a few application you can use basic message passing There were many intermediate steps, but those problems were solved a half decade ago You can buy stock cluster configurations from many vendors Good OS networking and libraries such as MPI are established Most HPTC applications run well on small scale clusters The real challenges were obvious Can we remove compute density as an obstacle to adoption? They node can talk to each other, now how do we provision and manage cluster that scale in production deployments How can we support essentially all applications, and solve the programming problem? Donald Becker becker@scyld.com Scyld Software Scyld Beowulf cluster systems 914 Bay Ridge Road, Suite 220 www.scyld.com Annapolis MD 21403 410-990-9993 From dld at cmb.usc.edu Thu Feb 24 17:00:53 2005 From: dld at cmb.usc.edu (Drake Diedrich) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] Re: motherboards for diskless nodes In-Reply-To: References: Message-ID: <20050225010053.GA31456@app1.cmb.usc.edu> On Thu, Feb 24, 2005 at 06:20:21PM -0500, Jamie Rollins wrote: > > How netboot-capable are modern motherboards with on-board nics? I > have experience with a couple that support PXE. However, I have > been having a hard time finding information on-line stating expicitly that > a given motherboard and/or bios supports netbooting. The only thing I've > been able to find so far is the Tyan K8SR that uses the AMI BIOS 8.0. I > get the impression that most MB's that have gigabit probably support PXE > booting, but I was curious what other's impressions are. You probably want to buy one in advance to test how reliable it is when PXE booting. We have a 64-node cluster with local disks that have no CDROMs or floppies, and we do maintenance and installs by net booting. It isn't reliable. We have to reboot several times to get the things to hear the PXE/DHCP replies and boot the pxelinux.0 image when attempting to reinstall a node. The motherboard is the MSI K8D Master with 2x Broadcom tg3 gigE. Many computers do netboot reliably in our environment (including laptops, older P-III Tyan boards, new Xeon Supermicro/e1000 boards, etc). Not all motherboard PXE/DHCP boot implementations are equal and up to the task for completely diskless use. If you switch to a slightly newer motherboard on deployment, all bets are off again (yes, made that mistake once, but had a friendly supplier who let us exchange parts until it all worked). If you deal with temporary files, want to suspend and swap out large low-priority jobs, etc, you probably want a local disk on each node anyway. Spending a couple gigs of that for a locally installed O/S isn't much of a drama, especially on ~16 nodes. It makes updates more reliable, as in-use libraries/binaries that are in use remain on local disk even when dpkg replaces them, and only get deleted when no longer in use. NFS (being stateless) doesn't have this behavior, so after an update you may occaisionally have jobs/daemons when they try to page in a file that has already been replaced. If you don't have a central fileserver yet, you can also spread your users' home directories among the disks on the nodes to avoid NFS contention (though this means no RAID unless you buy two disks per node). If one user launches some sort of cluster-wide NFS bomb, they only take out themselves and the job running on the node with their home directory. Users do this - they launch a stack of simultanous jobs that all load lots of data off the filesystem, flattening whichever fileserver has their home directory. Building a single high end fileserver that can survive the same load without severely impacting all other users is expensive and tough. > > Something else that we're looking for that I believe is far more esoteric > and has been equally hard to find information about is BIOS serial console > redirect, ie. being able to control the bios from the serial port. I've > been getting more and more into accessing machines through the serial > port. The only thing holding me back from throwing out the video and > keyboard entirely is being able to access and control the BIOS through the > serial port as well. We're using an Appro blade rack with devices that bring the critical ports (serial, power, and reset) back to the "blade control center" (BCC). It does more than just a serial console, and has a much smaller cable footprint. To be honest though, I never use it. We power down automatically using ACPI when the A/C fails (far too often), rather than attempt to coordinate the BCC pulling the power plug after a node shuts down. For console/BIOS, I prefer to just use a long monitor/keyboard cable and plug it directly into the node with the problems. A 64-way KVM switch with 64 cables would block airflow and access to the nodes, as well as being expensive. The long cable also works for all the other non-BCC nodes in the adjacent racks (long cable that can reach the whole row of racks). There aren't really that many motherboards out there without onboard video, and anything works for a text console. Don't be afraid to be a Luddite if it works. :) You can spend the money saved on serial cables and KVM switching hardware on noise cancelling headphones. -Drake From eno at dorsai.org Thu Feb 24 18:39:31 2005 From: eno at dorsai.org (Alpay Kasal) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: Message-ID: <0ICG009WS4JWON@mta6.srv.hcvlny.cv.net> Jamie, I went with Pentium4 with my recent purchases and I assumed any new motherboards would be able to boot off the network without issue. I was wrong - bigtime. I was buying for small footprint & low cost and I went through about 5 motherboards before I found one that hit the server properly everytime (the time wasted in finding hardware really hurt). I am using the Gigabyte 81845gvm now (but beware of bios revs 4, 5 , or 6. they break Pxe, only rev3 works everytime). I did not test EVERY version bios with all the mobo's I tested but learned the same chipset and bios on different mobo's are not all created equal. Do your research, then buy one for evaluation from a vendor who will take the hardware back. That's the only way to know for sure that your entire setup will work. Then come back here and report your findings :) PS: PXE is not the only embedded method of bootnic. I forget the other method but it's often paired with onboard realtek8139's, an old intel standard (?) which is useless these days but still found on some new motherboards. Maybe someone else here knows what I'm referring too. Pps: I'm a Beowulf newb and I am using XP Embedded in my setup. Alpay -----Original Message----- From: beowulf-bounces@beowulf.org [mailto:beowulf-bounces@beowulf.org] On Behalf Of Jamie Rollins Sent: Thursday, February 24, 2005 6:20 PM To: beowulf@beowulf.org Cc: debian-beowulf@lists.debian.org Subject: [Beowulf] motherboards for diskless nodes How netboot-capable are modern motherboards with on-board nics? I have experience with a couple that support PXE. However, I have been having a hard time finding information on-line stating expicitly that a given motherboard and/or bios supports netbooting. The only thing I've been able to find so far is the Tyan K8SR that uses the AMI BIOS 8.0. I get the impression that most MB's that have gigabit probably support PXE booting, but I was curious what other's impressions are. From becker at scyld.com Thu Feb 24 19:03:59 2005 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] Re: motherboards for diskless nodesy In-Reply-To: <20050225010053.GA31456@app1.cmb.usc.edu> References: <20050225010053.GA31456@app1.cmb.usc.edu> Message-ID: On Thu, 24 Feb 2005, Drake Diedrich wrote: > On Thu, Feb 24, 2005 at 06:20:21PM -0500, Jamie Rollins wrote: >> >> How netboot-capable are modern motherboards with on-board nics? I >> have experience with a couple that support PXE. PXE is the only standard way to netboot a PC. It has swept away the few other proprietary approaches (e.g. RPL). I'm not a fan of the PXE protocol and specification details. It's ugly. They picked bad semantics. They picked bad protocols. They picked exceptionally bad parameters. I'm a huge fan of PXE as a standard. It didn't need to be right. It just needed to be good enough that we can make it work reliably. PXE, and all of its ugliness, is gone in two seconds. And with 50+ million installations, it's everywhere. That Is Good. >> However, I have >> been having a hard time finding information on-line stating expicitly that >> a given motherboard and/or bios supports netbooting. Virtually every current motherboard with Ethernet supports PXE booting. A few years ago, when Gigabit Ethernet was new, some motherboards had both Fast and Gb Ethernet just because there was no PXE and WOL support for the GbE chips. > You probably want to buy one in advance to test how reliable it is when > PXE booting. We have a 64-node cluster with local disks that have no CDROMs > or floppies, and we do maintenance and installs by net booting. It isn't > reliable. We have to reboot several times to get the things to hear the > PXE/DHCP replies and boot the pxelinux.0 image when attempting to reinstall > a node. The specific problem here is very likely the PXE server implementation, not the client side. I'm guessing that you are using the ISC DHCP server combined with one of stand-alone TFTP server. This can't provide true PXE service, it cannot work around more than a single version of PXE bugs, and it has significant "scalability challenges" when many machines are simultaneously booting. Almost every BIOS uses the Intel PXE client code unchanged, and it accepts DHCP responses with static PXE information. We ended up writing our own integrated PXE server to reliably boot compute nodes. A purpose-built PXE server can - interpret the initial request to work around different generation of PXE client bugs. The BIOS code is unlikely to be fixed, and there are some pretty ugly bugs. (What does a file name of '' mean? Use the last file requested...) - work around the TFTP capture effect, where clients that drop a packet are squeezed out and quickly give up, leaving the machine powered on but useless. - defer answering new requests when especially busy, but always respond before the client times out. Just as importantly, our PXE server uses and updates the single cluster configuration file. Before writing the server we went through several rounds of writing configuration files from other configuration files, and each time we ended up with a fragile implementation that was difficult to debug. > older P-III Tyan boards, new Xeon Supermicro/e1000 boards, etc). Not all > motherboard PXE/DHCP boot implementations are equal and up to the task for > completely diskless use. If you switch to a slightly newer motherboard on > deployment, all bets are off again (yes, made that mistake once, but had a > friendly supplier who let us exchange parts until it all worked). Yup, they are using the Intel code, but with different bugs. Read notes on the web about using the ISC DHCP server: "You can work around bug #1 with canned response #1, which is incompatible with the response to fix bug #2." > If you deal with temporary files, want to suspend and swap out large > low-priority jobs, etc, you probably want a local disk on each node anyway. I completely agree. Local disk is the best I/O bandwidth for the buck. > Spending a couple gigs of that for a locally installed O/S isn't much of a > drama, especially on ~16 nodes. But it's the long-term administrative effort that costs, not the disk hardware. The need to maintain and update a persistent local O/S is the root of most of that cost. > It makes updates more reliable, as in-use libraries/binaries that are > in use remain on local disk even when dpkg replaces them, and only get > deleted when no longer in use. NFS (being stateless) doesn't have > this behavior, so after an update you may occaisionally have > jobs/daemons when they try to page in a file that has already been > replaced. A cluster has a richer versioning environment than a local machine. Simultaneously using different versions of a long-running applications is something you have to consider when running a cluster system. And as you point out, NFS sometimes does do the Right Thing. But a persistent local install isn't the only way to accomplish this. We put a specialized whole-file-caching filesystem underneath our system. It's only used for libraries and executable, and it tracks them by version not just path name. Since it retains the whole file, we don't encounter the unrecoverable problem of a page-in failure. And since the node only fully accepts a process after caching the required files we avoid other failure points. > If you don't have a central fileserver yet, you can also spread your > users' home directories among the disks on the nodes to avoid NFS contention > (though this means no RAID unless you buy two disks per node). NFS isn't bad. Nor does it necessarily doom a server to unbearable loads. For some types of file access, especially read-only access to small (<8KB) configuration files such as ~/.foo.conf, it's pretty close to optimal. What you don't want to use it for is - paging in programs and libraries. Especially not in a big cluster with big applications. - writing files that are used for synchronization NFS uses semi-synchronous writes, which kills performance, but unpredictable time-based cache flushing, which kills consistency >> Something else that we're looking for that I believe is far more esoteric >> and has been equally hard to find information about is BIOS serial console >> redirect, ie. being able to control the bios from the serial port. Serial consoles are pretty common today, albeit not well documented. But using them is a hardware problem. You double your connection count, with non-standard cables, large connectors and expensive serial port concentrators. Compare that to Ethernet, with standard cables, tiny but robust connectors, link beat lights, cheap switches and simple configuration rules. A much better solution is using a software system that has reliable booting. That means - Unchanging boot firmware on the node - Minimal hardware configuration before contacting boot server - Configuration reporting as part of the boot negotiation - No boot dependence on node configuration e.g. file system contents - Complete replacement of the boot software by the boot server - Immediate status logging over the network - Non-boot drivers and configuration controlled by the boot server These are simple principles, but almost every system out there misses at least one. > down. For console/BIOS, I prefer to just use a long monitor/keyboard cable > and plug it directly into the node with the problems. The "crash cart" approach. I believe in it. You should only need to use the console when you have a hardware problem, and you'll need to be right by the machine anyway. Footnote b1tch: Yup, PXE is ugly and stupid. Most are hidden, but one observable stupidity is how it tries to locate a server. The client tries for 1+2+4+16+32 seconds to locate a PXE server. A switch with spanning tree protocol enabled doesn't pass traffic for 60 seconds to avoid network loops. It should try for a slightly longer or much shorter period. And the exponential fallback is pointless. Apparently someone thought that it would be clever to use an Ethernet-like fallback, imagining it would avoiding congestion. But it would take thousands of machines to saturate even 10Mbps Ethernet. It's common for the first packet or two to be dropped as the network link stabilizes, leading to a two or four second delay. Donald Becker becker@scyld.com Scyld Software Scyld Beowulf cluster systems 914 Bay Ridge Road, Suite 220 www.scyld.com Annapolis MD 21403 410-990-9993 From john.hearns at streamline-computing.com Fri Feb 25 00:16:13 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: References: Message-ID: <1109319374.6055.17.camel@Vigor45> On Thu, 2005-02-24 at 18:20 -0500, Jamie Rollins wrote: > Hello. I am new to this list, and to beowulfery in general. I am working > at a physics lab and we have decided to put together a relatively small > beowulf cluster for doing data analysis. I was wondering if people on > this list could answer a couple of my newbie questions. > > The basic idea of the system is that it would be a collection of 16 to 32 > off-the-shelf motherboards, all booting off the network and operating > completely disklessly. We're looking at amd64 architecture running > Debian, although we're flexible (at least with the architecture ;). Most > of my questions have to do with diskless operation. Jamie, why are you going diskless? IDE hard drives cost very little, and you can still do your network install. Pick your favourite toolkit, Rocks, Oscar, Warewulf and away you go. BTW, have a look at Clusterworld http://www.clusterworld.com They have a project for a low-cost cluster which is similar to your thoughts. Also, with the caveat that I work for a clustering company, why not look at a small turnkey cluster? I fully acknowledge that building a small cluster from scratch will be a good learning exercise, and you can get to grips with the motherboard, PXE etc. However if you are spending a research grant, I'd argue that it would be cost effective to buy a system with support from any one of the companies that do this. If you get a prebuilt cluster, the company will have done the research on PXE booting, chosen gigabit interfaces and switches which perform well, chosen components which will last. And when your power supplies fail, or a disk fails someone will come round to replace them. And you can get on with doing your science. From starship at mtaonline.net Fri Feb 25 00:15:30 2005 From: starship at mtaonline.net (Starship Warrior) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] where can i learn to build a cluster machine? Message-ID: <421EDEA2.30206@mtaonline.net> I am totally new to clusters but have been a list member for some time and read all the emails trying to learn more - so can anyone tell me where there is a good how to guide - I have three machines that I would like to use linus and cluster together just to learn more - two are 3.0 Pentiums and the last one maybe a 3.5 or 6 not sure yet they the two have the same ASUS MB and will use SCSI drives the third will also have a ASUS MB just not sure yet what I will get thanks for any and all information Starship Warrior Cluster user wantabe LOL From eno at dorsai.org Fri Feb 25 00:37:33 2005 From: eno at dorsai.org (Alpay Kasal) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: <0ICG009WS4JWON@mta6.srv.hcvlny.cv.net> Message-ID: <0ICG00219L4MPX@mta3.srv.hcvlny.cv.net> Donald Becker just mentioned "RPL", I do believe that is what I was referring to in my last response. >PS: PXE is not the only embedded method of bootnic. I forget the other method but it's often paired with onboard realtek8139's, an old intel standard (?) which is useless these days but still found on some new motherboards. Maybe someone else here knows what I'm referring too. From gotero at linuxprophet.com Fri Feb 25 00:23:31 2005 From: gotero at linuxprophet.com (Glen Otero) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] O'Reilly Clusters Book Review Message-ID: My review of O'Reilly's latest clusters book published at HPCwire (http://www.tgc.com/hpcwire.html): > 'Crazy Talk' Clutters New Cluster Book > Glen Otero, Linux Prophet > When my colleagues and I heard that O'Reilly was releasing another > cluster book ("High Performance Linux Clusters with OSCAR, Rocks, > openMosix & MPI"), we knew it would not turn out well. One of my > colleagues even said, "It's going to be written by some guy that > doesn't know anything and [gets all excited] over clusters." > > Why such a pessimistic prediction? > > For one, it was uttered by the same cluster expert that O'Reilly > ignored while producing their first cluster book debacle several > years > ago. When told that their first book ("Building Linux Clusters" by > David Spector)?should be scrapped and rewritten, O'Reilly ignored > their reviewers. The advice only came from the knowledgeable folks at > VA Linux, *the* cluster company at that time. But what does VA Linux > know? It's O'Reilly, they obviously know better. > The first O'Reilly cluster book was a complete disaster. I wrote a > scathing review of it for Linux Journal in 2000. Completely void of > anything useful, the book and included software were simply not > finished. It was like reading a rough draft. Totally embarrassed, and > suddenly void of hubris, O'Reilly apologized to its audience and > pulled the book from print. > Not satisfied to sit around pointing fingers and complaining, I told > O'Reilly I would help them with their next cluster book attempt, if > there even was one. Before long, I signed a contract to write a > clusters book for O'Reilly. But in their infinite wisdom, they didn't > like the first few chapters that I submitted. Although I had gotten > other cluster experts to review what I had written, O'Reilly didn't > bother to get any experts to review what I was writing. They just > didn't like it, so they dismissed it out of hand. Needless to say the > "we know better" attitude was back, and that ended the contract. > > Which brings us to present day. This latest cluster book suffers from > the same brain damaged, hubris-driven process at O'Reilly. Just like > the first book, it's written by a virtual unknown in the cluster > community (Joseph D. Sloan) and comes across as having been written > in > a vacuum. > > Let's start with the book's title, "High Performance Linux Clusters > with OSCAR, Rocks, openMosix & MPI." There's nothing high-performance > about this book because there's no discussion of using any high > performance networks like Myrinet, Infiniband, or Quadrics outside of > four paragraphs on page 40. There are so many ill-informed sweeping > generalizations made about cluster networks on that page that I threw > the book against the wall when I read them. For example, Quadrics and > Infiniband are clearly established networking technologies, not > merely > "emerging," as the author believes. Sloan obviously hasn't attended a > Supercomputing conference in the last several years. Unfortunately, > the rest of the book is rife with several inaccurate cluster > oversimplifications and incorrect definitions of terms like single > system image (SSI) and virtual machine interface (VMI). The > "beginner's guide" design of the book is no excuse for inaccuracies > and oversimplifications. > > In my eyes, this book was doomed for the trash after page 8. Sloan > states that the term "Beowulf" is a politically charged term that > would be avoided in the book.? That is the most ridiculous thing I > have ever heard. It's impossible to take that comment seriously, > especially since the author doesn't even take the time to properly > define a Beowulf.? For these reasons alone, I can't take this book > seriously. I've thrown back my share of adult beverages with Don > Becker, and trust me when I say that the political nature of Beowulf > has never come up. Adding to the confusion, the phrase "more > traditional Beowulf-style cluster" is then used on page 63. I hope > now > you'll understand why I think this book is schizophrenic at best. > > Defining a Beowulf shouldn't have been too difficult for Sloan. He > could have used a term that he introduced on page 10, "asymmetric > cluster."? But I guess it's too much to ask that the Beowulf project, > Tom Sterling and Don Becker's brainchild that started the high > performance cluster phenomenon, be properly described and defined in > a > clusters book.? By the way, I've never heard the term "asymmetric > architecture" used when describing clusters. And, outside this book, > you won't either. > > After page 8, it's apparent that the author has nothing original to > offer and is going to regurgitate what has already been written about > clusters. There is absolutely no value in this because the online > documentation for all of the cluster projects covered by the author > is > far more informative than what is included in the book. For example, > while screenshots of a cluster install are included in the online > Rocks documentation, they are omitted in the book. Furthermore, after > regurgitating much of the online Rocks documentation, the author > doesn't offer any additional helpful hints or troubleshooting advice. > As someone who runs a company that provides and supports cluster > software based on Rocks, I can tell you that there are plenty of > pitfalls that should have been mentioned. > > This underscores my major complaint with this book. There's nothing > new, nothing novel and no real help offered. Everything is just laid > out superficially in front of the reader for them to make the right > cluster decision. The book should guide the cluster decision-making > process, but it only offers a bunch of questions -- with no > substantial answers. > > Sloan even admits on page 91 that there is a very detailed set of > installation instructions for OSCAR, including screen shots, > available > online. So why is this book necessary again? Oh yeah, the author is > supposed to help the reader decide if OSCAR, or any cluster toolkit > for that matter, is right for the reader. Unfortunately, no help of > any kind is offered. > > The typos and omissions weren't rampant this time, but the errors I > found on pages 76, 123, 127, 130, and 136 provided nasty flashbacks > of > the first O'Reilly book. Good thing I resigned myself to do a shot of > tequila after every typo I found. It dulled the pain this book > inflicted. > > OK. "Part I -- An Introduction to Clusters" is just inaccurate and > infuriating. "Part II -- Getting Started Quickly" contains recycled > and reformatted content easily found for free online. "Part III -- > Building Custom Clusters" isn't really about building custom > clusters, > but looks more closely at some software that was gleaned over in > Parts > I & II. While I don't agree with the inclusion of the parallel > virtual > file system (PVFS) and the omission of Sun Grid Engine in Part III, > I'm sure this can be chalked up to one of the tough decisions the > author had to make, like the omission of PVM and Condor from the > book. > "Part IV -- Cluster Programming" is actually a very good introduction > to programming, debugging, and profiling MPI programs. > > It's obvious that this book has no clear identity. It's like a 5th > grader's book report: a lifeless facsimile of what's been read, > totally void of originality, wisdom or topic advancement. But it's a > quick read because it uses small words. > > Should I be this harsh? After all, cluster computing is a complex > subject where the answer to most questions is "it depends."? However, > I believe that O'Reilly owed us an excellent book after their first > cluster gaffe, so I'm disappointed that O'Reilly took the easy way > out > by reorganizing and watering down documentation that is available > elsewhere. Even the content in the exemplary Part IV can be found in > several other places. It's just a lot less technical and intimidating > here.? > > There are better ways to write a clusters book. I know because I've > read several cluster book outlines by members of the cluster > intelligentsia that would have been better than this offering. So I'm > not going easy on O'Reilly, no matter how good their intentions. The > cluster community has a difficult enough time assisting people with > clusters without books like this dynamiting the proverbial cluster > well. The statement on page 28, "...benchmarking is probably a > meaningless activity and waste of time," is just plain wrong and > demonstrates a glaring lack of cluster understanding. > > If you really want to learn about clusters, pick up a copy of > Sterling's "Beowulf Cluster Computing with Linux," 2nd edition, or > check out Warewulf, Rocks, OSCAR, OpenMosix, and ClusterWorld online. > You could join a mailing list, like the Beowulf mailing list, and > subscribe to ClusterWorld Magazine. This is where the creators and > maintainers of all that is clustering hang out, announce, debate, > rant, create, lurk, help, and publish. If you want to be part of > clustering's future, then you'll check out the community's Cluster > Agenda and attend this year's ClusterWorld conference. > > ================================================= > Glen Otero received his Ph.D. in Microbiology and Immunology from > UCLA > in 1995 and immediately escaped to the more temperate climes and > better surf in San Diego. After some research on the molecular and > cellular biology of HIV and Herpes viruses at the Salk Institute for > Biological Sciences, Glen left the wet lab research bench in 1999. > Although leaving the research bench, he didn't leave science > altogether; traveling all the way across the street to the San Diego > Supercomputer Center (SDSC) for a stint at the Protein Data Bank. It > was while at SDSC that Glen had his Linux clusters and bioinformatics > epiphany. Soon after that illuminating event, Glen founded Linux > Prophet, a bioinformatics consultancy specializing in the > implementation, design, and deployment of Linux Beowulf clusters in > the life sciences. Late in 2002 Linux Prophet evolved into Callident, > a Linux cluster software and high performance computing company. > Glen Otero Ph.D. Linux Prophet -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 10616 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050225/3919966f/attachment.bin From ctierney at HPTI.com Fri Feb 25 09:17:44 2005 From: ctierney at HPTI.com (Craig Tierney) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: <1109319374.6055.17.camel@Vigor45> References: <1109319374.6055.17.camel@Vigor45> Message-ID: <1109351864.2883.22.camel@localhost.localdomain> On Fri, 2005-02-25 at 01:16, John Hearns wrote: > On Thu, 2005-02-24 at 18:20 -0500, Jamie Rollins wrote: > > Hello. I am new to this list, and to beowulfery in general. I am working > > at a physics lab and we have decided to put together a relatively small > > beowulf cluster for doing data analysis. I was wondering if people on > > this list could answer a couple of my newbie questions. > > > > The basic idea of the system is that it would be a collection of 16 to 32 > > off-the-shelf motherboards, all booting off the network and operating > > completely disklessly. We're looking at amd64 architecture running > > Debian, although we're flexible (at least with the architecture ;). Most > > of my questions have to do with diskless operation. > > Jamie, > why are you going diskless? > IDE hard drives cost very little, and you can still do your network > install. > Pick your favourite toolkit, Rocks, Oscar, Warewulf and away you go. > IDE drives fail, they use power, you waste time cloning, and depending on the toolkit you use you will run into problems with image consistency. I have run large systems of both kinds. The last system was diskless and I don't see myself going back. I like changing one file in one place and having the changes show up immediately. I like installing a packing once, and having it show up immediately, so I don't have to reclone or take the node offline to update the image. Craig > > BTW, have a look at Clusterworld http://www.clusterworld.com > They have a project for a low-cost cluster which is similar to your > thoughts. > > > Also, with the caveat that I work for a clustering company, > why not look at a small turnkey cluster? > I fully acknowledge that building a small cluster from scratch will be > a good learning exercise, and you can get to grips with the motherboard, > PXE etc. > However if you are spending a research grant, I'd argue that it would be > cost effective to buy a system with support from any one of the > companies that do this. > If you get a prebuilt cluster, the company will have done the research > on PXE booting, chosen gigabit interfaces and switches which perform > well, chosen components which will last. And when your power supplies > fail, or a disk fails someone will come round to replace them. > And you can get on with doing your science. > From alvin at iplink.net Fri Feb 25 09:22:05 2005 From: alvin at iplink.net (alvin) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: <0ICG00219L4MPX@mta3.srv.hcvlny.cv.net> References: <0ICG00219L4MPX@mta3.srv.hcvlny.cv.net> Message-ID: <421F5EBD.2030305@iplink.net> Alpay Kasal wrote: >Donald Becker just mentioned "RPL", I do believe that is what I was >referring to in my last response. > > > >>PS: PXE is not the only embedded method of bootnic. I forget the other >> >> >method but it's often paired with onboard realtek8139's, an old intel >standard (?) which is useless these days but still found on some new >motherboards. Maybe someone else here knows what I'm referring too. > > > RPL if I remember correctly was a IBM/Novell protocol for booting. There are reports it can be made to boot linux directly but I had it boot an etherboot image. Etherboot is another boot protocol that in some use I have a funny feeling it was based on a Sun protocol that was around before open boot. Etherboot can be put into non PXE flash bioses and has support for things like serial console. There could be an argument made to rip the PXE out of your new motherboard BIOS and replace it with etherboot. Alvin From rgb at phy.duke.edu Fri Feb 25 09:38:10 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: <0ICG00219L4MPX@mta3.srv.hcvlny.cv.net> References: <0ICG00219L4MPX@mta3.srv.hcvlny.cv.net> Message-ID: On Fri, 25 Feb 2005, Alpay Kasal wrote: > Donald Becker just mentioned "RPL", I do believe that is what I was > referring to in my last response. > > >PS: PXE is not the only embedded method of bootnic. I forget the other > method but it's often paired with onboard realtek8139's, an old intel > standard (?) which is useless these days but still found on some new > motherboards. Maybe someone else here knows what I'm referring too. I imagine you're referring to bootp or etherboot. bootp is a part of PXE. Most PXE implementations currently use dhcp and/or bootp followed by tftp. These in order do NIC TCP initialization, path/config information exchange, retrieval of a bootable image. At that point control is passed to a second stage bootloader (usually in ROM BIOS) that boots the retrieved image. Google is as always your friend and can find you anything from HOWTOs to Intel white papers that precisely define the process (that I'm describing very coarsely). The reason I put and/or is that dhcpd typically does both of the first two steps nowadays -- it isn't necessary to have a separate bootp daemon. Once upon a time when Sun sold lots of diskless workstations (by design) there was, but the loader sequence was more or less the same but without dhcp -- a bootparamd handling bootp and network initialization, tftp to retrieve an image in RAM, a local ROM loader to boot it. There is also the issue of PXE (per se) and etherboot (per se) -- see e.g. http://www.ltsp.org/documentation/pxe.howto.html Basically, this just involves what kind of image is passed back to the boot loader -- a lzpxe image (bootable image packed up for PXE) or a nbi (network bootable image) and what actually does the boot loading. For MOST users, all of this is irrelevant and not worth knowing or worrying about -- at most it alters how you prepare the bootable image for actual transfer and loading at the other end. To boot with PXE, you just set up dhcp and tftp (on a server) appropriately, using instructions available lots of places on the web. The image you boot and what happens afterward is completely under your control -- we routinely boot a dos image via pxe, for example, to relash a BIOS or run memtest86 via PXE. You can find a free open source DOS floppy image a variety of places on the web, e.g. here: http://www.freedos.org/freedos/files/ (Note that dos on such a floppy doesn't do very much, typically -- it functions pretty much strictly as a bootable program loader and execution environment with an associated (fairly small) set of BIOS calls and resource hooks built in. Not much of a kernel...) HTH -- I'm skipping a lot of detail (and may even be getting some of it wrong:-) but you can find that detail on the web and read to your heart's content, if it matters to you. rgb > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rsweet at aoes.com Fri Feb 25 09:33:58 2005 From: rsweet at aoes.com (Ryan Sweet) Date: Wed Nov 25 01:03:51 2009 Subject: shall we write our own? Re: [Beowulf] O'Reilly Clusters Book Review In-Reply-To: References: Message-ID: Glenn, I have also had a look at the new ORA cluster book, hoping that they had learned their lesson, and had a similar reaction. While I don't wish to discredit M. Sloan, as writing any sort of book is always going to be a lot of work and filled with compromise, I felt from the very beginning that the community can do better. After reading over your review, which, while scathing, was entirely accurate, I feel resolved that the beowulf community _should_ do better. Here's what I propose: let's make a "Beowulf.org Guide to Linux Clustering", or whatever the heck else you want to call it. Let us outline, review, improve, and comment on it here on this mailing list. Here's the hard part - lets also set a deadline, with realistic goals, and try to stick to it. Lets assign any publishing rights or other "details" like that to the FSF or Linux Documentation Project. Robert Brown has already done a lot of work on such a book, and generously made it freely available. Maybe he is amenable to this being a starting point? In any case, I would gladly provide hosting for something like this, and coordinate the project, as well as edit or write content. There are many questions that arise: Most importantly - What should go in the book? In what order should these topics be covered? Should there be an attempt to have a common style? How (and how often) should it be revised? Does the book target new beowulf admins, seasoned experts, or both and some in-between? Should mentioning vendors be allowed? What are the guidelines? and so on. Firstly, now that I've proposed the idea, I'll also start by volunteering to write a chapter on diskless clustering. Second, please take this opportunity to tell me why this is a bad idea, and while your at it send your comments on the questions above. regards, -Ryan -- Ryan Sweet Advanced Operations and Engineering Services AOES Group BV http://www.aoes.com Phone +31(0)71 5795521 Fax +31(0)71572 1277 From rgb at phy.duke.edu Fri Feb 25 09:53:39 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] where can i learn to build a cluster machine? In-Reply-To: <421EDEA2.30206@mtaonline.net> References: <421EDEA2.30206@mtaonline.net> Message-ID: On Thu, 24 Feb 2005, Starship Warrior wrote: > I am totally new to clusters but have been a list member for some time > and read all the emails trying to learn more - so can anyone tell me > where there is a good how to guide - I have three machines that I would > like to use linus and cluster together just to learn more - two are 3.0 > Pentiums and the last one maybe a 3.5 or 6 not sure yet they the two > have the same ASUS MB and will use SCSI drives the third will also > have a ASUS MB just not sure yet what I will get > > thanks for any and all information A "good howto guide" is a tough thing to define, because of the breadth of the problem, but here goes. See: http://www.phy.duke.edu/brahma (for example) as a cluster resource clearinghouse, with links to other cluster resource clearinghouses such as beowulf.org, beowulf underground and others. In particular there are links to the beowulf howto, the beowulf FAQ, and an online book on cluster engineering that can probably suffice to get you started. See also a series of columns I wrote last year for the then brand new Cluster World Magazine on this very topic. See also the list archives -- this is a FAQ beyond the FAQ and has been discussed/described on list repeatedly over years. See also sites like the warewulf site -- there are now free cluster-in-a-box (so to speak) distributions that should permit you to build and configure a learning cluster very easily indeed, often without even installing an operating system image on the nodes (so they can continue to run whatever you like except when you boot them into a cluster). To give you the direct answer, it goes something like the following: a) Hook systems into a common switched LAN e.g. an ethernet switch. b) If possible use decent quality PXE-aware NICs c) If possible use nodes with a decent amount of installed memory (>= 192 MB) although it is possible to get by with less, with effort. d) Node hard disk is optional for at least some installation methods (e.g. warewulf) but is useful and enables others. e) At least one system NEEDS ample hard disk and will serve as a "server" or "head node" to your cluster. This node will manage boot images, the distro you wish to install, NFS or other shared filesystems, authentication, and gives you a place to "login to the cluster". Note that this is a sloppy requirement -- there are many different ways to manage this and I'm just describing one of the simplest and most straightforward ones. It then goes like this. Set up linux (of your choice) on the boot server/head node. Learn to PXE boot (dhpc, tftp) and set up head node as boot server. Learn to set up an installation repository for e.g. kickstart or the distribution and packaging scheme of your choice and do so via a mirror on your head node. Create accounts and NFS home directories and so on on your head node. Then pick a kind of cluster and install it. This could be a bootable remote mountable node image on your server (warewulf) or kickstarting an installation onto all your nodes native or using FAI or something else. What you choose here depends on what linux distro you're comfortable and how serious a cluster you want to build. For most beginners, I tend to suggest either using a canned cluster (warewulf) or a simple NOW cluster (e.g. kickstarted FC2 on all nodes). Install "parallel clustering" packages as desired, e.g. pvm, lam mpi or mpich, ganglion, wulfware (xmlsysd/wulfstat), whatever. Or not, you can actually do and learn about clustering without them in at least some modes of operation using just plain old compilers, modern perl, ssh, and some scripts. That's it. Write a parallel program (using PVM or MPI) or run a serial program in parallel (lots of ways) and you can start to learn about parallel speedup and scaling and everything... rgb > > Starship Warrior > Cluster user wantabe LOL > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rsweet at aoes.com Fri Feb 25 09:52:59 2005 From: rsweet at aoes.com (Ryan Sweet) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: <1109351864.2883.22.camel@localhost.localdomain> References: <1109319374.6055.17.camel@Vigor45> <1109351864.2883.22.camel@localhost.localdomain> Message-ID: On Fri, 25 Feb 2005, Craig Tierney wrote: >> why are you going diskless? >> IDE hard drives cost very little, and you can still do your network >> install. >> Pick your favourite toolkit, Rocks, Oscar, Warewulf and away you go. >> > > IDE drives fail, they use power, you waste time cloning, and > depending on the toolkit you use you will run into problems > with image consistency. > > I have run large systems of both kinds. The last system was > diskless and I don't see myself going back. I like changing > one file in one place and having the changes show up immediately. > I like installing a packing once, and having it show up immediately, > so I don't have to reclone or take the node offline to update > the image. I think the term "diskless" is sometimes the problem when discussing centrally installed and managed systems. Lots of "diskless" cluster have GB and GB of local disks, only they are used for swap and temp I/O, not for the OS. In 2000 I switched from locally installed system images (using the very good - even back then - system-imager) to using either nfsroot or warewulf style diskless systems, but have retained the local disk for scratch I/O. While I can understand debating over the merits of nfsroot vs RAM-disk root, I fail to see many useful arguments for maintaining a local OS install. However, that doesn't mean that local disks are bad. It all depends upon the application, of course, but in many cases its hard to beat the local disk for temporary I/O, especially if you don't have gobs and gobs of RAM to spare. Also PVFS is sufficiently mature that you can easily combine all of the (very cheap) local disks into a large parallel filesystem. Using nfsroot you can switch from one "system image" (really just an nfsroot file tree) to another one with a simple reboot. You have all of the advantadges of central configuration and control combined with the convenience and speed of local I/O and local swap. It can be _very_ useful in a situation where you have to support multiple user communities with wierd apps or strange requirements. Using pxeboot and pxelinux, I've set up systems where the queue system could even request that a node use a specific system configuration before starting the job (eg: must have linux 2.4 with checkpointing in the kernel). Nodes might be available, but running another nfsroot cluster system image (say they are running RHEL, with no checkpointing, or for compatibility with some other commercial app they are running RH 7.2). The queue system tells the cluster master to reconfigure pxelinux so that the requested nodes default to the required config, by pointing them at another nfsroot tree. The cluster master tells the nodes to reboot, and when they are rebooted and running the appropriate image, the job runs. That sort of config requires a lot of glue, but it would be way too much headache to even attempt without "diskless" systems. regards, -Ryan -- Ryan Sweet Advanced Operations and Engineering Services AOES Group BV http://www.aoes.com Phone +31(0)71 5795521 Fax +31(0)71572 1277 From lindahl at pathscale.com Fri Feb 25 10:31:46 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: References: <1109319374.6055.17.camel@Vigor45> <1109351864.2883.22.camel@localhost.localdomain> Message-ID: <20050225183146.GA1563@greglaptop.internal.keyresearch.com> On Fri, Feb 25, 2005 at 06:52:59PM +0100, Ryan Sweet wrote: > While I can understand debating over the merits of nfsroot vs RAM-disk > root, I fail to see many useful arguments for maintaining a local OS > install. An example of something that goes very wrong with NFS is upgrading a file to a new file with the same name. If that file is a binary or library that's in use anywhere in the cluster, you are likely to have a problem. Local disks and Scyld, on the other hand, do the right thing: existing processes using the binary or library continue to use the old version, while new ones use the new version. This disagreement is as old as the hills, by the way: in the good old days, when Sun was young, lots of people ran their pizza-box workstations diskless, but that went out of style when Ethernet's performance was stuck in place for a bunch of years. It's important to understand arguments you disagree with; your dismissal is not a good sign. > It can be _very_ useful in a situation where you have to support multiple > user communities with wierd apps or strange requirements. Yep. But your conclusion: > That sort of config requires > a lot of glue, but it would be way too much headache to even attempt > without "diskless" systems. Doesn't make any sense; I have seen people describe such systems where they download a disk image when a batch job wants a different software load. It's certainly doable that way: it does have different tradeoffs from the diskless case, but if it gives you a headache, it's probably because you don't like it, not because it's hard to do. -- greg From lindahl at pathscale.com Fri Feb 25 10:34:08 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] Microsoft HPC survey spam? Message-ID: <20050225183408.GA1624@greglaptop.internal.keyresearch.com> I got 2 emails from some survey company doing an HPC survey for Microsoft, one at pathscale.com and one at our previous name, keyresearch.com. Am I just unlucky, or are they spamming a list of posters to, say, this mailing list? -- greg From mathog at mendel.bio.caltech.edu Fri Feb 25 11:06:58 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] RE: S2466 Wake on Lan working, anyone? Message-ID: After messing around with this for a couple of days, with some very helpful messages from Donald Becker, I still had not managed to make WOL work on the S2466. Using pci-config it was possible to put the onboard NIC into the D3 state and to observe that the correct bit was set in the NIC when a magic packet hit it. But no matter how late into the poweroff sequence it was put into this state, once powered off (poweroff or shutdown -h), it wouldn't power back on after the magic packet arrived. I had emailed Tyan support but they never sent anything back. So today I called them, and the word came down that: (The following quotes may not be exact but are as close as I can remember.) "WOL only works on the AMD MPX boards from the S1 state". S1 state has the motherboard fully powered but the disks may be spun down. The rationale given for this appalling design decision (or more likely, cover up for the design error) was that "Most people who care about such things use a remote management card in the machine, so WOL would have been redundant". Sure, and that's why there are so many posts from people trying to make WOL work on the S2466 mobos! Well, we can all stop asking because apparently you can't get there from here. It's possible this is still a BIOS problem that they could fix but I'm leaning more towards the hypothesis that they neglected to put in the hardware support required for the NIC to actually trigger a power on event. The more I learn about Tyan the less I like them. Surely they've known all there is to know about the broken WOL for at least 3 years, yet they didn't ever post the reason for the lack of WOL function in their FAQ section for this motherboard. Could it possibly be that they didn't want to lose sales by admitting the board couldn't do WOL in a manner that was of any use to anybody? Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From roger at ERC.MsState.Edu Fri Feb 25 11:33:53 2005 From: roger at ERC.MsState.Edu (Roger L. Smith) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] Microsoft HPC survey spam? In-Reply-To: <20050225183408.GA1624@greglaptop.internal.keyresearch.com> References: <20050225183408.GA1624@greglaptop.internal.keyresearch.com> Message-ID: On Fri, 25 Feb 2005, Greg Lindahl wrote: > I got 2 emails from some survey company doing an HPC survey for > Microsoft, one at pathscale.com and one at our previous name, > keyresearch.com. Am I just unlucky, or are they spamming a list of > posters to, say, this mailing list? I got one too, but I'm not convinced it's a trend yet either. We're both also SC attendees. _\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_ | Roger L. Smith Phone: 662-325-3625 | | Sr. Systems Administrator FAX: 662-325-7692 | | roger@ERC.MsState.Edu http://WWW.ERC.MsState.Edu/~roger | | Mississippi State University | |____________________________________ERC__________________________________| From ctierney at HPTI.com Fri Feb 25 11:38:44 2005 From: ctierney at HPTI.com (Craig Tierney) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: <421F6CAA.3040706@mail2.vcu.edu> References: <1109319374.6055.17.camel@Vigor45> <1109351864.2883.22.camel@localhost.localdomain> <421F6CAA.3040706@mail2.vcu.edu> Message-ID: <1109360324.2883.42.camel@localhost.localdomain> On Fri, 2005-02-25 at 11:21, Mike Davis wrote: > Craig, > > Reasons to run disks for physics work. > > 1. Large tmp files and checkpoints. Good reason, except when a node fails you lose your checkpoints. > > 2. Ability for distributed jobs to continue if master node fails. Jobs will continue to run once libraries are loaded. They just hang at the end. It all ends up being a risk assessment. We have been up for close to 6 months now. We have not had a failure of the NFS server. The load is all at boot time, but it does very little the rest of the time. I suspect that by making that statement I will be up at 2am tomorrow morning replacing hardware..... > > 3. saving network io for jobs rather than admin > > I actually seldom update compute nodes (unless an update is required for > software required for research). I mount, a /usr/global that does > contain software. I also mount /home on each node. I guess I wasn't as clear and someone else pointed out why disks are good. I actually have disks in some of my compute nodes for exactly these reasons. However, they are only for /tmp and swap. You do want to consider how you design your network and the rest of your system to boot diskless. Is the cost justified? For us, either the systems are booting and all of the IO is image IO, or the nodes are running and reading/writing files, the IO doesn't interfere. We are exporting our IO over the HSN (myrinet in this case) so the really fast IO isn't interfering anyway. Your /usr/global does seem to be a good solution that is half way between having everything local and pure diskless. > > An example of item 1 above are Gaussian jobs that we are now running > that require >40GB of tmp space. For these jobs I have both an OS 20GB > and tmp 100GB disk in each node. Due to a problematic scsi to ide > converter, I have experienced item 2 too many times with one cluster, > but even on the others I like knowing that work can continue even if the > host is down (facilitated by a separate nfs server). > If you know your job load needs /tmp, disk is great. I have never had users than needed to use space in this way, so moving away from diskfull nodes wasn't an issue. > Of course, I am definitely old school. I use static IP's, individual > passwd files. and simple scripts to handle administration. > I still would probably run system this way if it was disk-full. I have run both ways and I diskless has made my life much easier. Faster to get the system up, faster to make changes, easier to deal with hardware failures. Craig > Mike > > > > Craig Tierney wrote: > > >On Fri, 2005-02-25 at 01:16, John Hearns wrote: > > > > > >>On Thu, 2005-02-24 at 18:20 -0500, Jamie Rollins wrote: > >> > >> > >>>Hello. I am new to this list, and to beowulfery in general. I am working > >>>at a physics lab and we have decided to put together a relatively small > >>>beowulf cluster for doing data analysis. I was wondering if people on > >>>this list could answer a couple of my newbie questions. > >>> > >>>The basic idea of the system is that it would be a collection of 16 to 32 > >>>off-the-shelf motherboards, all booting off the network and operating > >>>completely disklessly. We're looking at amd64 architecture running > >>>Debian, although we're flexible (at least with the architecture ;). Most > >>>of my questions have to do with diskless operation. > >>> > >>> > >>Jamie, > >> why are you going diskless? > >>IDE hard drives cost very little, and you can still do your network > >>install. > >>Pick your favourite toolkit, Rocks, Oscar, Warewulf and away you go. > >> > >> > >> > > > >IDE drives fail, they use power, you waste time cloning, and > >depending on the toolkit you use you will run into problems > >with image consistency. > > > >I have run large systems of both kinds. The last system was > >diskless and I don't see myself going back. I like changing > >one file in one place and having the changes show up immediately. > >I like installing a packing once, and having it show up immediately, > >so I don't have to reclone or take the node offline to update > >the image. > > > >Craig > > > > > > > > > >>BTW, have a look at Clusterworld http://www.clusterworld.com > >>They have a project for a low-cost cluster which is similar to your > >>thoughts. > >> > >> > >>Also, with the caveat that I work for a clustering company, > >>why not look at a small turnkey cluster? > >>I fully acknowledge that building a small cluster from scratch will be > >>a good learning exercise, and you can get to grips with the motherboard, > >>PXE etc. > >>However if you are spending a research grant, I'd argue that it would be > >>cost effective to buy a system with support from any one of the > >>companies that do this. > >>If you get a prebuilt cluster, the company will have done the research > >>on PXE booting, chosen gigabit interfaces and switches which perform > >>well, chosen components which will last. And when your power supplies > >>fail, or a disk fails someone will come round to replace them. > >>And you can get on with doing your science. > >> > >> > >> > > > > > > > >_______________________________________________ > >Beowulf mailing list, Beowulf@beowulf.org > >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > > > > > From eugen at leitl.org Fri Feb 25 12:31:06 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] [Bioclusters] Need some advice on a cluster for EST/cDNA assembly, clustering (fwd from gary@www.bioinformatics.org) Message-ID: <20050225203106.GR1404@leitl.org> ----- Forwarded message from Gary Van Domselaar ----- From: Gary Van Domselaar Date: Fri, 25 Feb 2005 14:26:04 -0500 (EST) To: bioclusters@bioinformatics.org Subject: [Bioclusters] Need some advice on a cluster for EST/cDNA assembly, clustering Reply-To: "Clustering, compute farming & distributed computing in life science informatics" Hey Gang, I've been called in at the last moment to "consult" on the purchase of a cluster for a sequencing project. Admittedly, I know nothing about life science clusters, despite having been subscribed to this list from its inception. I am not making any money on this consulting, just helping out a neighbouring academic lab. So what I know at this point is that they have about $Cdn 200K to spend. They hae already talked to Sun, and Sun is offering them a "sweetheart" deal for (something) at about $120k. My only exposure is to a G4/G5 cluseter from BioTeam. I am impressed with it andit works really well for my purposes, and Im guessing it would work well for theirs too. I'm guessing a linux cluster would perfrom nicely too. The lab currently does not have a bioinforatician, but I thnik they have money for one. I'll probably just end up pointing them to Glen, Joe , and Chris, but any advice, suggestions, and pointers to where I can get a little more familiarity, well shucks, that would really be swell. Decidedly, g. -- Gary Van Domselaar, PhD. Postdoctoral Fellow, Computing Science and Biological Sciences University of Alberta Edmonton, AB, Canada Phone: 780-492-5969 Assistant Director, Bioinformatics.Org gary@bioinformatics.org _______________________________________________ Bioclusters maillist - Bioclusters@bioinformatics.org https://bioinformatics.org/mailman/listinfo/bioclusters ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050225/27b7d0c4/attachment.bin From eugen at leitl.org Fri Feb 25 12:42:41 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] Re: [Bioclusters] Re: Login & home directory strategies for PVM? (fwd from mgutteri@fhcrc.org) Message-ID: <20050225204241.GU1404@leitl.org> ----- Forwarded message from Michael Gutteridge ----- From: Michael Gutteridge Date: Fri, 25 Feb 2005 11:51:56 -0800 To: "Clustering, compute farming & distributed computing in life science informatics" Subject: Re: [Bioclusters] Re: Login & home directory strategies for PVM? X-Mailer: Apple Mail (2.619.2) Reply-To: "Clustering, compute farming & distributed computing in life science informatics" Thanks... been thinking about local homes vs. pvfs since I don't really need anything but .bashrc. However, managing local home directores on 62+ nodes gets boring after a while... I rather prefer the idea of running pvm as you indicate, but I haven't had any luck finding out how to do this- do you have a pointer to something that describes that? I can't even pull together a good google term to find out how that's typically done. I will very likely end up using pvfs for database directories if I can make it robust enough, though. Sounds like pvfs2 has some great improvements in that area. >Lastly, you can port to mpich on a bproc system like Scyld, and get >rid of >pvmd's altogether. From my conversations with the developers, sounds like a port to MPI is underway. Thanks ... On Feb 24, 2005, at 11:05 AM, Michael Will wrote: >Just statically mount /home rather than doing automounting of >individual homes, >and you are fine. Also you could run the pvmd's as a user that does >not require >or have an nfs-mounted home but uses local scratch instead. > >Lastly, you can port to mpich on a bproc system like Scyld, and get >rid of >pvmd's altogether. > >Michael _______________________________________________ Bioclusters maillist - Bioclusters@bioinformatics.org https://bioinformatics.org/mailman/listinfo/bioclusters ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050225/91f95fee/attachment.bin From hahn at physics.mcmaster.ca Fri Feb 25 14:18:16 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: <1109360324.2883.42.camel@localhost.localdomain> Message-ID: > > Reasons to run disks for physics work. > > 1. Large tmp files and checkpoints. > > Good reason, except when a node fails you lose your checkpoints. you means s/node/disk/ right? sure, but doing raid1 on a "diskless" node is not insane. though frankly, if your disk failure rate is that high, I'd probably do something like intermittently store checkpoints off-node. > It all ends up being a risk assessment. We have been up for close > to 6 months now. We have not had a failure of the NFS server. The I have two nothing-installed clusters; on in use for 2+ years, the other for about 8 months. the older one has never had an NFS-related problem of any kind (it's a dual-xeon with 2 u160 channels and 3 disks on each; other than scsi, nothing gold-plated.) this cluster started out with 48 dual-xeons and a single 48pt 100bT switch with a gigabit uplink. the newer cluster has been noticably less stable, mainly because I've been lazy. in this cluster, there are 3 racks of 32 dual-opterons (fc2 x86_64) that netboot from a single head node. each rack has a gigabit switch which is 4x LACP'ed to a "top" switch, which has one measly gigabit to the head/fileserver. worse yet, the head/FS is a dual-opteron (good), but running a crappy old 2.4 ia32 kernel. as far as I can tell, you simply have to think a bit about the bandwidths involved. the first cluster has many nodes connected via thin pipes, aggregated through a switch to gigabit connecting to decent on-server bandwidth. the second cluster has lots more high-bandwidth nodes, connected through 12 incoming gigabits, bottlenecked down to a single connection to the head/file server (which is itself poorly configured). one obvious fix to the latter is to move some IO load onto a second fileserver, which I've done. great increase in stability, though enough IO from enough nodes can still cause problems. shortly I'll have logins, home directories and work/scratch all on separate servers. for a more scalable system, I would put a small fileserver in each rack, but still leave the compute nodes nothing-installed. I know that the folks at RQCHP/Sherbrooke have done something like this, very nicely, for their serial farm. it does mean you have a potentially significant number of other servers to manage, but they can be identically configured. heck, they could even net-boot and just grab a copy of the compute-node filesystems from a central source. the Sherbrooke solution involves smart automation of the per-rack server for staging user files as well (they're specifically trying to support parameterized montecarlo runs.) regards, mark hahn. From ctierney at HPTI.com Fri Feb 25 15:02:06 2005 From: ctierney at HPTI.com (Craig Tierney) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: References: Message-ID: <1109372526.2883.81.camel@localhost.localdomain> On Fri, 2005-02-25 at 15:18, Mark Hahn wrote: > > > Reasons to run disks for physics work. > > > 1. Large tmp files and checkpoints. > > > > Good reason, except when a node fails you lose your checkpoints. > > you means s/node/disk/ right? sure, but doing raid1 on a "diskless" > node is not insane. though frankly, if your disk failure rate is > that high, I'd probably do something like intermittently store > checkpoints off-node. Yes and no. If the node is down, it is a bit tough for your model to progress. Raid1 works well enough in software so that you don't need additional hardware except the disk. > > > It all ends up being a risk assessment. We have been up for close > > to 6 months now. We have not had a failure of the NFS server. The > > I have two nothing-installed clusters; on in use for 2+ years, > the other for about 8 months. the older one has never had an > NFS-related problem of any kind (it's a dual-xeon with 2 u160 > channels and 3 disks on each; other than scsi, nothing gold-plated.) > this cluster started out with 48 dual-xeons and a single 48pt > 100bT switch with a gigabit uplink. > > the newer cluster has been noticably less stable, mainly because > I've been lazy. in this cluster, there are 3 racks of 32 dual-opterons > (fc2 x86_64) that netboot from a single head node. each rack has a > gigabit switch which is 4x LACP'ed to a "top" switch, which has > one measly gigabit to the head/fileserver. worse yet, the head/FS > is a dual-opteron (good), but running a crappy old 2.4 ia32 kernel. > > as far as I can tell, you simply have to think a bit about the > bandwidths involved. the first cluster has many nodes connected > via thin pipes, aggregated through a switch to gigabit > connecting to decent on-server bandwidth. > > the second cluster has lots more high-bandwidth nodes, connected > through 12 incoming gigabits, bottlenecked down to a single > connection to the head/file server (which is itself poorly configured). > > one obvious fix to the latter is to move some IO load onto > a second fileserver, which I've done. great increase in stability, > though enough IO from enough nodes can still cause problems. > shortly I'll have logins, home directories and work/scratch all on > separate servers. > > for a more scalable system, I would put a small fileserver in each rack, > but still leave the compute nodes nothing-installed. I know that > the folks at RQCHP/Sherbrooke have done something like this, very nicely, > for their serial farm. it does mean you have a potentially significant > number of other servers to manage, but they can be identically configured. > heck, they could even net-boot and just grab a copy of the compute-node > filesystems from a central source. the Sherbrooke solution involves > smart automation of the per-rack server for staging user files as well > (they're specifically trying to support parameterized montecarlo runs.) Sandia does something similar to this with their CIT toolkit, but it is still diskless. For every N nodes, they have an NFS-redirector. It boots diskless, and caches all of the files that the clients read. The clients hit the redirector, and not the main filesystem. If you do have a disk in these nodes, there are probably some interesting things you can do with CacheFS when it becomes stable. Craig From rgb at phy.duke.edu Fri Feb 25 16:25:37 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:51 2009 Subject: shall we write our own? Re: [Beowulf] O'Reilly Clusters Book Review In-Reply-To: References: Message-ID: On Fri, 25 Feb 2005, Ryan Sweet wrote: > > Glenn, > > I have also had a look at the new ORA cluster book, hoping that they had > learned their lesson, and had a similar reaction. While I don't wish to > discredit M. Sloan, as writing any sort of book is always going to be a lot of > work and filled with compromise, I felt from the very beginning that the > community can do better. After reading over your review, which, while > scathing, was entirely accurate, I feel resolved that the beowulf community > _should_ do better. > > Here's what I propose: let's make a "Beowulf.org Guide to Linux Clustering", > or whatever the heck else you want to call it. Let us outline, review, > improve, and comment on it here on this mailing list. Here's the hard part - > lets also set a deadline, with realistic goals, and try to stick to it. Lets > assign any publishing rights or other "details" like that to the FSF or Linux > Documentation Project. > > Robert Brown has already done a lot of work on such a book, and generously > made it freely available. Maybe he is amenable to this being a starting > point? Sure. I periodically solicit help for such a project on the list -- this is the first time somebody has solicited me:-) My experience is that it is really pretty difficult to get people to actually contribute content. However, I've already got a very decent start going, I think, and as always if anybody wants to contribute content (under the OPL it is published under) I will cheerily include it, with attribution. Based on Glenn's comments, I was actually feeling (once again) like I ought to try to shake free enough time to do another full pass through the content to bring it up to date and see if I can finish off some of the missing chapters and -- possibly -- seek a paper publisher. I want to keep it online/free either way (and there are publishers out there that are comfortable with this) but a lot of people want to own a paper copy of stuff like this. I get a lot of requests for a printable PDF from people all over who found the html with google but missed the online pdf images right next door... > > In any case, I would gladly provide hosting for something like this, and > coordinate the project, as well as edit or write content. > > There are many questions that arise: > Most importantly - What should go in the book? > In what order should these topics be covered? > Should there be an attempt to have a common style? > How (and how often) should it be revised? > Does the book target new beowulf admins, seasoned experts, or both and > some in-between? > Should mentioning vendors be allowed? What are the guidelines? > > and so on. > > Firstly, now that I've proposed the idea, I'll also start by volunteering to > write a chapter on diskless clustering. > > Second, please take this opportunity to tell me why this is a bad idea, and > while your at it send your comments on the questions above. It isn't a bad idea, but I've had open requests for content on the list for years now, so don't hold your breath. My personal experience is that if you want something written, you gotta write it yourself;-) rgb > > regards, > -Ryan > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Fri Feb 25 16:53:01 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: <20050225183146.GA1563@greglaptop.internal.keyresearch.com> References: <1109319374.6055.17.camel@Vigor45> <1109351864.2883.22.camel@localhost.localdomain> <20050225183146.GA1563@greglaptop.internal.keyresearch.com> Message-ID: On Fri, 25 Feb 2005, Greg Lindahl wrote: > On Fri, Feb 25, 2005 at 06:52:59PM +0100, Ryan Sweet wrote: > > > While I can understand debating over the merits of nfsroot vs RAM-disk > > root, I fail to see many useful arguments for maintaining a local OS > > install. > > An example of something that goes very wrong with NFS is upgrading a > file to a new file with the same name. If that file is a binary or > library that's in use anywhere in the cluster, you are likely to have > a problem. Local disks and Scyld, on the other hand, do the right > thing: existing processes using the binary or library continue to use > the old version, while new ones use the new version. > > This disagreement is as old as the hills, by the way: in the good old > days, when Sun was young, lots of people ran their pizza-box > workstations diskless, but that went out of style when Ethernet's > performance was stuck in place for a bunch of years. > > It's important to understand arguments you disagree with; your > dismissal is not a good sign. To add to Greg's remarks, yes diskless is a perfectly valid way to structure a LAN (and I'm, alas, as old as the hills and actually ran whole networks with sparc 1's, SLC's and ELC's diskless. They typically only had 4 MB or so of RAM (in later days as much as 16 or 32) and actually did remote swap as well as NFS home directories and binaries and so forth. They worked amazingly well given the times. The notion of "thin clients" goes back even before Sparcs -- I once ran a wierd IBM PC clone in the mid-80s that had a proprietary "network" interface, a hook into the bios of a host PC, and ran "diskless" by leeching on the floppies and hard drive or the host. It was marginally cheaper than getting a regular PC with its own hard drive because back then disk cost a mint -- all peripherals cost a mint. PC's cost thousands of 1980's dollars. I've run diskless linux clusters off and on as well, mostly from necessity. If you have enough memory it's good. Still, I actually think that there are excellent reasons to consider and perform installs to local disk. Robustness, speed, ease of installation and maintenance with tools like PXE, kickstart and yum have taken just about all the sting out of it. Having local swap is good. Having local scratch is good. Decreasing memory occupancy may be good -- having local disk means local paging is possible with a small performance edge (depending on your network and so forth). Thin clients have been proposed periodically over the years, but they never quite take off -- it is just too damn convenient to have some measure of local robustness, and hard disks are cheap. > > > It can be _very_ useful in a situation where you have to support multiple > > user communities with wierd apps or strange requirements. > > Yep. But your conclusion: > > > That sort of config requires > > a lot of glue, but it would be way too much headache to even attempt > > without "diskless" systems. > > Doesn't make any sense; I have seen people describe such systems where > they download a disk image when a batch job wants a different software > load. It's certainly doable that way: it does have different tradeoffs > from the diskless case, but if it gives you a headache, it's probably > because you don't like it, not because it's hard to do. Ya. Right now I think it is kinda cool just how MUCH one can do, all of it pretty easy. In the old days it was God's Own PITA to set up diskless anything -- I wrote all kinds of stuff myself to get systems to boot diskless and at least I COULD do it because I'd run Suns from the old days, and then there was management of "packages" (binary architectures and shared this and that) with no particular organization on top of that. You young'uns have it all easy -- several different toolsets to choose from to set up diskless operation, several different toolsets to choose from to set up disk and manage packages, and far more homogeneous hardware. I certainly don't think that diskless is a knee-jerk obvious choice for either LANs or clusters, although sure, there are some advantages to it (just as there are some advantages to having local disks). rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From jmdavis at mail2.vcu.edu Fri Feb 25 10:21:30 2005 From: jmdavis at mail2.vcu.edu (Mike Davis) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: <1109351864.2883.22.camel@localhost.localdomain> References: <1109319374.6055.17.camel@Vigor45> <1109351864.2883.22.camel@localhost.localdomain> Message-ID: <421F6CAA.3040706@mail2.vcu.edu> Craig, Reasons to run disks for physics work. 1. Large tmp files and checkpoints. 2. Ability for distributed jobs to continue if master node fails. 3. saving network io for jobs rather than admin I actually seldom update compute nodes (unless an update is required for software required for research). I mount, a /usr/global that does contain software. I also mount /home on each node. An example of item 1 above are Gaussian jobs that we are now running that require >40GB of tmp space. For these jobs I have both an OS 20GB and tmp 100GB disk in each node. Due to a problematic scsi to ide converter, I have experienced item 2 too many times with one cluster, but even on the others I like knowing that work can continue even if the host is down (facilitated by a separate nfs server). Of course, I am definitely old school. I use static IP's, individual passwd files. and simple scripts to handle administration. Mike Craig Tierney wrote: >On Fri, 2005-02-25 at 01:16, John Hearns wrote: > > >>On Thu, 2005-02-24 at 18:20 -0500, Jamie Rollins wrote: >> >> >>>Hello. I am new to this list, and to beowulfery in general. I am working >>>at a physics lab and we have decided to put together a relatively small >>>beowulf cluster for doing data analysis. I was wondering if people on >>>this list could answer a couple of my newbie questions. >>> >>>The basic idea of the system is that it would be a collection of 16 to 32 >>>off-the-shelf motherboards, all booting off the network and operating >>>completely disklessly. We're looking at amd64 architecture running >>>Debian, although we're flexible (at least with the architecture ;). Most >>>of my questions have to do with diskless operation. >>> >>> >>Jamie, >> why are you going diskless? >>IDE hard drives cost very little, and you can still do your network >>install. >>Pick your favourite toolkit, Rocks, Oscar, Warewulf and away you go. >> >> >> > >IDE drives fail, they use power, you waste time cloning, and >depending on the toolkit you use you will run into problems >with image consistency. > >I have run large systems of both kinds. The last system was >diskless and I don't see myself going back. I like changing >one file in one place and having the changes show up immediately. >I like installing a packing once, and having it show up immediately, >so I don't have to reclone or take the node offline to update >the image. > >Craig > > > > >>BTW, have a look at Clusterworld http://www.clusterworld.com >>They have a project for a low-cost cluster which is similar to your >>thoughts. >> >> >>Also, with the caveat that I work for a clustering company, >>why not look at a small turnkey cluster? >>I fully acknowledge that building a small cluster from scratch will be >>a good learning exercise, and you can get to grips with the motherboard, >>PXE etc. >>However if you are spending a research grant, I'd argue that it would be >>cost effective to buy a system with support from any one of the >>companies that do this. >>If you get a prebuilt cluster, the company will have done the research >>on PXE booting, chosen gigabit interfaces and switches which perform >>well, chosen components which will last. And when your power supplies >>fail, or a disk fails someone will come round to replace them. >>And you can get on with doing your science. >> >> >> > > > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > From dld at cmb.usc.edu Fri Feb 25 10:29:35 2005 From: dld at cmb.usc.edu (Drake Diedrich) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] Re: motherboards for diskless nodesy In-Reply-To: References: <20050225010053.GA31456@app1.cmb.usc.edu> Message-ID: <20050225182935.GC31456@app1.cmb.usc.edu> On Thu, Feb 24, 2005 at 10:03:59PM -0500, Donald Becker wrote: > > The specific problem here is very likely the PXE server implementation, > not the client side. > Ah, thanks. I'd read a bit about some of the alternate paths PXE used, but didn't realize there were quite so many bugs in Intel's various implementations. I'm currently using Tim Hurman's PXE 1.42, ISC DHCPD 3.0.1 on the same segment (and if I think about it, I bet I'll find a race in there somewhere), and Jean-Pierre Lefebvre's atftpd 3.0.1. It's working well on many of our systems, and works poorly for the largest class (the compute nodes). The next time I have to do a major install I may try to get it working better on whatever class of machine is being installed that day. > > >Spending a couple gigs of that for a locally installed O/S isn't much of a > >drama, especially on ~16 nodes. > > But it's the long-term administrative effort that costs, not the disk > hardware. The need to maintain and update a persistent local O/S is the > root of most of that cost. > For small clusters though, local admin just isn't that much of a burden compared to the hassles of writing your own distributed filesystem, testing images, scheduling reboots, or crashing long-term jobs. I'm using Makefile targets for each request, so I can later make the same changes to new nodes. eg: gsl: for h in `cat nodes` ; \ do \ ssh $$h apt-get install -y libgsl0 libgsl0-dev; \ done There are more elaborate cluster management/install systems (FAI, cfengine, ...), dsh to perform ssh in parallel, etc, but for a small research cluster with installation requirements that change daily, being able to make simple changes in-flight without any testing or scheduling updates, getting administrative approval, or really doing any of that hard stuff turns day-long tasks into a few minutes. User: can I have... Reply 2 minutes later: installed. It's not quite as simple as a single system image, but it's only about twice as much work as doing one node and retains all the flexibility, and doesn't require rebooting or re-imaging nodes and killing jobs. > >deleted when no longer in use. NFS (being stateless) doesn't have > >this behavior, so after an update you may occaisionally have > >jobs/daemons when they try to page in a file that has already been ^die [oops] > >replaced. > > NFS isn't bad. Nor does it necessarily doom a server to unbearable > loads. For some types of file access, especially read-only access to > small (<8KB) configuration files such as ~/.foo.conf, it's pretty close > to optimal. Oh, NFS is actually pretty good (most of gigE wire speed on large files), and I really like being able to do maintenance on the fileservers without killing jobs (shutdown and replace dead disks, switch kernels, etc). It's our fileservers that can't keep up: the cluster is able to pound them into the ground over NFS. If client-side NFS were worse, our fileservers would remain responsive during major job launches. :) Having users with on the order of a million small files each, most of which they try to open during the course of their jobs is pretty damaging. All of this is research code, so it tends to get written once (data not consolidated in a database), run once on the cluster, and then published. Localizing the damage helps, RAID10 helps, and convincing people to stage off a sacrificial scratch fileserver also helps. There are some relatively new distributed filesystems out there (Lustre, GFS, ...) that might survive this load better, but we haven't tested them, some aren't really unix filesystems at all, and we are a long way from ready to commit /home to one. From ballew at sublinear.net Fri Feb 25 10:54:31 2005 From: ballew at sublinear.net (Mark C. Ballew) Date: Wed Nov 25 01:03:51 2009 Subject: shall we write our own? Re: [Beowulf] O'Reilly Clusters Book Review In-Reply-To: References: Message-ID: <1109357671.4880.48.camel@sport> On Fri, 2005-02-25 at 18:33 +0100, Ryan Sweet wrote: > Glenn, > > I have also had a look at the new ORA cluster book, hoping that they had > learned their lesson, and had a similar reaction. While I don't wish to > discredit M. Sloan, as writing any sort of book is always going to be a lot of > work and filled with compromise, I felt from the very beginning that the > community can do better. After reading over your review, which, while > scathing, was entirely accurate, I feel resolved that the beowulf community > _should_ do better. > > Here's what I propose: let's make a "Beowulf.org Guide to Linux Clustering", > or whatever the heck else you want to call it. Let us outline, review, > improve, and comment on it here on this mailing list. Here's the hard part - > lets also set a deadline, with realistic goals, and try to stick to it. Lets > assign any publishing rights or other "details" like that to the FSF or Linux > Documentation Project. > > Robert Brown has already done a lot of work on such a book, and generously > made it freely available. Maybe he is amenable to this being a starting > point? > > In any case, I would gladly provide hosting for something like this, and > coordinate the project, as well as edit or write content. > > There are many questions that arise: > Most importantly - What should go in the book? > In what order should these topics be covered? > Should there be an attempt to have a common style? > How (and how often) should it be revised? > Does the book target new beowulf admins, seasoned experts, or both and > some in-between? > Should mentioning vendors be allowed? What are the guidelines? > > and so on. > > Firstly, now that I've proposed the idea, I'll also start by volunteering to > write a chapter on diskless clustering. > > Second, please take this opportunity to tell me why this is a bad idea, and > while your at it send your comments on the questions above. I think a community-written book is a spectacular idea. The question I have is would it be better to just do a web-based book since the beowulf community is basically a moving target, or just put these into printed "editions" as well as a Copyleft'd book? I volunteer for any editing or proofing if such a project comes to life. What goes in the book? Types of clusters, cluster purposes, cluster interconnects, common issues (HVAC, user admin), and scaling are somethings that comes to mind. Order? Start with the basics. What is a cluster? Why clusters and not SMP beasts? What types are there? What software is there? Style: Good question How often should it be revised? If it is web and dead tree based, the web version would constantly be updated with perhaps a yearly dead tree revision? Who does it target? I think a book that targeted beowulf admins with some stuff for the seasoned expert would be good. New beowulf admins often find themselves swamped with options. Seasoned admins join the mailing list. Vendors: I think that vendors should be allowed but only if there is a balance between free options and vendors, or in the case of hardware, why you'd go with a particular vendor (Myrinet vs. Infiniband, etc.). I think something like "you should go to PSSC or Penguin Computing" is a little bit of a stretch beyond an appendix on various vendors. Mark -- Mark C. Ballew Reno, Nevada ballew@sublinear.net http://markballew.com PGP: 0xB2A33008 AIM: pdx110 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://www.scyld.com/pipermail/beowulf/attachments/20050225/41cd413c/attachment.bin From john.hearns at streamline-computing.com Sat Feb 26 02:50:48 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] O'Reilly Clusters Book Review In-Reply-To: References: Message-ID: <1109415052.6688.12.camel@ip13.2214.h2.fosdem.lan> On Fri, 2005-02-25 at 00:23 -0800, Glen Otero wrote: > > > ______________________________________________________________________ > > My review of O'Reilly's latest clusters book published at HPCwire > (http://www.tgc.com/hpcwire.html): > Glen, I am preparing a review of this book for the UK Unix Users Group. I haven't finished reading your review yet, but I feel I should say that I don't agree with the tone. I'm sure that you have valid points - yes I agree that Myrinet and Infiniband should be dismissed as 'emerging'. But on balance I think it is a decent book, and I would recommend it. In fact I have flagged it up on this list. If I was to make a criticism, it would be about the section on MPI programming. There's nothing wrong with this - in fact I will refer to it myself. I would just have made it shorter and put some references in to existing tutorials on the web, or textbooks. But hey, I didn't write or edit the book. From john.hearns at streamline-computing.com Sat Feb 26 02:55:30 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: <1109351864.2883.22.camel@localhost.localdomain> References: <1109319374.6055.17.camel@Vigor45> <1109351864.2883.22.camel@localhost.localdomain> Message-ID: <1109415331.6688.16.camel@ip13.2214.h2.fosdem.lan> On Fri, 2005-02-25 at 10:17 -0700, Craig Tierney wrote: > > why are you going diskless? > > IDE hard drives cost very little, and you can still do your network > > install. > > Pick your favourite toolkit, Rocks, Oscar, Warewulf and away you go. > > > > IDE drives fail, they use power, you waste time cloning, and > depending on the toolkit you use you will run into problems > with image consistency. I agree - heck I'm work with large Beowulves every day. but listen to what I said. For THIS APPLICATION in a small lab, where a researcher is looking to homebrew a system, I firmly believe that putting IDE drives on each node and then installing over the network is the way ahead. We here on the Beowulf list can argue the benefits of diskless versus disks. But for someone who just wants too get something working and off the ground, I say go the 'conventional' route. > I have run large systems of both kinds. The last system was > diskless and I don't see myself going back. I like changing > one file in one place and having the changes show up immediately. > I like installing a packing once, and having it show up immediately, > so I don't have to reclone or take the node offline to update > the image. Why take a node offline to do an update or a disk system? From john.hearns at streamline-computing.com Sat Feb 26 02:59:57 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: <20050225183146.GA1563@greglaptop.internal.keyresearch.com> References: <1109319374.6055.17.camel@Vigor45> <1109351864.2883.22.camel@localhost.localdomain> <20050225183146.GA1563@greglaptop.internal.keyresearch.com> Message-ID: <1109415598.6688.21.camel@ip13.2214.h2.fosdem.lan> On Fri, 2005-02-25 at 10:31 -0800, Greg Lindahl wrote: > > Doesn't make any sense; I have seen people describe such systems where > they download a disk image when a batch job wants a different software > load. It's certainly doable that way: it does have different tradeoffs > from the diskless case, but if it gives you a headache, it's probably I've always dreamed of using User Mode Linux images for this. In a Grid-based world, prepare a UML instance which has all the libraries and runtime to run your code. Ship it across the grid with your executable. The cluster at the receiving end can be running any distribution - it runs your UML in a sandbox. And before anyone says it, yes performance would be a dog, and I don't see how UML could access all those nice Myrinet and Infiniband cards. SO I'm definitely blue-skying. From john.hearns at streamline-computing.com Sat Feb 26 03:04:47 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: References: Message-ID: <1109415887.6688.24.camel@ip13.2214.h2.fosdem.lan> On Fri, 2005-02-25 at 17:18 -0500, Mark Hahn wrote: > you means s/node/disk/ right? sure, but doing raid1 on a "diskless" > node is not insane. though frankly, if your disk failure rate is > that high, I'd probably do something like intermittently store > checkpoints off-node. We recently put in mirrored system disks on a cluster. The nodes in question are beefy SunFire V40z's, and the cluster is intended to run long (ie. week long) jobs and the concern was raised about long jobs failing due to a disk failure. From atp at piskorski.com Sat Feb 26 13:32:17 2005 From: atp at piskorski.com (Andrew Piskorski) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: <1109415598.6688.21.camel@ip13.2214.h2.fosdem.lan> References: <1109415598.6688.21.camel@ip13.2214.h2.fosdem.lan> Message-ID: <20050226213217.GA85119@piskorski.com> On Sat, Feb 26, 2005 at 10:59:57AM +0000, John Hearns wrote: > On Fri, 2005-02-25 at 10:31 -0800, Greg Lindahl wrote: > > Doesn't make any sense; I have seen people describe such systems where > > they download a disk image when a batch job wants a different software > > load. It's certainly doable that way: it does have different tradeoffs > > from the diskless case, but if it gives you a headache, it's probably > > I've always dreamed of using User Mode Linux images for this. > And before anyone says it, yes performance would be a dog, In that case, you should look into Xen. I haven't heard of anyone using it for HPC yet, but if I remember right, they claim only a 3% or so performance loss running Linux virtualized under Xen vs. running on the bare metal: http://www.cl.cam.ac.uk/Research/SRG/netos/xen/ -- Andrew Piskorski http://www.piskorski.com/ From mathog at mendel.bio.caltech.edu Sat Feb 26 16:06:51 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] RE: S2466 Wake on Lan working, anyone? Message-ID: > > A much clearer demonstration of this failure is to run "shutdown -h 0" > and then attempt to power the machine on again at the front pannel switch. > > Apparently the state machine that controls the power on these things > becomes locked in it's "shutdown" state and never enters it's "cold and > ready to be booted" state. There is a solution to the problem you described. When using a 2.6.x kernel force the "button" module to load by placing the line button in /etc/modprobe.preload reboot poweroff At that point you can use the front power switch to restart the machine after a poweroff. At least with ACPI built into the kernel and "acpi=on" present on the boot line. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From mathog at mendel.bio.caltech.edu Sat Feb 26 16:33:34 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] RE: S2466 Wake on Lan working, anyone? Message-ID: > We sold a lot of those boards, and I could not remember any claims for > WOL support, so I went looking through the literature. The _current_ literature says nothing. The literature when we bought this did. Google for S2466 and WOL and you'll find the old data sheets still available on the web, for instance here: http://www.bellmicro.com/product/HotProduct/tyan/d_s2466_210.pdf that indicate the presence of a 3 pin WOL header. What point having a WOL header if the board won't do WOL? Note that they still don't claim that the board won't do WOL, only that it will only WOL from a state where the capability is of no use. As you point out, that line is missing from the current literature for the board here: ftp://ftp.tyan.com/datasheets/d_s2466_270.pdf so apparently somewhere between _210 and _270 WOL was eliminated from the product sheet. They fessed up (and fixed) the USB problem but seem to have swept the WOL problem under the rug. Or maybe the original documentation was in error? Either way, the story changed. > I see no reference to WOL. > Looking at their FAQ i see no mention of WOL: > http://www.tyan.com/support/html/f_s2466.html My point exactly. Even if they don't support it people have asked enough that it should be in the FAQ. > While I understand that you want this to work, I see no reason to > complain that they do not properly support something they never claimed > to provide. But they used to claim WOL, they just don't anymore. > Could it be that you are expecting too much? And bitter about not > getting your way? Nope, bitter about a company removing WOL from the product spec rather than admitting explicitly, if only in the FAQ, that it wasn't supported. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From maurice at harddata.com Fri Feb 25 22:10:48 2005 From: maurice at harddata.com (Maurice Hilarius) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] Re: RE: S2466 Wake on Lan working, anyone? In-Reply-To: <200502252000.j1PK08cJ025090@bluewest.scyld.com> References: <200502252000.j1PK08cJ025090@bluewest.scyld.com> Message-ID: <422012E8.8030406@harddata.com> "David Mathog" wrote: >The more I learn about Tyan the less I like them. Surely they've >known all there is to know about the broken WOL for at least >3 years, yet they didn't ever post the reason for the lack of WOL >function in their FAQ section for this motherboard. Could it >possibly be that they didn't want to lose sales by admitting >the board couldn't do WOL in a manner that was of any use >to anybody? > We sold a lot of those boards, and I could not remember any claims for WOL support, so I went looking through the literature. Can you show me an instance of something where Tyan claim they provide WOL support on this board? Looking here: http://www.tyan.com/products/html/tigermpx.html ftp://ftp.tyan.com/datasheets/d_s2466_270.pdf I see no reference to WOL. Looking at their FAQ i see no mention of WOL: http://www.tyan.com/support/html/f_s2466.html They DO sell a server management card that provides equivalent capability plus other features: http://www.tyan.com/products/html/m3289.html While I understand that you want this to work, I see no reason to complain that they do not properly support something they never claimed to provide. That they even tried to help at their tech support strikes me as going beyond the call of duty. Could it be that you are expecting too much? And bitter about not getting your way? From rsweet at aoes.com Sat Feb 26 02:35:37 2005 From: rsweet at aoes.com (Ryan Sweet) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: <20050225183146.GA1563@greglaptop.internal.keyresearch.com> References: <1109319374.6055.17.camel@Vigor45> <1109351864.2883.22.camel@localhost.localdomain> <20050225183146.GA1563@greglaptop.internal.keyresearch.com> Message-ID: Greg, Thanks for the injection of some perspective. It is clear that I chose my phrases poorly, and some alternate wording would have been more constructive ;-) On Fri, 25 Feb 2005, Greg Lindahl wrote: > On Fri, Feb 25, 2005 at 06:52:59PM +0100, Ryan Sweet wrote: > >> While I can understand debating over the merits of nfsroot vs RAM-disk >> root, I fail to see many useful arguments for maintaining a local OS >> install. > > An example of something that goes very wrong with NFS is upgrading a > file to a new file with the same name. If that file is a binary or > library that's in use anywhere in the cluster, you are likely to have > a problem. Local disks and Scyld, on the other hand, do the right > thing: existing processes using the binary or library continue to use > the old version, while new ones use the new version. In practice I think the importance of this depends upon the particular requirements of the site. Definitely there must be some places where its important, and should be planned for. However most sites I've seen accept (prefer) that new versions of application software be installed alongside the old, rather than in replacement of it, and for system software updates they are usually willing to accept stopping running jobs and starting them again (either with the queue system or without. > This disagreement is as old as the hills, by the way: in the good old > days, when Sun was young, lots of people ran their pizza-box > workstations diskless, but that went out of style when Ethernet's > performance was stuck in place for a bunch of years. Or in some places it kept right on going. > It's important to understand arguments you disagree with; your > dismissal is not a good sign. > > Yep. But your conclusion: > Doesn't make any sense; I have seen people describe such systems where > they download a disk image when a batch job wants a different software > load. It's certainly doable that way: it does have different tradeoffs > from the diskless case, but if it gives you a headache, it's probably > because you don't like it, not because it's hard to do. Yes, I agree. I chose my words poorly in the first post, and came down rather hard against local installs. In the end the choice is about balancing managing complexity with the requirements of the particular site. I think that in a large percentage of use cases the admin will find that managing "diskless" (local disk for swap/scratch) systems is highly advantageous. regards, -Ryan -- Ryan Sweet Advanced Operations and Engineering Services AOES Group BV http://www.aoes.com Phone +31(0)71 5795521 Fax +31(0)71572 1277 From ajt at rri.sari.ac.uk Sat Feb 26 03:51:34 2005 From: ajt at rri.sari.ac.uk (Tony Travis) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: References: <1109319374.6055.17.camel@Vigor45> <1109351864.2883.22.camel@localhost.localdomain> Message-ID: <422062C6.2020603@rri.sari.ac.uk> Ryan Sweet wrote: > [...] > I think the term "diskless" is sometimes the problem when discussing > centrally installed and managed systems. Lots of "diskless" cluster > have GB and GB of local disks, only they are used for swap and temp > I/O, not for the OS. Hello, Ryan. I agree with you: It is common for so-called 'diskless' nodes to have a local disk for /tmp and swap. In our case I have also made a symbolic link /var/tmp -> /tmp on each node as well. In fact, Sun used to call this a 'dataless' client (i.e. no permanent data stored on the client, only local swap and temporary files). In the end, Sun abandoned support for 'dataless' clients in favour of their NFS-based cacheFS. The important thing about diskless/dataless/cacheFS clients is that they can easily be replaced with a new one if they go wrong without loss of permanent 'data'. Of course, the data associated with processes actually running on a node is lost if the node fails in use, but the new node is a plug-in replacement for the old one, and just needs to be rebooted. In our case, there is a little bit more to it than that because we have to add the new MAC address to the DHCP server for the fixed IP address used by the node, and partition/format the new local disk but this is done by a script and takes about 2 minutes!. It might be a lot less confusing if we talked about PXE booting dataless clients/nodes... Tony. -- Dr. A.J.Travis, | mailto:ajt@rri.sari.ac.uk Rowett Research Institute, | http://www.rri.sari.ac.uk/~ajt Greenburn Road, Bucksburn, | phone:+44 (0)1224 712751 Aberdeen AB21 9SB, Scotland, UK. | fax:+44 (0)1224 716687 From reuti at staff.uni-marburg.de Sat Feb 26 04:27:50 2005 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: <1109415598.6688.21.camel@ip13.2214.h2.fosdem.lan> References: <1109319374.6055.17.camel@Vigor45> <1109351864.2883.22.camel@localhost.localdomain> <20050225183146.GA1563@greglaptop.internal.keyresearch.com> <1109415598.6688.21.camel@ip13.2214.h2.fosdem.lan> Message-ID: <1109420870.42206b4640844@home.staff.uni-marburg.de> Quoting John Hearns : > On Fri, 2005-02-25 at 10:31 -0800, Greg Lindahl wrote: > > > > > Doesn't make any sense; I have seen people describe such systems where > > they download a disk image when a batch job wants a different software > > load. It's certainly doable that way: it does have different tradeoffs > > from the diskless case, but if it gives you a headache, it's probably > > I've always dreamed of using User Mode Linux images for this. > In a Grid-based world, prepare a UML instance which has all the > libraries and runtime to run your code. Ship it across the grid with > your executable. > The cluster at the receiving end can be running any distribution - it > runs your UML in a sandbox. I would like to have it also: if any queuing system wants to kill a job on a node: just shutdown the virtual machine. And you also get off of any semaphores and shared memory segments (and message queues), which maybe left behind in other cases. I saw leftover semaphores not only on Linux, but also on AIX and SuperUX in case of a job abort. Is there any safe way to release them after a job? I already got the idea, to catch them with a library which wraps the shmget(),.. calls by using LD_PRELOAD to get the IDs, and then release them in an epilog after the jobs (seems working, but of course only for dynamically linked applications). Just got the hint to look at Meiosys. Seems they have such features in their virtual machines. Cheers - Reuti From hepu.deng at rmit.edu.au Fri Feb 25 16:56:05 2005 From: hepu.deng at rmit.edu.au (Hepu Deng) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] ICNC'05-FSKD'05 Final Call for Papers/Special Sessions/Sponsorship: Changsha China Message-ID: ---------------------------------------------------------------------- 2005 International Conference on Natural Computation (ICNC'05) International Conference on Fuzzy Systems and Knowledge Discovery (FSKD'05) ---------------------------------------------------------------------- 27 - 29 August 2005, Changsha, China ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Home Page: http://www.xtu.edu.cn/nc2005 http://www.ntu.edu.sg/home/elpwang/nc2005 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ *** Submission Deadline: 15 March 2005 *** FINAL CALL FOR PAPERS, SPECIAL SESSIONS, AND SPONSORSHIP The ICNC'05-FSKD'05 will feature the most up-to-date research results in computational algorithms inspired from nature, including biological, ecological, and physical systems. It is an exciting and emerging inter- disciplinary area in which a wide range of techniques and methods are being studied for dealing with large, complex, and dynamic problems. The joint conferences will also promote cross-fertilization over these exciting and yet closely-related areas. Registration to either conference will entitle a participant to the proceedings and technical sessions of both conferences, as well as the conference banquet, buffet lunches, and tours to some attractions in Changsha. Specific areas include, but are not limited to neural computation, evolutionary computation, quantum computation, DNA computation, chemical computation, information processing in cells and tissues, molecular computation, computation with words, fuzzy computation, granular computation, artificial life, swarm intelligence, ants colony, artificial immune systems, etc., with applications to knowledge discovery, finance, operations research, and more. Publications ------------ The ICNC'05 and FSKD'05 conference proceedings will be published in Springer's Lecture Notes in Computer Science (LNCS) and Lecture Notes in Artificial Intelligence (LNAI), respectively. Both the LNCS and LNAI are indexed in SCI-Expanded. A selected number of authors will be invited to expand and revise their papers for possible inclusions in peer-reviewed international journals / edited books. Special Sessions ---------------- In addition to regular sessions, participants are encouraged to organize special sessions on specialized topics. Each special session should have at least 4 papers. Special session organizers will solicit submissions and conduct reviews on the submitted papers. Proposals for special sessions should be sent to the respective Program Chairs, i.e., Ke Chen (neural computation, Ke.Chen@manchester.ac.uk) Yew Soon Ong (other topics in ICNC'05, asysong@ntu.edu.sg) Yaochu Jin (FSKD'05, yaochu.jin@honda-ri.de) Keynote Speakers ---------------- Shun-ichi Amari, Japan Aike Guo, China Nikhil R. Pal, India Xin Yao, UK About Changsha, Hunan, China ---------------------------- Changsha, the capital of Hunan Province, is a historic and cultural city in southern China and a busy port on the Xiangjiang River, with a population over 6 million. Founded 3000 years ago, the city became the capital of the Zhou state (951-960 AD) and a leading commercial center during the Song dynasty (960-1279 AD). Changsha International Airport is easily accessible with direct flights to all major domestic and some international destinations. Other famous tourist destinations in Hunan include the Zhangjiajie National Park (natural heritage listed by UN) and Fenghuang (Phoenix) Ancient City. Important Dates --------------- Paper Submission : 15 March 2005 Decision Notification : 15 April 2005 Final Versions / Author Registration: 15 May 2005 Contact ------- Email: nc2005@xtu.edu.cn Phone/Fax: +86 732 829 2201 / 829 3249 Submission of Papers -------------------- Authors are invited to submit a full paper as an electronic file (postscript, pdf or Word format) at the conference website. Templates are available at both the conference website and the Springer website. Sponsorship / Exhibition ------------------------ The conferences will offer product vendors a sponsorship package and/or an opportunity to interact with conference participants. Product demonstration and exhibition can also be arranged. For more information, please visit the conference web page. Sponsor / Organizer ------------------- Xiangtan University, China Technical Co-Sponsor -------------------- IEEE Circuits and Systems Society IEEE Computational Intelligence Society IEEE Control Systems Society In Co-operation with -------------------- International Neural Network Society International Fuzzy Systems Association Chinese Association for Artificial Intelligence European Neural Network Society Fuzzy Mathematics and Systems Association of China Japanese Neural Network Society Asia-Pacific Neural Network Assembly Honorary Conference Chairs -------------------------- Shun-ichi Amari, Japan Lotfi A. Zadeh, USA International Advisory Board ---------------------------- Toshio Fukuda, Japan Kunihiko Fukushima, Japan Tom Gedeon, Australia Aike Guo, China Zhenya He, China Janusz Kacprzyk, Poland Nik Kasabov, New Zealand John A. Keane, UK Soo-Young Lee, Korea Erkki Oja, Finland Nikhil R. Pal, India Witold Pedrycz, Canada Jose Principe, USA Harold Szu, USA Shiro Usui, Japan Xindong Wu, USA Lei Xu, Hong Kong, China Xin Yao, UK Syozo Yasui, Japan Bo Zhang, China Yixin Zhong, China Jacek M. Zurada, USA General Chair ------------- He-An Luo, China General Co-Chairs ----------------- Lipo Wang, Singapore Yunqing Huang, China Program Chairs -------------- ICNC'05: Ke Chen, UK Yew Soon Ong, Singapore FSKD'05: Yaochu Jin, Germany Local Arrangement Chairs ------------------------ Renren Liu, China Xieping Gao, China Proceedings Chair ----------------- Fen Xiao, China Publicity Chair --------------- Hepu Deng, Australia Sponsorship/Exhibits Chairs --------------------------- Shaoping Ling, China Geok See Ng, Singapore Webmaster --------- Linai Kuang, China Yanyu Liu, China From jake at spiekerfamily.com Sun Feb 27 19:27:23 2005 From: jake at spiekerfamily.com (Jake Thebault-Spieker) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] Pi calculator/RAID accross all nodes/Mosix vs. OpenMosix Message-ID: <42228F9B.2040503@spiekerfamily.com> A couple of questions. 1. Is it possible to use the HD in the nodes on the cluster as one HD(kind of like an extended RAID array)? 2. Does anybody know of a program that will calculate pi, one digit at a time, infinitely that will run in parallel? 3. What is the difference between Mosix(www.mosix.org) and openMosix(www.openmosix.org)? I'm in the process of reading "Engineering a Beowulf Style Computer Cluster" by Robert Brown. I like it a lot and it contains lots of information. Thanks Mr. Brown! ;-) -- I think computer viruses should count as life. I think it says something about human nature that the only form of life we have created so far is purely destructive. We've created life in our own image. --Stephen Hawking Jake Thebault-Spieker From rgb at phy.duke.edu Sun Feb 27 23:10:38 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: <1109420870.42206b4640844@home.staff.uni-marburg.de> References: <1109319374.6055.17.camel@Vigor45> <1109351864.2883.22.camel@localhost.localdomain> <20050225183146.GA1563@greglaptop.internal.keyresearch.com> <1109415598.6688.21.camel@ip13.2214.h2.fosdem.lan> <1109420870.42206b4640844@home.staff.uni-marburg.de> Message-ID: On Sat, 26 Feb 2005, Reuti wrote: > Quoting John Hearns : > > > On Fri, 2005-02-25 at 10:31 -0800, Greg Lindahl wrote: > > > > > > > > Doesn't make any sense; I have seen people describe such systems where > > > they download a disk image when a batch job wants a different software > > > load. It's certainly doable that way: it does have different tradeoffs > > > from the diskless case, but if it gives you a headache, it's probably > > > > I've always dreamed of using User Mode Linux images for this. > > In a Grid-based world, prepare a UML instance which has all the > > libraries and runtime to run your code. Ship it across the grid with > > your executable. > > The cluster at the receiving end can be running any distribution - it > > runs your UML in a sandbox. > > I would like to have it also: if any queuing system wants to kill a job on a > node: just shutdown the virtual machine. And you also get off of any semaphores > and shared memory segments (and message queues), which maybe left behind in > other cases. I saw leftover semaphores not only on Linux, but also on AIX and > SuperUX in case of a job abort. Is there any safe way to release them after a > job? I already got the idea, to catch them with a library which wraps the > shmget(),.. calls by using LD_PRELOAD to get the IDs, and then release them in > an epilog after the jobs (seems working, but of course only for dynamically > linked applications). > > Just got the hint to look at Meiosys. Seems they have such features in their > virtual machines. Another place to look for stuff not unlike this is the COD project at Duke. Except that with COD the "sandbox" is the whole computer. If your application needs a specific operating system or resource collection, you just prepare an appropriate image and boot the cluster (diskless or not) into that image long enough to run the application, then boot it back into something else. Clumsy as this sounds (and obviously overkill for certain classes of things) it has some significant advantages to consider. In addition to very definitely having all the right libraries and resources there is security -- not to worry, the images you load contain YOUR account information and authentication information, so when your reboot into something else later, the entire system is taken down. Booting up a cluster into a new image can take as little as a few minutes, which is no big deal if the task will run for days. It eliminates the need for significant virtualization or something like vmware. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Sun Feb 27 23:28:51 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] Pi calculator/RAID accross all nodes/Mosix vs. OpenMosix In-Reply-To: <42228F9B.2040503@spiekerfamily.com> References: <42228F9B.2040503@spiekerfamily.com> Message-ID: On Sun, 27 Feb 2005, Jake Thebault-Spieker wrote: > A couple of questions. > > 1. Is it possible to use the HD in the nodes on the cluster as one > HD(kind of like an extended RAID array)? Yes. Google for PVFS. > 2. Does anybody know of a program that will calculate pi, one digit at a > time, infinitely that will run in parallel? I don't know about one that will compute an infinite number of digits in PI, but the computation of PI via the arctan series is trivially partitionable in a variety of ways. You'll spend more time working to sum and align the digits you get (as they obviously will have to be obtained and manipulated piecewise as strings) than you will doing the computation per se. It actually sounds like a decent exercise, as the carry from small digits may have to propagate iteratively back to larger ones as you extend the computation farther and farther. However, it ALSO sounds like one of those problems where parallization may not do too well trying to beat a well-written serial version. Also, IIRC there are example programs for computing pi in parallel in lam and mpich, but I don't think they are geared for returning all the digits as a digit string. You might look at the following: http://www.mathpages.com/home/kmath373.htm or http://aemes.mae.ufl.edu/~uhk/PI.html for some of many online articles on pi and its computation. Google is your friend. > > 3. What is the difference between Mosix(www.mosix.org) and > openMosix(www.openmosix.org)? I don't know, don't use Mosix. But somebody on list probably does. > I'm in the process of reading "Engineering a Beowulf Style Computer > Cluster" by Robert Brown. I like it a lot and it contains lots of > information. Thanks Mr. Brown! ;-) You're welcome! -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From joachim at ccrl-nece.de Mon Feb 28 01:46:08 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Wed Nov 25 01:03:51 2009 Subject: shall we write our own? Re: [Beowulf] O'Reilly Clusters Book Review In-Reply-To: References: Message-ID: <4222E860.2060508@ccrl-nece.de> Ryan Sweet wrote: > Here's what I propose: let's make a "Beowulf.org Guide to Linux > Clustering", or whatever the heck else you want to call it. Let us > outline, review, improve, and comment on it here on this mailing list. > Here's the hard part - lets also set a deadline, with realistic goals, > and try to stick to it. Lets assign any publishing rights or other > "details" like that to the FSF or Linux Documentation Project. [...] > Second, please take this opportunity to tell me why this is a bad idea, > and while your at it send your comments on the questions above. I'd say a wiki would be an easier start as it is self-organizing. Depending on how it develops, you can still turn it into a book. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From eugen at leitl.org Mon Feb 28 02:01:38 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:51 2009 Subject: shall we write our own? Re: [Beowulf] O'Reilly Clusters Book Review In-Reply-To: <4222E860.2060508@ccrl-nece.de> References: <4222E860.2060508@ccrl-nece.de> Message-ID: <20050228100138.GM1404@leitl.org> On Mon, Feb 28, 2005 at 10:46:08AM +0100, Joachim Worringen wrote: > I'd say a wiki would be an easier start as it is self-organizing. > Depending on how it develops, you can still turn it into a book. Yes, my vote goes to a Wiki, too. (I could host it, if necessary). -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050228/f1b90ca5/attachment.bin From landman at scalableinformatics.com Mon Feb 28 04:58:49 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] Pi calculator/RAID accross all nodes/Mosix vs. OpenMosix In-Reply-To: References: <42228F9B.2040503@spiekerfamily.com> Message-ID: <42231589.4080706@scalableinformatics.com> >>2. Does anybody know of a program that will calculate pi, one digit at a >>time, infinitely that will run in parallel? > > > I don't know about one that will compute an infinite number of digits in > PI, but the computation of PI via the arctan series is trivially > partitionable in a variety of ways. You'll spend more time working to > sum and align the digits you get (as they obviously will have to be > obtained and manipulated piecewise as strings) than you will doing the > computation per se. It actually sounds like a decent exercise, as the > carry from small digits may have to propagate iteratively back to larger > ones as you extend the computation farther and farther. > http://mathworld.wolfram.com/PiDigits.html http://mathworld.wolfram.com/PiFormulas.html http://www.andrews.edu/~calkins/physics/Miracle.pdf and others. It is possible to calculate the digits individually using the Bailey et al algorithm. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From eugen at leitl.org Mon Feb 28 09:24:57 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] [Lustre-announce] CLUSTER FILE SYSTEMS, INC. RELEASES LUSTRE VERSION 1.2.4 TO THE GENERAL PUBLIC (fwd from jeff@clusterfs.com) Message-ID: <20050228172456.GE1404@leitl.org> ----- Forwarded message from Jeffrey Denworth ----- From: Jeffrey Denworth Date: Mon, 28 Feb 2005 10:19:22 -0500 To: lustre-announce@lists.clusterfs.com, lustre-discuss@lists.clusterfs.com Subject: [Lustre-announce] CLUSTER FILE SYSTEMS, INC. RELEASES LUSTRE VERSION 1.2.4 TO THE GENERAL PUBLIC X-Mailer: Apple Mail (2.619.2) Boston, MA?February 28, 2005? Cluster File Systems, Inc.?, the leader in high-performance parallel file systems, today released a major update to the Lustre? file system on its public download site.?Used on many of the world?s largest Linux clusters, Lustre 1.2.4 represents a significant advancement in the development of a world-class open-source file system.?This release further demonstrates Cluster File Systems? ongoing commitment to make new versions of Lustre available to the free software community on a regular basis. The Lustre file system is a next-generation cluster storage solution, designed to serve clusters with up to tens of thousands of nodes, manage petabytes of storage, and move hundreds of gigabytes per second with state of the art security and management infrastructure. Improvements in Lustre 1.2.4 over the previous publicly available version include: - support for Intel?Itanium?, Intel EM64T, and AMD64 Architectures - support for Linux 2.6 - support for the Quadrics?QsNet II (Elan 4) interconnect - a disaster recovery tool (lfsck) - support for Object Storage Server addition - zero-configuration Lustre clients - very many improvements to performance and stability - demonstrated capability on systems with more than 1,200 nodes and 300 TB of storage, at more than 11 GB/s ? Downloading Lustre To download Lustre 1.2.4, please visit http://www.clusterfs.com/download.html About Cluster File Systems Founded in 2001, Cluster File Systems has established itself as the recognized leader in high-performance, scalable cluster file system technology.?The company?s premier Lustre cluster file system has demonstrated acceptance and capability on the world?s fastest cluster supercomputers.?Through partnerships with leading HPC storage, server, and software vendors, Cluster File Systems is helping cluster customers worldwide realize the benefits of scalable, reliable storage with Lustre.?The company is headquartered in Boston, Massachusetts, with operations in North America, Europe, and Asia. Lustre, the Lustre logo, Cluster File Systems, and CFS are trademarks of Cluster File Systems, Inc in the United States.?All other trademarks are the property of their respective holders. _______________________________________________ Lustre-announce mailing list Lustre-announce@lists.clusterfs.com https://lists.clusterfs.com/mailman/listinfo/lustre-announce ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050228/02c77c0f/attachment.bin From mike at etek.chalmers.se Mon Feb 28 00:32:48 2005 From: mike at etek.chalmers.se (Mikael Fredriksson) Date: Wed Nov 25 01:03:51 2009 Subject: [Beowulf] Pi calculator/RAID accross all nodes/Mosix vs. OpenMosix In-Reply-To: References: <42228F9B.2040503@spiekerfamily.com> Message-ID: <4222D730.2090702@etek.chalmers.se> Robert G. Brown wrote: >>3. What is the difference between Mosix(www.mosix.org) and >>openMosix(www.openmosix.org)? > > > I don't know, don't use Mosix. But somebody on list probably does. Mosix is proprietary software, OpenMosix is opensoftware. OpenMosix has it's roots in the Mosix project. MF From rsweet at aoes.com Mon Feb 28 02:01:29 2005 From: rsweet at aoes.com (Ryan Sweet) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] Pi calculator/RAID accross all nodes/Mosix vs. OpenMosix In-Reply-To: References: <42228F9B.2040503@spiekerfamily.com> Message-ID: >> >> 3. What is the difference between Mosix(www.mosix.org) and >> openMosix(www.openmosix.org)? > > I don't know, don't use Mosix. But somebody on list probably does. MOSIX started in the '70s, on PDP11. It was ported to linux in the '90s. It is a system for allowing a (pseudo)cluster-wide process id space and cluster-wide load balancing via process migration. OpenMOSIX is a fork/re-write of MOSIX that was started a few years ago due to disagreements about the MOSIX license (which is not Open Source). OpenMOSIX is released under the GPL, and has a much larger developer and user community than MOSIX. If you are interested in single-system-image clustering maybe also checkout bproc http://sourceforge.net/projects/bproc (note - see scyld for a complete bproc solution) or OpenSSI http://www.openssi.org and (definitely more experimental than the above, but promising) Kerrighed http://www.kerrighed.org/ SSI clusters are certainly interesting, but if you are new to clustering you may want to get your feet wet with a more traditional model first, so that you have a good reference point when reviewing and weighing options for SSI. regards, -Ryan -- Ryan Sweet Advanced Operations and Engineering Services AOES Group BV http://www.aoes.com Phone +31(0)71 5795521 Fax +31(0)71572 1277 From eno at dorsai.org Mon Feb 28 03:42:05 2005 From: eno at dorsai.org (Alpay Kasal) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] Thermal Kill-Switch Message-ID: <0ICM0051KDO1XP@mta9.srv.hcvlny.cv.net> Hello all. Can someone please give me some pointers to a vendor of some kind of Thermal Kill-Switch? If I'm not mistaken, it's an inline AC power device that powers-off if the room reaches a certain temperature. I checked eBay, Google, and Home Depot. No dice. Thanks for the help. Ps: not really looking to start a DIY project. I am hoping someone can point me to a low-cost off the shelf device. And just in case this fits anyones needs, I found this in the Beowulf.org archives... http://www.apcc.com/products/family/index.cfm?id=47 I think it might be too pricey for my budget Alpay From rsweet at aoes.com Mon Feb 28 04:24:19 2005 From: rsweet at aoes.com (Ryan Sweet) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] So we will write our own book - next steps... In-Reply-To: <20050228100138.GM1404@leitl.org> References: <4222E860.2060508@ccrl-nece.de> <20050228100138.GM1404@leitl.org> Message-ID: On Mon, 28 Feb 2005, Eugen Leitl wrote: > On Mon, Feb 28, 2005 at 10:46:08AM +0100, Joachim Worringen wrote: > >> I'd say a wiki would be an easier start as it is self-organizing. >> Depending on how it develops, you can still turn it into a book. > > Yes, my vote goes to a Wiki, too. > > (I could host it, if necessary). I'm willing to try this as it may be a good way to bootstrap an effort, but I see a few problems with it, which may be real problems or may be imagined ones (hopefully this doesn't wander too far off-topic - jab me with a stick if it does): * I think it would be good to target the Linux Documentation Project, which uses DocBook (http://www.tldp.org/LDP/LDP-Author-Guide/html/docbook-why.html) DocBook has a lot of advantages for this sort of thing. If a wiki were used to organise the content what does the actual data look like? Raw text in a a database, or xml in a database would be preferable for later conversion to docbook. A wiki that used docbook articles as a backend would be great. Google turns up a dead project on freshmeat. What I think would be bad is a wiki database containing 500 paragraphs of HTML, with different styles (if any), inconsistent tags, and so on. * most wikis seem to make it difficult to generate a printed copy or pdf version of the whole document - similarly, is it possible to make entire wikis available as a download for offline reading? * I've seen far more badly structured or confusing wikis that good ones. The ones that I have seen that are good are much closer in form to a FAQ or a HOWTO, using the wiki more for collaborative editing than for organsiational structure. Maybe all this implies is a stronger editorial presence, I don't know. * Drupal's collaborative book feature looks like maybe an interesting middle-road: http://drupal.org/node/284 though maybe it would have the same problems. Also re: >> Robert Brown has already done a lot of work on such a book, and generously >> made it freely available. Maybe he is amenable to this being a starting >> point? > > Sure. I periodically solicit help for such a project on the list -- > this is the first time somebody has solicited me:-) > > My experience is that it is really pretty difficult to get people to > actually contribute content. However, I've already got a very decent > start going, I think, and as always if anybody wants to contribute > content (under the OPL it is published under) I will cheerily include > it, with attribution. Well, we've even received a testimonial just yesterday about how helpful it has been. I do recall that the book has a rather personal style to it though (an asset that makes the publication less dull), which may (or may not?) make it seem awkward as the basis for a larger collaborative effort. > Based on Glenn's comments, I was actually feeling (once again) like I > ought to try to shake free enough time to do another full pass through > the content to bring it up to date and see if I can finish off some of > the missing chapters and -- possibly -- seek a paper publisher. I want > to keep it online/free either way (and there are publishers out there > that are comfortable with this) but a lot of people want to own a paper > copy of stuff like this. I get a lot of requests for a printable PDF > from people all over who found the html with google but missed the > online pdf images right next door... Which is another reason that maybe a wiki would not entirely serve (see above). OK, this may be a loaded question - for Robert, how do you feel about people contributing to your book vs starting a new, collaborative effort which draws upon the strenghts of what you have already done? For others, how would you feel about contributing to Robert's book using his Latex template, vs starting a new collaborative effort? Each approach has advantages though if, as was mentioned, its been difficult to get wider contributions for the book as it is, then maybe a more overtly collaborative approach would help. BTW - If Mr. Joeseph Sloan is around, I hope you aren't taking offense (er.. I guess I would taken offense at the tone of Glen's review - but hopefully you appreciates brutal honesty;-) - I just think there is a need for a book that draws upon all of the knowledge which is dispensed on this list and presents it in a balanced, well-considered and thorough fashion, offering something for beginners and advanced users alike. In either case I think I can devote a certain amount of dedicated time to the effort, as it overlaps with my need to develop some training material, because it has been very difficult to find good HPC consultants lately. Incidentaly, with reference to the above, we're hiring: http://www.aoes.com/en/jobs/vn0416.html send me your CV if you are both able and interested. -- Ryan Sweet Advanced Operations and Engineering Services AOES Group BV http://www.aoes.com Phone +31(0)71 5795521 Fax +31(0)71572 1277 From makcaym at gmail.com Mon Feb 28 04:28:37 2005 From: makcaym at gmail.com (m a) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] Beowulf cluster usage statistics Message-ID: Hello, I would like to install new beowulf cluster. is there any statistical info about what are the distribution of cluster software distribution all over the world? is there any info about beowulf cluster specs? what would be your suggestion for new comers to start where? thanks makcaym From john.hearns at streamline-computing.com Mon Feb 28 15:57:30 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] So we will write our own book - next steps... In-Reply-To: References: <4222E860.2060508@ccrl-nece.de> <20050228100138.GM1404@leitl.org> Message-ID: <1109635050.6244.17.camel@Vigor11> On Mon, 2005-02-28 at 13:24 +0100, Ryan Sweet wrote: > > > * Drupal's collaborative book feature looks like > maybe an interesting middle-road: http://drupal.org/node/284 though maybe it > would have the same problems. > A first look at this Drupal thing looks good. It would be nice to get together an overview of cluster interconnects. Why they are important, what the various choices are, and what the strengths and weaknesses are. As Glen originally commented, this is a very poor part of the OReilly book by Sloan. On this list, we have many expert people from industry, working for the companies which (for example) develop and support interconnects. From rgb at phy.duke.edu Mon Feb 28 17:28:12 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] So we will write our own book - next steps... In-Reply-To: References: <4222E860.2060508@ccrl-nece.de> <20050228100138.GM1404@leitl.org> Message-ID: On Mon, 28 Feb 2005, Ryan Sweet wrote: > > Based on Glenn's comments, I was actually feeling (once again) like I > > ought to try to shake free enough time to do another full pass through > > the content to bring it up to date and see if I can finish off some of > > the missing chapters and -- possibly -- seek a paper publisher. I want > > to keep it online/free either way (and there are publishers out there > > that are comfortable with this) but a lot of people want to own a paper > > copy of stuff like this. I get a lot of requests for a printable PDF > > from people all over who found the html with google but missed the > > online pdf images right next door... > > Which is another reason that maybe a wiki would not entirely serve (see > above). > > OK, this may be a loaded question - for Robert, how do you feel about people > contributing to your book vs starting a new, collaborative effort which draws > upon the strenghts of what you have already done? I'm perfectly happy either way, as long as no further demands are put on my time (which is under a tension of a dozen kilonewtons or so:-). I put an OPL on the document for a reason -- as long as you don't publish an effort that contains a whole lot of my writing for money and not give me any (or get permission to do so ahead of time), you're welcome to steal, reuse, borrow, adapt, or otherwise mutilate my efforts in anything you put together with some measure of attribution and the viral copyleft thing in force. As I also said, I welcome contributions -- if anybody wants to contribute chapters that would be great, and I'll even leave your name at the top of your chapters. One concept that I had for the book some time ago that isn't really implemented is to make it a kind of revolving "journal of cluster computing". I've written what might be viewed as a core/intro to cluster computing, with fairly detailed sections on at least some of the important stuff. What it NEEDS is somebody who is a Myrinet expert to write a chapter or article on "Using Myrinet in a Compute Cluster" -- stuff on getting it, installing it, plugging in the hardware drivers and so forth so that e.g. MPI can run on top of it, some example programs (toy code or real applications) that run on 100 BT TCP/IP and Myrinet side by side for timing and parallel speedup comparisons. Ditto for SCI. Ditto for Fiber Channel, infiniband, gigabit ethernet. An article on diskless clustering. An article on installing and using warewulf on top of e.g. RHEL, FC2 or FC3, Centos, Caosity. An article on SGE. All by people who actually use all of the tools in their daily work. This is where I, or any possible author, come up short. I know something about all of the above, but I don't have direct experience with all of it and don't know a lot of people that do. Greg, probably, and Don. People in the business side of building clusters so that they end up hands on with lots of hardware configurations. A few people in the REALLY big cluster compute centers. So what we NEED is some of the real experts on the list to write expert level but user-friendly contributions. If these were done as "articles" rather than chapters per se, it would also address the problem any such documentation has with information getting "stale" quickly. Without updates every year, a lot of the technical stuff has such a short half-life that any cluster book quickly becomes nearly useless beyond the intro level. I haven't done a major catchup on my book for a couple of whole years, and it is already woefully behind. Alas, my experience with co-authors so far hasn't been too positive. I think no fewer than five or six people have offered to do everything from write half the book with me as a full co-author to contribute a chapter here or there, and I have yet to see a single line of actual contributed text. Hence my cynicism -- we are busy, we are all busy. Writing is a LOT of work (I promise -- it is one of the things I "do"). Most folks don't realize how hard until they have to write a twenty or thirty page chapter (maybe with references and figures) that needs to pass some sort of review and that other people will read and everything. Twenty or thirty hours later... Well, apparently they quit before they get to the 20-30 hour mark. So I will watch with great interest as you try to get something together, and will applaud your energy and determination if you succeed. A wiki/blog sort of thing actually isn't such a terrible idea if you can push it to the critical point where enough people participate and contribute. Sort of an online freeform journal. But then, this list is (if and as google succeeds in getting to the online archives) already a pretty hellacious resource in that regard. > For others, how would you feel about contributing to Robert's book using his > Latex template, vs starting a new collaborative effort? > > Each approach has advantages though if, as was mentioned, its been difficult > to get wider contributions for the book as it is, then maybe a more overtly > collaborative approach would help. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From jake at spiekerfamily.com Mon Feb 28 18:11:55 2005 From: jake at spiekerfamily.com (Jake Thebault-Spieker) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] Pi calculator/RAID accross all nodes/Mosix vs. OpenMosix Message-ID: <4223CF6B.5070805@spiekerfamily.com> Ok, thanks for all the replies. Does anybody know of a use for a cluster? I know I'm going to build one, and I have no programming experience. My thought was that I would just calculate Pi out infinitely, but that doesn't look like it will work. The nodes will be cyrix x86 processors w/ clock speeds of 133 MHz. They each have about 3 GB of HD space. Thoughts on something that I can calculate and log in a MySQL database? -- I think computer viruses should count as life. I think it says something about human nature that the only form of life we have created so far is purely destructive. We've created life in our own image. --Stephen Hawking Jake Thebault-Spieker From maillists at gauckler.ch Mon Feb 28 22:44:14 2005 From: maillists at gauckler.ch (Michael Gauckler) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? Message-ID: <1109659454.6544.2.camel@localhost.localdomain> Dear List, I would like to gather the data from several processes. Instead of the comonly used stride, I want to interleave the data: Rank 0: AAAAA -> ABCDABCDABCDABCDABCD Rank 1: BBBBB ----^---^---^---^---^ Rank 2: CCCCC -----^---^---^---^---^ Rank 3: DDDDD ------^---^---^---^---^ Since the stride of the receive type is indicated in multpiles of its mpi_type, no interleaving is possible (the smallest striping factor leads to AAAAABBBBBBCCCCCDDDDD). Is there a way to achieve this behaviour in an elegant way, as MPI_Gather promises it? Or do I need to do Send/Recv with self-aligned offsets? Thank you for your help! Michael