From eugen at leitl.org Wed Mar 2 02:51:11 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] Re: [Bioclusters] error while running mpiblast (fwd from landman@scalableinformatics.com) Message-ID: <20050302105110.GC13336@leitl.org> ----- Forwarded message from Joe Landman ----- From: Joe Landman Date: Wed, 02 Mar 2005 00:25:05 -0500 To: "Clustering, compute farming & distributed computing in life science informatics" Cc: Subject: Re: [Bioclusters] error while running mpiblast User-Agent: Mozilla Thunderbird 1.0 (X11/20041207) Reply-To: "Clustering, compute farming & distributed computing in life science informatics" James Cuff wrote: >"Iam running this on SGI multiprocessor(numa)", > >you are running on a single shared (well near unified, and SGI do this >very, very well) memory server with, as you said and appear to understand >shared storage... > >*sigh* > >What on earth are you going to gain from MPI? Standard NCBI threads >should do for you just fine, or maybe I've been smoking the funny stuff. Hi James: it is quite possible that mpiblast will scale better than NCBI blast on this system. mpi forces you to pay attention to locality of reference, so you tend to do a good job partitioning your code (that is, if it scales). NCBI is built with pthreads, and I haven't seen it scale much beyond about 10 CPUs on an SMP. The coarser grain of the mpiblast partitioning (the pthread partitioning is very fine grain) will very likely result in better scalability on a NUMA. Not only that, but large multicpu NUMA's have problems with memory hotspots. I remember in the Origin days we used to play games with DPLACE directives and whatnot else to control memory layout, replication of pages, etc. This was under Irix, and there were rich sets of tools to help. I don't think many of them are available under Linux right now (possibly in the SGI propack). You don't see much a problem in 2/4 way systems. It becomes serious when you load data into a page, and you start getting 16 requestors for that page. Page migration is not a win here. readonly page replication can be a huge win here. Luckily, with mpi, all references are local to begin with ... That said, I don't have ready access to one, so I cannot test this hypothesis, though I might just throw together a BBS experiment to test this. I'd love to play with a nice 9MB cache machine. This would be a sweet blast engine :) Expensive... yes, but running out of cache is a "good thing" (TM). >If you _do_ happen to have multiple NUMA's in a cluster, (1) you are very >lucky and (2) you should the still listen to Joe's advice... Local is >only local so far, try: > > Shared=/home/kalyani/toolkit/ncbi > Local=/tmp/kalyani_mpiblast/ > >(or as Joe maybe put better) > > Shared=/home/kalyani/toolkit/ncbi > Local=/mylocalfilesystemthatnoonewillmesswith/kalyani_mpiblast/ Lucas sent me a note indicated that in 1.3.0 they allow for shared and local to coexist. Aaron/Lucas, if you are about, could you clarify some of this? I don't want to lead people astray (and I will need to update the SGE tool). > > WFM, YMMV.. Note: We have not built the mpiblast RPM for Itanium (nor for that matter, any of our RPMs). Is there any interest in this? Curious. Joe > >Best, > >J. > >-- >James Cuff, D. Phil. >Group Leader, Applied Production Systems >Broad Institute of MIT and Harvard. 320 Charles Street, >Cambridge, MA. 02141. Tel: 617-252-1925 Fax: 617-258-0903 > > > > >_______________________________________________ >Bioclusters maillist - Bioclusters@bioinformatics.org >https://bioinformatics.org/mailman/listinfo/bioclusters -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 _______________________________________________ Bioclusters maillist - Bioclusters@bioinformatics.org https://bioinformatics.org/mailman/listinfo/bioclusters ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050302/aba839cc/attachment.bin From eugen at leitl.org Wed Mar 2 09:30:08 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support Message-ID: <20050302173008.GY13336@leitl.org> Speaking of InfiniBand, I presume there are still no motherboards with IB ports onboard? http://www.internetnews.com/dev-news/article.php/3485401 February 24, 2005 Linux Kernel 2.6.11 Supports InfiniBand By Sean Michael Kerner The Linux world is bracing for the final release of the new Linux 2.6.11 kernel, which will include a long list of driver updates and patches, with InfiniBand support perhaps being one of most interesting new additions. Late last night, Linux creator Linus Torvalds issued the fifth release candidate for the 2.6.11 kernel. The first 2.6.11 RC was issued on Jan. 12; the second on Jan 21; the third on Feb. 2; and the fourth on Feb. 12. In the RC5 posting, Torvalds indicated that it was likely the last RC before the final release. "Hey, I hoped -- rc4 was the last one, but we had some laptop resource conflicts, various ppc TLB flush issues, some possible stack overflows in networking and a number of other details warranting a quick -- rc5 before the final 2.6.11," Torvalds wrote. "This time it's really supposed to be a quickie, so people who can, please check it out, and we'll make the real 2.6.11 asap." The long list of updates in the 2.6.11 kernel includes architecture updates for x86-64, ia64, ppc, arm and mips, as well as updates to ACPI (define), DRI (Direct Rendering Infrastructure, which permits direct access to graphics hardware for X Window System users), ALSA (Advanced Linux Sound Architecture, which provides MIDI and audio functionality to the Linux), SCSI (define) and the XFS high-performance journaling filesystem. The 2.6.11 kernel will also be significant in that it includes driver support for the InfiniBand (define) interconnect architecture. InfiniBand, which is derived from its underlying concept of "infinite bandwidth," is a switched fabric interconnect technology for high-performance network devices that is common in a number of supercomputer clusters. The upcoming inclusion of InfiniBand support in the Linux kernel is a major step according to the InfiniBand Trade Association. "The inclusion of InfiniBand drivers in the upstream Linux kernel is a significant milestone," Ross Schibler, CTO of InfiniBand vendor Topspin Communications, told internetnews.com. InfiniBand support was available previously in various Linux distributions, but it wasn't part of the mainstream kernel.org Linux. "This now means that anyone that downloads a kernel will have automatic access to the software," explained Schibler. "It also means that any upcoming distributions (Red Hat, SUSE, etc.) will have the software included on their CDs. Previously SUSE had it on a distribution, but only in the 'unsupported' directory." Schibler sees the inclusion of InfiniBand as a testament to the maturation of the technology. "Now that the technology has matured to such a point that Linus has accepted it into the kernel, the way is paved for greater distribution of the code and accelerated deployment of the technology," Schibler said. The previous Linux kernel.org release, version 2.6.10 was issued on Dec. 24 after two release candidates. Linux distribution began including the 2.6.10 thereafter with Red Hat's Fedora Project being one of the first. Fedora Core 3 initially shipped with the 2.6.9 kernel and then upgraded to the 2.6.10 kernel on Jan 13. Mandrakelinux's 10.2 Beta 3 also includes the 2.6.10 release. SUSE Linux 9.2 currently includes the 2.6.8 kernel. Including the most recent kernel into a distribution is not a particularly easy task. The upcoming Debian, code-named Sarge, will only ship with the 2.6.8 kernel. In a release update e-mail, Debian Sarge release manager Andreas Barth related that a meeting was recently held to review the status of which kernel they would include. "The team leads involved eventually decided to stay with kernel 2.6.8 and 2.4.27, rather than bumping the 2.6 kernel to 2.6.10," Barth wrote. "This decision was made upon review of the known bugs in each of the 2.6 kernel versions; despite some significant bugs in the Debian 2.6.8 kernel tree, these bugs were weighed against the additional delays that a kernel version bump would introduce in the schedule for debian-installer RC3." "As it happens, preparing 2.4 and 2.6 kernels with the security fixes for all architectures took roughly two months from start to finish, during which time preparation of the next debian-installer release candidate has been entirely stalled," he added. -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050302/aad3689c/attachment.bin From gotero at linuxprophet.com Wed Mar 2 11:08:19 2005 From: gotero at linuxprophet.com (Glen Otero) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: <20050302173008.GY13336@leitl.org> References: <20050302173008.GY13336@leitl.org> Message-ID: <5769e684325b11f3146f648fe06d327f@linuxprophet.com> Arima and Iwill have mobos with IB LOM (Landed on Motherboard). Glen On Mar 2, 2005, at 9:30 AM, Eugen Leitl wrote: > > Speaking of InfiniBand, I presume there are still no motherboards with > IB > ports onboard? > > http://www.internetnews.com/dev-news/article.php/3485401 > > February 24, 2005 > Linux Kernel 2.6.11 Supports InfiniBand > By Sean Michael Kerner > > The Linux world is bracing for the final release of the new Linux > 2.6.11 > kernel, which will include a long list of driver updates and patches, > with > InfiniBand support perhaps being one of most interesting new additions. > > Late last night, Linux creator Linus Torvalds issued the fifth release > candidate for the 2.6.11 kernel. The first 2.6.11 RC was issued on > Jan. 12; > the second on Jan 21; the third on Feb. 2; and the fourth on Feb. 12. > > In the RC5 posting, Torvalds indicated that it was likely the last RC > before > the final release. > > "Hey, I hoped -- rc4 was the last one, but we had some laptop resource > conflicts, various ppc TLB flush issues, some possible stack overflows > in > networking and a number of other details warranting a quick -- rc5 > before the > final 2.6.11," Torvalds wrote. > > "This time it's really supposed to be a quickie, so people who can, > please > check it out, and we'll make the real 2.6.11 asap." > > The long list of updates in the 2.6.11 kernel includes architecture > updates > for x86-64, ia64, ppc, arm and mips, as well as updates to ACPI > (define), DRI > (Direct Rendering Infrastructure, which permits direct access to > graphics > hardware for X Window System users), ALSA (Advanced Linux Sound > Architecture, > which provides MIDI and audio functionality to the Linux), SCSI > (define) and > the XFS high-performance journaling filesystem. > > The 2.6.11 kernel will also be significant in that it includes driver > support > for the InfiniBand (define) interconnect architecture. InfiniBand, > which is > derived from its underlying concept of "infinite bandwidth," is a > switched > fabric interconnect technology for high-performance network devices > that is > common in a number of supercomputer clusters. > > The upcoming inclusion of InfiniBand support in the Linux kernel is a > major > step according to the InfiniBand Trade Association. > > "The inclusion of InfiniBand drivers in the upstream Linux kernel is a > significant milestone," Ross Schibler, CTO of InfiniBand vendor Topspin > Communications, told internetnews.com. > > InfiniBand support was available previously in various Linux > distributions, > but it wasn't part of the mainstream kernel.org Linux. > > "This now means that anyone that downloads a kernel will have automatic > access to the software," explained Schibler. "It also means that any > upcoming > distributions (Red Hat, SUSE, etc.) will have the software included on > their > CDs. Previously SUSE had it on a distribution, but only in the > 'unsupported' > directory." > > Schibler sees the inclusion of InfiniBand as a testament to the > maturation of > the technology. > > "Now that the technology has matured to such a point that Linus has > accepted > it into the kernel, the way is paved for greater distribution of the > code and > accelerated deployment of the technology," Schibler said. > > The previous Linux kernel.org release, version 2.6.10 was issued on > Dec. 24 > after two release candidates. Linux distribution began including the > 2.6.10 > thereafter with Red Hat's Fedora Project being one of the first. > > Fedora Core 3 initially shipped with the 2.6.9 kernel and then > upgraded to > the 2.6.10 kernel on Jan 13. Mandrakelinux's 10.2 Beta 3 also includes > the > 2.6.10 release. SUSE Linux 9.2 currently includes the 2.6.8 kernel. > > Including the most recent kernel into a distribution is not a > particularly > easy task. The upcoming Debian, code-named Sarge, will only ship with > the > 2.6.8 kernel. In a release update e-mail, Debian Sarge release manager > Andreas Barth related that a meeting was recently held to review the > status > of which kernel they would include. > > "The team leads involved eventually decided to stay with kernel 2.6.8 > and > 2.4.27, rather than bumping the 2.6 kernel to 2.6.10," Barth wrote. > "This > decision was made upon review of the known bugs in each of the 2.6 > kernel > versions; despite some significant bugs in the Debian 2.6.8 kernel > tree, > these bugs were weighed against the additional delays that a kernel > version > bump would introduce in the schedule for debian-installer RC3." > > "As it happens, preparing 2.4 and 2.6 kernels with the security fixes > for all > architectures took roughly two months from start to finish, during > which time > preparation of the next debian-installer release candidate has been > entirely > stalled," he added. > > -- > Eugen* Leitl leitl > ______________________________________________________________ > ICBM: 48.07078, 11.61144 http://www.leitl.org > 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE > http://moleculardevices.org http://nanomachines.net > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > Glen Otero Ph.D. Linux Prophet -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 5263 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050302/3d6e786d/attachment.bin From hahn at physics.mcmaster.ca Wed Mar 2 15:09:09 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: <5769e684325b11f3146f648fe06d327f@linuxprophet.com> Message-ID: > Arima and Iwill have mobos with IB LOM (Landed on Motherboard). given the choice between a $150 pcie IB nic and having it onboard, I'd choose the separate card. I know the IB salesdroids always say that getting onto the MB will change everything, but this doesn't make sense. IB is completely different from onboard gigabit, for instance, because there is no ubiquitous IB infrastructure ready, waiting to be exploited. the problem with "if you build it onboard, they will come" is also the marginal cost. onboard gigabit is nearly the same cost as onboard 100bT, very low, and you pretty much always want it. onboard IB is noticably higher than onboard GBE, noticable in absolute terms, and you definitely have no possible use for it on many systems. remember, most people don't even saturate GBE yet, and GBE ports are damned cheap. GBE nics are free, and switch ports are now down to $US 23/port: http://froogle.google.com/froogle?q=netgear+GS748T&btnG=Search+Froogle fundamentally, IB is still facing most of the same problems it always has: - requires fairly expensive, unique infrastructure - not the greatest physical layer: it's easy to wind up with literally tons of IB cables. - not clearly superior in performance vs alternatives. - apparently designed by people who disliked existing technique or were ignorant of it. - not a drop-in replacement for alternatives. From bill at cse.ucdavis.edu Wed Mar 2 15:17:53 2005 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: <5769e684325b11f3146f648fe06d327f@linuxprophet.com> References: <20050302173008.GY13336@leitl.org> <5769e684325b11f3146f648fe06d327f@linuxprophet.com> Message-ID: <20050302231753.GA5857@cse.ucdavis.edu> On Wed, Mar 02, 2005 at 11:08:19AM -0800, Glen Otero wrote: > Arima and Iwill have mobos with IB LOM (Landed on Motherboard). > Via pci-express? Or via an HTX[1] slot? [1] http://www.hypertransport.org/products/productdetail.cfm?RecordID=65 When? -- Bill Broadley Computational Science and Engineering UC Davis From gotero at linuxprophet.com Wed Mar 2 15:21:19 2005 From: gotero at linuxprophet.com (Glen Otero) Date: Wed Nov 25 01:03:52 2009 Subject: Fwd: [Beowulf] 2.6.11 is out; with InfiBand support Message-ID: <0a773778f2243e80397520d276ba56b0@linuxprophet.com> Begin forwarded message: > From: Glen Otero > Date: March 2, 2005 3:20:41 PM PST > To: Bill Broadley > Subject: Re: [Beowulf] 2.6.11 is out; with InfiBand support > > > On Mar 2, 2005, at 3:17 PM, Bill Broadley wrote: > >> On Wed, Mar 02, 2005 at 11:08:19AM -0800, Glen Otero wrote: >>> Arima and Iwill have mobos with IB LOM (Landed on Motherboard). >>> >> >> Via pci-express? > > PCI-Express > >> Or via an HTX[1] slot? >> >> [1] >> http://www.hypertransport.org/products/productdetail.cfm?RecordID=65 >> >> When? > > Available now, according to Mellanox. I've seen pictures of the boards. >> >> -- >> Bill Broadley >> Computational Science and Engineering >> UC Davis >> >> > Glen Otero Ph.D. > Linux Prophet > > Glen Otero Ph.D. Linux Prophet -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 1172 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050302/0ef3889c/attachment.bin From jrajiv at hclinsys.com Tue Mar 1 03:07:53 2005 From: jrajiv at hclinsys.com (Rajiv) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] GRID APPLICATION Message-ID: <012e01c51e4e$ea4b6860$0f120897@PMORND> Dear All, I have setup Globus 3.2 on two machines and I am able to submit job from one machine to another. I have a basic doubt about what application to run in GRID environments. Shouldn't the GRID application use resources of both the GRID machines simultaniously. Are there any applications like this. So far I am only running remote jobs from on machine to another - for eg I can submit and run LINPACK/GROMACS job from one master of a cluster to a master of another cluster. Regards, Rajiv From jakob at unthought.net Tue Mar 1 07:51:34 2005 From: jakob at unthought.net (Jakob Oestergaard) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] motherboards for diskless nodes In-Reply-To: <1109415598.6688.21.camel@ip13.2214.h2.fosdem.lan> References: <1109319374.6055.17.camel@Vigor45> <1109351864.2883.22.camel@localhost.localdomain> <20050225183146.GA1563@greglaptop.internal.keyresearch.com> <1109415598.6688.21.camel@ip13.2214.h2.fosdem.lan> Message-ID: <20050301155134.GM347@unthought.net> On Sat, Feb 26, 2005 at 10:59:57AM +0000, John Hearns wrote: > On Fri, 2005-02-25 at 10:31 -0800, Greg Lindahl wrote: > > > > > Doesn't make any sense; I have seen people describe such systems where > > they download a disk image when a batch job wants a different software > > load. It's certainly doable that way: it does have different tradeoffs > > from the diskless case, but if it gives you a headache, it's probably > > I've always dreamed of using User Mode Linux images for this. > In a Grid-based world, prepare a UML instance which has all the > libraries and runtime to run your code. Ship it across the grid with > your executable. > The cluster at the receiving end can be running any distribution - it > runs your UML in a sandbox. Please see RFC 1925, corollary 6a: It is always possible to add another level of indirection. Coming from truth 6: It is easier to move a problem around (for example, by moving the problem to a different part of the overall network architecture) than it is to solve it. Your UML is, as its name implies, a user-space application, just like the real application you were actually trying to run. If your application cannot be run on a given distro, I pretty much doubt your UML (which is a very very complex user mode application) will run. What you want is KISS: Keep It Simple (Stupid) Don't link to a gazillion libraries if you don't have to. Link the libraries statically when feasible (gives you a performance gain in many cases anyway). A statically linked application, or one with only glibc linked dynamically, will run on very wide ranges of distributions. Trust me on this; I make a living from selling an evil capitalistic closed-source solution which needs to run on a very wide range of distributions (and no, we do not link glibc statically because we're not allowed to, but we keep our dependencies minimal and our binaries do run on a very wide range of distributions). > > And before anyone says it, yes performance would be a dog, > and I don't see how UML could access all those nice Myrinet and > Infiniband cards. SO I'm definitely blue-skying. Again; adding layers of indirection is rarely a solution. -- / jakob From peter at cs.usfca.edu Tue Mar 1 09:19:41 2005 From: peter at cs.usfca.edu (Peter Pacheco) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] Re: Pi calculator In-Reply-To: <42231589.4080706@scalableinformatics.com> References: <42231589.4080706@scalableinformatics.com> Message-ID: <20050301171941.GA5545@cs.usfca.edu> On Mon, Feb 28, 2005 at 07:58:49AM -0500, Joe Landman wrote: > > >>2. Does anybody know of a program that will calculate pi, one digit at a > >>time, infinitely that will run in parallel? > > > > > >I don't know about one that will compute an infinite number of digits in > >PI, but the computation of PI via the arctan series is trivially > >partitionable in a variety of ways. You'll spend more time working to > >sum and align the digits you get (as they obviously will have to be > >obtained and manipulated piecewise as strings) than you will doing the > >computation per se. It actually sounds like a decent exercise, as the > >carry from small digits may have to propagate iteratively back to larger > >ones as you extend the computation farther and farther. > > > > > http://mathworld.wolfram.com/PiDigits.html > http://mathworld.wolfram.com/PiFormulas.html > http://www.andrews.edu/~calkins/physics/Miracle.pdf > > and others. > > It is possible to calculate the digits individually using the Bailey et > al algorithm. > > Joe I wrote a short MPI program last summer that uses the Bailey-Borwein-Plouffe algorithm and the GMP library (http://www.swox.com/gmp/) to compute arbitrarily many digits of pi. Jake, send me email (peter@cs.usfca.edu) if you want a copy. Best wishes, Peter Pacheco Department of Computer Science University of San Francisco San Francisco, CA 94117 (415) 422-6630 peter@cs.usfca.edu From steve_heaton at ozemail.com.au Tue Mar 1 16:21:04 2005 From: steve_heaton at ozemail.com.au (steve_heaton@ozemail.com.au) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] So we will write our own book - next steps... Message-ID: <20050302002104.EBYF8920.swebmail00.mail.ozemail.net@localhost> G'day all I humbly submit my A$0.02 as a novice Beowulf'er. I don't have a problem with a series of collected articles. I agree it's a great way to keep the journal/book fresh. The personal styles of the authors doesn't present a challenge for me if the content is good quality. This list is a great example! I'd like to see "something in front of the punters" rather than aiming for perfection with little output as a result. Esp as we're initally looking at a soft format. Articles need not be long and involved. Some of the gems I've got from this list are 25 words or less ;) Another advantage of the article approach. Once we get to "things", my suggestion for an outline (per topic) is: -) What is it? (with a bit of background/history etc) -) How does it work? (Roughly) -) How to I install/use it? -) Tricks and tips (solutions to common problems) -) Where to find more info (net refs, books etc) You're basic FAQ thang :) Vendors are in but the editors wield a heavy hand on 'barrow pushing. The vendors on this list seem good on the education and rarely get pulled into a p'ing contest. Something rare and beautiful compared to other lists! They know their kit, I'd like their knowledge and experience. No doubt they'll be flooded with sales as a result ;) Ability to download journal/book for offline reading is critical. Editors are neutral moderators. Eg. They don't side on local HD v's net boot but will present (all) options without fear nor favour. The goal is to leave the reader in a position to make an informed decision :) I have every intention to contribute. The words are vapour until it do! =) Cheers Stevo This message was sent through MyMail http://www.mymail.com.au From ddw at dreamscape.com Tue Mar 1 21:02:48 2005 From: ddw at dreamscape.com (Dan Williams) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] Re: So we will write our own book - next steps... In-Reply-To: <200503012001.j21K0Yik026268@bluewest.scyld.com> References: <200503012001.j21K0Yik026268@bluewest.scyld.com> Message-ID: <422548F8.7010603@dreamscape.com> The question has been raised as to addressing the needs of beginners, as well as advanced people. I am about as beginner as you can get. I have never built or used a cluster, and am a Linux newbie, besides. If a rank beginner chapter is desired, I volunteer to write it, if someone can hold my hand while I turn a pair of Pentium 100MHz motherboards and miscellaneous parts I have in my junk pile into a working (2 node) cluster. I am pretty good at writing non-fiction if it's a subject I know or can learn about, but as of now, I only have the vaguest notions on how to make a functioning cluster. If there is interest in including a chapter that is detailed enough but basic enough that someone who knows nothing on the subject can learn enough to actually build a functioning cluster from junk parts, then I'm your writer. I'll build a "proof of concept" junkyard cluster and write about it, if someone can help me figure out how. DDW From eno at dorsai.org Wed Mar 2 01:00:43 2005 From: eno at dorsai.org (Alpay Kasal) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] where can i learn to build a cluster machine? In-Reply-To: Message-ID: <0ICP00HHNVJ1JO@mta2.srv.hcvlny.cv.net> Holy Cow... This message is a keeper. Thanks a million Robert. Alpay Kasal -----Original Message----- From: beowulf-bounces@beowulf.org [mailto:beowulf-bounces@beowulf.org] On Behalf Of Robert G. Brown Sent: Friday, February 25, 2005 12:54 PM To: Starship Warrior Cc: beowulf@beowulf.org Subject: Re: [Beowulf] where can i learn to build a cluster machine? ...snip... To give you the direct answer, it goes something like the following: a) Hook systems into a common switched LAN e.g. an ethernet switch. b) If possible use decent quality PXE-aware NICs c) If possible use nodes with a decent amount of installed memory (>= 192 MB) although it is possible to get by with less, with effort. d) Node hard disk is optional for at least some installation methods (e.g. warewulf) but is useful and enables others. e) At least one system NEEDS ample hard disk and will serve as a "server" or "head node" to your cluster. This node will manage boot images, the distro you wish to install, NFS or other shared filesystems, authentication, and gives you a place to "login to the cluster". Note that this is a sloppy requirement -- there are many different ways to manage this and I'm just describing one of the simplest and most straightforward ones. ...snip... rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From atp at piskorski.com Wed Mar 2 18:06:00 2005 From: atp at piskorski.com (Andrew Piskorski) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: References: Message-ID: <20050303020600.GA56437@piskorski.com> On Wed, Mar 02, 2005 at 06:09:09PM -0500, Mark Hahn wrote: > > Arima and Iwill have mobos with IB LOM (Landed on Motherboard). > > given the choice between a $150 pcie IB nic and having it onboard, > I'd choose the separate card. I know the IB salesdroids always Except, a single PCI-X Infiniband card currently costs $1000 or so, right? (That's for a 4x 2 port card, but Froogle does not seem to know of any cheaper cards.) http://h30094.www3.hp.com/product.asp?sku=2603660&jumpid=ex_r2910_frooglesmb/accessories http://www.costcentral.com/proddetail/HP_NC570C/376158B21/F35425/froogle/ -- Andrew Piskorski http://www.piskorski.com/ From rene at renestorm.de Wed Mar 2 19:05:54 2005 From: rene at renestorm.de (rene) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <1109659454.6544.2.camel@localhost.localdomain> References: <1109659454.6544.2.camel@localhost.localdomain> Message-ID: <200503030405.54791.rene@renestorm.de> Hi Michael, in my opinion, this would be a gather with lenght 1 but sended 4 times. This seems to be the easiest and slowest way. If Im not totally wrong your interleaving looks like an Alltoall followed by a reduce operation, but why don't you sort the recv buffer afterwards? Cu Rene > Dear List, > > I would like to gather the data from several processes. > Instead of the comonly used stride, I want to interleave > the data: > > Rank 0: AAAAA -> ABCDABCDABCDABCDABCD > Rank 1: BBBBB ----^---^---^---^---^ > Rank 2: CCCCC -----^---^---^---^---^ > Rank 3: DDDDD ------^---^---^---^---^ > > Since the stride of the receive type is indicated > in multpiles of its mpi_type, no interleaving is > possible (the smallest striping factor leads to > AAAAABBBBBBCCCCCDDDDD). > > Is there a way to achieve this behaviour in an > elegant way, as MPI_Gather promises it? Or do > I need to do Send/Recv with self-aligned offsets? > > Thank you for your help! > > Michael > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From felix.rauch.valenti at gmail.com Wed Mar 2 17:00:49 2005 From: felix.rauch.valenti at gmail.com (Felix Rauch Valenti) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] Re: Pi calculator In-Reply-To: <20050301171941.GA5545@cs.usfca.edu> References: <42231589.4080706@scalableinformatics.com> <20050301171941.GA5545@cs.usfca.edu> Message-ID: <4eafc81b050302170064486203@mail.gmail.com> On Tue, 1 Mar 2005 09:19:41 -0800, Peter Pacheco wrote: > I wrote a short MPI program last summer that uses > the Bailey-Borwein-Plouffe algorithm and the GMP library > (http://www.swox.com/gmp/) to compute arbitrarily many digits of pi. > Jake, send me email (peter@cs.usfca.edu) if you want a copy. If somebody really wants to spend zillions of cycles on calculating Pi just for fun, you could also look for non-random patterns in Pi on the way. Maybe you will become famous one day. (insert reference to Carl Sagan's "Contact" here) - Felix From joachim at ccrl-nece.de Thu Mar 3 01:14:40 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <1109659454.6544.2.camel@localhost.localdomain> References: <1109659454.6544.2.camel@localhost.localdomain> Message-ID: <4226D580.2010206@ccrl-nece.de> Michael Gauckler wrote: > Dear List, > > I would like to gather the data from several processes. > Instead of the comonly used stride, I want to interleave > the data: > > Rank 0: AAAAA -> ABCDABCDABCDABCDABCD > Rank 1: BBBBB ----^---^---^---^---^ > Rank 2: CCCCC -----^---^---^---^---^ > Rank 3: DDDDD ------^---^---^---^---^ > > Since the stride of the receive type is indicated > in multpiles of its mpi_type, no interleaving is > possible (the smallest striping factor leads to > AAAAABBBBBBCCCCCDDDDD). > > Is there a way to achieve this behaviour in an > elegant way, as MPI_Gather promises it? Or do > I need to do Send/Recv with self-aligned offsets? Actually, I don't see an 'elegant' way to do this, either. The decision between multiple MPI_Gatherv() calls and a Irecv/Send/Waitall construct depends on the quality of the MPI implementation you use (MPI_Gatherv can be optimized well for small amounts of data), the characteristics of you interconnect (high latency gives more room for optimization) and the number of processes you use. For small process numbers, you wont see much of a difference anyway. You could also try to gather all data on the root in separate buffers, and then let this process send/recv to itself using the proper datatypes. Finally, if this communication is not a significant part of your runtime, you shouldn't spend much time optimizing it anyway. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From rgb at phy.duke.edu Thu Mar 3 04:54:46 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] Re: Pi calculator In-Reply-To: <4eafc81b050302170064486203@mail.gmail.com> References: <42231589.4080706@scalableinformatics.com> <20050301171941.GA5545@cs.usfca.edu> <4eafc81b050302170064486203@mail.gmail.com> Message-ID: On Thu, 3 Mar 2005, Felix Rauch Valenti wrote: > On Tue, 1 Mar 2005 09:19:41 -0800, Peter Pacheco wrote: > > I wrote a short MPI program last summer that uses > > the Bailey-Borwein-Plouffe algorithm and the GMP library > > (http://www.swox.com/gmp/) to compute arbitrarily many digits of pi. > > Jake, send me email (peter@cs.usfca.edu) if you want a copy. > > If somebody really wants to spend zillions of cycles on calculating Pi > just for fun, you could also look for non-random patterns in Pi on the > way. Maybe you will become famous one day. > (insert reference to Carl Sagan's "Contact" here) Just be sure that you look with a powerful statistical tool -- remembering those damnable typing monkeys. Pi is well known to have all sorts of non-random-looking patterns in it. Distributed (as far as all studies done to date that I found referenced on the web) completely randomly...;-) Wait! I see a cloud that looks like the Virgin Mary! Gotta go and write the Enquirer...:-) rgb (Still haven't taken my medicine this morning, and Deadline hisself is already pre-emptively hassling me for the column I haven't written yet for May:-) (I just HAVE to quit playing WoW until 2 am before cumulative sleep deprivation slays me like a dragon did last night.) (Hmmm, combine business with pleasure? Maybe I'll try to contact the WoW folks and get some detail about their realm cluster. That would make a nifty article for June...) (Damn, my interior monologue isn't working this morning. Must sleep...:-) > > - Felix > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Thu Mar 3 05:20:57 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <4226D580.2010206@ccrl-nece.de> References: <1109659454.6544.2.camel@localhost.localdomain> <4226D580.2010206@ccrl-nece.de> Message-ID: On Thu, 3 Mar 2005, Joachim Worringen wrote: > Michael Gauckler wrote: > > Dear List, > > > > I would like to gather the data from several processes. > > Instead of the comonly used stride, I want to interleave > > the data: > > > > Rank 0: AAAAA -> ABCDABCDABCDABCDABCD > > Rank 1: BBBBB ----^---^---^---^---^ > > Rank 2: CCCCC -----^---^---^---^---^ > > Rank 3: DDDDD ------^---^---^---^---^ > > > > Since the stride of the receive type is indicated > > in multpiles of its mpi_type, no interleaving is > > possible (the smallest striping factor leads to > > AAAAABBBBBBCCCCCDDDDD). > > > > Is there a way to achieve this behaviour in an > > elegant way, as MPI_Gather promises it? Or do > > I need to do Send/Recv with self-aligned offsets? What about RMA-like commands? MPI_Get in a loop? Since that is controlled by the gatherer, one would presume that it preserves call order (although it is non-blocking). Or of course there are always raw sockets... where you have complete control. Depending on how critical it is that you preserve this strict interleaving order. rgb > > Actually, I don't see an 'elegant' way to do this, either. The decision > between multiple MPI_Gatherv() calls and a Irecv/Send/Waitall construct > depends on the quality of the MPI implementation you use (MPI_Gatherv > can be optimized well for small amounts of data), the characteristics of > you interconnect (high latency gives more room for optimization) and the > number of processes you use. For small process numbers, you wont see > much of a difference anyway. > > You could also try to gather all data on the root in separate buffers, > and then let this process send/recv to itself using the proper datatypes. > > Finally, if this communication is not a significant part of your > runtime, you shouldn't spend much time optimizing it anyway. > > Joachim > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From eugen at leitl.org Thu Mar 3 05:42:20 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] purchasing sources for Newisys 2100 (and 4300)? Message-ID: <20050303134219.GB13336@leitl.org> Question to resident hardware purchasers: where do you get your Newisys systems, in the EU (Germany, especially)? Small quantities. The company I work for has good prices for Sun V20z iron, but naturally I'm looking for better deals, especially with large memory configurations. I know this is the wrong place to ask, but I can't find a lead on the web. Thanks, -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050303/f67aa6aa/attachment.bin From gropp at mcs.anl.gov Thu Mar 3 06:23:10 2005 From: gropp at mcs.anl.gov (William Gropp) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <1109659454.6544.2.camel@localhost.localdomain> References: <1109659454.6544.2.camel@localhost.localdomain> Message-ID: <6.2.1.2.2.20050303081639.098a3c10@pop.mcs.anl.gov> At 12:44 AM 3/1/2005, Michael Gauckler wrote: >Dear List, > >I would like to gather the data from several processes. >Instead of the comonly used stride, I want to interleave >the data: > >Rank 0: AAAAA -> ABCDABCDABCDABCDABCD >Rank 1: BBBBB ----^---^---^---^---^ >Rank 2: CCCCC -----^---^---^---^---^ >Rank 3: DDDDD ------^---^---^---^---^ > >Since the stride of the receive type is indicated >in multpiles of its mpi_type, no interleaving is >possible (the smallest striping factor leads to >AAAAABBBBBBCCCCCDDDDD). > >Is there a way to achieve this behaviour in an >elegant way, as MPI_Gather promises it? Or do >I need to do Send/Recv with self-aligned offsets? You should be able to do this with MPI_Gather by creating a new datatype on the receiving process whose extent is the size of a single item; that will get you the correct offset for the first element. In order to receive the subsequent elements into the desired location, you need to use a vector type containing the number of elements. And for this to be fast, you need an MPI implementation that will handle the "resized" datatype efficiently (use MPI_Type_vector to create the full datatype and MPI_Type_create_resized to change its effective extent). If you are moving large amounts of data, separate send/recvs are probably a better choice. Bill >Thank you for your help! > > Michael > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf William Gropp http://www.mcs.anl.gov/~gropp From rross at mcs.anl.gov Thu Mar 3 08:35:37 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: References: <1109659454.6544.2.camel@localhost.localdomain> <4226D580.2010206@ccrl-nece.de> Message-ID: <42273CD9.5030503@mcs.anl.gov> Robert G. Brown wrote: > On Thu, 3 Mar 2005, Joachim Worringen wrote: > >>Michael Gauckler wrote: >> >>>I would like to gather the data from several processes. >>>Instead of the comonly used stride, I want to interleave >>>the data: >>> >>>Rank 0: AAAAA -> ABCDABCDABCDABCDABCD >>>Rank 1: BBBBB ----^---^---^---^---^ >>>Rank 2: CCCCC -----^---^---^---^---^ >>>Rank 3: DDDDD ------^---^---^---^---^ >>> >>>Since the stride of the receive type is indicated >>>in multpiles of its mpi_type, no interleaving is >>>possible (the smallest striping factor leads to >>>AAAAABBBBBBCCCCCDDDDD). >>> >>>Is there a way to achieve this behaviour in an >>>elegant way, as MPI_Gather promises it? Or do >>>I need to do Send/Recv with self-aligned offsets? > > > What about RMA-like commands? MPI_Get in a loop? Since that is > controlled by the gatherer, one would presume that it preserves call > order (although it is non-blocking). I would hope that one would read the spec instead! MPI_Get()s don't necessarily *do* anything until the corresponding synchronization call. This allows the implementation to aggregate messages. Call order (of the MPI_Get()s in an epoch) is ignored. > Or of course there are always raw sockets... where you have complete > control. Depending on how critical it is that you preserve this strict > interleaving order. > > rgb No you don't! You're just letting the kernel buffer things instead of the MPI implementation. Plus, Michael's original concern was doing this in an elegant way, not explicitly controlling the ordering. Joachim had some good options for MPI. Regards, Rob From rross at mcs.anl.gov Thu Mar 3 08:36:48 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <6.2.1.2.2.20050303081639.098a3c10@pop.mcs.anl.gov> References: <1109659454.6544.2.camel@localhost.localdomain> <6.2.1.2.2.20050303081639.098a3c10@pop.mcs.anl.gov> Message-ID: <42273D20.2000702@mcs.anl.gov> William Gropp wrote: > At 12:44 AM 3/1/2005, Michael Gauckler wrote: > >> Dear List, >> >> I would like to gather the data from several processes. >> Instead of the comonly used stride, I want to interleave >> the data: >> >> Rank 0: AAAAA -> ABCDABCDABCDABCDABCD >> Rank 1: BBBBB ----^---^---^---^---^ >> Rank 2: CCCCC -----^---^---^---^---^ >> Rank 3: DDDDD ------^---^---^---^---^ >> >> Since the stride of the receive type is indicated >> in multpiles of its mpi_type, no interleaving is >> possible (the smallest striping factor leads to >> AAAAABBBBBBCCCCCDDDDD). >> >> Is there a way to achieve this behaviour in an >> elegant way, as MPI_Gather promises it? Or do >> I need to do Send/Recv with self-aligned offsets? > > > You should be able to do this with MPI_Gather by creating a new datatype > on the receiving process whose extent is the size of a single item; that > will get you the correct offset for the first element. In order to > receive the subsequent elements into the desired location, you need to > use a vector type containing the number of elements. And for this to be > fast, you need an MPI implementation that will handle the "resized" > datatype efficiently (use MPI_Type_vector to create the full datatype > and MPI_Type_create_resized to change its effective extent). If you are > moving large amounts of data, separate send/recvs are probably a better > choice. > > Bill Nice! Rob From joachim at ccrl-nece.de Thu Mar 3 08:47:23 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <6.2.1.2.2.20050303081639.098a3c10@pop.mcs.anl.gov> References: <1109659454.6544.2.camel@localhost.localdomain> <6.2.1.2.2.20050303081639.098a3c10@pop.mcs.anl.gov> Message-ID: <42273F9B.4040506@ccrl-nece.de> William Gropp wrote: > You should be able to do this with MPI_Gather by creating a new datatype > on the receiving process whose extent is the size of a single item; that > will get you the correct offset for the first element. In order to > receive the subsequent elements into the desired location, you need to > use a vector type containing the number of elements. And for this to be > fast, you need an MPI implementation that will handle the "resized" > datatype efficiently (use MPI_Type_vector to create the full datatype > and MPI_Type_create_resized to change its effective extent). If you are > moving large amounts of data, separate send/recvs are probably a better > choice. Oh yes, I forgot, twiddling with LB and UB. I never liked this, esp. as an MPI implementor. Not especially 'elegant', but it should work. Good conformance test, BTW. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From rsweet at aoes.com Thu Mar 3 00:29:56 2005 From: rsweet at aoes.com (Ryan Sweet) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] GRID APPLICATION In-Reply-To: <012e01c51e4e$ea4b6860$0f120897@PMORND> References: <012e01c51e4e$ea4b6860$0f120897@PMORND> Message-ID: On Tue, 1 Mar 2005, Rajiv wrote: > Dear All, > I have setup Globus 3.2 on two machines and I am able to submit job from > one machine to another. I have a basic doubt about what application to run > in GRID environments. Shouldn't the GRID application use resources of both > the GRID machines simultaniously. Are there any applications like this. So > far I am only running remote jobs from on machine to another - for eg I can > submit > and run LINPACK/GROMACS job from one master of a cluster to a master of > another cluster. Dear Rajiv, There seem to be a lot of people building clusters and grid systems lately without applications to run on them ;-). That's nice to see, I guess, in as much as it indicates the broad level of acceptance for the technologies. It is very much the reverse of the "to scratch an itch" way in which things used to be done. ;-) Grid systems (the term means many things to many people - here I mean roughly "a collection of resources used in collaboration spanning multiple geographic locations and administrative domains") are used in a wide variety of ways. In your scenario, if you are building a globus system in order to learn about globus, and you can now run jobs on one host from another and vice-versa, you've already got a lot of the hard work done. If you would like to use multiple grid resources to simultaneously work on a larger problem than any of them can tackle when working alone, then you need a way to take your problem and partition it into chunks that can be submitted to various resources around the grid, in your example, split a larger problem in two, and submit half to each resource. It is not usually practical (though there are exceptions) to run jobs which have a parallel communication component (MPI or PVM) across grid resoures (submitting multiple local mpi jobs to multiple resources is ok though, provided that you have a way to verify that the resources can accept your mpi jobs and run them - thats where RSL and a broker, etc. come in). Some middleware to broker between the requirements of your job and the available resources is usually used. There are a lot of projects that do that, and any attempt I would make to list them would surely leave out one or more deserving ones. Google is your friend. For GROMACS there are lots of examples out there. Here is a very friendly one from the UK NGS: http://www.ngs.ac.uk/sites/ox/software/gromacs.html regards, -Ryan -- Ryan Sweet Advanced Operations and Engineering Services AOES Group BV http://www.aoes.com Phone +31(0)71 5795521 Fax +31(0)71572 1277 From rsweet at aoes.com Thu Mar 3 00:52:32 2005 From: rsweet at aoes.com (Ryan Sweet) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] So we will write our own book - next steps... In-Reply-To: <20050302002104.EBYF8920.swebmail00.mail.ozemail.net@localhost> References: <20050302002104.EBYF8920.swebmail00.mail.ozemail.net@localhost> Message-ID: The response to this thread has been great! I am keeping track of all the responses, and will try to present some sort of overview. I have a system ready for hosting. What I'd like to do is to review a few different wikis/collaboration systems/etc. to check on some issues such as their security risks, ease of installation/maintenance, printing/offline reading support, and so on. If you have a preference please send me suggestions for consideration off-list. Either I'll setup up a few of the best ones and then we can choose among them, or if I don't have time I'll just setup the one I think is the best and if people who are actually contributing don't like it we can discuss changing it at that time. I think I will have something by Tuesday, though maybe earlier. If you are thinking of writing something, please go have a look at the (older but weathering well) FAQ on beowulf.org and then at Robert Brown's book first and to avoid re-inventing the wheel, update, acknowledge, and borrow. regards, -Ryan -- Ryan Sweet Advanced Operations and Engineering Services AOES Group BV http://www.aoes.com Phone +31(0)71 5795521 Fax +31(0)71572 1277 From djholm at fnal.gov Thu Mar 3 06:58:33 2005 From: djholm at fnal.gov (Don Holmgren) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: <20050303020600.GA56437@piskorski.com> References: <20050303020600.GA56437@piskorski.com> Message-ID: I was just quoted quantity 2 PCI-X HCA's (4x 2 port Mellanox) by a reseller for $655 each. Last Fall we purchased a large quantity of PCI-E HCA's for considerably less than that unit price. Supposedly the "memory free" PCI-E HCA's that use host memory, rather than on-board sram, should move prices towards $100 sometime this year - we'll see (landed on motherboard pricing at ~ $70, see http://www.mellanox.com/news/press/pr_030105.html) Like other high performance network gear, it's tough to get accurate pricing information without going out and getting quotes. Don Holmgren On Wed, 2 Mar 2005, Andrew Piskorski wrote: > On Wed, Mar 02, 2005 at 06:09:09PM -0500, Mark Hahn wrote: > > > Arima and Iwill have mobos with IB LOM (Landed on Motherboard). > > > > given the choice between a $150 pcie IB nic and having it onboard, > > I'd choose the separate card. I know the IB salesdroids always > > Except, a single PCI-X Infiniband card currently costs $1000 or so, > right? (That's for a 4x 2 port card, but Froogle does not seem to > know of any cheaper cards.) > > http://h30094.www3.hp.com/product.asp?sku=2603660&jumpid=ex_r2910_frooglesmb/accessories > http://www.costcentral.com/proddetail/HP_NC570C/376158B21/F35425/froogle/ > > -- > Andrew Piskorski > http://www.piskorski.com/ > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From list-beowulf at onerussian.com Thu Mar 3 08:33:59 2005 From: list-beowulf at onerussian.com (Yaroslav Halchenko) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] DCC (debian cluster components) Message-ID: <20050303163359.GJ4482@washoe.rutgers.edu> Dear Debianized beowulfers or beowulfiezed Debian users, Does any one has experience with http://www.irb.hr/en/cir/projects/dcc/ which recently was released? Project Goals We expect to integrate some existing technologies (like LDAP, System Installation Suite, Torque, C3...) and develop a production-grade toolset for easier cluster management, based on Debian GNU/Linux distribution. This involves development of automation mechanisms that provide a flexible platform for high-performance computation tasks, but also provide a system-administrator to have a secure, easy to maintain, reliable and good supported cluster administration toolbox, based on Debian/GNU Linux. -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555] -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20050303/f24dcef6/attachment.bin From rgb at phy.duke.edu Thu Mar 3 10:35:08 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <42273CD9.5030503@mcs.anl.gov> References: <1109659454.6544.2.camel@localhost.localdomain> <4226D580.2010206@ccrl-nece.de> <42273CD9.5030503@mcs.anl.gov> Message-ID: On Thu, 3 Mar 2005, Rob Ross wrote: > > What about RMA-like commands? MPI_Get in a loop? Since that is > > controlled by the gatherer, one would presume that it preserves call > > order (although it is non-blocking). > > I would hope that one would read the spec instead! MPI_Get()s don't > necessarily *do* anything until the corresponding synchronization call. > This allows the implementation to aggregate messages. Call order (of > the MPI_Get()s in an epoch) is ignored. Ouch! I did read the spec, about ten seconds before replying, and note that I SAID it was non-blocking (from the spec) and was thinking about using it with sync's. However, yes, this is a bit oxymoronic on a reread (preserve call order vs non-blocking? Jeeze:-) and I consider myself whomped upside the head:-) Still, what IS to prevent him from alternating gets and synchronization calls while retrieving . I don't think there is any alternative to doing this in ANY non-blocking, potentially aggregating or parallel communications scenario, although where he puts the barrier might vary. Depending on whether he cares about the (ABCD)(ABCD)... order per se he might try (however he does the "get"ting or receiving, or whatever): get A (sync) get B (sync) get C (sync) (inefficient but absolutely guarantees loop order). If (ABCD)(BADC)(DCAB)... is ok (he doesn't care what order they arrive in but he doesn't want the A communications to be aggregated so that two A's get there before he gets the BCD from the same cycle of computations) then he should be able to do: get A get B get C get D (sync) get A get B get C... If I understand things correctly (where there is a very definite chance that I don't!) these (on top of any library) will in the first case be equivalent to a blocking TCP read from A, B, C... in order (but probably not as efficient as TCP would be in this particular case because MPI is optimized against the near diametrically opposite assumption) while the second would be equivalent to using select to monitor a group of open sockets for I/O, reading from them in the order that data becomes available, but adding a toggle so you don't read from one twice before reading from all of them. Although there is likely more than one way to do it, and where in low-level programming one might well want to implement handshaking of some sort to trigger the next cycle's send (blocking the remote client's execution as necessary) to avoid overrunning buffers or exhausting memory on the master/aggregator if for any reason one host turns out to be "slow" relative to the others. In MPI one hopes all of that is handled for you, and more. > > Or of course there are always raw sockets... where you have complete > > control. Depending on how critical it is that you preserve this strict > > interleaving order. > > > > rgb > > No you don't! You're just letting the kernel buffer things instead of > the MPI implementation. Plus, Michael's original concern was doing this > in an elegant way, not explicitly controlling the ordering. Of course one has maximum control with raw sockets (or more generically, raw devices). Somewhere down inside MPI there ARE raw sockets (or equivalent at some level in the ISO/OSI stack). The MPI library wraps them up and hides all sorts of details while making all the device features uniform across the now "invisible" devices and in the process necessarily excluding one from ACCESSING all of those features a.k.a. the details being hidden. I may have misunderstood the recent discussion about the possible need for an MPI ABI but I thought that was what it was all about -- providing a measure of low level control that is BY DESIGN hidden by the API but is there, should one every try to code for it, in the actual kernel/device interface (e.g. regulated by ioctls). Note that at this low (device driver) level I would expect the kernel to handle at least some asynchronous low-level buffering and the primary interrupt processing for the physical device layer FOR the MPI implementation or any other program that uses the device -- you cannot safely escape this. This does not mean that you cannot control just where you stop using the kernel and rely on your own intermediary layers for handing the device above the level of raw read/write plus ioctls. That is the application layer or higher order kernel networking layers (depending on just where and how you access the device itself) may well manage buffers of their own, reliability, retransmission, RDMA, blocking/non-blocking of the interface, timeouts, and more. Low level networking is not easy, which is WHY people wrote PVM and the MPI network devices. So ultimately, all I was observing is that it is pretty straightforward (not necessarily easy, but straightforward) to write an application that very definitely and without any question goes and gets a chunk of data (say, contents of a struct) from an open socket on system A and puts it in a memory location (with appropriate pointers and sizeof and so forth for the struct), THEN gets a chunk of data from an open socket on system B and puts it in the next memory location, THEN gets a chunk of data from an open socket on system C and puts it ... In fact, since TCP generally does block on a read until there is data on the socket, it is relatively difficult to do it any other way in a simple loop over sockets -- you have to use select as noted above to avoid polling and non-blocking I/O, and in all cases one has to be pretty careful not to drop things and to handle cases where a stream runs slow or does other bad things. As I've learned the hard way. As far as elegance is concerned: a) That's a bit in the eye of the beholder. There are tradeoffs between simplicity of code and ease of development work vs performance and control but it is hard to say which is more "elegant". It's fairer to make the value-neutral statement that you have to work much harder to write a parallel application on top of raw sockets (no question at all;-) but have all the control and optimization potential available to userspace (at the ioctl level above the kernel and device driver itself) if you do so. To cite a metaphorical situation, is coding in assembler, whether one is coding a complete application or an embedded optimized operation, "inelegant"? Perhaps, but that's not the word I would have used. There are times when assembler is very elegant, in the sense that it directly encodes an algorithm with the greatest degree of optimization and control where a compiler might well generate indifferent code or fail to use all the features of the hardware. Once upon a time many many years ago I handed coded e.g. trig functions and arithmetic for the 8087 in assembler because my compiler generated 8088 code that ran about ten times more slowly. Elegant or inelegant? Compare to just this situation -- if for some reason you require e.g. absolute control over the order of utilization of your network links in a parallel computation (perhaps to avoid collisions or device/line contention, to do something exotic with transmission order and pattern on a hypercubical network) you may well find that MPI or PVM simply do not provide that degree of control, period. They try to "do the right thing" for a generic class of problem and simple assumptions as to the kind and number of interfaces and routes between nodes and load patterns along those routes, BUT there is no >>guarantee<< that the right thing they end up with (often chosen for robustness and optimization for the most common classes of problems) will be right for YOUR problem and network and no way to tweak it if it is not. In that case, using raw network devices (whatever they might be) might well be the only way to achieve the requisite level of control and yes, might be worth a factor of 10 in speed. I'll bet money that if you polled the list, you'd find that there exist people who have gone in and hacked MPI at the source level to "break" it (de-optimize it for the most common applications so they run worse) or who have run over time several versions of MPI including "new and improved" ones, who have found empirically that there are applications for which the hacked/older "disimproved" versions perform better. b) Anyway, this explains why I mentioned raw sockets at he end. Note well the "Depending on how..." Maybe I read the original message incorrectly, but I thought that the issue was that (for reasons unknown) he wanted to guarantee collection in the order A then B then C... Why he wanted to do this wasn't clear, nor was it clear whether (in any given cycle) it would be ok to do A then C then B then... (and just not overlap the next A). If the strict interleaving wasn't an issue, than I would have thought just putting a barrier at the end of an ABC...Z cycle would have forced all communications to complete before starting the next cycle. So IF this really IS a critical requirement -- he has to read from A, complete the read blocking no fooling, move on and read from B, etc, no data parallelism or asynchronicity permitted that might violate this strict order (or if he's interleaving communications on four different network devices along different routes to different sub-clusters of nodes), then doing it within MPI might or might not be efficient. Raw TCP sockets (or lower level hardware-dependent I/O) would a PITA to code, but you can pretty much guarantee that the resulting code is as efficient as possible, given the requirement, and it might be the ONLY way to accomplish a complicated interleave of node I/O for some very specific set of reasons. If you do the considerable work required to make it so, of course, copy of the complete works of Stevens in hand...;-) > Joachim had some good options for MPI. I agree. I don't even disagree with what you say above -- I understand what you mean. I just think that we need more data before concluding that those options were enough. He described his design goal but not his motivation. For some design goals there are probably lots of good ways to do it in MPI. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Thu Mar 3 10:40:07 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <42273D20.2000702@mcs.anl.gov> References: <1109659454.6544.2.camel@localhost.localdomain> <6.2.1.2.2.20050303081639.098a3c10@pop.mcs.anl.gov> <42273D20.2000702@mcs.anl.gov> Message-ID: On Thu, 3 Mar 2005, Rob Ross wrote: OK, having re-reread everything, I conclude that you were completely right after all. I misunderstood what his question was. I'm still not certain that I understand, but if Bill has answered it it definitely isn't what I though. So double-whomp. I'll go sleep now. rgb > > > William Gropp wrote: > > At 12:44 AM 3/1/2005, Michael Gauckler wrote: > > > >> Dear List, > >> > >> I would like to gather the data from several processes. > >> Instead of the comonly used stride, I want to interleave > >> the data: > >> > >> Rank 0: AAAAA -> ABCDABCDABCDABCDABCD > >> Rank 1: BBBBB ----^---^---^---^---^ > >> Rank 2: CCCCC -----^---^---^---^---^ > >> Rank 3: DDDDD ------^---^---^---^---^ > >> > >> Since the stride of the receive type is indicated > >> in multpiles of its mpi_type, no interleaving is > >> possible (the smallest striping factor leads to > >> AAAAABBBBBBCCCCCDDDDD). > >> > >> Is there a way to achieve this behaviour in an > >> elegant way, as MPI_Gather promises it? Or do > >> I need to do Send/Recv with self-aligned offsets? > > > > > > You should be able to do this with MPI_Gather by creating a new datatype > > on the receiving process whose extent is the size of a single item; that > > will get you the correct offset for the first element. In order to > > receive the subsequent elements into the desired location, you need to > > use a vector type containing the number of elements. And for this to be > > fast, you need an MPI implementation that will handle the "resized" > > datatype efficiently (use MPI_Type_vector to create the full datatype > > and MPI_Type_create_resized to change its effective extent). If you are > > moving large amounts of data, separate send/recvs are probably a better > > choice. > > > > Bill > > Nice! > > Rob > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From lindahl at pathscale.com Thu Mar 3 10:55:16 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: References: <20050303020600.GA56437@piskorski.com> Message-ID: <20050303185516.GB1453@greglaptop.internal.keyresearch.com> On Thu, Mar 03, 2005 at 08:58:33AM -0600, Don Holmgren wrote: > I was just quoted quantity 2 PCI-X HCA's (4x 2 port Mellanox) by a > reseller for $655 each. Last Fall we purchased a large quantity of > PCI-E HCA's for considerably less than that unit price. > > Supposedly the "memory free" PCI-E HCA's that use host memory, rather > than on-board sram, should move prices towards $100 sometime this > year - we'll see (landed on motherboard pricing at ~ $70, see > http://www.mellanox.com/news/press/pr_030105.html) You're mixing retail price with wholesale price. The $69 price is apparently quantity 10,000, for just the chip, and the card that is inexpensive will be PCIe 4X, which hurts performance. I hear that today's street price for Mellanox-based cards is ~$600 in cluster-sized quantities, which matches what you report. > Like other high performance network gear, it's tough to get > accurate pricing information without going out and getting quotes. One nice thing about Myricom is that their prices have always been on the web -- all you need to know in addition is your discount. -- greg From rross at mcs.anl.gov Thu Mar 3 11:17:50 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: References: <1109659454.6544.2.camel@localhost.localdomain> <6.2.1.2.2.20050303081639.098a3c10@pop.mcs.anl.gov> <42273D20.2000702@mcs.anl.gov> Message-ID: <422762DE.1090209@mcs.anl.gov> Sleep is good :). Robert G. Brown wrote: > On Thu, 3 Mar 2005, Rob Ross wrote: > > OK, having re-reread everything, I conclude that you were completely > right after all. I misunderstood what his question was. I'm still not > certain that I understand, but if Bill has answered it it definitely > isn't what I though. > > So double-whomp. I'll go sleep now. > > rgb > > >> >>William Gropp wrote: >> >>>At 12:44 AM 3/1/2005, Michael Gauckler wrote: >>> >>> >>>>Dear List, >>>> >>>>I would like to gather the data from several processes. >>>>Instead of the comonly used stride, I want to interleave >>>>the data: >>>> >>>>Rank 0: AAAAA -> ABCDABCDABCDABCDABCD >>>>Rank 1: BBBBB ----^---^---^---^---^ >>>>Rank 2: CCCCC -----^---^---^---^---^ >>>>Rank 3: DDDDD ------^---^---^---^---^ >>>> >>>>Since the stride of the receive type is indicated >>>>in multpiles of its mpi_type, no interleaving is >>>>possible (the smallest striping factor leads to >>>>AAAAABBBBBBCCCCCDDDDD). >>>> >>>>Is there a way to achieve this behaviour in an >>>>elegant way, as MPI_Gather promises it? Or do >>>>I need to do Send/Recv with self-aligned offsets? >>> >>> >>>You should be able to do this with MPI_Gather by creating a new datatype >>>on the receiving process whose extent is the size of a single item; that >>>will get you the correct offset for the first element. In order to >>>receive the subsequent elements into the desired location, you need to >>>use a vector type containing the number of elements. And for this to be >>>fast, you need an MPI implementation that will handle the "resized" >>>datatype efficiently (use MPI_Type_vector to create the full datatype >>>and MPI_Type_create_resized to change its effective extent). If you are >>>moving large amounts of data, separate send/recvs are probably a better >>>choice. >>> >>>Bill >> >>Nice! >> >>Rob >>_______________________________________________ >>Beowulf mailing list, Beowulf@beowulf.org >>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> > > From rross at mcs.anl.gov Thu Mar 3 12:50:24 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <42273F9B.4040506@ccrl-nece.de> References: <1109659454.6544.2.camel@localhost.localdomain> <6.2.1.2.2.20050303081639.098a3c10@pop.mcs.anl.gov> <42273F9B.4040506@ccrl-nece.de> Message-ID: <42277890.5070509@mcs.anl.gov> Joachim Worringen wrote: > William Gropp wrote: > >> You should be able to do this with MPI_Gather by creating a new >> datatype on the receiving process whose extent is the size of a single >> item; that will get you the correct offset for the first element. In >> order to receive the subsequent elements into the desired location, >> you need to use a vector type containing the number of elements. And >> for this to be fast, you need an MPI implementation that will handle >> the "resized" datatype efficiently (use MPI_Type_vector to create the >> full datatype and MPI_Type_create_resized to change its effective >> extent). If you are moving large amounts of data, separate send/recvs >> are probably a better choice. > > Oh yes, I forgot, twiddling with LB and UB. I never liked this, esp. as > an MPI implementor. Not especially 'elegant', but it should work. Good > conformance test, BTW. It needs to have a negative extent to really test things. The positive extents are easy! Rob From hahn at physics.mcmaster.ca Thu Mar 3 17:24:06 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: Message-ID: > Supposedly the "memory free" PCI-E HCA's that use host memory, rather > than on-board sram, should move prices towards $100 sometime this > year - we'll see (landed on motherboard pricing at ~ $70, see > http://www.mellanox.com/news/press/pr_030105.html) the IB world (still consting only of Mellanox chips, right?) has done a good job pushing down adapter prices. can anyone comment on trends in switch pricing? thanks, mark hahn. From john.hearns at streamline-computing.com Fri Mar 4 01:33:18 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] DCC (debian cluster components) In-Reply-To: <20050303163359.GJ4482@washoe.rutgers.edu> References: <20050303163359.GJ4482@washoe.rutgers.edu> Message-ID: <1109928798.5537.22.camel@Vigor45> On Thu, 2005-03-03 at 11:33 -0500, Yaroslav Halchenko wrote: > Dear Debianized beowulfers or beowulfiezed Debian users, > > Does any one has experience with > http://www.irb.hr/en/cir/projects/dcc/ > which recently was released? > > Project Goals > > We expect to integrate some existing technologies (like LDAP, System > Installation Suite, Seems strange that they haven't chosen FAI (Fully Automated Installer). As an aside, there was a poster about FAI up outside the cluster developer's room at FOSDEM last weekend. From rgb at phy.duke.edu Fri Mar 4 05:03:00 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] DCC (debian cluster components) In-Reply-To: <1109928798.5537.22.camel@Vigor45> References: <20050303163359.GJ4482@washoe.rutgers.edu> <1109928798.5537.22.camel@Vigor45> Message-ID: On Fri, 4 Mar 2005, John Hearns wrote: > On Thu, 2005-03-03 at 11:33 -0500, Yaroslav Halchenko wrote: > > Dear Debianized beowulfers or beowulfiezed Debian users, > > > > Does any one has experience with > > http://www.irb.hr/en/cir/projects/dcc/ > > which recently was released? > > > > Project Goals > > > > We expect to integrate some existing technologies (like LDAP, System > > Installation Suite, > > Seems strange that they haven't chosen FAI (Fully Automated Installer). > As an aside, there was a poster about FAI up outside the cluster > developer's room at FOSDEM last weekend. Is FAI being loved by somebody at this point? There was a time a few years ago where it seemed to be lying fallow (although as always I could be mistaken about that). Toolsets like that usually need a fairly active and energetic human to care for them, if not several... rgb > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From roger at ERC.MsState.Edu Fri Mar 4 06:10:28 2005 From: roger at ERC.MsState.Edu (Roger L. Smith) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: References: Message-ID: On Thu, 3 Mar 2005, Mark Hahn wrote: > the IB world (still consting only of Mellanox chips, right?) > has done a good job pushing down adapter prices. > > can anyone comment on trends in switch pricing? I know at least one vendor has a 24 port model using the newer IB chipset for around $8,000. _\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_ | Roger L. Smith Phone: 662-325-3625 | | Sr. Systems Administrator FAX: 662-325-7692 | | roger@ERC.MsState.Edu http://WWW.ERC.MsState.Edu/~roger | | Mississippi State University | |____________________________________ERC__________________________________| From gmpc at sanger.ac.uk Fri Mar 4 06:40:01 2005 From: gmpc at sanger.ac.uk (Guy Coates) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] DCC (debian cluster components) In-Reply-To: References: <20050303163359.GJ4482@washoe.rutgers.edu> <1109928798.5537.22.camel@Vigor45> Message-ID: > > Is FAI being loved by somebody at this point? There was a time a few > years ago where it seemed to be lying fallow (although as always I could > be mistaken about that). Toolsets like that usually need a fairly > active and energetic human to care for them, if not several... It is still alive. We're just in the process of rolling out a new cluster with it at this very moment. Works fine. Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK Tel: +44 (0)1223 834244 ex 7199 From laytonjb at charter.net Fri Mar 4 07:43:27 2005 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: References: Message-ID: <4228821F.90607@charter.net> Roger L. Smith wrote: >On Thu, 3 Mar 2005, Mark Hahn wrote: > > > >>the IB world (still consting only of Mellanox chips, right?) >>has done a good job pushing down adapter prices. >> >>can anyone comment on trends in switch pricing? >> >> > >I know at least one vendor has a 24 port model using the newer IB chipset >for around $8,000. > > > I just finished an interconnect survey article for Doug and ClusterWorld Magazine. As part of the article I have a nice table with list prices for 8 nodes and 128 nodes for the various interconnects. It should be out in the May issue so be sure to look for it. However, to match what Roger said, one IB vendor gave me a list price for 8-ports of IB for under $8,000. Jeff From landman at scalableinformatics.com Fri Mar 4 08:04:37 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: <4228821F.90607@charter.net> References: <4228821F.90607@charter.net> Message-ID: <42288715.10403@scalableinformatics.com> 8 ports under 8k or 24 ports under 8k? Jeffrey B. Layton wrote: > However, to match what Roger said, one IB vendor gave me > a list price for 8-ports of IB for under $8,000. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From laytonjb at charter.net Fri Mar 4 08:16:03 2005 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: <42288715.10403@scalableinformatics.com> References: <4228821F.90607@charter.net> <42288715.10403@scalableinformatics.com> Message-ID: <422889C3.5080902@charter.net> 8 ports under 8k, but it was a 24 port switch :) This includes all of the HCA's, switches (only one), cables, and software. Jeff > 8 ports under 8k or 24 ports under 8k? > > Jeffrey B. Layton wrote: > >> However, to match what Roger said, one IB vendor gave me >> a list price for 8-ports of IB for under $8,000. > From roger at ERC.MsState.Edu Fri Mar 4 08:34:32 2005 From: roger at ERC.MsState.Edu (Roger L. Smith) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: <422889C3.5080902@charter.net> References: <4228821F.90607@charter.net> <42288715.10403@scalableinformatics.com> <422889C3.5080902@charter.net> Message-ID: THe price I stated was for a 24 port switch for around $8,000 list. As a matter of fact, I just confirmed this with the vendor. This does not include cables or HCAs. On Fri, 4 Mar 2005, Jeffrey B. Layton wrote: > 8 ports under 8k, but it was a 24 port switch :) > This includes all of the HCA's, switches (only one), > cables, and software. > > Jeff > > > 8 ports under 8k or 24 ports under 8k? > > > > Jeffrey B. Layton wrote: > > > >> However, to match what Roger said, one IB vendor gave me > >> a list price for 8-ports of IB for under $8,000. > > > _\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_ | Roger L. Smith Phone: 662-325-3625 | | Sr. Systems Administrator FAX: 662-325-7692 | | roger@ERC.MsState.Edu http://WWW.ERC.MsState.Edu/~roger | | Mississippi State University | |____________________________________ERC__________________________________| From eugen at leitl.org Fri Mar 4 08:49:17 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] Re: OS X in a Classified Environment... (fwd from kstaats@terrasoftsolutions.com) Message-ID: <20050304164917.GR13336@leitl.org> ----- Forwarded message from Kai Staats ----- From: Kai Staats Date: Fri, 4 Mar 2005 09:22:42 -0700 To: scitech@lists.apple.com Subject: Re: OS X in a Classified Environment... Organization: Terra Soft Solutions, Inc. User-Agent: KMail/1.7 Reply-To: kstaats@terrasoftsolutions.com Bryan, [snip] > Army contractor for aerodynamics work. I would be interested to find > out what happened to the Navy sonar cluster compute project that used > G4 servers running Linux... The original 272 G4 Xserves implemented in 2003 continue to be in use on-board the subs (from what I understand). In addition, the TI04 project (this past summer) invoked the use of G5 Xserves running our 64-bit Linux OS, Y-HPC. If PowerPC continues to be used in the sonar imaging environment, Linux will continue to be the preferred OS due to its flexibility and ease of code migration to/from non-PowerPC systems that remain a part of the on-board imaging systems. More info here: http://www.terrasoftsolutions.com/realworld/showcase/dod/ ... with several other DoE/DoD customers: http://www.terrasoftsolutions.com/products/y-hpc/customers.shtml kai _______________________________________________ Do not post admin requests to the list. They will be ignored. Scitech mailing list (Scitech@lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/scitech/eugen%40leitl.org This email sent to eugen@leitl.org ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050304/8e9e67dc/attachment.bin From canon at nersc.gov Thu Mar 3 10:28:02 2005 From: canon at nersc.gov (Shane Canon) Date: Wed Nov 25 01:03:52 2009 Subject: [Beowulf] RAID storage: Vendor vs. parts In-Reply-To: <421C2DA4.8090608@psc.edu> References: <421C2DA4.8090608@psc.edu> Message-ID: <42275732.2020306@nersc.gov> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 This was my experience as well. Most of the tools worked out of the box, but the partitioning was a real hang up. The easiest way around this is to have a separate boot drive that the installer can partition in its normal manner and then to have the >2TB device be completely separate and configured after first boot. You then use the whole drive (no partitions). This worked for me. - --Shane Paul Nowoczynski wrote: | Alvin Oga wrote: | |> hi ya steve |> |> On Tue, 22 Feb 2005, Steve Cousins wrote: |> |> |> |>> That's what I'm shooting for. Anybody have good luck with volumes |>> greater |>> than 2 TB with Linux? I think LSI SCSI cards are needed (?) and the 2.6 |>> Kernel is needed with CONFIG_LBD=y. Any hints or notes about doing this |>> would be greatly appreciated. Google has not been much of a friend on |>> this unfortunatlely. I'm guessing I'd run into NFS limits too. |>> |> |> |> for files/volumes over 2TB ... it's a question of libs, apps and |> kernel everything has to work ... which is not always the case |> |> |> | We've got this working at PSC without too much pain.. even with scsi | block devices >2TB. The LBD is needed but it | doesn't solve all the problems with large disks, especially if you have | a single volume which is larger than | 2TB. The issue we ran into was that many disk related apps like mdadm | and [s]fdisk don't support | the BLKGETSIZE64 ioctl. So even though your kernel is using 64 bits, | some needed apps are not. There are also issues with disklabels for | devices >2TB. The normal dos-style disklabel used by linux | doesn't support them so you'll need a kernel patch for the "plaintext" | partition table made by Andries Brouwer. | If you're interested in running this on 2.6 I can give you the patch. | As far as cards go I think the adaptec u320 cards | are better. I've seen less scsi timeout weirdness with them (this could | be related to our disks). Performance wise | the lsi and adaptec are about the same.. we see ~400MB/sec when using | both channels - even with a sub pci-x bus. For a couple hundred bucks a | card this is really good news. | --paul | |> i don't play much with 2.6 kernels other than on suse-9.x boxes |> |> |> |>> Also, am I being overly cautious about having a spare RAID controller on |>> hand? How frequent do RAID controllers go bad compared to disks, power |>> supplies and fan modules? I'd guess that it would be very infrequent. |>> |> |> |> it's always better to have spare parts ... ( part of my requirement ) if |> they expect the systems to be available 24x7 ... |> - more importantly, how long can they wait, when silly inexpensive |> things die, before it gets replaced |> |> - dead fans is $2.oo - $15 each to keep the disks cool |> |> - power supply is $50 range ... but if one bought n+1 powersupply |> than its supposed to not be an issue anymore, but you will need to |> have its replacement handy |> |> - raid controllers should NOT die, nor cpu, mem, mb, nic, etc |> and it's not cheap to have these items floating around as spare |> parts |> |> - ethernet cables will go funky if random people have access |> to the patch panels ... ( keep the fingers away ) |> |> - ups will go bonkers too |> |> - what failure mode can one protect against and what will happen |> if "it" dies |> - best protection against downtime for users is to have an |> warm-swap server which is updated a hourly or daily ... ( my |> preference - 2nd identical or bigger-disk capacity system ) |> |> |> |>> Looking back at my own experience I think I've had to return one out |>> of 15 |>> in the last eight years, and that was bad as soon as I bought it. |>> |> |> |> seems too high of a return rate ?? 1 out of 15 ?? |> |> |> |>> If this is too off-topic let me know and I'll move it elsewhere. |>> |> |> |> ditto here |> 24x7x365 uptime compute environment is fun/frustrating stuff on tight |> budgets |> |> c ya |> alvin |> |> _______________________________________________ |> Beowulf mailing list, Beowulf@beowulf.org |> To change your subscription (digest mode or unsubscribe) visit |> http://www.beowulf.org/mailman/listinfo/beowulf |> |> | | _______________________________________________ | Beowulf mailing list, Beowulf@beowulf.org | To change your subscription (digest mode or unsubscribe) visit | http://www.beowulf.org/mailman/listinfo/beowulf -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCJ1cxZd/2zrI5CioRAnYnAJ98qtd17aPK62aCw4UNt79klUdasQCglLyo kNXL0h7KGaQfFmla33Gxfn4= =osvt -----END PGP SIGNATURE----- From djholm at fnal.gov Thu Mar 3 14:31:56 2005 From: djholm at fnal.gov (Don Holmgren) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: <20050303185516.GB1453@greglaptop.internal.keyresearch.com> References: <20050303020600.GA56437@piskorski.com> <20050303185516.GB1453@greglaptop.internal.keyresearch.com> Message-ID: On Thu, 3 Mar 2005, Greg Lindahl wrote: > On Thu, Mar 03, 2005 at 08:58:33AM -0600, Don Holmgren wrote: > > > I was just quoted quantity 2 PCI-X HCA's (4x 2 port Mellanox) by a > > reseller for $655 each. Last Fall we purchased a large quantity of > > PCI-E HCA's for considerably less than that unit price. > > > > Supposedly the "memory free" PCI-E HCA's that use host memory, rather > > than on-board sram, should move prices towards $100 sometime this > > year - we'll see (landed on motherboard pricing at ~ $70, see > > http://www.mellanox.com/news/press/pr_030105.html) > > You're mixing retail price with wholesale price. The $69 price is > apparently quantity 10,000, for just the chip, and the card that > is inexpensive will be PCIe 4X, which hurts performance. Yes, it will be interesting to see what the motherboard vendors charge for an IB option. $69 (assuming they hit the 10K volume) + the price of the I/O connector + engineering cost + margin. I'm hoping that it will be < $150, but I may be too optimistic. Interesting comment about PCIe 4X hurting performance, thanks! The current PCI-E cards have two ports and use an 8X slot, but I'd guess that most cluster applications use only a single port. What's the performance penalty for using a single 4X port HCA in a 4X PCI-E slot compared with using a single port on a dual port card in an 8X slot? I believe the MemFree cards also incur a few tenths of a microsecond latency hit because of the need to access host memory, at least accordin?g to the preliminary benchmarks shown at SC'04. > > I hear that today's street price for Mellanox-based cards is ~$600 in > cluster-sized quantities, which matches what you report. We did a little better than that - for quantity 260, we paid < $450 for PCI-E HCA's. A couple of other bids were around $500. > > > Like other high performance network gear, it's tough to get > > accurate pricing information without going out and getting quotes. > > One nice thing about Myricom is that their prices have always been on > the web -- all you need to know in addition is your discount. > > -- greg Agreed. Among the many other nice things about Myricom. Don From djholm at fnal.gov Fri Mar 4 10:43:59 2005 From: djholm at fnal.gov (Don Holmgren) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: References: <4228821F.90607@charter.net> <42288715.10403@scalableinformatics.com> <422889C3.5080902@charter.net> Message-ID: I've made two purchases in the last 12 months of 24-port switches. Two switches last April came in at ~ $4000 each. 16 switches last Sept came in at ~ $3300 each. These were two different brands of switches, both based on the Mellanox Infiniscale III (24 port crossbar) silicon. Clearly YMMV on pricing. Don Holmgren On Fri, 4 Mar 2005, Roger L. Smith wrote: > > > THe price I stated was for a 24 port switch for around $8,000 list. As a > matter of fact, I just confirmed this with the vendor. > > This does not include cables or HCAs. > > On Fri, 4 Mar 2005, Jeffrey B. Layton wrote: > > > 8 ports under 8k, but it was a 24 port switch :) > > This includes all of the HCA's, switches (only one), > > cables, and software. > > > > Jeff > > > > > 8 ports under 8k or 24 ports under 8k? > > > > > > Jeffrey B. Layton wrote: > > > > > >> However, to match what Roger said, one IB vendor gave me > > >> a list price for 8-ports of IB for under $8,000. > > > > > > > > _\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_ > | Roger L. Smith Phone: 662-325-3625 | > | Sr. Systems Administrator FAX: 662-325-7692 | > | roger@ERC.MsState.Edu http://WWW.ERC.MsState.Edu/~roger | > | Mississippi State University | > |____________________________________ERC__________________________________| > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From jrajiv at hclinsys.com Thu Mar 3 22:51:09 2005 From: jrajiv at hclinsys.com (Rajiv) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] HA OSCAR for loadbalancing and failover Message-ID: <006d01c52086$8c50a8d0$0f120897@PMORND> Dear Sir, i am carrying outloadbalancing using OSCAR-3.0.We are also carrying out failover using HA-OSCAR 1.0 beta release (High Availability OSCAR). We are required to acheive loadbalancing and failover for the following services: 1. HTTP. 2. FTP. 3. TELNET. 4. DHCP. 5. SQUID. Our setup is as follows: 1 Primary server , 1 client node (using OSCAR-3.0) 1 standby server (using HA OSCAR) We have succeeded in building the cluster but am having problems regarding loadbalancing.We are trying to achieve loadbalancing using the PBS Server(Portable Batch System) which comes inbuilt with OSCAR-3.0.We are queing the services as jobs and trying to distribute these jobs between the server and client node. But the problem we are facing is that we are not able to submit the job to the PBS server. Sir,firstly, we would like you to confirm if we are going on the right track for achieving loadbalancing.We would like to know how you'll have achieved load balancing? Regards, Rajiv From maurice at harddata.com Thu Mar 3 23:56:46 2005 From: maurice at harddata.com (Maurice Hilarius) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Re: Beowulf Digest, Vol 13, Issue 4 In-Reply-To: <200503031629.j23GThuk022748@bluewest.scyld.com> References: <200503031629.j23GThuk022748@bluewest.scyld.com> Message-ID: <422814BE.1070203@harddata.com> Andrew Piskorski wrote: >From: Andrew Piskorski >Subject: Re: [Beowulf] 2.6.11 is out; with InfiBand support > > >Except, a single PCI-X Infiniband card currently costs $1000 or so, >right? (That's for a 4x 2 port card, but Froogle does not seem to >know of any cheaper cards.) > > New "name brand" ( sorry, it's under NDA) "Memory Free" cards will be selling for under $400 for PCI Express 4X, and under $600 for PCI Express 8X. Availability Q2 for production quantities. Still a lot more expensive than Myrinet, and Myri have their own surprises to reveal in that time frame as well. It's a great time for cluster computing: Opteron Rev E nForce4 dual core Opterons S-ATA2 SAS Economical 10Gb interconnects. Wow. With our best regards, Maurice W. Hilarius Telephone: 01-780-456-9771 Hard Data Ltd. FAX: 01-780-456-9772 11060 - 166 Avenue email:maurice@harddata.com Edmonton, AB, Canada http://www.harddata.com/ T5X 1Y3 From lists at subnetz.org Fri Mar 4 05:43:27 2005 From: lists at subnetz.org (Tilman Koschnick) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] DCC (debian cluster components) In-Reply-To: References: <20050303163359.GJ4482@washoe.rutgers.edu> <1109928798.5537.22.camel@Vigor45> Message-ID: <1109943807.2081.51.camel@mother.subnetz.org> On Fri, 2005-03-04 at 08:03 -0500, Robert G. Brown wrote: > Is FAI being loved by somebody at this point? There was a time a few > years ago where it seemed to be lying fallow (although as always I could > be mistaken about that). Toolsets like that usually need a fairly > active and energetic human to care for them, if not several... I think it is. The latest (fairly long) changelog entry - version 2.6.6 - dates from 21 Jan 2005. I went to a talk by the FAI maintainer a couple of months ago, and he didn't give the impression he was going to abandon it any time soon. There was talk about porting FAI to Redhat/rpm, but I don't know what the state of this is. Cheers, Til From list-beowulf at onerussian.com Fri Mar 4 06:22:08 2005 From: list-beowulf at onerussian.com (Yaroslav Halchenko) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] DCC (debian cluster components) In-Reply-To: References: <20050303163359.GJ4482@washoe.rutgers.edu> <1109928798.5537.22.camel@Vigor45> Message-ID: <20050304142208.GC32176@washoe.rutgers.edu> On Fri, Mar 04, 2005 at 08:03:00AM -0500, Robert G. Brown wrote: > Is FAI being loved by somebody at this point? There was a time a few > years ago where it seemed to be lying fallow (although as always I could > be mistaken about that). Toolsets like that usually need a fairly > active and energetic human to care for them, if not several... I like FAI and although it is just a set of scripts, it seems to be stable. I've used it 1.5 years ago for the first time to install first 10 nodes on the cluster. I had to tweak it to make it work but then whenever we've got 15 more nodes 5 month ago, my old FAI configuration required just few adjustments to make its job. DCC seems to use the idea of "image-based installation model" opposed to FAI which is flexible cloned installation. For uniform clusters image-based installation probably is better than FAI which admits different classes of the configuration, thus is more flexible. -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555] -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20050304/2795d8f4/attachment.bin From agshew at gmail.com Fri Mar 4 07:51:06 2005 From: agshew at gmail.com (Andrew Shewmaker) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] DCC (debian cluster components) In-Reply-To: <1109928798.5537.22.camel@Vigor45> References: <20050303163359.GJ4482@washoe.rutgers.edu> <1109928798.5537.22.camel@Vigor45> Message-ID: On Fri, 04 Mar 2005 09:33:18 +0000, John Hearns wrote: > On Thu, 2005-03-03 at 11:33 -0500, Yaroslav Halchenko wrote: > > Dear Debianized beowulfers or beowulfiezed Debian users, > > > > Does any one has experience with > > http://www.irb.hr/en/cir/projects/dcc/ > > which recently was released? > > > > Project Goals > > > > We expect to integrate some existing technologies (like LDAP, System > > Installation Suite, > > Seems strange that they haven't chosen FAI (Fully Automated Installer). > As an aside, there was a poster about FAI up outside the cluster > developer's room at FOSDEM last weekend. If you think it is strange that they appear to have chosen System Installation Suite over FAI because you are thinking that SIS is focused on RPM distros (I was under that impression at one time), then you should know that SIS is primarily developed on Debian. -- Andrew Shewmaker From egan at sense.net Fri Mar 4 08:41:30 2005 From: egan at sense.net (Egan Ford) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Windows Server 2003 Compute Cluster Edition In-Reply-To: <422889C3.5080902@charter.net> Message-ID: <002201c520d9$04f6c380$e2054109@oberon> Unless given away, better price/performance, or killer app, I estimate that the number Windows HPC clusters to be very small. I'd like to say zero, but I have customers today doing HPC on Windows. The apps are not available for other platforms. http://news.com.com/Windows+for+supercomputers+likely+out+by+fall/2100-1012_ 3-5598603.html?part=rss&tag=5598782&subj=news From tallpaul at speakeasy.org Fri Mar 4 12:43:12 2005 From: tallpaul at speakeasy.org (Paul English) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] DCC (debian cluster components) In-Reply-To: References: <20050303163359.GJ4482@washoe.rutgers.edu> <1109928798.5537.22.camel@Vigor45> Message-ID: On Fri, 4 Mar 2005, Robert G. Brown wrote: > On Fri, 4 Mar 2005, John Hearns wrote: > > > On Thu, 2005-03-03 at 11:33 -0500, Yaroslav Halchenko wrote: > > > Dear Debianized beowulfers or beowulfiezed Debian users, > > > > > > Does any one has experience with > > > http://www.irb.hr/en/cir/projects/dcc/ > > > which recently was released? > > > > > > Project Goals > > > > > > We expect to integrate some existing technologies (like LDAP, System > > > Installation Suite, > > > > Seems strange that they haven't chosen FAI (Fully Automated Installer). > > As an aside, there was a poster about FAI up outside the cluster > > developer's room at FOSDEM last weekend. > > Is FAI being loved by somebody at this point? There was a time a few > years ago where it seemed to be lying fallow (although as always I could > be mistaken about that). Toolsets like that usually need a fairly > active and energetic human to care for them, if not several... The list is alive, and has posts, etc. I did not have a great deal of luck getting help with my questions and the process is (if anything) more raw than Kickstart. I gave it a good try for several months because many of our machines are debian, but in the end I gave up. For clustering purposes, ROCKS has been much more useful - quick, useful responses on the mailing list, and a lot more of the lower-level crud is hidden with simpler utilities. For general network installs (workstations, general purpose servers), Kickstart was far easier to use and find answers for than FAI, although it could use some of the abstraction that ROCKS has. In the "modern" era of PXE, on "most networks" adding new machines with specific configurations could be done with a single command or GUI. We're not there yet. :-) Paul From maillists at gauckler.ch Fri Mar 4 14:30:26 2005 From: maillists at gauckler.ch (Michael Gauckler) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] MPI programming question: Interleaved MPI_Gatherv? In-Reply-To: <1109659454.6544.2.camel@localhost.localdomain> References: <1109659454.6544.2.camel@localhost.localdomain> Message-ID: <1109975426.5116.16.camel@localhost.localdomain> Dear List, thank you for all your replies concerning my question about interleaved gathers. (Interleaved from was meant in terms of memory layout, not time of arrival of the message.) Yes, there is a solution to this problem by changing the lower and upper bounds of the datatype with the help of MPI_Type_create_resized. Trough the lam-mpi mailing list I got a reply from Josh which I like to share with you because it even includes the source of a demo application (see below). Thank you very much! Yours, Michael ___ Von: Josh Hursey Datum: Tue, 1 Mar 2005 09:50:43 -0500 (15:50 CET) Yes, this can be achieved in an elegant way with MPI_Gather, but you need to adjust the receive datatype. You will need to create a new MPI_Datatype that will stride as you need it to. The trick is to shift the lower and upper bounds on this new strided data type so it will interleave values. Something like: /* Create a datatype to receive into. */ MPI_Type_vector( NUM_LOCAL_ELE, /* # of blocks */ 1, /* # of datatypes in a block (one for this array) */ gsize, /* Stride between successive blocks */ MPI_CHAR, /* Type of each block */ &old_type); MPI_Type_commit( &old_type); /* Resize the type to allow interleaving, * so make it only one MPI_CHAR wide */ MPI_Type_create_resized(old_type, 0, /* Lower Bound */ 1, /* Uppoer Bound change to one block */ &new_type); MPI_Type_commit( &new_type); Then use the new_type as the receive type argument to the MPI_Gather function. I attached a sample code that does exactly this, and produces the following output: $ mpirun -np 4 gather_interleave Rank 0 A A A A A A A A A A A A Rank 1 B B B B B B B B B B B B Rank 2 C C C C C C C C C C C C Rank 3 D D D D D D D D D D D D Final: A B C D A B C D A B C D A B C D A B C D A B C D A B C D A B C D A B C D A B C D A B C D A B C D Hope this helps. Josh -------------------- #include #include #define NUM_LOCAL_ELE 12 int main(int argc, char *argv[]){ int rank, gsize, i, j; char local_array[NUM_LOCAL_ELE]; char *collected_array; MPI_Datatype new_type, old_type; /* Initialize */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &gsize); /* Create a datatype to receive into. */ MPI_Type_vector( NUM_LOCAL_ELE, /* # of blocks */ 1, /* # of datatypes in a block (one for this array) */ gsize, /* Stride between successive blocks */ MPI_CHAR, /* Type of each block */ &old_type); MPI_Type_commit( &old_type); /* Resize the type to allow interleaving, * so make it only one MPI_CHAR wide */ MPI_Type_create_resized(old_type, 0, /* Lower Bound */ 1, /* Uppoer Bound change to one block */ &new_type); MPI_Type_commit( &new_type); /* Initialize local array with characters: * Rank 0 = A A A... * Rank 1 = B B B... * Rank 2 = C C C... * ... */ for(i = 0; i < NUM_LOCAL_ELE; ++i ) { local_array[i] = 'A' + rank; } /* Print out local array */ sleep(rank * 1); printf("Rank %d", rank); for(i = 0; i < NUM_LOCAL_ELE; ++i) { printf("\t%c", local_array[i]); } printf("\n"); if(rank == 0) collected_array = (char *)malloc(gsize * NUM_LOCAL_ELE * sizeof(char)); MPI_Gather( local_array, NUM_LOCAL_ELE, MPI_CHAR, collected_array, 1, new_type, 0, MPI_COMM_WORLD); /* Print out Gathered array */ if(rank == 0) { printf("Final:\n"); for(i = 0; i < gsize; ++i) { for(j = 0; j < NUM_LOCAL_ELE; ++j) { printf("\t%c", collected_array[i*NUM_LOCAL_ELE+j]); } printf("\n"); } } if (rank == 0) free(collected_array); MPI_Finalize(); return 0; } Am Dienstag, den 01.03.2005, 07:44 +0100 schrieb Michael Gauckler: > Dear List, > > I would like to gather the data from several processes. > Instead of the comonly used stride, I want to interleave > the data: > > Rank 0: AAAAA -> ABCDABCDABCDABCDABCD > Rank 1: BBBBB ----^---^---^---^---^ > Rank 2: CCCCC -----^---^---^---^---^ > Rank 3: DDDDD ------^---^---^---^---^ > > Since the stride of the receive type is indicated > in multpiles of its mpi_type, no interleaving is > possible (the smallest striping factor leads to > AAAAABBBBBBCCCCCDDDDD). > > Is there a way to achieve this behaviour in an > elegant way, as MPI_Gather promises it? Or do > I need to do Send/Recv with self-aligned offsets? > > Thank you for your help! > > Michael From taj at www.linux.org.uk Thu Mar 3 14:39:28 2005 From: taj at www.linux.org.uk (Trent Jarvi) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] http://www.beowulf.org/overview/history.html Message-ID: Just a heads up. This page appears to be corrupted. While not all Beowulf clusters are supercomputers, one can build a Beowulf that is powerful enough to attract the interest of supercomputer users. Beyond the seasoned parallel programmer, Beowulf clusters have been built and used by programmers with little or no parallel programming experience. Beowulf clusters provide universities, often with limited resources, an excellent platform to teach parallel programming cNvq0ZhTgBrP kOLZWGuE0+ZiqlFOd2ml5US6LXQ/8jfnOSP4wydRdXTBOTOpewexZw1KyyFaZYgXTx5zQTNf 5QFWN4fE0H3CCkPYVhNTdPWIDurIhwMLdwxbCTM6fcG3+JA+1TpQX+s5ZlYw5+bvDqkre+1Y [...] -- Trent Jarvi taj@www.linux.org.uk From mwill at penguincomputing.com Fri Mar 4 15:52:18 2005 From: mwill at penguincomputing.com (Michael Will) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] HA OSCAR for loadbalancing and failover In-Reply-To: <006d01c52086$8c50a8d0$0f120897@PMORND> References: <006d01c52086$8c50a8d0$0f120897@PMORND> Message-ID: <200503041552.18491.mwill@penguincomputing.com> Even though it is an interesting idea to use a beowulf cluster for this, in particular when using several nodes to do loadbalancing with and automatic deployment of services, I think it is the wrong tool for the task you have set for yourself. Your requirements would probably be more easily fulfilled with a simple HA failover cluster (no oscar involved). See http://www.ultramonkey.org for details. Especially when you only have two servers, one as primary and one as standby, which is a classical active/passive config, then there is no reason to have the complexity of a beowulf style cluster. ultramonkey.org also mentions LVS which helps with loadbalancing and I believe they even have a solution for session synchronisation, which means that even when a failover of the loadbalancer occurs, a tcp/ip session does not die but gets redirected. You will not need any PBS then, but rather have a package called 'heartbeat' that defines the services to be failed over in it's own config files. Michael On Thursday 03 March 2005 10:51 pm, Rajiv wrote: > Dear Sir, > i am carrying outloadbalancing using OSCAR-3.0.We are also carrying out > failover using HA-OSCAR 1.0 beta release (High Availability OSCAR). We are > required to acheive loadbalancing > and failover for the following services: > 1. HTTP. > 2. FTP. > 3. TELNET. > 4. DHCP. > 5. SQUID. > Our setup is as follows: > 1 Primary server , 1 client node (using OSCAR-3.0) > 1 standby server (using HA OSCAR) > > We have succeeded in building the cluster but am > having problems regarding loadbalancing.We are trying to achieve > loadbalancing using the PBS Server(Portable Batch System) which comes > inbuilt with OSCAR-3.0.We are queing the services as jobs and trying to > distribute these jobs between the server and client node. But the problem > we are facing is that we are not able to submit the job to the PBS server. > Sir,firstly, we would like you to confirm if we are going on the right > track for achieving loadbalancing.We would like to know how you'll have > achieved load balancing? > > Regards, > Rajiv > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Michael Will, Linux Sales Engineer Tel: 415-954-2822 Toll Free: 888-PENGUIN Fax: 415-954-2899 www.penguincomputing.com Visit us at FOSE 2005! Washington Convention Center, Washington, DC April 5th-7th, 2005 Linux Pavilion, Booth 2225 From eugen at leitl.org Sat Mar 5 00:00:01 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] HA OSCAR for loadbalancing and failover In-Reply-To: <200503041552.18491.mwill@penguincomputing.com> References: <006d01c52086$8c50a8d0$0f120897@PMORND> <200503041552.18491.mwill@penguincomputing.com> Message-ID: <20050305075952.GH13336@leitl.org> On Fri, Mar 04, 2005 at 03:52:18PM -0800, Michael Will wrote: > Especially when you only have two servers, one as primary and one as standby, > which is a classical active/passive config, then there is no reason to have the > complexity of a beowulf style cluster. http://www.linux-ha.org/ (1.99?) supports up to 8 and beyond, but it needs some testing. -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050305/e194bdd9/attachment.bin From gmpc at sanger.ac.uk Sat Mar 5 02:45:27 2005 From: gmpc at sanger.ac.uk (Guy Coates) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] DCC (debian cluster components) In-Reply-To: References: <20050303163359.GJ4482@washoe.rutgers.edu> <1109928798.5537.22.camel@Vigor45> Message-ID: > In the "modern" era of PXE, on "most networks" adding new machines > with specific configurations could be done with a single command or GUI. > We're not there yet. :-) The only thing that ever came close was RLX's Control Tower management software. It did "one click" management and provisioning of machines. One the blade systems it could even do "zero click configuration", as you could set policy like "Any machines I put into slots 1-10 should automatically get configuration Y put on them." It was generic enough so that it could provision any operating system you could think off, and if it didn't do something you wanted it to, it was also easy to dig under the covers and hack the code. The only down side was the price-tag, which was extortionate. Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK Tel: +44 (0)1223 834244 ex 7199 From kus at free.net Sat Mar 5 04:59:19 2005 From: kus at free.net (Mikhail Kuzminsky) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: Message-ID: In message from Mark Hahn (Wed, 2 Mar 2005 18:09:09 -0500 (EST)): >> Arima and Iwill have mobos with IB LOM (Landed on Motherboard). > >given the choice between a $150 pcie IB nic and having it onboard, >I'd choose the separate card. I know the IB salesdroids always >say that getting onto the MB will change everything, but this >doesn't make sense. IB is completely different from onboard gigabit, >for instance, because there is no ubiquitous IB infrastructure >ready, waiting to be exploited. > >the problem with "if you build it onboard, they will come" is also >the marginal cost. onboard gigabit is nearly the same cost as >onboard 100bT, very low, and you pretty much always want it. >onboard IB is noticably higher than onboard GBE, noticable in >absolute terms, and you definitely have no possible use for it >on many systems. > >remember, most people don't even saturate GBE yet, Yes, I agree. But we are developing some quantum-chemical application which speedup at parallelization is bandwith-limited. And we obtain that speedup on 6 processors w/IB 4x interconnect is about 34% percent higher than for Myrinet. Yours Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow > and GBE >ports are damned cheap. GBE nics are free, and switch ports >are now down to $US 23/port: > >http://froogle.google.com/froogle?q=netgear+GS748T&btnG=Search+Froogle > >fundamentally, IB is still facing most of the same problems it always >has: > >- requires fairly expensive, unique infrastructure >- not the greatest physical layer: it's easy to wind up with > literally tons of IB cables. >- not clearly superior in performance vs alternatives. >- apparently designed by people who disliked existing technique > or were ignorant of it. >- not a drop-in replacement for alternatives. > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf From rgb at phy.duke.edu Sun Mar 6 06:39:06 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] http://www.beowulf.org/overview/history.html In-Reply-To: References: Message-ID: On Thu, 3 Mar 2005, Trent Jarvi wrote: > > Just a heads up. > > This page appears to be corrupted. > > While not all Beowulf clusters are supercomputers, one can build a Beowulf > that is powerful enough to attract the interest of supercomputer users. > Beyond the seasoned parallel programmer, Beowulf clusters have been built > and used by programmers with little or no parallel programming experience. > Beowulf clusters provide universities, often with limited resources, an > excellent platform to teach parallel programming cNvq0ZhTgBrP > kOLZWGuE0+ZiqlFOd2ml5US6LXQ/8jfnOSP4wydRdXTBOTOpewexZw1KyyFaZYgXTx5zQTNf > 5QFWN4fE0H3CCkPYVhNTdPWIDurIhwMLdwxbCTM6fcG3+JA+1TpQX+s5ZlYw5+bvDqkre+1Y > > [...] Corrupted and out of date, too:-) Nobody who looks at the top500 list (whatever my opinions about its basis;-) would nowadays say that one can "build a Beowulf that is powerful enough to attract the interest of supercomputer users". It's getting to be much more of a "seasoned parallel programmers (a.k.a. `old guys') can remember a time when parallel programming was carried out on `supercomputers', basically a name for a cluster with proprietary internal processor interconnects". Linux hasn't finished taking over the world, although it continues to make excellent progress with all sorts of economic and historical forces driving it. "Beowulfs" in the generic sense of COTS clusters with network interconnects for IPCs, pretty much have taken over the supercomputing world with only a few exceptions, and even those exceptions are relying less and less on anything like a custom communications bus. Not even the engineering of the dedicated systems scales, while using a "COTS" communication platform such as Myri or Dolphinics, or IB or even gigE lets you leverage all sorts of useful work done by other humans devoted to this one purpose or this purpose among others. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Sun Mar 6 07:04:17 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] DCC (debian cluster components) In-Reply-To: References: <20050303163359.GJ4482@washoe.rutgers.edu> <1109928798.5537.22.camel@Vigor45> Message-ID: On Sat, 5 Mar 2005, Guy Coates wrote: > > In the "modern" era of PXE, on "most networks" adding new machines > > with specific configurations could be done with a single command or GUI. > > We're not there yet. :-) > > The only thing that ever came close was RLX's Control Tower management > software. It did "one click" management and provisioning of machines. > One the blade systems it could even do "zero click configuration", as you > could set policy like > > "Any machines I put into slots 1-10 should automatically get configuration > Y put on them." > > It was generic enough so that it could provision any operating system you > could think off, and if it didn't do something you wanted it to, it was > also easy to dig under the covers and hack the code. > > The only down side was the price-tag, which was extortionate. > > Guy > > I agree that there is some work involved in building a PXE installable configuration for e.g. kickstart, but it isn't excessive. Take template kickstart file(s). Edit it to select package set(s) for same(different) node config(s). Put it(them) on server. Edit dhcpd.conf to to point to bootloader in /tftpboot. Edit boot.msg, pxelinux.cfg/default to point to (list of) node type(s) and point to the associated kickstart file(s) and boot option(s), respectively. Boot. There IS a GUI for building the kickstart file (under RH and FC, at least), although I suspect that most people would use this at most to build the template and then tune it up by hand -- it is actually easier to edit a flatfile with an editor once you see the layout. dhcpd doesn't have a GUI to front it AFAIK, so this remains a place one could do some work. It would be useful to build one that tests the URL paths to e.g. the kickstart files and the tftpboot paths to the initrd images. It would be even lovelier to have the same interface edit boot.msg and pxelinux.cfg/default at the same time so that all three could be consistent. This is the one place I find myself hopping from directory to directory to make matching changes, as I create an image for "fc3 workstation" or "fc3 node" for testing purposes and need to ensure that the matching kickstart file is in the right place and correctly corresponds to the dhcpd entry. This is even more true if one wants to create an image indexed per IP number (the "right" way, arguably, for tftpboot to function) so that everything becomes totally automagic. I always am of two minds about high-level front ends to low level admin commands. Yes, they are convenient and let newbies start to work with a low buy-in as far as learning curve is concerned. However, they also SHIELD a newbie from learning what they really need to know to be an effective manager, and (to my own experience anyway) they rarely work stably as the various tools upon which they are built evolve. For one thing, the GUI designer almost always omits features and capabilities of the lower level stuff, so eventually you want to do something you "know" can be done but the GUI doesn't. For another, somebody changes something really subtle, such as a default pathway in /tftpboot, and your GUI "breaks" and you have no idea why and won't until you learn, in a panic, all the things the GUI shielded you from. I tend to think GUIs work best when they are a standard part of a single administrative package co-developed by the packages maintainers. A GUI that spans multiple tools and functions simply has more issues. Hence redhat-config-kickstart has a chance to remain useful, but building its functionality into a sweeping redhat-config-pxe (including kickstart, dchpd, tftp) is a bit dicier. STILL POSSIBLE, mind you -- it just needs somebody to love it and maintain it. That's why I asked about FAI -- if somebody doesn't actively maintain the GUIs, the scripts, the documentation, as the underlying toolset slowly changes, eventually the friendly front end fails to encompass all sorts of desireable functionality or starts to break, sometimes, for some users, trying to do some things. In FC or RH, there is no "supertool", but the individual tools work well and aren't THAT hard to learn -- most of them have dedicated HOWTOs. Some supertools exist in ROCKS and warewulf and so on, and have (at the moment) devoted maintainers who love their product. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From landman at scalableinformatics.com Mon Mar 7 10:01:20 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] looking for a reference on failure rates Message-ID: <422C96F0.4090406@scalableinformatics.com> Hi folks: I am looking for a reference which describes failure rates of modern computer components as a function of temperature. The usual rule of thumb is that every 10 degrees above a certain value doubles the failure rate (or decreases lifetime). I would like to look at this analysis and refer to it for something I am working on. Thanks Joe From James.P.Lux at jpl.nasa.gov Mon Mar 7 12:00:25 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] looking for a reference on failure rates In-Reply-To: <422C96F0.4090406@scalableinformatics.com> References: <422C96F0.4090406@scalableinformatics.com> Message-ID: <6.1.1.1.2.20050307115932.041dcf70@mail.jpl.nasa.gov> At 10:01 AM 3/7/2005, Joe Landman wrote: >Hi folks: > > I am looking for a reference which describes failure rates of modern > computer components as a function of temperature. The usual rule of > thumb is that every 10 degrees above a certain value doubles the failure > rate (or decreases lifetime). I would like to look at this analysis and > refer to it for something I am working on. > > Thanks > >Joe How rigorous a reference? Or a general description of failure rates vs temp, for, e.g. microprocessors. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From landman at scalableinformatics.com Mon Mar 7 12:03:46 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] looking for a reference on failure rates In-Reply-To: <6.1.1.1.2.20050307115932.041dcf70@mail.jpl.nasa.gov> References: <422C96F0.4090406@scalableinformatics.com> <6.1.1.1.2.20050307115932.041dcf70@mail.jpl.nasa.gov> Message-ID: <422CB3A2.1020307@scalableinformatics.com> Hi Jim: Something I can refer to for primary literature for a paper. If it is anecdotal, that may be fine as well, though I will have to treat it differently. This is largely for microprocessors, disks, networks, etc. General digital equipment, with a focus on computers in clusters. Thanks! Joe Jim Lux wrote: > At 10:01 AM 3/7/2005, Joe Landman wrote: > >> Hi folks: >> >> I am looking for a reference which describes failure rates of modern >> computer components as a function of temperature. The usual rule of >> thumb is that every 10 degrees above a certain value doubles the >> failure rate (or decreases lifetime). I would like to look at this >> analysis and refer to it for something I am working on. >> >> Thanks >> >> Joe > > How rigorous a reference? Or a general description of failure rates vs > temp, for, e.g. microprocessors. > > > > James Lux, P.E. > Spacecraft Radio Frequency Subsystems Group > Flight Communications Systems Section > Jet Propulsion Laboratory, Mail Stop 161-213 > 4800 Oak Grove Drive > Pasadena CA 91109 > tel: (818)354-2075 > fax: (818)393-6875 > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From djholm at fnal.gov Mon Mar 7 11:59:56 2005 From: djholm at fnal.gov (Don Holmgren) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] looking for a reference on failure rates In-Reply-To: <422C96F0.4090406@scalableinformatics.com> References: <422C96F0.4090406@scalableinformatics.com> Message-ID: On Mon, 7 Mar 2005, Joe Landman wrote: > Hi folks: > > I am looking for a reference which describes failure rates of modern > computer components as a function of temperature. The usual rule of > thumb is that every 10 degrees above a certain value doubles the failure > rate (or decreases lifetime). I would like to look at this analysis and > refer to it for something I am working on. > > Thanks > > Joe > > Joe - Take a look at this Test and Measurement World article for starters: http://www.reed-electronics.com/tmworld/article/CA187523.html The rule of thumb that you mention comes from using an Arrhenius model to describe the relationship between temperature and failure rates. Arrhenius first published this equation (now named after him) in 1889 k(T) = A exp ( -Ea / RT) to explain the variation of reaction rates with temperature of several elementary chemical reactions. Here, k is the reaction rate, A is a constant, Ea is the activation energy for the reaction, R is the ideal gas constant, and T is the temperature in Kelvin. It turns out that many semiconductor degradation mechanisms - electromigration, corrosion, defect growth, etc. - fit this relationship well. Note that you'll usually see Boltzmann's constant (another 'k') instead of 'R' in the semiconductor reliability literature. Chemists use R and express Ea in units of kJoule/mole, physicists and engineers tend to use k and express Ea in electron volts. In the reliability literature, you'll often see the Arrhenius model written in term of time to failure, which is proportional to the inverse of the reaction rate. At two different temperatures T1 and T2, the times to failure would be given by t1 = A exp (Ea / kT1) # k = Boltzmann's constant t2 = A exp (Ea / kT2) and so the ratio of lifetimes is given by t1/t2 = exp [ Ea/k * (1/T1 - 1/T2) ] If t1 is room temperature (~ 298K), an activation energy of about 0.54 eV would give a doubling in failure rate at a 10 degree C higher temperature. There's a handy chemist's page at http://antoine.frostburg.edu/chem/senese/101/kinetics/faq/temperature-and-reaction-rate.shtml that will let you plug in 3 of the 4 variables (T1, T2, Ea, reaction rate ratio) and it will give you the third. I've got a number of semiconductor reliability texts with tables of Ea versus failure mechanism - I can post the references if you request, though they're a bit dated (15 years old). Ea varies widely in these tables from about 0.3 eV to as high as 2.0 eV. There are even some negative Ea's, corresponding to failure mechanisms that decelerate with increasing temperature. The "factor of 2 with every 10 degrees" is only a very rough rule of thumb. Don Holmgren From landman at scalableinformatics.com Mon Mar 7 12:31:53 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] looking for a reference on failure rates In-Reply-To: References: <422C96F0.4090406@scalableinformatics.com> Message-ID: <422CBA39.60600@scalableinformatics.com> Hi Don: This is excellent. More detail than I need for this, but useful nonetheless. I am familiar with/have used Arrhenius models for rate prediction in the past, but did not make the connection to failure rates. The next question that this begs is the failure mode. Each failure mode will likely have a different set of activation energies. I'll go grab this info (have some old stuff around). Any similarity for disk drives and other components? Of course the question of what the activation energy would mean for a macroscopic failure (hard disks) is relevant ... Thanks! Joe Don Holmgren wrote: > > On Mon, 7 Mar 2005, Joe Landman wrote: > > >>Hi folks: >> >> I am looking for a reference which describes failure rates of modern >>computer components as a function of temperature. The usual rule of >>thumb is that every 10 degrees above a certain value doubles the failure >>rate (or decreases lifetime). I would like to look at this analysis and >>refer to it for something I am working on. >> >> Thanks >> >>Joe >> >> > > > Joe - > > Take a look at this Test and Measurement World article for starters: > > http://www.reed-electronics.com/tmworld/article/CA187523.html > > The rule of thumb that you mention comes from using an Arrhenius model > to describe the relationship between temperature and failure rates. > Arrhenius first published this equation (now named after him) in 1889 > > k(T) = A exp ( -Ea / RT) > > to explain the variation of reaction rates with temperature of several > elementary chemical reactions. Here, k is the reaction rate, A is a > constant, Ea is the activation energy for the reaction, R is the ideal > gas constant, and T is the temperature in Kelvin. It turns out that > many semiconductor degradation mechanisms - electromigration, corrosion, > defect growth, etc. - fit this relationship well. Note that you'll > usually see Boltzmann's constant (another 'k') instead of 'R' in the > semiconductor reliability literature. Chemists use R and express Ea in > units of kJoule/mole, physicists and engineers tend to use k and express > Ea in electron volts. > > In the reliability literature, you'll often see the Arrhenius model > written in term of time to failure, which is proportional to the inverse > of the reaction rate. At two different temperatures T1 and T2, the > times to failure would be given by > > t1 = A exp (Ea / kT1) # k = Boltzmann's constant > t2 = A exp (Ea / kT2) > > and so the ratio of lifetimes is given by > > t1/t2 = exp [ Ea/k * (1/T1 - 1/T2) ] > > If t1 is room temperature (~ 298K), an activation energy of about > 0.54 eV would give a doubling in failure rate at a 10 degree C higher > temperature. > > There's a handy chemist's page at > > http://antoine.frostburg.edu/chem/senese/101/kinetics/faq/temperature-and-reaction-rate.shtml > > that will let you plug in 3 of the 4 variables (T1, T2, Ea, reaction > rate ratio) and it will give you the third. > > I've got a number of semiconductor reliability texts with tables of Ea > versus failure mechanism - I can post the references if you request, > though they're a bit dated (15 years old). Ea varies widely in these > tables from about 0.3 eV to as high as 2.0 eV. There are even some > negative Ea's, corresponding to failure mechanisms that decelerate with > increasing temperature. The "factor of 2 with every 10 degrees" is only > a very rough rule of thumb. > > Don Holmgren -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From JSaxelby at vicr.com Mon Mar 7 12:24:04 2005 From: JSaxelby at vicr.com (Saxelby, John) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] The Arrhenius Equation and life Message-ID: I am going out on a limb here but from my work I have concluded that the 10 degrees temp rise halving life is bogus. It is based on the Arrhenius Equation which is for chemical reactions. It does not apply to semiconductors for example, though it might apply to the oil in the bearings of a hard disk. This is a controversial subject and the answer is not as simple as 10C 2X From James.P.Lux at jpl.nasa.gov Mon Mar 7 15:26:17 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] looking for a reference on failure rates In-Reply-To: <422CB3A2.1020307@scalableinformatics.com> References: <422C96F0.4090406@scalableinformatics.com> <6.1.1.1.2.20050307115932.041dcf70@mail.jpl.nasa.gov> <422CB3A2.1020307@scalableinformatics.com> Message-ID: <6.1.1.1.2.20050307150113.042f5398@mail.jpl.nasa.gov> At 12:03 PM 3/7/2005, Joe Landman wrote: >Hi Jim: > > Something I can refer to for primary literature for a paper. If it is > anecdotal, that may be fine as well, though I will have to treat it > differently. > > This is largely for microprocessors, disks, networks, etc. General > digital equipment, with a focus on computers in clusters. > > Thanks! > >J One of our reliability guys recommended this: E.A. Amerasekera & F.N. Najim, "Failure Mechanisms in Semiconductor Devices", 2nd Ed., Wiley, NY, 1997 You might also take a look at MIL-HDBK-217F, which provides reliability math models for just about everything electronic. There might be some argument about the applicability of this in some instances, but it's certainly a commonly used document. Chapter 5 talks about microcircuits. Section 5.8 has the temperature factors (Ea in eV) for various logical families... CMOS looks like 0.35, BiCMOS and LSTTL are 0.5, Linears are 0.65 (Converting to life effects, it looks like a 20C rise in temp corresponds to twice the failure rate for CMOS, and a 20C rise is a 4.9 factor increase for Linears (10 deg= 2.3))... the actual assembly failure rate will depend on how many of each kind of part, what temperature they're at, etc. Something like a doubling per 10C is probably not too far from the overall effect. It's not going to be 10 times and it's not going to be 10% increase either. There's also a BellCore/Telcordia model which apparently takes into account burnin and testing. It might be more relevant, depending on the environment. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From jake at spiekerfamily.com Tue Mar 8 16:49:14 2005 From: jake at spiekerfamily.com (Jake Thebault-Spieker) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] DCC information? Message-ID: <422E480A.4070007@spiekerfamily.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 hi all, I've noticed that within the past week there have been some mentions of DCC. Does anyone know of a HOWTO, or a writeup somewhere that I can learn how to use it? I tried installing it, and didn't understand half of it. But it is now installed on what will be my master node. Thoughts? - -- I think computer viruses should count as life. I think it says something about human nature that the only form of life we have created so far is purely destructive. We've created life in our own image. - --Stephen Hawking 010010100110000101101 011011001010010000001 010100011010000110010 101100010011000010111 010101101100011101000 010110101010011011100 000110100101100101011 010110110010101110010 Jake Thebault-Spieker -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCLkgJI2YvXV9Bxi0RAnEDAJsGnN/eyA22+o9TWDpkUWd3vjXuaQCeLKfH EXdUrEQYjO9dlgTi5Knidbw= =XOkc -----END PGP SIGNATURE----- From sadat.ali.khan at gmail.com Wed Mar 9 07:36:33 2005 From: sadat.ali.khan at gmail.com (sadat khan) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] queries Message-ID: I would like to know from the esteemed members as to what really are MPI and PVM and have become outdated??? Another question is what is rock ? From tim at linux-force.com Wed Mar 9 08:46:23 2005 From: tim at linux-force.com (tim@linux-force.com) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] StorCloud call for participation extended Message-ID: All, The deadline has been extended to participate in the StorCloud challenge. Please visit the URL below, but ignore the March 1st deadline. http://www.vtksolutions.com/StorCloud/2005/applications.html Regards, Tim Wilcox From john.hearns at streamline-computing.com Wed Mar 9 13:10:46 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] queries In-Reply-To: References: Message-ID: <1110402646.5643.127.camel@Vigor11> On Wed, 2005-03-09 at 21:06 +0530, sadat khan wrote: > I would like to know from the esteemed members as to what really are > MPI and PVM and have become outdated??? > Another question is what is rock ? Sadat, Rocks is a clustering distribution. http://rocks.npaci.edu/Rocks/ It is a distribution of Linux which makes it easy to construct a Beowulf There is also Rock Linux, which I was interested in at one point. But I don't think this is what you are interested in. http://www.rocklinux.org/ From jonas.palencia at abbott.com Wed Mar 9 14:31:28 2005 From: jonas.palencia at abbott.com (Jonas M Palencia) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] scyld beowulf beoboot-install utility Message-ID: Hi All, We are running Scyld 28cz-7 on our cluster. One of our nodes (Compute Node 0) in the cluster was replaced because of bad motherboard. So the MAC address has changed. The hard disk wasn't changed but some linux was installed into it for testing purposes. I'm trying to add this node back to the cluster. Using beosetup, the new MAC address was registered as node 0. I tried to partition the disk using the beofdisk tool, then I restarted the node: ---------------------------------- [root@abcmc02 fdisk]# beofdisk -w -n 0 Disk /dev/hda: 4865 cylinders, 255 heads, 63 sectors/track Old situation: Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0 Device Boot Start End #cyls #blocks Id System /dev/hda1 * 0+ 0 1- 8001 89 Unknown /dev/hda2 1 516 516 4144770 82 Linux swap /dev/hda3 517 4864 4348 34925310 83 Linux /dev/hda4 0 - 0 0 0 Empty New situation: Units = sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/hda1 * 63 16064 16002 89 Unknown /dev/hda2 16065 8305604 8289540 82 Linux swap /dev/hda3 8305605 78156224 69850620 83 Linux /dev/hda4 0 - 0 0 Empty Successfully wrote the new partition table Re-reading the partition table ... If you created or changed a DOS partition, /dev/foo7, say, then use dd(1) to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1 (See fdisk(8).) The partition table on node 0 has been modified. You must reboot each affected node for changes to take effect. [root@abcmc02 fdisk]# beoboot-install 0 /dev/hda Creating boot images... Installing beoboot on partition 1 of /dev/hda. mke2fs 1.32 (09-Nov-2002) /dev/hda1: 11/2000 files (0.0% non-contiguous), 268/8001 blocks Done rcp: /boot/boot.b: No such file or directory Failed to copy boot.b to node 0:/tmp/.beoboot-install.mnt -------------------------------- I guess the problem is the beoboot-install utility. It didn't find the /boot/boot.b file. Indeed, that file cannot be found in the master node. Could this be a bug? After rebooting, it came out with an ERROR state on the BeoSetup window. Here's the log: ---------------- node_up: Initializing cluster node 0 at Wed Mar 9 15:44:55 EST 2005. node_up: Setting system clock from the master. node_up: Configuring loopback interface. node_up: Loading device support modules for kernel version 2.4.27-294r0048.Scyldsmp. setup_fs: Configuring node filesystems using /etc/beowulf/fstab... setup_fs: Checking /dev/hda2 (type=swap)... chkswap: /dev/hda2: Unable to find swap-space signature setup_fs: FSCK failure. (OK for RAM disks) setup_fs: Mounting /dev/hda2 on swap (type=swap; options=defaults) swapon: /dev/hda2: Invalid argument setup_fs: Failed to mount /dev/hda2 on swap (fatal). --------------- thanks, Jonas -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20050309/bcade501/attachment.html From ocschwar at MIT.EDU Wed Mar 9 21:07:01 2005 From: ocschwar at MIT.EDU (Omri Schwarz) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster. Message-ID: <200503100507.j2A571J1000565@zygorthian-space-raiders.mit.edu> The Ageia.com physics processing unit's marketing literature seems so oriented to gaming that I wonder if they would open their API enough for people in our market niche could look at it and whether it is suitable for putting in clusters. But if they did, what do y'all think? Would specialty (albeit commodity) coprocessors hanging off a PCI slot be suitable for your applications? http://ageia.com While I'm bringing this up, how about things like the MAP processor? http://www.srccomp.com/HardwareElements.htm#MAPProcessor From wangsj at yahoo.com Thu Mar 10 06:36:29 2005 From: wangsj at yahoo.com (Shih-Jon Wang) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] can't download beoqueue Message-ID: <20050310143629.11741.qmail@web30207.mail.mud.yahoo.com> Hi. Thomas. I can't download beoqueue... from the following link. Could you please send me a copy of it? Many Thanks! SJ http://www.weswulf.org/beoqueue.tar.gz From landman at scalableinformatics.com Thu Mar 10 07:23:17 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster. In-Reply-To: <200503100507.j2A571J1000565@zygorthian-space-raiders.mit.edu> References: <200503100507.j2A571J1000565@zygorthian-space-raiders.mit.edu> Message-ID: <42306665.8020305@scalableinformatics.com> Omri Schwarz wrote: > The Ageia.com physics processing unit's marketing literature seems > so oriented to gaming that I wonder if they would open their API > enough for people in our market niche could look at it and whether > it is suitable for putting in clusters. But if they did, what do y'all > think? Would specialty (albeit commodity) coprocessors hanging off a > PCI slot be suitable for your applications? Some applications would do very well with application specific processing systems. > > http://ageia.com > > > While I'm bringing this up, how about things like the MAP > processor? > > http://www.srccomp.com/HardwareElements.htm#MAPProcessor Or any others. Inverting the question, if you pay 4000$US per dual CPU compute node (+/- a bit depending upon technology, config, supplier), what price (if any) would you be willing to pay for an accelerator that offered you an order of magnitude more performance per node, on your code, and sat in the PCI-e/X or HTX slots? And also as important: how hard would you be willing to work/how much effort committed to program these things? This makes lots of assumptions, such as such a beast existing, your code being mapped or mappable to it, and you being interested in this. Part of what motivates this question are things like the Cray XD1 FPGA board, or PathScale's processors (unless I misunderstood their functions). Other folks have CPUs on a card of various sorts, ranging from FPGA to DSPs. I am basically wondering aloud what sort of demand for such technology might exist. I assume the answer starts with "if the price is right" ... the question is what is that price, what are the features/functionality, and how hard do people want to work on such bits. Note: As Jeff Layton pointed out many times, the GPUs in a number of machines are being used by at least one group for CFD, so you can think of these as a sort of dedicated attached processor. They are not general purpose, but highly specialized computational pipelines. If you could have a more general one, what would it look like, what would it do/emphasize, and how much would it cost? I know there is no one answer, but I thought it would be fun to extend Omri's question. Curious. > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From rgb at phy.duke.edu Thu Mar 10 09:19:11 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster. In-Reply-To: <42306665.8020305@scalableinformatics.com> References: <200503100507.j2A571J1000565@zygorthian-space-raiders.mit.edu> <42306665.8020305@scalableinformatics.com> Message-ID: On Thu, 10 Mar 2005, Joe Landman wrote: > Inverting the question, if you pay 4000$US per dual CPU compute node > (+/- a bit depending upon technology, config, supplier), what price (if > any) would you be willing to pay for an accelerator that offered you an > order of magnitude more performance per node, on your code, and sat in > the PCI-e/X or HTX slots? And also as important: how hard would you be > willing to work/how much effort committed to program these things? This > makes lots of assumptions, such as such a beast existing, your code > being mapped or mappable to it, and you being interested in this. > > Part of what motivates this question are things like the Cray XD1 FPGA > board, or PathScale's processors (unless I misunderstood their > functions). Other folks have CPUs on a card of various sorts, ranging > from FPGA to DSPs. I am basically wondering aloud what sort of demand > for such technology might exist. I assume the answer starts with "if > the price is right" ... the question is what is that price, what are > the features/functionality, and how hard do people want to work on such > bits. > > Note: As Jeff Layton pointed out many times, the GPUs in a number of > machines are being used by at least one group for CFD, so you can think > of these as a sort of dedicated attached processor. They are not > general purpose, but highly specialized computational pipelines. If you > could have a more general one, what would it look like, what would it > do/emphasize, and how much would it cost? I know there is no one > answer, but I thought it would be fun to extend Omri's question. Problems with coprocessing solutions include: a) Cost -- sometimes they are expensive, although they >>can<< yield commensurate benefits for some code as you point out. b) Availability -- I don't just mean whether or not vendors can get them; I mean COTS vs non-COTS. They are frequently on-of-a-kind beasts with a single manufacturer. c) Usability. They typically require "special tools" to use them at all. Cross-compilers, special libraries, code instrumentation. All of these things require fairly major programming effort to implement in your code to realize the speedup, and tend to decrease the general-purpose portability of the result, tying you even more tightly (after investing all this effort) with the (probably one) manufacturer of the add-on. c) Continued Availability -- They also not infrequently disappear without a trace (as "general purpose" coprocessors, not necessarily as ASICs) within a year or so of being released and marketed. This is because Moore's Law is brutal, and even if a co-processor DOES manage to speed up your actual application (and not just a core loop that comprises 70% of your actual application) by a factor of ten, that's at most four or five years of ML advances. If your code has a base of 30% or so that isn't sped up at all (fairly likely) then your application runs maybe 2-3 times as fast at best and ML eats it in 1-3 years. d) Support. Using the tools and processors effectively requires a fair bit of knowledge, but there is usually a pitifully small set of other implementers of the non-mainstream technology and no good communications channels between them (with some exceptions, of course). You're likely to be mostly on your own while trying to get the tools installed, code written and debugged, and eventually made efficient. If the tool or processor turns out to be "broken" for your purpose, you aren't likely to get much help with this, either, as you're a fringe market (again, with possible exceptions). Each of these alter the naive cost-benefit estimate of "Gee it is 10x faster in more core loop and only makes my system cost 2x as much". Maybe it is 10x faster in the core loop that is 70% of your code, so that now the application runs in 0.37x the original time (good, but now has to be compared to perhaps 0.5x the time available from getting 2x as many ordinary systems). Maybe it takes you four months to get the cross-compiler installed and all your code ported and to then TWEAK the code so it really DOES give you the touted 10x speedup for your core loops, which may have to be reblocked and written using special instructions, which then also necessitates revalidating the results (in case bugs have crept in during the port). Maybe the company that made the core DSP releases a new one in the meantime (they've got ML to contend with as well) and it has a different instruction set, so that a year from now when you want to expand the cluster you either re-instrument all the code again or rely on warehoused chips of the old variety. Maybe in 1 year dual core, 64 bit CPUs are released that effectively double, then double again, what you can get out of COTS systems at near constant cost and your 32 bit CPU plus coprocessor suddenly is slower, less portable, AND more expensive. Or not. Maybe it speeds things up 10x, costs only 2x, will be available for at least 3 more years, has a user base with hundreds of users and a dedicated mailing list, has commercial or open source compiler support that requires only minor tweaks or the use of standard library calls to get most of the benefit, and is built to a standard so that four companies make the actual chips, not just one. I'm just reviewing the questions one would like to ask. Anecdotally I'm reminded of e.g. the 8087, Micro Way's old transputer sets (advertised in PC mag for decades), the i860 (IIRC), the CM-5, and many other systems built over the years that tried to provide e.g. a vector co-processor in parallel with a regular general purpose CPU, sometimes on the same motherboard and bus, sometimes on daughterboards or even on little mini-network connections hung off the bus somehow. None of these really caught on (except for the 8087, and it is an exercise for the studio audience as to why an add-on processor that really should have been a part of the original processor itself, made by the mfr of the actual crippled CPU from the beginning, succeeded), although nearly all of them were used by at least a few intrepid individuals to great benefit. Allowing that Nature is efficient in its process of natural selection, this seems like a genetic/memetic variation that generally lacks the CBA advantages required to make it a real success. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From landman at scalableinformatics.com Thu Mar 10 09:57:29 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster. In-Reply-To: References: <200503100507.j2A571J1000565@zygorthian-space-raiders.mit.edu> <42306665.8020305@scalableinformatics.com> Message-ID: <42308A89.5060907@scalableinformatics.com> Robert G. Brown wrote: > On Thu, 10 Mar 2005, Joe Landman wrote: [...] > Problems with coprocessing solutions include: > > a) Cost -- sometimes they are expensive, although they >>can<< yield > commensurate benefits for some code as you point out. I am imagining a co-processor in the 1/10 x -> 4x range compared to node cost: A graphics card is the prototype in mind. > > b) Availability -- I don't just mean whether or not vendors can get > them; I mean COTS vs non-COTS. They are frequently on-of-a-kind beasts > with a single manufacturer. This is an issue with almost anything except CPUs, where we have 2 (or 3 if you include PPC) manufacturers. > c) Usability. They typically require "special tools" to use them at > all. Cross-compilers, special libraries, code instrumentation. TANSTAAFL. The idea is that the cost to do the "port" has to be low (to zero). > All of > these things require fairly major programming effort to implement in > your code to realize the speedup, and tend to decrease the > general-purpose portability of the result, tying you even more tightly > (after investing all this effort) with the (probably one) manufacturer > of the add-on. > > c) Continued Availability -- They also not infrequently disappear > without a trace (as "general purpose" coprocessors, not necessarily as > ASICs) within a year or so of being released and marketed. This is > because Moore's Law is brutal, and even if a co-processor DOES manage to > speed up your actual application (and not just a core loop that > comprises 70% of your actual application) by a factor of ten, that's at > most four or five years of ML advances. If your code has a base of 30% > or so that isn't sped up at all (fairly likely) then your application > runs maybe 2-3 times as fast at best and ML eats it in 1-3 years. And this is why the pricing issue is important. At what point does it make economic sense to buy a coprocessor? In the case of graphics cards, the coprocessor has amazing economies of scale (and it needs it). You need similar economies of scale for a coprocessor system, which is why I think the cost should be similar to the node cost (like existing graphics cards at 1/10 to 4x node cost). > > d) Support. Using the tools and processors effectively requires a > fair bit of knowledge, but there is usually a pitifully small set of > other implementers of the non-mainstream technology and no good > communications channels between them (with some exceptions, of course). Hmmm. OpenGL uses C/C++/Fortran bindings to get at the power (at least I think there is a way to call GL from fortran). What I was thinking was a high level (C/Fortran/C++) interface to them ala OpenGL. Jeff Layton if you are around, what is the name of that compiler set for the GPUs? Brook? Something like that. > You're likely to be mostly on your own while trying to get the tools > installed, code written and debugged, and eventually made efficient. If > the tool or processor turns out to be "broken" for your purpose, you > aren't likely to get much help with this, either, as you're a fringe > market (again, with possible exceptions). > > Each of these alter the naive cost-benefit estimate of "Gee it is 10x > faster in more core loop and only makes my system cost 2x as much". > > Maybe it is 10x faster in the core loop that is 70% of your code, so > that now the application runs in 0.37x the original time (good, but now > has to be compared to perhaps 0.5x the time available from getting 2x as > many ordinary systems). Maybe it takes you four months to get the > cross-compiler installed and all your code ported and to then TWEAK the > code so it really DOES give you the touted 10x speedup for your core > loops, which may have to be reblocked and written using special > instructions, which then also necessitates revalidating the results (in > case bugs have crept in during the port). Maybe the company that made > the core DSP releases a new one in the meantime (they've got ML to > contend with as well) and it has a different instruction set, so that a > year from now when you want to expand the cluster you either > re-instrument all the code again or rely on warehoused chips of the old > variety. Again, I point to OpenGL as a prototypical interface for this. The underlying driver may change, but the interface is effectively constant to the programmer, regardless of how many pixel shaders exist in the pipeline. > Maybe in 1 year dual core, 64 bit CPUs are released that > effectively double, then double again, what you can get out of COTS > systems at near constant cost and your 32 bit CPU plus coprocessor > suddenly is slower, less portable, AND more expensive. Well, this has happened in the GPU market, and the GPUs have tracked with ML. This is an issue for anyone committing to any computer of any sort. ML is ML and it is going to drop the value of what you purchase very quickly. > > Or not. Maybe it speeds things up 10x, costs only 2x, will be available > for at least 3 more years, has a user base with hundreds of users and a > dedicated mailing list, has commercial or open source compiler support > that requires only minor tweaks or the use of standard library calls to > get most of the benefit, and is built to a standard so that four > companies make the actual chips, not just one. And an interface layer that masks the chip differences, so that when chips are changed out, the programs need not change (like OpenGL), though they can to take advantage of great new feature X (an additional MAC layer in the pipeline). > > I'm just reviewing the questions one would like to ask. > > Anecdotally I'm reminded of e.g. the 8087, Micro Way's old transputer > sets (advertised in PC mag for decades), the i860 (IIRC), the CM-5, and > many other systems built over the years that tried to provide e.g. a > vector co-processor in parallel with a regular general purpose CPU, > sometimes on the same motherboard and bus, sometimes on daughterboards > or even on little mini-network connections hung off the bus somehow. > > None of these really caught on (except for the 8087, and it is an > exercise for the studio audience as to why an add-on processor that > really should have been a part of the original processor itself, made by > the mfr of the actual crippled CPU from the beginning, succeeded), > although nearly all of them were used by at least a few intrepid > individuals to great benefit. Allowing that Nature is efficient in its > process of natural selection, this seems like a genetic/memetic > variation that generally lacks the CBA advantages required to make it a > real success. So there is an expression that I like attributing to myself, but I may have "borrowed" it from elsewhere. Something designed to fail often will. The "general purpose" accelerator cards (transputer, NS32032, ...) all suffered from a lack of application focus among other things. There was the prevalent attitude of "if you build it, then they will buy". These units largely failed to take hold apart from tiny niches. OTOH, "specialized" accelerator cards (Graphics cards, RAID cards, Sound cards) have been a smashing success, as the CBA makes sense, they deliver a specific value, and they are easy to use. The take home message is that any accelerator card needs to do the same. What these accelerator cards do is offload work from the CPU. Not all of the will work as businesses, and this isn't a magical formula for success. Moreover, the "specialized" GPUs seem to have applicability in CFD and other areas. This is interesting as it opens a possibility for significant acceleration of some computations. They fundamental question is whether or not there will be wide adoption. I am not seeing wide adoption of the GPU as a CFD engine right now, but what if you had a "CFD engine" chip that cost about the same as the GPU, stuck it on a card, and had a high level language interface to it, so you hand it your expensive routines to crank on. The physics chip bit got me thinking along the molecular dynamics lines last night, specifically the non-bonded calculations. I am sure others could regail us with their computational burdens (and I would like to hear them myself at some point in time, it is quite instructive to hear what people are worrying about). I think the physics chip in hardware is a neat idea, though I think you need a high level interface to it, open standards, and lots of support to make it work. Moreover, it needs to be programmable: not because physics changes so often, but because the implied models may differ from what you want. As I said, I am curious, and I think it is an interesting idea. If done right, with the wind at the right angles, good user/community support, I think it could work :) > > rgb > -- joe From laytonjb at charter.net Thu Mar 10 10:22:52 2005 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster. In-Reply-To: <42308A89.5060907@scalableinformatics.com> References: <200503100507.j2A571J1000565@zygorthian-space-raiders.mit.edu> <42306665.8020305@scalableinformatics.com> <42308A89.5060907@scalableinformatics.com> Message-ID: <4230907C.6090007@charter.net> Joe Landman wrote: > Hmmm. OpenGL uses C/C++/Fortran bindings to get at the power (at > least I think there is a way to call GL from fortran). What I was > thinking was a high level (C/Fortran/C++) interface to them ala > OpenGL. Jeff Layton if you are around, what is the name of that > compiler set for the GPUs? Brook? Something like that. The code is BrookGPU: http://graphics.stanford.edu/projects/brookgpu/ This is derived from the Merrimac project at Stanford: http://merrimac.stanford.edu/ There is also some other tools. For example, Sh: http://libsh.org I mentioned these in the ClusterWatch column in the March 2005 issue. Jeff From mathog at mendel.bio.caltech.edu Thu Mar 10 11:13:18 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster. Message-ID: "Omri Schwarz" wrote: > Would specialty (albeit commodity) coprocessors hanging off a > PCI slot be suitable for your applications? Well, a couple of points: 1. There may very well be no PCI slot available for such a coprocessor. Forget about it for a 1U, and for a 2U the available slots may already be used up by a small graphics card, myrinet, hardware monitor etc. An open slot may not be sufficient, there are almost certainly going to be space and cooling problems. (I can't count the number of times an empty slot had to be left between cards on PCs I've worked on.) 2. Assuming that you can get the card in there and not fry anything you're going to have to at the very least rewrite the code to make use of the specialized hardware. Unless this board comes with a very, very clever compiler to automagically detect the bits it can do best you're looking at some serious time and/or money spent on programming. 3. Where's the market to pay for this? Perhaps some sort of specialized rendering engine might be able to sell enough units to the folks who make CGI movies. Similarly, a specialized FFT engine might find a home in many places. The physics engine that started this thread might be of some use to, well, physicists. Or maybe not, I'm going to guess that the "physics" it implements may not work so well when scaled up to multi-lightyear distances or down to the point where quantum mechanics is important. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From James.P.Lux at jpl.nasa.gov Thu Mar 10 11:48:19 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster. In-Reply-To: References: <200503100507.j2A571J1000565@zygorthian-space-raiders.mit.edu> <42306665.8020305@scalableinformatics.com> Message-ID: <6.1.1.1.2.20050310112134.02826ff8@mail.jpl.nasa.gov> At 09:19 AM 3/10/2005, Robert G. Brown wrote: >On Thu, 10 Mar 2005, Joe Landman wrote: > > > > > Part of what motivates this question are things like the Cray XD1 FPGA > > board, or PathScale's processors (unless I misunderstood their > > functions). Other folks have CPUs on a card of various sorts, ranging > > from FPGA to DSPs. I am basically wondering aloud what sort of demand > > for such technology might exist. I assume the answer starts with "if > > the price is right" ... the question is what is that price, what are > > the features/functionality, and how hard do people want to work on such > > bits. > >Problems with coprocessing solutions include: > > a) Cost -- sometimes they are expensive, although they >>can<< yield >commensurate benefits for some code as you point out. > > b) Availability -- I don't just mean whether or not vendors can get >them; I mean COTS vs non-COTS. They are frequently on-of-a-kind beasts >with a single manufacturer. Definitely an issue. > c) Usability. They typically require "special tools" to use them at >all. Cross-compilers, special libraries, code instrumentation. All of >these things require fairly major programming effort to implement in >your code to realize the speedup, and tend to decrease the >general-purpose portability of the result, tying you even more tightly >(after investing all this effort) with the (probably one) manufacturer >of the add-on. To a certain extent, though, this is being mitigated by things like Signal Processing Workbench or Matlab, which have "plug ins" to convert generic algorithm descriptions (i.e. simulink models, etc.) into runnable code on the coprocessor or FPGA. As far as product lock-in goes, "in theory" one could just recompile for a new target processor, although I don't know if anyone's ever done this. It does greatly reduce the "time and cost to demonstrate capability" > c) Continued Availability -- They also not infrequently disappear >without a trace (as "general purpose" coprocessors, not necessarily as >ASICs) within a year or so of being released and marketed. This is >because Moore's Law is brutal, and even if a co-processor DOES manage to >speed up your actual application (and not just a core loop that >comprises 70% of your actual application) by a factor of ten, that's at >most four or five years of ML advances. If your code has a base of 30% >or so that isn't sped up at all (fairly likely) then your application >runs maybe 2-3 times as fast at best and ML eats it in 1-3 years. There are specialized applications, lending themselves to clusters, for which this might not hold. If we look at Xilinx FPGAs, for instance, while not quite doubling every 18 months, they ARE dramatically increasing in speed and size fairly quickly. And, it's not hugely difficult to take a design that ran at speed X on size Y Xilinx FPGA and port it to speed A on Size B Xilinx FPGA. Consider a classic big crunching ASIC/FPGA application, that of running many correlators in parallel to demodulate very faint signals buried in noise (specifically, raw data coming back from deep space probes), or some applications in radio astronomy. In the latter case, particularly, there's a lot of interest in taking an array of radio telescopes and simultaneously forming many beams, so you can look lots of directions at once, to look for transient events that are "interesting" (like supernovae). The radio astronomy community is relatively poor (Paul Allen's interest notwithstanding), so they've got an incentive to use cheap commodity processing for their needs, but off the shelf PCs might not hack it. They're looking at a lot of architectures that strongly resemble the usual cluster... data from all antennas streams into a raft of processors via ethernet, and each processor forms some subset of beams either in space or frequency. They might have a coprocessor card in the machine that does some of the early really intensive beamforming computation. Take a look at the Allen Telescope Array or at the Square Kilometer Array or at LOFAR. >Anecdotally I'm reminded of e.g. the 8087, Micro Way's old transputer >sets (advertised in PC mag for decades), the i860 (IIRC), the CM-5, and >many other systems built over the years that tried to provide e.g. a >vector co-processor in parallel with a regular general purpose CPU, >sometimes on the same motherboard and bus, sometimes on daughterboards >or even on little mini-network connections hung off the bus somehow. > >None of these really caught on (except for the 8087, and it is an >exercise for the studio audience as to why an add-on processor that >really should have been a part of the original processor itself, made by >the mfr of the actual crippled CPU from the beginning, succeeded), THat's pretty easy. In the good old days, you had an integer CPU and an add on FPU in almost all architectures. The FPU didn't have instruction decoding, sequencing, or anything like that.. more like an extra ALU that tied to the internal bus. Just like having memory management in a separate chip. Intel and Motorola both used this approach. Intel did start to integrate the MMU into the chip with "segment registers" on the 8086, except that it provided zip, zero, none, nada memory protection. This was part of a strategy to keep the codebase compatible with the 8080. After all, who in their right mind would write a program bigger than 64K.. the user application code would never look at the segment registers, which would be managed by a multitasking OS. Think of it as integrated "bank switching", which was quite popular in the 8bit processor world (and itself, an outgrowth of how PDP-11 memory mangement worked) It wasn't until the 80286 that it started to be some more sophistication, and really, it was the 386 that made decent memory management possible. Moto started with a virtual memory scheme and paging, and so became the darling of software folks who had come to expect such things from the PDP-11, DEC-10, DG, and even mainframe world. In any case, NONE of them could have fit the FPU on the die and had decent yields. Besides, you're talking processors that cost $200-400 (in 1980s) and processors with integrated FPUs would have cost upwards of $1K-$1.5K (because of the lower yield). As fab technology advanced, you could either build bigger faster processors (in the separate CPU/FPU model) or you could build integrated processors at the same slow speed. Even today, I'd venture to guess that the vast number of CPU cycles spent on PCs are integer mode computations (bitblts and the like to make windows work). It's not like you need FP to do Word or PowerPoint, or even Excel. It's rendered 3D graphics that really drives FP performance in the consumer market. This drives an interesting battle between the graphics ASIC makers (so that an add on card can do the rendering) and the CPU makers (who want to put it onboard, so that the total system cost is less), and, as well the support provided by MS Windows to use either one effectively. The game market clearly doesn't want to have to try and support ALL the possible graphics cards out there (it was a nightmare trying to write high performance graphics applications back in the late 80's, early 90s. The few skilled folks who were good at it earned their shekels.) >although nearly all of them were used by at least a few intrepid >individuals to great benefit. Allowing that Nature is efficient in its >process of natural selection, this seems like a genetic/memetic >variation that generally lacks the CBA advantages required to make it a >real success. > > rgb James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From krystyx at acd.net Thu Mar 10 09:14:33 2005 From: krystyx at acd.net (Krys Kaya-Sar) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: Message-ID: YMMV? my apologies, very new here. -----Original Message----- From: beowulf-bounces@beowulf.org [mailto:beowulf-bounces@beowulf.org] On Behalf Of Don Holmgren Sent: Friday, March 04, 2005 13:44 To: Roger L. Smith Cc: Joe Landman; Beowulf@beowulf.org; Mark Hahn Subject: Re: [Beowulf] 2.6.11 is out; with InfiBand support I've made two purchases in the last 12 months of 24-port switches. Two switches last April came in at ~ $4000 each. 16 switches last Sept came in at ~ $3300 each. These were two different brands of switches, both based on the Mellanox Infiniscale III (24 port crossbar) silicon. Clearly YMMV on pricing. Don Holmgren On Fri, 4 Mar 2005, Roger L. Smith wrote: > > > THe price I stated was for a 24 port switch for around $8,000 list. As a > matter of fact, I just confirmed this with the vendor. > > This does not include cables or HCAs. > > On Fri, 4 Mar 2005, Jeffrey B. Layton wrote: > > > 8 ports under 8k, but it was a 24 port switch :) > > This includes all of the HCA's, switches (only one), > > cables, and software. > > > > Jeff > > > > > 8 ports under 8k or 24 ports under 8k? > > > > > > Jeffrey B. Layton wrote: > > > > > >> However, to match what Roger said, one IB vendor gave me > > >> a list price for 8-ports of IB for under $8,000. > > > > > > > > _\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_ > | Roger L. Smith Phone: 662-325-3625 | > | Sr. Systems Administrator FAX: 662-325-7692 | > | roger@ERC.MsState.Edu http://WWW.ERC.MsState.Edu/~roger | > | Mississippi State University | > |____________________________________ERC__________________________________| > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From redboots at ufl.edu Thu Mar 10 12:00:23 2005 From: redboots at ufl.edu (Paul Johnson) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] hpl - large problems fail In-Reply-To: <1110402646.5643.127.camel@Vigor11> References: <1110402646.5643.127.camel@Vigor11> Message-ID: <4230A757.1020203@ufl.edu> All: I have a 4 node cluster(dont snicker :) ) and Im trying to do some benchmarking with HPL. I want to test 2 of the nodes with 1Gb of ram each. I calculated the maximum problem size that can fit in 2Gb and still allow for memory for the operating system. That came out to be around 14500x14500. When I run that size of a test it always fails. The largest problem that I can test and not have it fail on me is 12500x12500. What is the reason behind this? Im confused on what is going on here. Thanks for any help. Regards, Paul -- Paul Johnson Graduate Student - Mechanical Engineering University of Florida - Gainesville, Fl http://plaza.ufl.edu/redboots Reclaim Your Inbox! http://www.mozilla.org/products/thunderbird From ocschwar at MIT.EDU Thu Mar 10 08:13:50 2005 From: ocschwar at MIT.EDU (Omri Schwarz) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster. In-Reply-To: <42306665.8020305@scalableinformatics.com> References: <200503100507.j2A571J1000565@zygorthian-space-raiders.mit.edu> <42306665.8020305@scalableinformatics.com> Message-ID: On Thu, 10 Mar 2005, Joe Landman wrote: > > http://ageia.com > > > > > > While I'm bringing this up, how about things like the MAP > > processor? > > > > http://www.srccomp.com/HardwareElements.htm#MAPProcessor > > Or any others. > > Inverting the question, if you pay 4000$US per dual CPU compute node > (+/- a bit depending upon technology, config, supplier), what price (if > any) would you be willing to pay for an accelerator that offered you an > order of magnitude more performance per node, on your code, and sat in > the PCI-e/X or HTX slots? And also as important: how hard would you be > willing to work/how much effort committed to program these things? This > makes lots of assumptions, such as such a beast existing, your code > being mapped or mappable to it, and you being interested in this. One might presume that if a piece of kit becomes known as attractive to our community there would be a port of BLAS, LAPACK, FFTW and so on written for it in very short order. From ocschwar at MIT.EDU Thu Mar 10 10:12:50 2005 From: ocschwar at MIT.EDU (Omri Schwarz) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster. In-Reply-To: References: <200503100507.j2A571J1000565@zygorthian-space-raiders.mit.edu> <42306665.8020305@scalableinformatics.com> Message-ID: On Thu, 10 Mar 2005, Robert G. Brown wrote: > On Thu, 10 Mar 2005, Joe Landman wrote: > > Problems with coprocessing solutions include: > > a) Cost -- sometimes they are expensive, although they >>can<< yield > commensurate benefits for some code as you point out. > The COTS variants of coprocessor products are driven by market demand for making Lara Croft look more like Angelina Jolie. For us, that's good. It means the price might be right. > b) Availability -- I don't just mean whether or not vendors can get > them; I mean COTS vs non-COTS. They are frequently on-of-a-kind beasts > with a single manufacturer. Alas, the same market interaction above is driving prices down but puts pressure on these products having proprietary interfaces. Electronic Arts, Inc (17 days without a fatal case of karoshi!) would certainly like it that way. > c) Usability. They typically require "special tools" to use them at > all. Cross-compilers, special libraries, code instrumentation. All of > these things require fairly major programming effort to implement in > your code to realize the speedup, and tend to decrease the > general-purpose portability of the result, tying you even more tightly > (after investing all this effort) with the (probably one) manufacturer > of the add-on. But! If you're not an overly bespoke Beowulf user, i.e. you use linear algebra packages, DSP, or so on and so forth, and you devote the time to implement a library on such hardware, you instantly have a community of people with similar needs, an ability to help you, and your vendor now has a market that might be worth paying attention to. SGI seems to think so as far as GPUs go. From agshew at gmail.com Thu Mar 10 11:04:19 2005 From: agshew at gmail.com (Andrew Shewmaker) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster. In-Reply-To: <42308A89.5060907@scalableinformatics.com> References: <200503100507.j2A571J1000565@zygorthian-space-raiders.mit.edu> <42306665.8020305@scalableinformatics.com> <42308A89.5060907@scalableinformatics.com> Message-ID: On Thu, 10 Mar 2005 12:57:29 -0500, Joe Landman wrote: > I think the physics chip in hardware is a neat idea, though I think you > need a high level interface to it, open standards, and lots of support > to make it work. Moreover, it needs to be programmable: not because > physics changes so often, but because the implied models may differ from > what you want. I was interested in whether they were supporting Linux with their SDK [1]. Here's what I found: Their SDK is unsurprisingly MS centric, but it is built on something called the Open Dynamics Framework [2]. They don't have a Linux/OpenGL port yet, but it looks like they have been designing it so they easily can, and they say they probably will in the future. They are using C++ and provide Lua bindings for rapid prototyping. Their PSCL (Physics Scripting Language) documentation references the ODE (Open Dynamics Engine) [3], but I'm not quite sure how they fit together other than they collaborated on PSCL. It looks like ODE does currently run on Linux. [1] http://www.ageia.com/novodex_downloads.html [2] http://physicstools.org/forum1/ [3] http://ode.org/ -- Andrew Shewmaker From rgb at phy.duke.edu Thu Mar 10 12:29:11 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster. In-Reply-To: <42308A89.5060907@scalableinformatics.com> References: <200503100507.j2A571J1000565@zygorthian-space-raiders.mit.edu> <42306665.8020305@scalableinformatics.com> <42308A89.5060907@scalableinformatics.com> Message-ID: On Thu, 10 Mar 2005, Joe Landman wrote: > So there is an expression that I like attributing to myself, but I may > have "borrowed" it from elsewhere. > > Something designed to fail often will. > > The "general purpose" accelerator cards (transputer, NS32032, ...) all > suffered from a lack of application focus among other things. There was > the prevalent attitude of "if you build it, then they will buy". These > units largely failed to take hold apart from tiny niches. > > OTOH, "specialized" accelerator cards (Graphics cards, RAID cards, Sound > cards) have been a smashing success, as the CBA makes sense, they > deliver a specific value, and they are easy to use. The take home > message is that any accelerator card needs to do the same. What these > accelerator cards do is offload work from the CPU. Not all of the will > work as businesses, and this isn't a magical formula for success. And you have the volume issue. Offhand, I can easily think of at least a few HPC coprocessor cards that might be useful in a cluster: A $30 PCI-bus card that does nothing but generate super-high-quality random numbers (uniform deviates of various widths and/or ints of various widths) at high speed (faster than the CPU can, which means say 50 megarands/second or better) and deliver them directly to memory without the CPUs help (so one can build a circular queue and keep it full with only occasional calls requesting the next block of rands) would be a Great Boon to Monte Carlo-heads like myself. A $30 linear algebra card. Yes, I know -- a lot of graphics cards are essentially vector processors and can be used in this way, but I'm not satisfied. These cards aren't DESIGNED to be used as general purpose LAPACK-like or BLAS-like engines. I can't help but think that one could design a set of such cards that would function like a little mini-cluster even within a single system, partitioning the problem and doing sub-blocks, all in parallel with the main CPU and working directly with memory. There are probably more, but random numbers and linear algebra are both major components of a lot of work. Look at the problem here. Your $30 graphics chip is used in tens of millions of units per year. Your $30 random number generator card a) has to "work", which is not trivial to arrange. I have a bitwise random number test in dieharder (a GPL package for testing random numbers I'm working on) that every supposedly rng in the GSL fails at six bits, most still fail at five bits, and quite a few fail at four. That is, forget uniform distribution of BYTES, let alone 4 bytes sequences -- there are measureable deviations away from random for just 6 bit substrings of a long string of bits. Hardware rng's are often no better. b) you have to be able to sell enough to make money, and that will be tough at $30/card... Ditto for linear algebra, although there there is a high end market and companies DO sell engines for a lot more than $30 to a very small market. And make money. We just want the best of both worlds... > > Moreover, the "specialized" GPUs seem to have applicability in CFD and > other areas. This is interesting as it opens a possibility for > significant acceleration of some computations. They fundamental > question is whether or not there will be wide adoption. I am not seeing > wide adoption of the GPU as a CFD engine right now, but what if you had > a "CFD engine" chip that cost about the same as the GPU, stuck it on a > card, and had a high level language interface to it, so you hand it your > expensive routines to crank on. > > The physics chip bit got me thinking along the molecular dynamics lines > last night, specifically the non-bonded calculations. I am sure others > could regail us with their computational burdens (and I would like to > hear them myself at some point in time, it is quite instructive to hear > what people are worrying about). Ya, stuff like this would be great -- ODE solvers on a chip or add-on card. But NOT easy to build and NOT that big a market. rgb > > > I think the physics chip in hardware is a neat idea, though I think you > need a high level interface to it, open standards, and lots of support > to make it work. Moreover, it needs to be programmable: not because > physics changes so often, but because the implied models may differ from > what you want. > > As I said, I am curious, and I think it is an interesting idea. If done > right, with the wind at the right angles, good user/community support, I > think it could work :) > > > > > > > > rgb > > > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From Bogdan.Costescu at iwr.uni-heidelberg.de Thu Mar 10 12:38:15 2005 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster. In-Reply-To: <42308A89.5060907@scalableinformatics.com> Message-ID: On Thu, 10 Mar 2005, Joe Landman wrote: > The physics chip bit got me thinking along the molecular dynamics lines > last night, specifically the non-bonded calculations. http://www.research.ibm.com/grape/ is already quite old... CHARMM and I think AMBER (as MD applications) were able to use the chips - given that these are 2 of the most used MD codes, it would suggest that there was a significant gain in speed to justify the extra coding. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De From gmpc at sanger.ac.uk Thu Mar 10 14:11:37 2005 From: gmpc at sanger.ac.uk (Guy Coates) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] hpl - large problems fail In-Reply-To: <4230A757.1020203@ufl.edu> References: <1110402646.5643.127.camel@Vigor11> <4230A757.1020203@ufl.edu> Message-ID: On Thu, 10 Mar 2005, Paul Johnson wrote: > All: > > I have a 4 node cluster(dont snicker :) ) Everyone starts off small. and Im trying to do some > benchmarking with HPL. I want to test 2 of the nodes with 1Gb of > ram each. I calculated the maximum problem size that can fit in 2Gb > and still allow for memory for the operating system. That came out to > be around 14500x14500. When I run that size of a test it always fails. > The largest problem that I can test and not have it fail on me is > 12500x12500. > What is the reason behind this? Im confused on what is going on here. > Thanks for any help. Do you know what actually caused the failure? If your problem size was too big, and you are really out of memory, you should see some messages in the system log saying the out-of-memory-killer was activated and HPL was zapped. If you know your machines was not actually out of memory, then you have broken hardware on one of your nodes. Run memtest+ or memtest on your nodes (Possibly the world's most useful pieces of diagnostic software). http://www.memtest86.com http://www.memtest.org If you haven't seen it, IBM have a redpaper on tuning HPL, which gives some good starting parameters, problem-sizing tips and an overview of different BLAS libraries you can compile against to get that extra few Gflops of performance. Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK Tel: +44 (0)1223 834244 ex 7199 From ctierney at HPTI.com Thu Mar 10 14:29:07 2005 From: ctierney at HPTI.com (Craig Tierney) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] hpl - large problems fail In-Reply-To: <4230A757.1020203@ufl.edu> References: <1110402646.5643.127.camel@Vigor11> <4230A757.1020203@ufl.edu> Message-ID: <1110493747.2973.75.camel@hpti10.fsl.noaa.gov> On Thu, 2005-03-10 at 13:00, Paul Johnson wrote: > All: > > I have a 4 node cluster(dont snicker :) ) and Im trying to do some > benchmarking with HPL. I want to test 2 of the nodes with 1Gb of > ram each. I calculated the maximum problem size that can fit in 2Gb > and still allow for memory for the operating system. That came out to > be around 14500x14500. When I run that size of a test it always fails. > The largest problem that I can test and not have it fail on me is > 12500x12500. > What is the reason behind this? Im confused on what is going on here. > Thanks for any help. Run hpl on each node by itself, using all of the memory. You probably have a bad stick of memory somewhere. Craig From redboots at ufl.edu Thu Mar 10 14:56:16 2005 From: redboots at ufl.edu (Paul Johnson) Date: Wed Nov 25 01:03:53 2009 Subject: Clarification: [Beowulf] hpl - large problems fail In-Reply-To: References: <1110402646.5643.127.camel@Vigor11> <4230A757.1020203@ufl.edu> Message-ID: <4230D090.4060301@ufl.edu> Guy Coates wrote: >On Thu, 10 Mar 2005, Paul Johnson wrote: > > > >>All: >> >>I have a 4 node cluster(dont snicker :) ) >> >> > >Everyone starts off small. > >and Im trying to do some > > >>benchmarking with HPL. I want to test 2 of the nodes with 1Gb of >>ram each. I calculated the maximum problem size that can fit in 2Gb >>and still allow for memory for the operating system. That came out to >>be around 14500x14500. When I run that size of a test it always fails. >>The largest problem that I can test and not have it fail on me is >>12500x12500. >>What is the reason behind this? Im confused on what is going on here. >>Thanks for any help. >> >> > > >Do you know what actually caused the failure? > >If your problem size was too big, and you are really out of memory, you >should see some messages in the system log saying the out-of-memory-killer >was activated and HPL was zapped. > >If you know your machines was not actually out of memory, then you have >broken hardware on one of your nodes. Run memtest+ or memtest on your >nodes (Possibly the world's most useful pieces of diagnostic software). > >http://www.memtest86.com >http://www.memtest.org > > >If you haven't seen it, IBM have a redpaper on tuning HPL, which gives >some good starting parameters, problem-sizing tips and an overview of >different BLAS libraries you can compile against to get that extra few >Gflops of performance. > >Cheers, > >Guy > > > I should have been more clearer in my description. It doesn't fail at the command prompt when I run it. It fails when it checks the solution to linear equations. The residual is too high and fails. This is part of the data from my HPL.out file: ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WC12R2L4 14500 64 1 2 388.43 5.233e+00 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 284363.4669186 ...... FAILED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 210262.3627204 ...... FAILED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 41377.6398965 ...... FAILED ||Ax-b||_oo . . . . . . . . . . . . . . . . . = 0.001692 ||A||_oo . . . . . . . . . . . . . . . . . . . = 3708.772315 ||A||_1 . . . . . . . . . . . . . . . . . . . = 3695.221759 ||x||_oo . . . . . . . . . . . . . . . . . . . = 6.847285 ||x||_1 . . . . . . . . . . . . . . . . . . . = 19610.120504 ============================================================================ Sorry for the confusion, Paul -- Paul Johnson Graduate Student - Mechanical Engineering University of Florida - Gainesville, Fl http://plaza.ufl.edu/redboots Reclaim Your Inbox! http://www.mozilla.org/products/thunderbird -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20050310/9349bbcf/attachment.html From mathog at mendel.bio.caltech.edu Thu Mar 10 16:22:39 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf Message-ID: "Robert G. Brown" wrote > A $30 PCI-bus card that does nothing but generate super-high-quality > random numbers (uniform deviates of various widths and/or ints of > various widths) at high speed (faster than the CPU can, which means say > 50 megarands/second or better) and deliver them directly to memory > without the CPUs help (so one can build a circular queue and keep it > full with only occasional calls requesting the next block of rands) > would be a Great Boon to Monte Carlo-heads like myself. I've often thought the same. The thing is, it doesn't have to be pseudorandom numbers, it can be real random numbers generated by a physical process. For instance, put some radioactive material on the card and count the decays observed in a fixed interval. In that case the faster you want the random numbers the hotter the source must be. Unfortunately a Curie is "only" 37,000,000,000 decays per second. If you want 50 Mrands per second and use 1 Curie of material that's only about 740 expected decays per interval, assuming that you can catch them all, so not as many bits in those rands as you might want. Also I'm pretty sure I don't want to be anywhere near a beowulf with 1 Curie of radiation in each node! When I was considering this it was just to generate a small number of random numbers at a much slower rate, in which case radiation equivalent to a smoke alarm would have sufficed. It might be safer and a good deal more practical to use shot noise instead. That could be integrated nicely onto a chip, with thousands of little amplifiers and adders all plugging away in parallel to generate your numbers. And does it really have to be in a PCI slot? Why not build it into memory, Then the numbers don't need to move across the bus at all, any read from that stick will contain a random number. Just be sure that said card has some way of passing the POST sequence. Tricky doing both I suppose. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rgb at phy.duke.edu Thu Mar 10 17:08:17 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf In-Reply-To: References: Message-ID: On Thu, 10 Mar 2005, David Mathog wrote: > "Robert G. Brown" wrote > > > A $30 PCI-bus card that does nothing but generate super-high-quality > > random numbers (uniform deviates of various widths and/or ints of > > various widths) at high speed (faster than the CPU can, which means say > > 50 megarands/second or better) and deliver them directly to memory > > without the CPUs help (so one can build a circular queue and keep it > > full with only occasional calls requesting the next block of rands) > > would be a Great Boon to Monte Carlo-heads like myself. > > I've often thought the same. The thing is, it doesn't have to be > pseudorandom numbers, it can be real random numbers generated > by a physical process. For instance, put some radioactive > material on the card and count the decays observed in a > fixed interval. In that case the faster you want the random > numbers the hotter the source must be. Unfortunately a > Curie is "only" 37,000,000,000 decays per second. If you want > 50 Mrands per second and use 1 Curie of material that's only > about 740 expected decays per interval, assuming that you can catch > them all, so not as many bits in those rands as you might want. > Also I'm pretty sure I don't want to be anywhere near a beowulf > with 1 Curie of radiation in each node! When I was considering this > it was just to generate a small number of random numbers at a much > slower rate, in which case radiation equivalent to a smoke alarm > would have sufficed. Yeah, I've done all of these computations and looked into various quantum devices and entropy based generators. However, you'd be surprised how difficult it is to make a HARDWARE based random number device that passes e.g. diehard. There is a deep truth here. "Random Number Generator" really is an oxymoron in a universe governed by natural law. Even quantum randomness arises from the fact that the "system" generating it is an open one in contact with a universe/bath in an unknown state (and watch out or I'll hit you up with my "Generalized Master Equation" lecture...;-). Unpredictability, you can find under every leaf. Randomness? Not so easy. > It might be safer and a good deal more practical to use shot noise > instead. That could be integrated nicely onto a chip, with > thousands of little amplifiers and adders all plugging away > in parallel to generate your numbers. Lots out them out there, but not so easy to make random enough at the bit level to make diehard happy. Shot noise, thermal noise, photon counters, more. They tend not to be cheap, not necessarily be particularly random, and they are almost universally "slow as molasses" compared to a good pseudorandom number generator. > > And does it really have to be in a PCI slot? Why not build it > into memory, Then the numbers don't need to move across the bus > at all, any read from that stick will contain a random number. > Just be sure that said card has some way of passing the > POST sequence. Tricky doing both I suppose. Actually IIRC there was some effort by Intel to build something into a CPU or some other part of a mobo chipset, but I don't know what came of it. The people who need this commercially are webvolken and encryption algorithms, where "unpredictable" is almost as good as "random to the nth bit of randomness". Nobody uses marginally secure encryption if they can help it. However, these folks are totally happy with kilorands/sec, let alone megarands/sec. I need hundreds of megarands/sec. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From landman at scalableinformatics.com Thu Mar 10 17:48:42 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf In-Reply-To: References: Message-ID: <4230F8FA.9070405@scalableinformatics.com> Robert G. Brown wrote: > Actually IIRC there was some effort by Intel to build something into a > CPU or some other part of a mobo chipset, but I don't know what came of > it. The people who need this commercially are webvolken and encryption > algorithms, where "unpredictable" is almost as good as "random to the > nth bit of randomness". Nobody uses marginally secure encryption if > they can help it. However, these folks are totally happy with > kilorands/sec, let alone megarands/sec. I need hundreds of > megarands/sec. (thinking aloud) Hmmm. I bet we could generate this (if you dont mind tt800 or its MT ilk), in software using some really neat and very powerful (and inexpensive) COTS chips. Dont know if we could hit 0.1 GRPS (gigarand per second), but I bet it could come pretty close. I have been thinking about a USB dongle for the past few months that provided random ints at a pretty nice rate. The chip I have in mind for this might be able to be powered off the USB port also (I think, I need to see the overall power needs). Could do a PCI card... would cost more to fabricate and sell. The $30 mark might be low. I'd guess closer to about $100 for that, unless we could get significant volumes ... (economies of scale work wonders for component pricing). (/thinking aloud) > > rgb > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From natorro at fisica.unam.mx Thu Mar 10 17:32:07 2005 From: natorro at fisica.unam.mx (Carlos Lopez Nataren) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] How to disable GUI in Mac OS X? Message-ID: <1110504727.3267.1.camel@linux> I'm doing some performance test with a G5 Xserve, both with linux and MacOS X, but it seem unfair to test the performance of MacOS X with the graphical interface up, does anyone know how to disable the graphical interface so I just get a nice text console? Thanks a lot in advance Carlos -- Carlos Lopez Nataren Instituto de Fisica, UNAM From atp at piskorski.com Thu Mar 10 18:52:15 2005 From: atp at piskorski.com (Andrew Piskorski) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster. In-Reply-To: <200503100507.j2A571J1000565@zygorthian-space-raiders.mit.edu> References: <200503100507.j2A571J1000565@zygorthian-space-raiders.mit.edu> Message-ID: <20050311025215.GA88338@piskorski.com> On Thu, Mar 10, 2005 at 12:07:01AM -0500, Omri Schwarz wrote: > http://ageia.com Interesting. Commodity - or at least "cheap commodity" - means its primary market has to be something other than HPC, of course. 3D video games qualify. So, the main questions I see are: 1. What kind of HPC stuff is this Ageia PhysX chip good for, and how good? 2. How likely is it to get really, really widely used in non-HPC fields, thus driving down price? 3. What other such candidate chips are out there? Same questions for those. 4. What to do about it? (E.g., "Heh, Ageia, there's this little HPC niche you could sell to with maybe no extra effort.") Question 1 seems pretty open. Looks like they've released nearly nothing about its actual hardware design and performance. Handles 40,000 rigid bodies, 0.13 um process, 125 million transistors, 25 Watts. That's about it so far. On question 2, who knows. But, supposedly Sega has licensed the PhysX chip. (Naturally, no word whether or not they will be shipping it in millions of game consoles.) Some other googling suggests that at least some big game developers are supporting for it (e.g., "the Unreal Engine 3 thoroughly exploits the Novodex physics API"). So it doesn't seem totally impossible that either Ageia or some competitor (say, the GPU companies) might succeed in creating a new market niche. But it's all very early, AFAIK no one has even announced an actual board with the chip yet, much less offered one for sale. If I had to guess, they probably haven't fab'd (at TSMC) more than a couple sample lots yet. The press is that they're shooting for retail hardware on sale for Christmas. What I find interesting, is if these sorts of highly realistic, "render explosions in real-time" simulations catch on, even if this PhysX chip turns out to be useless for anything other than this year's latest first person shooter, its successors may not. Yet another potential opportunity for cluster users to ride on the mass market coattails... -- Andrew Piskorski http://www.piskorski.com/ From rgb at phy.duke.edu Thu Mar 10 19:31:07 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf In-Reply-To: <4230F8FA.9070405@scalableinformatics.com> References: <4230F8FA.9070405@scalableinformatics.com> Message-ID: On Thu, 10 Mar 2005, Joe Landman wrote: > > > Robert G. Brown wrote: > > > Actually IIRC there was some effort by Intel to build something into a > > CPU or some other part of a mobo chipset, but I don't know what came of > > it. The people who need this commercially are webvolken and encryption > > algorithms, where "unpredictable" is almost as good as "random to the > > nth bit of randomness". Nobody uses marginally secure encryption if > > they can help it. However, these folks are totally happy with > > kilorands/sec, let alone megarands/sec. I need hundreds of > > megarands/sec. > > (thinking aloud) > > Hmmm. I bet we could generate this (if you dont mind tt800 or its MT > ilk), in software using some really neat and very powerful (and > inexpensive) COTS chips. Dont know if we could hit 0.1 GRPS (gigarand > per second), but I bet it could come pretty close. As I said, it isn't as easy as it sounds to generate truly random numbers. Pseudorandom yes, but there I can generate immediate data (from my dieharder program, for example) on how fast you can make them. On a dual 242 Opteron, for example, using the mt19937_1999 generator (one of the best, fastest rngs from the GSL) I get roughly 50 MegaRPS per processor, or 100 MRPS total. Generators of comparable or a bit better quality tend to run just a bit slower to a lot slower. The built in /dev/random entropy generator in linux is VERY slow and blocks when there is insufficient entropy. /dev/urandom doesn't block, is quite unpredictable, but is STILL very, very slow at less than a MROPS on nearly any modern hardware. This gives you some idea of how difficult it is to get 100 MRPS rates on ANY sort of hardware, even with pseudorandom number generators. Hardware generators have the same issues. Radioactivity based generators need to be pretty damn hot to generate 3.2 random gigabits per second. Photon counters have all sorts of hardware response time issues (and tend not to be as poissonian/random as you might think, see Hanbury-Brown-Twiss or computations of antibunching in the fluorescent spectrum). "Entropy" based generators can be OK, but again, it is difficult to know all the timescales of decay/decorrelation of all aspects of order, as nearly ALL hardware based sources are not independent random for very short times -- rather you hope that they have "only" relatively short autocorrelation times and don't sample them too often. This is one of the things I might be working on a bit over the next year, and is one reason I started working on dieharder (the tool I use both to measure their quality of randomness, so to speak, and to time them as well). > I have been thinking about a USB dongle for the past few months that > provided random ints at a pretty nice rate. The chip I have in mind for > this might be able to be powered off the USB port also (I think, I need > to see the overall power needs). > > Could do a PCI card... would cost more to fabricate and sell. The $30 > mark might be low. I'd guess closer to about $100 for that, unless we > could get significant volumes ... (economies of scale work wonders for > component pricing). FIRST you have to come up with a design, THEN you have to build a prototype and test it. Chances are that when you test it you'll find that while perhaps (if you're lucky) it is unpredictable, it isn't "uniformly random" at the bit level. After all, plenty of things sample a random DISTRIBUTION (say a Gaussian) that is unpredictable but not uniform. Then you get to look for transformations that will take your non-uniform random distribution and make it uniform at the bit level. Typically this will cost you bits -- for example, if your original distribution has a bias in its 0 vs 1 numbers you can combine pairs of bits to create a distribution that is uniform in 0 and 1. Then you have to look at 00, 01, 10, and 11 -- do they all occur binomially distributed (on average) with p = 0.25? How about 000, 001, 010, 011, 100, 101, 110, 111 with p = 0.125? Etc. (I find that NO pseudorandom number generators I've tested have the right six or greater bit distribution, that is, 000000, 000001, 000010, ... are not each binomially distributed with p = 1/64 in a very large bit sample.) Eventually you get it to pass diehard (and maybe even dieharder) at the expense of many bits and much consequent halving of the peak rates. If it is STILL really fast -- 100 MROPS is a good thing to shoot for as it is the equivalent of a dedicated dual opteron THEN you can look for a bus and market. Not easy at all. Or cheap. Intel and other companies have long been looking for a good way of doing this, and if you find one (especially one that could be incorporated on a single chip) you can probably sell it directly to them and retire. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From atp at piskorski.com Thu Mar 10 20:50:50 2005 From: atp at piskorski.com (Andrew Piskorski) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf In-Reply-To: References: Message-ID: <20050311045050.GA31865@piskorski.com> On Thu, Mar 10, 2005 at 08:08:17PM -0500, Robert G. Brown wrote: > they can help it. However, these folks are totally happy with > kilorands/sec, let alone megarands/sec. I need hundreds of > megarands/sec. The VIA x86 cpus have "PadLock ACE" RNG hardware, but they only claim 1.5 mbit/s or so. Lots slower than your dual Opterons pushing PRNG code, but then, those little Vias are probably happy to get other work done too while spitting out those random bits. :) Some web links say that Diehard is happy with the Via's output. But then I've also heard that Diehard wasn't intended (still true?) for finding the type of non-randomness hardware generators are prone to: http://www.robertnz.net/hwrng.htm http://www.robertnz.net/true_rng.html I wonder if the Via's underlying randomness sampling mechanism is inherently rate limited, or if they could scale it up to a single small RNG-only chip with 100 or 1000 times the bit rate. There does seem to be some market for such a thing, so since they have not done so, probably not? -- Andrew Piskorski http://www.piskorski.com/ From john.hearns at streamline-computing.com Fri Mar 11 00:02:11 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] 2.6.11 is out; with InfiBand support In-Reply-To: References: Message-ID: <1110528131.5349.1.camel@Vigor45> On Thu, 2005-03-10 at 12:14 -0500, Krys Kaya-Sar wrote: > YMMV? my apologies, very new here. > Your Mileage May Vary If I'm not wrong, this comes from USA automobile advertisements. The manufacturers give a certain petrol (gas) mileage for the car, but warn the customer that the mileage they get may very from that. Same with computers - you might not achieve the same results. From eugen at leitl.org Fri Mar 11 00:03:14 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf In-Reply-To: References: Message-ID: <20050311080314.GF17303@leitl.org> On Thu, Mar 10, 2005 at 04:22:39PM -0800, David Mathog wrote: > "Robert G. Brown" wrote > > > A $30 PCI-bus card that does nothing but generate super-high-quality > > random numbers (uniform deviates of various widths and/or ints of > > various widths) at high speed (faster than the CPU can, which means say > > 50 megarands/second or better) and deliver them directly to memory > > without the CPUs help (so one can build a circular queue and keep it > > full with only occasional calls requesting the next block of rands) > > would be a Great Boon to Monte Carlo-heads like myself. > > I've often thought the same. The thing is, it doesn't have to be > pseudorandom numbers, it can be real random numbers generated Some can do both. http://www.via.com.tw/en/initiatives/padlock/hardware.jsp Unfortunately, the VIA CPUs are otherwise next to useless for numerics. -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050311/97a9aac8/attachment.bin From eugen at leitl.org Fri Mar 11 00:24:48 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:53 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf In-Reply-To: References: Message-ID: <20050311082447.GG17303@leitl.org> On Thu, Mar 10, 2005 at 08:08:17PM -0500, Robert G. Brown wrote: > Yeah, I've done all of these computations and looked into various > quantum devices and entropy based generators. However, you'd be > surprised how difficult it is to make a HARDWARE based random number > device that passes e.g. diehard. There is a deep truth here. "Random http://www.via.com.tw/en/downloads/whitepapers/initiatives/padlock/evaluation_padlock_rng.pdf Support is there starting with kernel 2.6.11. > Actually IIRC there was some effort by Intel to build something into a > CPU or some other part of a mobo chipset, but I don't know what came of > it. The people who need this commercially are webvolken and encryption It's still there in some chipsets, but you can't rely on it being there if you don't control your hardware purchases. > algorithms, where "unpredictable" is almost as good as "random to the > nth bit of randomness". Nobody uses marginally secure encryption if > they can help it. However, these folks are totally happy with > kilorands/sec, let alone megarands/sec. I need hundreds of > megarands/sec. http://fp.gladman.plus.com/ACE/ claims almost 2 GByte/s throughput for AES. The bottleneck will be the GBit NIC on a PCI bus. -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050311/84988f60/attachment.bin From john.hearns at streamline-computing.com Fri Mar 11 00:34:43 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster. In-Reply-To: References: Message-ID: <1110530084.5349.24.camel@Vigor45> On Thu, 2005-03-10 at 11:13 -0800, David Mathog wrote: > "Omri Schwarz" wrote: > > > Would specialty (albeit commodity) coprocessors hanging off a > > PCI slot be suitable for your applications? > > Well, a couple of points: > > 1. There may very well be no PCI slot available for such a > coprocessor. Forget about it for a 1U, Don't wish to be a pain, but our 1U nodes have space for at least one PCI card. MSI dual-Opteron nodes can take two. One is on a riser card above the mainboard, one is at 180 degrees to that. Same motherboard used in Sun V20z, so can take two PCI. I'd imagine the same for other suppliers. Is your point that there may be a slot and a riser card, but there is not enough space to fit a coprocessor card in a 1U? I'll agree there. From john.hearns at streamline-computing.com Fri Mar 11 01:44:47 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf cluster. In-Reply-To: References: Message-ID: <1110534287.6204.3.camel@Vigor45> On Thu, 2005-03-10 at 11:13 -0800, David Mathog wrote: > "Omri Schwarz" wrote: > > > Would specialty (albeit commodity) coprocessors hanging off a > > PCI slot be suitable for your applications? > > 2. Assuming that you can get the card in there and not fry anything > you're going to have to at the very least rewrite the code to make > use of the specialized hardware. Unless this board comes with a > very, very clever compiler to automagically detect the bits it can > do best you're looking at some serious time and/or money spent on > programming. > We have looked at Clearspeed's product: http://www.clearspeed.com/products/apps.php They have ported standard libraries, and say that if your software uses BLAS etc. it will run without change. From rgb at phy.duke.edu Fri Mar 11 05:24:14 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf In-Reply-To: <20050311082447.GG17303@leitl.org> References: <20050311082447.GG17303@leitl.org> Message-ID: On Fri, 11 Mar 2005, Eugen Leitl wrote: > On Thu, Mar 10, 2005 at 08:08:17PM -0500, Robert G. Brown wrote: > > > Yeah, I've done all of these computations and looked into various > > quantum devices and entropy based generators. However, you'd be > > surprised how difficult it is to make a HARDWARE based random number > > device that passes e.g. diehard. There is a deep truth here. "Random > > http://www.via.com.tw/en/downloads/whitepapers/initiatives/padlock/evaluation_padlock_rng.pdf > > Support is there starting with kernel 2.6.11. Yes, and also see the attached, which is pretty much a linear predecessor of this white paper prepared by the same company for Intel. The von neumann transformation is what I was referring to as a means of improving bitlevel statistics, but a) it costs you 1/2 of the total bits generated to use; b) it only fixes the first moment of the bit distribution (restores balance between 0's and 1's) but as both papers empirically note, it doesn't fix either serial correlations due to at best empirically known autocorrelation patterns or higher order (N at a time) bit correlations). Thus it is safe to say that while the sources are unpredictable, they are not "random" (distributed as random numbers to all orders). And they are slow. Or as you say (said) -- good for cryptography, not so good for simulation, except to seed a decent prng. However, linux folks can do just as well for cryptography with /dev/random, which doesn't require anything at all. At least that's what I think -- dieharder doesn't yet support the whole NIST STS suite but I personally think that a whole lot of both STS and diehard will reduce to a single more direct test of bitlevel randomness of a generator that I HAVE implemented. But since none of this is published (yet) and hence referreed, I could be mistaken. > > Actually IIRC there was some effort by Intel to build something into a > > CPU or some other part of a mobo chipset, but I don't know what came of > > it. The people who need this commercially are webvolken and encryption > > It's still there in some chipsets, but you can't rely on it being there if > you don't control your hardware purchases. This was the origin of the white paper attached -- part of their original engineering effort, I believe. > > algorithms, where "unpredictable" is almost as good as "random to the > > nth bit of randomness". Nobody uses marginally secure encryption if > > they can help it. However, these folks are totally happy with > > kilorands/sec, let alone megarands/sec. I need hundreds of > > megarands/sec. > > http://fp.gladman.plus.com/ACE/ > claims almost 2 GByte/s throughput for AES. The bottleneck will be the GBit > NIC on a PCI bus. I'll have to look into this. If the rands are simulation quality and the device not too expensive, this could easily be worth it. This is the sort of thing I was hoping to turn up in this discussion. rgb > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Fri Mar 11 05:28:15 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Quasi-Non-Von-Neumann hardware in a Beowulf In-Reply-To: <20050311082447.GG17303@leitl.org> References: <20050311082447.GG17303@leitl.org> Message-ID: On Fri, 11 Mar 2005, Eugen Leitl wrote: OK, THIS time I'll attach the Intel paper. Oh wait. I can't -- the list doesn't permit attachments. Damn, I think it is the listed as the third hit on Google "Intel random number generator" Gotta go. Busy morning. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From gmpc at sanger.ac.uk Fri Mar 11 06:46:24 2005 From: gmpc at sanger.ac.uk (Guy Coates) Date: Wed Nov 25 01:03:54 2009 Subject: Clarification: [Beowulf] hpl - large problems fail In-Reply-To: <4230D090.4060301@ufl.edu> References: <1110402646.5643.127.camel@Vigor11> <4230A757.1020203@ufl.edu> <4230D090.4060301@ufl.edu> Message-ID: > the command prompt when I run it. It fails when it checks the solution > to linear equations. The residual is too high and fails. This is part > of the data from my HPL.out file: > This could still be dodgy memory; if bits get flipped then you can expect those sorts of numerical instabilities. Try running a single HPL job on each machine. If you get the correct answer on 3 machines and the wrong answer on one, then you've narrowed it down to hardware. If you get the wrong answer on all your machines then you probably have a software problem. Try recompiling HPL with no compiler optimisations, a different compiler and/or blas library. If that doesn't work, then it might just be possible that you are into wierd hardware/kernel bug territory. I ran into similar HPL problems whilst benchmarking a rather large hardware purchase we made several years ago. The HPL residuals were coming out as NaN. Recompiling with a different compiler gave the same result. Rather worryingly, the same binaries ran correctly when run on different hardware. After alot of head scratching and phonecalls to an extremely worried vendor ("Hey, this kit you sold us can't do maths properly!") the problem was tracked down to a dodgy kernel module. It turned out that the module provided by the vendor to do console-over-lan stomped over the floating point registers under certain circumstances. Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK Tel: +44 (0)1223 834244 ex 7199 From ctierney at HPTI.com Fri Mar 11 07:06:21 2005 From: ctierney at HPTI.com (Craig Tierney) Date: Wed Nov 25 01:03:54 2009 Subject: Clarification: [Beowulf] hpl - large problems fail In-Reply-To: References: <1110402646.5643.127.camel@Vigor11> <4230A757.1020203@ufl.edu> <4230D090.4060301@ufl.edu> Message-ID: <1110553581.2823.11.camel@localhost.localdomain> On Fri, 2005-03-11 at 07:46, Guy Coates wrote: > > the command prompt when I run it. It fails when it checks the solution > > to linear equations. The residual is too high and fails. This is part > > of the data from my HPL.out file: > > > > This could still be dodgy memory; if bits get flipped then you can expect > those sorts of numerical instabilities. > > Try running a single HPL job on each machine. If you get the correct > answer on 3 machines and the wrong answer on one, then you've narrowed it > down to hardware. > > If you get the wrong answer on all your machines then you probably have a > software problem. Try recompiling HPL with no compiler optimisations, a > different compiler and/or blas library. > > > If that doesn't work, then it might just be possible that you are into > wierd hardware/kernel bug territory. I ran into similar HPL problems > whilst benchmarking a rather large hardware purchase we made several years > ago. The HPL residuals were coming out as NaN. Recompiling with a > different compiler gave the same result. Rather worryingly, the same > binaries ran correctly when run on different hardware. After alot of head > scratching and phonecalls to an extremely worried vendor ("Hey, this kit > you sold us can't do maths properly!") the problem was tracked down to a > dodgy kernel module. It turned out that the module provided by the vendor > to do console-over-lan stomped over the floating point registers under > certain circumstances. > It could also be the interconnect. If you are using ethernet, I would think it is unlikely but I have seen issues with high-speed interconnects where they had a problem with the PCI slot, and we would get wrong answers when running HPL on more than 2 systems. Craig From nj at hemeris.com Fri Mar 11 00:28:33 2005 From: nj at hemeris.com (Nicolas Jungers) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] How to disable GUI in Mac OS X? In-Reply-To: <1110504727.3267.1.camel@linux> References: <1110504727.3267.1.camel@linux> Message-ID: <1110529713.6330.20.camel@lcube> On Thu, 2005-03-10 at 19:32 -0600, Carlos Lopez Nataren wrote: > I'm doing some performance test with a G5 Xserve, both with linux and > MacOS X, but it seem unfair to test the performance of MacOS X with the > graphical interface up, does anyone know how to disable the graphical > interface so I just get a nice text console? there is two way to do it: - at boot, press Command-S to go in single user mode - at login, use >console as login name Regards, Nicolas From eugen at leitl.org Fri Mar 11 12:47:29 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Via Now Shipping Dual-Processor Mini-ITX Board Message-ID: <20050311204729.GA17303@leitl.org> Link: http://slashdot.org/article.pl?sid=05/03/11/1840251 Posted by: timothy, on 2005-03-11 20:20:00 from the smallosity dept. An anonymous reader writes "Via is now shipping its [1]first dual-processor mini-ITX board. The DP-310 features two 1GHz processors, gigabit Ethernet, support for SATA drives, and a media-processing graphics chipset. It targets high-density applications -- according to Via, a 42-U rack with 168 processors would draw about 2.5 kilowatts, or about as much power as two hair dryers." This also looks like the basis for a nice car computer. Also on the small-computing front, an anonymous reader submits "General Micro, meanwhile, last week released what it calls the [2]world's fastest mini-ITX board, powered by a Pentium M clocked up to 2.3GHz. " References 1. http://linuxdevices.com/news/NS7109201579.html 2. http://linuxdevices.com/news/NS5099841192.html ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050311/648bf59a/attachment.bin From trippm at gmail.com Fri Mar 11 12:56:26 2005 From: trippm at gmail.com (Mauricio Carrillo Tripp) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Terrible scaling with a Cisco Catalyst 2948G-GE-TX Switch Message-ID: Hi, I'm new in this list. I hope there's somebody here that can help me out. I've been using a cluster with 16 nodes for quite a while, the switch I'm using is a cheap DELL switch (PowerConnect 2624, ~$300). I run Molecular Dynamics Simulations using NAMD, and the scaling I got was good (around 75% performance on 16 nodes). I recently got another 32 pc's to build a new cluster, but this time I had to buy a more expensive switch (Cisco Catalyst 2948G-GE-TX, ~$3,000). To my surprise, the scaling is just TERRIBLE, and after a lot of tests I finally found that the problem is the switch (I'm not sure what or why exactly, though). Using the Cisco switch or the Dell switch on the same cluster running exactly the same program I get very different scaling (please see Fig. 3 at http://chem.acad.wabash.edu/~trippm/Clusters/performance.php). I logged in into the Cisco switch's interface and disabled spantree, enabled fastport and set the ports to be always 1000tx (instead of being auto detect), and nothing I do seems to help. So, if there is someone else using this type of Cisco switch, could you tell me if you found the same behaviour and figured out what was wrong? Any other thoughts will be appreciated too. Thanks. -- Mauricio Carrillo Tripp, PhD Department of Chemistry Wabash College trippm@wabash.edu http://chem.acad.wabash.edu/~trippm From hahn at physics.mcmaster.ca Fri Mar 11 21:35:16 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Terrible scaling with a Cisco Catalyst 2948G-GE-TX Switch In-Reply-To: Message-ID: > is a cheap DELL switch (PowerConnect 2624, ~$300). I run Molecular Dynamics relabeled SMC, iirc. reasonable commodity (in a good way) hardware. > I recently got another 32 pc's to build a new cluster, but this time I had to > buy a more expensive switch (Cisco Catalyst 2948G-GE-TX, ~$3,000). To definitely an older-generation switch - can't even do full line rate on all ports, since the backplane has just 12 Gbps bandwidth. http://www.cisco.com/en/US/products/hw/switches/ps606/ps5502/ (which is feeling-lucky if you google the switch's name). > my surprise, the scaling is just TERRIBLE, and after a lot of tests I finally > found that the problem is the switch (I'm not sure what or why exactly, though). bandwidth. the switch appears to be a hopped-up 100bT switch, kind of sad actually. > exactly the same program I get very different scaling (please see Fig. 3 at > http://chem.acad.wabash.edu/~trippm/Clusters/performance.php). when your speedup flattens and falls, you know you hit a bottleneck... > Any other thoughts will be appreciated too. ebay it; maybe someone will buy it for an office (where the design is actually OK, since ports won't saturate for long.) Cisco doesn't have any special magic in networking, certainly not in anything as commoditized as ethernet switching. From mechti01 at luther.edu Fri Mar 11 19:19:40 2005 From: mechti01 at luther.edu (Timo Mechler) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Grants for Beowulf Clusters Message-ID: <1883.172.17.11.39.1110597580.squirrel@172.17.11.39> Hi all, I'm wondering what kind of success rate people are having with obtaining grants for Beowulf type Linux Clusters (for example, from the National Science Foundation). Let me give you a little bit more info as to why I'm asking this: I'm a junior undergraduate at a small liberal arts college in Iowa (~2600 students), and have solely been pursuing Beowulf clusters for well over a year now. I believe strongly that even though that my school is small, several departments on campus could benefit from the use of a beowulf cluster in the research that does go on. I've been using older, slower machines as a proof of concept for now. Ideally, we would want a faster beowulf system eventually that offers significant improvements over anything desktop pc's have to offer nowadays. Being that money is an issue at smaller schools, is there any I could obtain a grant for a beowulf cluster? If so, besides the NSF, what would be some other sources to apply to? Since some of you guys on this come from big companies or Univesities, I would appreciate any insight and suggestions you can give me. All input is appreciated. Thanks in advance! Best Regards, -Timo Mechler -- Timo R. Mechler mechti01@luther.edu From jake at spiekerfamily.com Sat Mar 12 19:26:11 2005 From: jake at spiekerfamily.com (Jake Thebault-Spieker) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Folding@Home on a Beowulf? Message-ID: <4233B2D3.7000807@spiekerfamily.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Does anybody have any experience with Folding@Home (http://folding.stanford.edu)? I'd like to run it on my six node 133MHz CPU, but my cluster won't be online. Is there a way to get the folding jobs another way? Like downloading them at a different location, then transferring them to the cluster? - -- I think computer viruses should count as life. I think it says something about human nature that the only form of life we have created so far is purely destructive. We've created life in our own image. - --Stephen Hawking 010010100110000101101 011011001010010000001 010100011010000110010 101100010011000010111 010101101100011101000 010110101010011011100 000110100101100101011 010110110010101110010 Jake Thebault-Spieker -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCM7LTI2YvXV9Bxi0RApkcAJ93zARE+h0iqhrqAmP0JbCxXSgdnwCg4eWw HuIQGgZXjZmXn99FcUKNFlc= =opi/ -----END PGP SIGNATURE----- From rgb at phy.duke.edu Mon Mar 14 06:47:02 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Grants for Beowulf Clusters In-Reply-To: <1883.172.17.11.39.1110597580.squirrel@172.17.11.39> References: <1883.172.17.11.39.1110597580.squirrel@172.17.11.39> Message-ID: On Fri, 11 Mar 2005, Timo Mechler wrote: > Hi all, > > I'm wondering what kind of success rate people are having with obtaining > grants for Beowulf type Linux Clusters (for example, from the National > Science Foundation). Let me give you a little bit more info as to why I'm > asking this: I'm a junior undergraduate at a small liberal arts college > in Iowa (~2600 students), and have solely been pursuing Beowulf clusters > for well over a year now. I believe strongly that even though that my > school is small, several departments on campus could benefit from the use > of a beowulf cluster in the research that does go on. I've been using > older, slower machines as a proof of concept for now. Ideally, we would > want a faster beowulf system eventually that offers significant > improvements over anything desktop pc's have to offer nowadays. Being > that money is an issue at smaller schools, is there any I could obtain a > grant for a beowulf cluster? If so, besides the NSF, what would be some > other sources to apply to? Since some of you guys on this come from big > companies or Univesities, I would appreciate any insight and suggestions > you can give me. All input is appreciated. Thanks in advance! Nearly all the university clusters in the world (with a few exceptions like my or Jeff Layton's home clusters:-) are purchased with grant money of one sort or another, so "success" or not it's the only game in town. Businesses, of course, often pay out of pocket, but that's the game in the big city (so to speak). MOST of the university clusters are likely sponsored to do some specific piece of grant-funded research. That is, I want to do Monte Carlo research into the dynamic and static critical properties of continuous Heisenberg ferromagnets, so I write a proposal to do this research that contains a hardware budget for the cluster upon which I plan to do it. However, the NSF has LOTS of grants for different categories, including grants to stimulate and improve undergraduate educational experiences or to build shared infrastructure. Few of them are likely to be available to an undergraduate, though. What I'd recommend is that you find a local faculty person (or three) that you can convince to share your vision and that might "need" the cluster either to teach students or to do some actual research (or both). Perhaps one from computer science, one from physics, one from chemistry or biology. See if all of you together can write an institutional grant proposal for a startup cluster. Note that this need not be terribly expensive or a lot of money -- you can build a perfectly reasonable starter cluster for around $20-25K if you can get the school to pick up the tab for power and cooling (likely to be a few thousand a year). If the cluster is thoroughly firewalled from the university network backbone, students could probably install it and administer it and manage it and write applications for it for the participating faculty or for course credit. It would be fun. Note that you CAN get started for a lot less money than even this. Barebones compute nodes can be had for almost nothing. Doug Eadline and Jeff Layton recently demonstrated this rather spectacularly here: http://www.clusterworld.com/value_cluster.shtml This is basically an 8 node cluster made with all new components for $2500. While they are experts (and hence can actually ride the bleeding edge of what will work for the least possible amount of money) this does indicate what can be done. More reasonably, you can get started for something like $500 for miscellaneous stuff (shelving, network switch and cabling, $1000 for a fairly nice desktop for a front end, fileserver, head node, and somewhere between $300 and $600 per node, where the range in prices reflects the amount of memory and hard disk and kind of network per node. Check out pricewatch -- some of the barebones, no-OS systems are less than $300 even for e.g. AMD-64s (not exactly turtle-slow, that is:-). What this means is that NSF or not you CAN build a student cluster at your school. You can damn near fund it with a bake sale or other simple/fun fundraisers. This kind of petty cash one can often convince the school to pony up out of their petty cash budget, or you can start a "cluster computing club" and (maybe) get a few thou out of your school's clubs and activities budget. Or you and seven friends can each contribute $500 tax-deductable dollars (yes, sigh, from your parents). Or you can walk main street and knock on doors and ask local businesses to fund it, or... Anyway, you get the point. You don't need NSF money to get started, and in fact it would be a lot easier for you to build a small cluster to get started first and THEN look for that $20K grant from NSF to make it into a middling big cluster! rgb > > Best Regards, > > -Timo Mechler > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From james.p.lux at jpl.nasa.gov Mon Mar 14 07:04:53 2005 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Grants for Beowulf Clusters References: <1883.172.17.11.39.1110597580.squirrel@172.17.11.39> Message-ID: <003401c528a7$2db91180$30f49580@LAPTOP152422> I can't claim that I am successful at getting grants for clusters, however... If you can make a good case that a cluster will make it possible to solve some other "important" problem, the odds go up greatly. Think of a cluster as a tool, just like a microscope or an ultra centrifuge or a furnace. How would you justify getting the budget for a big microscope (like a SEM)? The key is to have a problem that everyone wants to attack, and the cluster being the way to attack it. You said you've been doing proof of concept.. Is that to prove that you can build a cluster, or that you've demonstrated some useful "work" with the cluster on a problem that someone is interested in (i.e. for which there is funding available). Otherwise, you're a solution looking for a problem. Jim Lux ----- Original Message ----- From: "Timo Mechler" To: Sent: Friday, March 11, 2005 7:19 PM Subject: [Beowulf] Grants for Beowulf Clusters > Hi all, > > I'm wondering what kind of success rate people are having with obtaining > grants for Beowulf type Linux Clusters (for example, from the National > Science Foundation). Let me give you a little bit more info as to why I'm > asking this: I'm a junior undergraduate at a small liberal arts college > in Iowa (~2600 students), and have solely been pursuing Beowulf clusters > for well over a year now. I believe strongly that even though that my > school is small, several departments on campus could benefit from the use > of a beowulf cluster in the research that does go on. I've been using > older, slower machines as a proof of concept for now. Ideally, we would > want a faster beowulf system eventually that offers significant > improvements over anything desktop pc's have to offer nowadays. Being From diep at xs4all.nl Mon Mar 14 09:25:31 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Grants for Beowulf Clusters Message-ID: <3.0.32.20050314182531.015df148@pop.xs4all.nl> Oh well, My chessprogram diep is looking for a cluster to run at important tournaments at, also world champs computerchess. lots of publicity and bragging rights in such a case :) volunteers can email me: diep@xs4all.nl warning: i need some testing time, in a competative league you can't afford running without professional testing. Testing is 99% of the achievement. In german they say: "ubung macht dem meister". That is exactly the problems with grants to government. They do not understand the need for testing at all. Getting testing time is impossible there. That's why majority of all serious software doesn't run at such clusters. Jonathan Schaeffer told me: "You must run at what you can test." In general however, you see the biggest nonsense getting written to get system time. Marketing departments are even worse however. Like: "this new machine is going to research DNA". This where 0.5% of all system time goes to such problems. A 2048 processor itanium2 supercomputer with most expensive quad capable itanium2 cpu's (the dual cpu's are cheaper) is a bit overkill for that. But well... At 07:04 AM 3/14/2005 -0800, Jim Lux wrote: >I can't claim that I am successful at getting grants for clusters, >however... >If you can make a good case that a cluster will make it possible to solve >some other "important" problem, the odds go up greatly. Think of a cluster >as a tool, just like a microscope or an ultra centrifuge or a furnace. How >would you justify getting the budget for a big microscope (like a SEM)? > >The key is to have a problem that everyone wants to attack, and the cluster >being the way to attack it. You said you've been doing proof of concept.. >Is that to prove that you can build a cluster, or that you've demonstrated >some useful "work" with the cluster on a problem that someone is interested >in (i.e. for which there is funding available). > >Otherwise, you're a solution looking for a problem. > >Jim Lux >----- Original Message ----- >From: "Timo Mechler" >To: >Sent: Friday, March 11, 2005 7:19 PM >Subject: [Beowulf] Grants for Beowulf Clusters > > >> Hi all, >> >> I'm wondering what kind of success rate people are having with obtaining >> grants for Beowulf type Linux Clusters (for example, from the National >> Science Foundation). Let me give you a little bit more info as to why I'm >> asking this: I'm a junior undergraduate at a small liberal arts college >> in Iowa (~2600 students), and have solely been pursuing Beowulf clusters >> for well over a year now. I believe strongly that even though that my >> school is small, several departments on campus could benefit from the use >> of a beowulf cluster in the research that does go on. I've been using >> older, slower machines as a proof of concept for now. Ideally, we would >> want a faster beowulf system eventually that offers significant >> improvements over anything desktop pc's have to offer nowadays. Being > > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From diep at xs4all.nl Mon Mar 14 09:34:12 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] The move to gigabit - technical questions Message-ID: <3.0.32.20050314183411.0122fa88@pop.xs4all.nl> Good evening, It's interesting to investigate what gigabit can do for small home clusters. Any latency oriented approach is doomed to fail obviously at gigabit. But they're cheap. For 40 euro i see several getting offered already. First important question is of course how much system time those NIC's eat when fully loading their bandwidth. Example, i have an old dual k7 here with pci 2.2 (32 bits 33Mhz). Suppose i put a gigabit card in it. In say 6 messages a second i ship 8MB data at a time. Ship and send in turn. So it ships a packet of 8MB, then receives a packet of 8MB. Other than the cost of the thread to store the packet to RAM, does such a card in any way stop or block the cpu's which are 100% loaded with searching software (my chessprogram diep in this case)? What penalty other than that thread handling the message is there in terms of system time reduction to the 2 processes searching? Oh btw, i assume that gigabit can handle 48MB/s user data a second? Vincent From gotero at linuxprophet.com Mon Mar 14 22:43:06 2005 From: gotero at linuxprophet.com (Glen Otero) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] pe's with SGE 6.0 Message-ID: <847768a8906d4db72b7d9d073b3f9c21@linuxprophet.com> I think I broke something while playing with grid engine 6.0, pvm-3.4.4-19, and mpich2. Anyone have pvm and mpi/mpich templates that they know work in creating pe's with SGE 6.0? Thanks! Glen Glen Otero Ph.D. Linux Prophet -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 275 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050314/85297c2c/attachment.bin From atp at piskorski.com Tue Mar 15 03:05:41 2005 From: atp at piskorski.com (Andrew Piskorski) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] The move to gigabit - technical questions In-Reply-To: <3.0.32.20050314183411.0122fa88@pop.xs4all.nl> References: <3.0.32.20050314183411.0122fa88@pop.xs4all.nl> Message-ID: <20050315110541.GA68651@piskorski.com> On Mon, Mar 14, 2005 at 06:34:12PM +0100, Vincent Diepeveen wrote: > Good evening, > > It's interesting to investigate what gigabit can do for small home clusters. > > Any latency oriented approach is doomed to fail obviously at gigabit. But > they're cheap. For 40 euro i see several getting offered already. Which is 50 USD or so? You can smetimes get gigabit ethernet for MUCH cheaper than that if you look around, particularly on Ebay. E.g., I recently bought 10 Intel Pro/1000 MT cards (the lowest end 32 bit PCI desktop kind, but still) for $11 each, from an Ebay vendor Ebay who turned out to be local. -- Andrew Piskorski http://www.piskorski.com/ From dag at sonsorol.org Tue Mar 15 03:25:23 2005 From: dag at sonsorol.org (Chris Dagdigian) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Re: pe's with SGE 6.0 In-Reply-To: <47678a7e1f54e22f79f188cae34ec482@linuxprophet.com> References: <47678a7e1f54e22f79f188cae34ec482@linuxprophet.com> Message-ID: <4236C623.3040401@sonsorol.org> Hi Glen, Parallel environments (PE's) are "mostly" the same in Grid Engine 6 vs 5.3 in my experience. The main "gotcha" difference is that in SGE 6 you tell the *qeueue* the list of PE's it is able to support while in SGE 5 the opposite occured -- the PE itself was configured with a list of queues that it was active in. The other addition is the "urgency_slots" param (I think) which was not in SGE 5.3. If you had PE definitions or deployment scripts that worked in SGE 5.3 but not in 6 it may be due to the above. The "pe_list" parameter has moved from the PE object itself and into the queue configuration. For SGE 6 there are still the usual PVM and MPI templates and examples that come with the distribution. Just look in $SGE_ROOT/pvm/ and $SGE_ROOT/mpi/. Reuti also just updated the Grid Engine tight LAMMPI HOWTO which is here: http://gridengine.sunsource.net/project/gridengine/howto/lam-integration/lam-integration.html Back to PE's ... This is what a generic loosely integrated MPICH PE would look like in SGE 6: > workgroupcluster:~ admin$ qconf -sp mpich > pe_name mpich > slots 512 > user_lists NONE > xuser_lists NONE > start_proc_args /common/sge/mpi/startmpi.sh $pe_hostfile > stop_proc_args /common/sge/mpi/stopmpi.sh > allocation_rule $fill_up > control_slaves FALSE > job_is_first_task TRUE > urgency_slots min Note that there is no list of queues that the PE runs in. This has moved. The "pe_list" is now part of the queue configuration: > workgroupcluster:~ admin$ qconf -sq all.q > qname all.q > hostlist @allhosts > seq_no 0 > load_thresholds np_load_avg=1.75 > suspend_thresholds NONE > nsuspend 1 > suspend_interval 00:05:00 > priority 0 > min_cpu_interval 00:05:00 > processors UNDEFINED > qtype BATCH INTERACTIVE > ckpt_list NONE > pe_list make mpich > rerun FALSE < .... SNIP .... > I've tried to list the differences between Grid Engine 5 and Grid Engine 6 at this URL: http://bioteam.net/dag/gridengine-6-features.html Not sure if I got it all but feedback/corrections are welcome. Regards, Chris Glen Otero wrote: > I think I broke something while playing with grid engine 6.0, > pvm-3.4.4-19, and mpich2. Anyone have pvm and mpi/mpich templates that > they know work in creating pe's with SGE 6.0? > > Thanks! > > Glen > > Glen Otero Ph.D. > -- Chris Dagdigian, BioTeam - Independent life science IT & informatics consulting Office: 617-665-6088, Mobile: 617-877-5498, Fax: 425-699-0193 PGP KeyID: 83D4310E iChat/AIM: bioteamdag Web: http://bioteam.net From felix.rauch.valenti at gmail.com Tue Mar 15 04:51:49 2005 From: felix.rauch.valenti at gmail.com (Felix Rauch Valenti) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Terrible scaling with a Cisco Catalyst 2948G-GE-TX Switch In-Reply-To: References: Message-ID: <4eafc81b05031504517e478fa9@mail.gmail.com> On Fri, 11 Mar 2005 15:56:26 -0500, Mauricio Carrillo Tripp wrote: [...] > I recently got another 32 pc's to build a new cluster, but this time I had to > buy a more expensive switch (Cisco Catalyst 2948G-GE-TX, ~$3,000). To > my surprise, the scaling is just TERRIBLE, and after a lot of tests I finally > found that the problem is the switch (I'm not sure what or why exactly, though). [...] I don't know about the 2948G you mention above, but we had some serious performance problems with a 2900XL about 4 years ago. You can find my mails from back then in the list archives: http://www.beowulf.org/archive/2001-August/004688.html http://www.beowulf.org/archive/2002-May/007161.html http://www.scyld.com/pipermail/beowulf/2001-May/003763.html Basically the switch's performance broke down completely as soon as more than half of its ports where fully loaded with full-duplex communication. - Felix From eugen at leitl.org Tue Mar 15 08:24:51 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] CFP for ISPA'05 [Deadline April 1, 2005] (fwd from rabenseifner@hlrs.de) Message-ID: <20050315162451.GK17303@leitl.org> ----- Forwarded message from Rolf Rabenseifner ----- From: Rolf Rabenseifner Date: Tue, 15 Mar 2005 17:07:08 +0100 (CET) To: eugen@leitl.org Subject: CFP for ISPA'05 [Deadline April 1, 2005] Dear HLRS User or member of my course-invitation-list, as member of the program committee, I'm sending you the CFP for the ISPA 2005. Best regards Rolf Rabenseifner ------------------------------------------------------------------------- Third International Symposium on Parallel and Distributed Processing and Applications (ISPA 2005) Nanjing, China, Nov. 2-5, 2005 URL: http://keysoftlab.nju.edu.cn/ispa2005/ Following the traditions of previous successful ISPA conferences, ISPA '03 (held in Aizu-Wakamatsu City, Japan) and ISPA '04 (held in Hong Kong), the objective of ISPA '05 is to provide a forum for scientists and engineers in academia and industry to exchange and discuss their experiences, new ideas, research results, and applications about all aspects of parallel and distributed computing and networking. ISPA '05 will feature session presentations, workshops, tutorials and keynote speeches. Topics of particular interest include, but are not limited to : Computer networks Network routing and communication algorithms Parallel/distributed system architectures Tools and environments for software development Parallel/distributed algorithms Parallel compilers Parallel programming languages Distributed systems Wireless networks, mobile and pervasive computing Reliability, fault-tolerance, and security Performance evaluation and measurements High-performance scientific and engineering computing Internet computing and Web technologies Database applications and data mining Grid and cluster computing Parallel/distributed applications High performance bioinformatics Submissions should include an abstract, key words, the e-mail address of the corresponding author, and must not exceed 15 pages, including tables and figures, with PDF, PostScript, or MS Word format. Electronic submission through the submission website is strongly encouraged. Hard copies will be accepted only if electronic submission is not possible. Submission of a paper should be regarded as an undertaking that, should the paper be accepted, at least one of the authors will register and attend the conference to present the work. Important Dates: Workshop proposals due: April 1, 2005 Paper submission due: April 1, 2005 Acceptance notification: July 1, 2005 Camera-ready due: July 30, 2005 Conference: Nov. 2-5, 2005 Publication: The proceedings of the symposium will be published in Springer's Lecture Notes in Computer Science. A selection of the best papers for the conference will be published in a special issue of The Journal of Supercomputing and International Journal of High Performance Computing and Networking (IJHPCN). General Co-Chairs: Jack Dongarra, University of Tennessee, USA Jiannong Cao, Hong Kong Polytechnic University, China Jian Lu, Nanjing University, China Program Co-Chair: Yi Pan, Georgia State University, USA Daoxu Chen, Nanjing University, China Vice Program Co-Chairs: Algorithms Ivan Stojmenovic, University of Ottawa, Canada Architecture and Networks Mohamed Ould-Khaoua, University of Glasgow, UK Middleware and Grid Computing Mark Baker, University of Portsmouth, UK Software Jingling Xue, University of New South Wales, Australia Applications Zhi-Hua Zhou, Nanjing University, China Steering Committee Co-Chairs Sartaj Sahni, University of Florida, USA Yaoxue Zhang, Ministry of Education, China Minyi Guo, University of Aizu, Japan Steering Committee: Jiannong Cao, Hong Kong PolyU, China Francis Lau, Univ. of Hong Kong, China Yi Pan, Georgia State Univ. USA Li Xie, Nanjing University, China Jie Wu, Florida Altantic Univ. USA Laurence T. Yang, St. Francis Xavier Univ. Canada Hans P. Zima, California Institute of Technology, USA Weiming Zheng, Tsinghua University, China Local Organizing Committee Co-Chairs: Xianglin Fei, Nanjing University, China Baowen Xu, Southeast University, China Ling Chen, Yangzhou University, China Workshop Chair Guihai Chen, Nanjing University, China Tutorial Chair Yuzhong Sun, Institute of Computing Technology, CAS, China Publicity Chair: Cho-Li Wang, Univ. of Hong Kong, China Publication Chair: Hui Wang, University of Aizu, Japan Registration Chair: Xianglin Fei, Nanjing University, China Program Committee: See web page http://keysoftlab.nju.edu.cn/ispa2005/ for details. ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050315/274090b5/attachment.bin From eugen at leitl.org Tue Mar 15 09:26:57 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] [Bioclusters] Direct connect infiniband/quadrics? (fwd from farul@aldrich.com.my) Message-ID: <20050315172657.GR17303@leitl.org> ----- Forwarded message from Farul Mohd Ghazali ----- From: Farul Mohd Ghazali Date: Wed, 16 Mar 2005 01:22:40 +0800 To: "Clustering, compute farming & distributed computing in life science informatics" Subject: [Bioclusters] Direct connect infiniband/quadrics? X-Mailer: Apple Mail (2.619.2) Reply-To: "Clustering, compute farming & distributed computing in life science informatics" Has anyone had any experience with a direct connect/point-to-point implementation of Quadrics or Inifiniband? I talked to a small lab doing some computational chemistry and molecular dynamics work and they're interested in setting up a cluster but there is a need to justify the cost of a cluster before the budget can be approved. During the discussion, the idea of using direct connect infiniband or quadrics on two dual or quad Opteron nodes came up as a testbed platform to justify to management. From a price point of view, this is very attractive since it'll probably cost less than $40,000 (two quad Opterons, two Quadrics cards) for a testbed system. Money is tight... So, is this setup workable? In theory this should be faster than a gigabit based interconnect, even if it's just two nodes but I'd welcome any other ideas/suggestions. Thanks. -- "Leadership & Life-long Learning" -- Farul Mohd. Ghazali Manager, Systems & Bioinformatics Open Source Systems Sdn. Bhd. www.aldrich.com.my Tel: +603-8656 0139/29 Fax: +603-8656 0132 _______________________________________________ Bioclusters maillist - Bioclusters@bioinformatics.org https://bioinformatics.org/mailman/listinfo/bioclusters ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050315/b0429236/attachment.bin From gotero at linuxprophet.com Tue Mar 15 10:10:32 2005 From: gotero at linuxprophet.com (Glen Otero) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Re: [Bioclusters] Direct connect infiniband/quadrics? In-Reply-To: <6b93220479d766d0bcbc61d55542b657@aldrich.com.my> References: <9287779F96BDD311804100508B8B64230D4643E7@KANATAEXCHANGE1> <41E688D7.7080302@scalableinformatics.com> <41E68DCA.7080305@georgetown.edu> <6b93220479d766d0bcbc61d55542b657@aldrich.com.my> Message-ID: <493750150fb6900d633f50fc43108429@linuxprophet.com> Speaking to the workable question--I've built small clusters just like you described with Quadrics, Infiniband and Rocks. Quadrics support isn't built into Rocks, but the Quadrics folks typically make their software Rocks-aware. Infiniband support is supposedly built into Rocks, but I haven't heard any success stories with it. It seems that it's too Infinicon-centric. So you may be limited to using Infinicon gear to make that work. I have experience making the openIB stuff work with Rocks, so that is the path I would recommend. Mellanox and Topspin are good to work with, as is Quadrics, when it comes to shoe-horning their software onto Rocks clusters. You'll see better performance even with two nodes. But I'd take a hard look at what it will cost you to scale out with either interconnect and decide if the difference in latency is worth the difference in price. Glen On Mar 15, 2005, at 9:22 AM, Farul Mohd Ghazali wrote: > > Has anyone had any experience with a direct connect/point-to-point > implementation of Quadrics or Inifiniband? I talked to a small lab > doing some computational chemistry and molecular dynamics work and > they're interested in setting up a cluster but there is a need to > justify the cost of a cluster before the budget can be approved. > > During the discussion, the idea of using direct connect infiniband or > quadrics on two dual or quad Opteron nodes came up as a testbed > platform to justify to management. From a price point of view, this is > very attractive since it'll probably cost less than $40,000 (two quad > Opterons, two Quadrics cards) for a testbed system. Money is tight... > > So, is this setup workable? In theory this should be faster than a > gigabit based interconnect, even if it's just two nodes but I'd > welcome any other ideas/suggestions. Thanks. > > > -- "Leadership & Life-long Learning" -- > > Farul Mohd. Ghazali > Manager, Systems & Bioinformatics > Open Source Systems Sdn. Bhd. > www.aldrich.com.my Tel: +603-8656 0139/29 Fax: +603-8656 0132 > > _______________________________________________ > Bioclusters maillist - Bioclusters@bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters > > Glen Otero Ph.D. Linux Prophet -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 2264 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050315/4a571112/attachment.bin From bropers at cct.lsu.edu Mon Mar 14 06:23:24 2005 From: bropers at cct.lsu.edu (Brian D. Ropers-Huilman) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Grants for Beowulf Clusters In-Reply-To: <1883.172.17.11.39.1110597580.squirrel@172.17.11.39> References: <1883.172.17.11.39.1110597580.squirrel@172.17.11.39> Message-ID: <42359E5C.6020504@cct.lsu.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Timo, Unfortunately, the NSF and DOE wells are running dry. Actually, we are quite confused as to what these major organizations are doing in terms of funding and believe they do not have a clear strategy as of yet. We even had Peter Freeman down last fall to give a talk and still were not completely clear on the NSF direction, though this recent article might shed some light [ http://www.informationweek.com/story/showArticle.jhtml?articleID=159401335&tid=5978 ]. In our estimation, NIH, with a focus on bioinformatics, is the best likely source. We have also learned that major collaboratories are more likely to be funded. These agencies no longer want to fund department X's little cluster, department Y's little cluster, and then department Z's as well. They would prefer that the departments come together and get funding for a single larger system. I realize this does not directly answer your question, but thought I would provide my viewpoint. P.S. I was born and raised in Platteville, Wisconsin, not 100 miles from Decorah, through my undergraduate studies at Madison and my first several years in the corporate world. Both my and my partner's families are still there and we both still call Wisconsin "home." Timo Mechler said the following on 2005.03.11 21:19: > Hi all, > > I'm wondering what kind of success rate people are having with obtaining > grants for Beowulf type Linux Clusters (for example, from the National > Science Foundation). Let me give you a little bit more info as to why I'm > asking this: I'm a junior undergraduate at a small liberal arts college > in Iowa (~2600 students), and have solely been pursuing Beowulf clusters > for well over a year now. I believe strongly that even though that my > school is small, several departments on campus could benefit from the use > of a beowulf cluster in the research that does go on. I've been using > older, slower machines as a proof of concept for now. Ideally, we would > want a faster beowulf system eventually that offers significant > improvements over anything desktop pc's have to offer nowadays. Being > that money is an issue at smaller schools, is there any I could obtain a > grant for a beowulf cluster? If so, besides the NSF, what would be some > other sources to apply to? Since some of you guys on this come from big > companies or Univesities, I would appreciate any insight and suggestions > you can give me. All input is appreciated. Thanks in advance! > > Best Regards, > > -Timo Mechler > - -- Brian D. Ropers-Huilman .:. Asst. Director .:. HPC and Computation Center for Computation & Technology (CCT) bropers@cct.lsu.edu Johnston Hall, Rm. 350 +1 225.578.3272 (V) Louisiana State University +1 225.578.5362 (F) Baton Rouge, LA 70803-1900 USA http://www.cct.lsu.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (Darwin) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCNZ5bwRr6eFHB5lgRAoX9AKDYFTwIM+DL1TUFdVnOmoQrtZ/HEgCeIo/s uEz8BodFo/g0N11CbQhomQA= =K8Vk -----END PGP SIGNATURE----- From Karl.Bellve at umassmed.edu Mon Mar 14 07:43:32 2005 From: Karl.Bellve at umassmed.edu (Karl Bellve) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Reference Paper Message-ID: <4235B124.8000705@umassmed.edu> Is there a Beowulf paper by Donald Becker that I could reference in an upcoming publication? -- Cheers, Karl Bellve, Ph.D. ICQ # 13956200 Biomedical Imaging Group TLCA# 7938 University of Massachusetts Email: Karl.Bellve@umassmed.edu Phone: (508) 856-6514 Fax: (508) 856-1840 PGP Public key: finger kdb@molmed.umassmed.edu From aleahy at knox.edu Mon Mar 14 09:23:48 2005 From: aleahy at knox.edu (Andrew Leahy) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Grants for Beowulf Clusters In-Reply-To: <1883.172.17.11.39.1110597580.squirrel@172.17.11.39> References: <1883.172.17.11.39.1110597580.squirrel@172.17.11.39> Message-ID: <4235C8A4.70908@knox.edu> We had some success at getting funding from the NSF/CCLI program for a college-level beowulf cluster: https://www.fastlane.nsf.gov/servlet/showaward?award=0089045 It might be a good idea to search through NSF Fastlane to see what else they've funded through CCLI. I personally don't believe that a dedicated beowulf cluster is necessarily a good investment for a strictly undergraduate institution. We may be outliers among liberal arts colleges, but many of our scientists aren't all that computationally-minded and so far our cluster hasn't seen any use outside of the mathematics and computer science departments. However, our computer scientists were looking for a linux lab (on our exclusively Windows campus) for use in instruction. So we argued that we could build a dual use linux lab/beowulf cluster that we could hold classes in and use as a cluster. Technically, this isn't a beowulf cluster. However, we gave each system dual processors (reasoning that a student simply typing away on a document wouldn't really notice if one of the processors was working on a computationally intensive program at the same time) and we equipped each system with two network cards, one of which was hooked up to our own dedicated "beowulf" network. So you might say it was "beowulf-like". I can't say that our grant was a tremendous success, but that's largely for external reasons. We built it at the height of the ".com" era, when there were gobs of computer science students. As it was originally envisioned, the course we designed would be a numerical analysis course with an emphasis on parallel algorithms (a sexy subject for CS types--there isn't anything else about distributed memory programming in the curriculum) primarily for solving large systems of linear equations. However, when ".com" went bust CS enrollments dried up and we haven't really had a big audience. Right now, I'm retooling the course to focus on applied partial differential equations, again with an emphasis on solving large systems of equations with parallel algorithms. We'll see if it can pick up a broader audience among science students in general. If anybody else has ideas for distributed computing topics that would go well in an undergraduate numerical analysis course, please let me know. Andrew Leahy Knox College Timo Mechler wrote: > Hi all, > > I'm wondering what kind of success rate people are having with obtaining > grants for Beowulf type Linux Clusters (for example, from the National > Science Foundation). Let me give you a little bit more info as to why I'm > asking this: I'm a junior undergraduate at a small liberal arts college > in Iowa (~2600 students), and have solely been pursuing Beowulf clusters > for well over a year now. I believe strongly that even though that my > school is small, several departments on campus could benefit from the use > of a beowulf cluster in the research that does go on. I've been using > older, slower machines as a proof of concept for now. Ideally, we would > want a faster beowulf system eventually that offers significant > improvements over anything desktop pc's have to offer nowadays. Being > that money is an issue at smaller schools, is there any I could obtain a > grant for a beowulf cluster? If so, besides the NSF, what would be some > other sources to apply to? Since some of you guys on this come from big > companies or Univesities, I would appreciate any insight and suggestions > you can give me. All input is appreciated. Thanks in advance! > > Best Regards, > > -Timo Mechler > --- [This E-mail scanned for viruses by Declude Virus] From billharman at comcast.net Mon Mar 14 12:47:34 2005 From: billharman at comcast.net (Bill Harman) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Grants for Beowulf Clusters In-Reply-To: <1883.172.17.11.39.1110597580.squirrel@172.17.11.39> Message-ID: <200503142044.j2EKiXSf014963@bluewest.scyld.com> You can always try the Oak Ridge National Lab approach - The Stone SouperComputer - their original Beowulf cluster, which is no longer in operation, but was made up of used and discarded PC's. http://stonesoup.esd.ornl.gov/ Go around to different department in the University of to local business and ask for their used equipment, how knows, maybe you can build a 64-node note heterogeneous cluster, with the right price. Bill Harman, Salt Lake City office P - (801) 572-9252 F - (801) 571-4927 wharman@prism.net billharman@comcast.net skype: harman8015729252 -----Original Message----- From: beowulf-bounces@beowulf.org [mailto:beowulf-bounces@beowulf.org] On Behalf Of Timo Mechler Sent: Friday, March 11, 2005 8:20 PM To: beowulf@beowulf.org Subject: [Beowulf] Grants for Beowulf Clusters Hi all, I'm wondering what kind of success rate people are having with obtaining grants for Beowulf type Linux Clusters (for example, from the National Science Foundation). Let me give you a little bit more info as to why I'm asking this: I'm a junior undergraduate at a small liberal arts college in Iowa (~2600 students), and have solely been pursuing Beowulf clusters for well over a year now. I believe strongly that even though that my school is small, several departments on campus could benefit from the use of a beowulf cluster in the research that does go on. I've been using older, slower machines as a proof of concept for now. Ideally, we would want a faster beowulf system eventually that offers significant improvements over anything desktop pc's have to offer nowadays. Being that money is an issue at smaller schools, is there any I could obtain a grant for a beowulf cluster? If so, besides the NSF, what would be some other sources to apply to? Since some of you guys on this come from big companies or Univesities, I would appreciate any insight and suggestions you can give me. All input is appreciated. Thanks in advance! Best Regards, -Timo Mechler -- Timo R. Mechler mechti01@luther.edu _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From Glen.Gardner at verizon.net Mon Mar 14 14:41:37 2005 From: Glen.Gardner at verizon.net (Glen Gardner) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] The move to gigabit - technical questions References: <3.0.32.20050314183411.0122fa88@pop.xs4all.nl> Message-ID: <42361321.5070402@verizon.net> Gigabit will be a little faster than 100Mbit on a small cluster, but not a lot. I ended up using 5 cheap gigabit switches to make a gigabit concentrator for my 12 node cluster. It eliminated the tendency for the network to saturate under a heavy load. It also let me use gigabit network cards in my I/O node and controlling node with a small improvement in file I/O. The compute nodes remaind with 100 Mbit to conserve power. The setup works rather nicely. Glen Vincent Diepeveen wrote: >Good evening, > >It's interesting to investigate what gigabit can do for small home clusters. > >Any latency oriented approach is doomed to fail obviously at gigabit. But >they're cheap. For 40 euro i see several getting offered already. > >First important question is of course how much system time those NIC's eat >when fully loading their bandwidth. > >Example, i have an old dual k7 here with pci 2.2 (32 bits 33Mhz). >Suppose i put a gigabit card in it. > >In say 6 messages a second i ship 8MB data at a time. Ship and send in turn. > >So it ships a packet of 8MB, then receives a packet of 8MB. > >Other than the cost of the thread to store the packet to RAM, does such a >card in any way stop or block the cpu's which are 100% loaded with >searching software (my chessprogram diep in this case)? > >What penalty other than that thread handling the message is there in terms >of system time reduction to the 2 processes searching? > >Oh btw, i assume that gigabit can handle 48MB/s user data a second? > >Vincent > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > -- Glen E. Gardner, Jr. AA8C AMSAT MEMBER 10593 Glen.Gardner@verizon.net http://members.bellatlantic.net/~vze24qhw/index.html From jzamor at gmail.com Mon Mar 14 23:51:36 2005 From: jzamor at gmail.com (Josh Zamor) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Seg Fault with pvm_upkstr() and Linux. Message-ID: <19430ea71246a4e9a13899873914bff7@gmail.com> I have just started in on programming using PVM and have run into an odd problem. I have written a C (c99) program that calculates the factorial of a number by calculating parts of the range. I'm using the GMP for dealing with large numbers (I have done this program successfully before using numerous methods including pthreads). The basic way it works for the cluster is that the program starts on a machine, determines the subrange to be calculated for each task and then waits for each process to come back with the answer for it's subrange. The main process then finds the total result of the factorial from multiplying the subrange results back together... Pretty standard... The problem is that after a subrange is calculated by a task the result is put into a character array (null terminated, created from GMP's mpz_getstr()), it is packaged using pvm_pkstr() and sent to the parent. The parent can receive this using pvm_recv(), but as soon as it tries to store this into a character array in the program using pvm_upkstr(), or explore it with pvm_bufinfo(), it segfaults. I've also tried this in a couple of different ways, passing ints, it works, but passing strings (either large or small) results in a segfault. Of my 2 "cluster" learning setup one is running Mac OSX and the other is running Gentoo Linux. It is only the linux box that segfaults, the OSX box finds the answer correctly. The segfault happens whenever the linux box is part of the PVM created cluster, or when PVM is ran alone (no other boxes in the cluster) on the linux box. If anyone has any idea why this is happening, or is only happening on the linux box I would be grateful. Thanks, tech details follow: Gentoo Linux Box: Proc: AMD AthlonXP 1700+ RAM: 512MB. Kernel: 2.4.26-gentoo-r9 PVM: 3.4.5 GMP: 4.1.4 GCC: 3.3.5 Mac OSX: Proc: G4 1GHz RAM: 768MB Kernel: 10.3.8, Mach 7.8.0 PVM: 3.4.5 GMP: 4.1.4 GCC: 3.3 (20030304) Thanks again. Regards, -J Zamor jzamor@gmail.com From bjornts at mi.uib.no Tue Mar 15 05:47:53 2005 From: bjornts at mi.uib.no (Bjorn Tore Sund) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Re: Grants for Beowulf Clusters In-Reply-To: <200503142001.j2EK0FuW014048@bluewest.scyld.com> References: <200503142001.j2EK0FuW014048@bluewest.scyld.com> Message-ID: On Mon, 14 Mar 2005 beowulf-request@beowulf.org wrote: > Message: 4 > Date: Mon, 14 Mar 2005 07:04:53 -0800 > From: "Jim Lux" > Subject: Re: [Beowulf] Grants for Beowulf Clusters > To: "Timo Mechler" , > Message-ID: <003401c528a7$2db91180$30f49580@LAPTOP152422> > Content-Type: text/plain; charset="iso-8859-1" > > I can't claim that I am successful at getting grants for clusters, > however... > If you can make a good case that a cluster will make it possible to solve > some other "important" problem, the odds go up greatly. Think of a cluster > as a tool, just like a microscope or an ultra centrifuge or a furnace. How > would you justify getting the budget for a big microscope (like a SEM)? > > The key is to have a problem that everyone wants to attack, and the cluster > being the way to attack it. You said you've been doing proof of concept.. > Is that to prove that you can build a cluster, or that you've demonstrated > some useful "work" with the cluster on a problem that someone is interested > in (i.e. for which there is funding available). > > Otherwise, you're a solution looking for a problem. Moving further away from the original question, I'd like to expand on the above. A Beowulf cluster can be used to solve problems. You need the problem. You also need to make sure that any and all clusters you get have the necessary usage volume to warrant the purchase. Assume you've successfully argued/proved that the problem you're trying to solve can be addressed by using a Beowulf cluster. Do you have the human resources to actually do so? There's been several cases of people getting clusters to address specific problems, only to discover they don't have time to learn how to code properly for them, and nor does anyone else. And if your funding source is then of a type that frowns on their money being spent with nothing really happening as a result, finding funding for more useful stuff later isn't going to be easy. Bj?rn -- Bj?rn Tore Sund Phone: (+47) 555-84894 Stupidity is like a System administrator Fax: (+47) 555-89672 fractal; universal and Math. Department Mobile: (+47) 918 68075 infinitely repetitive. University of Bergen VIP: 81724 Support: system@mi.uib.no Contact: teknisk@mi.uib.no Direct: bjornts@mi.uib.no From diep at xs4all.nl Tue Mar 15 12:11:44 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] The move to gigabit - technical questions Message-ID: <3.0.32.20050315211144.016291f0@pop.xs4all.nl> At 05:41 PM 3/14/2005 -0500, Glen Gardner wrote: >Gigabit will be a little faster than 100Mbit on a small cluster, but not >a lot. What is 'not a lot'. I would guess it's factor 10 faster in bandwidth? >I ended up using 5 cheap gigabit switches to make a gigabit concentrator >for my 12 node cluster. >It eliminated the tendency for the network to saturate under a heavy load. Very interesting, can you post a connection scheme and routing table? >It also let me use gigabit network cards in my I/O node and controlling >node with a small improvement in file I/O. Streaming i/o or random access? cheapo disk arrays get what is it, 400MB/s handsdown or so? that's raid5 readspeed, plenty security at a raid5 array. >The compute nodes remaind with 100 Mbit to conserve power. The setup >works rather nicely. what type of software do you run at it, embarrassingly parallel software? Vincent >Glen > >Vincent Diepeveen wrote: > >>Good evening, >> >>It's interesting to investigate what gigabit can do for small home clusters. >> >>Any latency oriented approach is doomed to fail obviously at gigabit. But >>they're cheap. For 40 euro i see several getting offered already. >> >>First important question is of course how much system time those NIC's eat >>when fully loading their bandwidth. >> >>Example, i have an old dual k7 here with pci 2.2 (32 bits 33Mhz). >>Suppose i put a gigabit card in it. >> >>In say 6 messages a second i ship 8MB data at a time. Ship and send in turn. >> >>So it ships a packet of 8MB, then receives a packet of 8MB. >> >>Other than the cost of the thread to store the packet to RAM, does such a >>card in any way stop or block the cpu's which are 100% loaded with >>searching software (my chessprogram diep in this case)? >> >>What penalty other than that thread handling the message is there in terms >>of system time reduction to the 2 processes searching? >> >>Oh btw, i assume that gigabit can handle 48MB/s user data a second? >> >>Vincent >> >>_______________________________________________ >>Beowulf mailing list, Beowulf@beowulf.org >>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> >> >> > >-- >Glen E. Gardner, Jr. >AA8C >AMSAT MEMBER 10593 >Glen.Gardner@verizon.net > > >http://members.bellatlantic.net/~vze24qhw/index.html > > > > > From james.p.lux at jpl.nasa.gov Tue Mar 15 13:09:00 2005 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] The move to gigabit - technical questions References: <3.0.32.20050315211144.016291f0@pop.xs4all.nl> Message-ID: <004201c529a3$35e70db0$32a8a8c0@LAPTOP152422> ----- Original Message ----- From: "Vincent Diepeveen" To: "Glen Gardner" Cc: ; Sent: Tuesday, March 15, 2005 12:11 PM Subject: Re: [Beowulf] The move to gigabit - technical questions > At 05:41 PM 3/14/2005 -0500, Glen Gardner wrote: > >Gigabit will be a little faster than 100Mbit on a small cluster, but not > >a lot. > > What is 'not a lot'. > > I would guess it's factor 10 faster in bandwidth? I would guess it's not 10 times faster (leaving aside latency and bus bandwidth to the interface issues). I don't know about the details, but it's real common to have some sort of synchronization/equalization sequence at the front of the packet that runs at a lower bit rate than the payload rate. Wired "thicknet" ethernet, as I recall, had a 64 bit alternating 1/0 preamble before the actual packet contents w/header. It could be adequately modeled, though, as some sort of fixed per packet overhead time. This is especially true for wireless LANs (I know that not many clusters use these, but as they get faster, and there's more channels available, it gets attractive.. no cables!). 802.11a/b/g always starts at 2 Mb/sec for the preamble, and then shifts to a faster modulation (depending on the propagation). Jim Lux From brian at cypher.acomp.usf.edu Tue Mar 15 15:21:45 2005 From: brian at cypher.acomp.usf.edu (Brian R Smith) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Folding@Home on a Beowulf? In-Reply-To: <4233B2D3.7000807@spiekerfamily.com> References: <4233B2D3.7000807@spiekerfamily.com> Message-ID: <42376E09.30207@cypher.acomp.usf.edu> Not really a beowulf question, but i'll bite. Some time ago, we had a condor grid at my center with some pretty wild machines. The problem was that no one had any use for it (they were obsessed with running their projects on our beowulf despite the fact that they were mostly single processor jobs). So i decided i'd run folding@home on 6 of the nodes, just to put them to use somehow (I usually use one of our _true_ beowulfs for my work and anything serious). Just set up f@h on each machine, giving each a different machine id number, 1-6 should suffice. As for getting the data sets, you should probably find some way of getting these nodes online... Do you have a switch or some extra ethernet ports? Maybe use one of the machines as a router? You could also whip up a script that would "pretend" to be each instance of a machine id (1-6), receive the data, then place it in the proper directory in each node. Your script would run on a machine with internet access. It would start a copy of f@h as a particular machine id (i think you can do this by feeding it different config files). Once it has retreived the data, kill the process, move the data to one of your nodes and repeat. As for sending the data back... i'm at a loss. Its been a while since I've run this, as it is more of a distributed computing project than a beowulf-ready application, so i might be a little off on this. However, you get the general idea. -brian Jake Thebault-Spieker wrote: >-----BEGIN PGP SIGNED MESSAGE----- >Hash: SHA1 > >Does anybody have any experience with Folding@Home >(http://folding.stanford.edu)? I'd like to run it on my six node 133MHz >CPU, but my cluster won't be online. Is there a way to get the folding >jobs another way? Like downloading them at a different location, then >transferring them to the cluster? > >- -- >I think computer viruses should count as life. >I think it says something about human nature >that the only form of life we have created so far is purely destructive. >We've created life in our own image. >- --Stephen Hawking > >010010100110000101101 >011011001010010000001 >010100011010000110010 >101100010011000010111 >010101101100011101000 >010110101010011011100 >000110100101100101011 >010110110010101110010 > >Jake Thebault-Spieker >-----BEGIN PGP SIGNATURE----- >Version: GnuPG v1.2.5 (MingW32) >Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org > >iD8DBQFCM7LTI2YvXV9Bxi0RApkcAJ93zARE+h0iqhrqAmP0JbCxXSgdnwCg4eWw >HuIQGgZXjZmXn99FcUKNFlc= >=opi/ >-----END PGP SIGNATURE----- > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From rgb at phy.duke.edu Wed Mar 16 06:30:41 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] Seg Fault with pvm_upkstr() and Linux. In-Reply-To: <19430ea71246a4e9a13899873914bff7@gmail.com> References: <19430ea71246a4e9a13899873914bff7@gmail.com> Message-ID: On Tue, 15 Mar 2005, Josh Zamor wrote: > I have just started in on programming using PVM and have run into an > odd problem. I have written a C (c99) program that calculates the > factorial of a number by calculating parts of the range. I'm using the > GMP for dealing with large numbers (I have done this program > successfully before using numerous methods including pthreads). The > basic way it works for the cluster is that the program starts on a > machine, determines the subrange to be calculated for each task and > then waits for each process to come back with the answer for it's > subrange. The main process then finds the total result of the factorial > from multiplying the subrange results back together... Pretty > standard... > > The problem is that after a subrange is calculated by a task the result > is put into a character array (null terminated, created from GMP's > mpz_getstr()), it is packaged using pvm_pkstr() and sent to the parent. > The parent can receive this using pvm_recv(), but as soon as it tries > to store this into a character array in the program using pvm_upkstr(), > or explore it with pvm_bufinfo(), it segfaults. > > I've also tried this in a couple of different ways, passing ints, it > works, but passing strings (either large or small) results in a > segfault. This SOUNDS like programming error -- using a pointer as an int or vice versa. I'd do two things -- look at the actual result produced by GMP on the client side in some detail -- dumping it bytewise a character at a time isn't that dumb an idea. GMP introduces all sorts of new types, and I'll BET that these types are structs, not the actual data. So is the result a normal pointer-addressable string or a struct? Maybe what you are returning is a container for a pointer to anonymous memory on the client, not the contents of that memory... (Note that I've never used GMP so don't know, but you definitely need to check to make sure that what you are returning is an actual complete data object and not a container e.g. a struct or linked list). I assume that you've experimented and have no difficulty returning and unpacking ordinary ints, strings, or raw data blocks with PVM. If so you probably aren't making a pointer error on the master server side, although it never hurts to check. If you want other eyes on your actual code (might be useful if it is indeed programmer error) please post. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Wed Mar 16 06:41:40 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] The move to gigabit - technical questions In-Reply-To: <3.0.32.20050315211144.016291f0@pop.xs4all.nl> References: <3.0.32.20050315211144.016291f0@pop.xs4all.nl> Message-ID: On Tue, 15 Mar 2005, Vincent Diepeveen wrote: > At 05:41 PM 3/14/2005 -0500, Glen Gardner wrote: > >Gigabit will be a little faster than 100Mbit on a small cluster, but not > >a lot. > > What is 'not a lot'. > > I would guess it's factor 10 faster in bandwidth? (Maybe, you don't get QUITE 100% of the raw clock advantage in all applications on all hardware, Vincent;-). However, for most applications on most hardware you >>should<< get a signficant advantage -- 80-95% of 10x, or 8-9.5x. Not a just "a little". A really, really cheap switch might have problems with bisection bandwidth and chop this down for simultaneous flat-out bidirectional data streams, but relatively few parallel applications engage in flat-out bidirectional communications. Even if it does, your problem is more likely to be with resource contention (e.g. two hosts trying to talk to a third at the same time) than it is with actual bandwidth oversubscription. This is what Vincent is suggesting that you look into (or let us look into:-) below. If your particular usage pattern does create resource contention, then you might well need to either hand-optimize the pattern to avoid saturating your cheap hardware, create a network with cheap components that effectively breaks up the pathological communications pattern (which it sounds like is what you actually did) or buy better hardware (either better gigE switches or a "real" HPC network). However you shouldn't really trash gigE itself -- it isn't at fault and your results aren't typical. rgb > > >I ended up using 5 cheap gigabit switches to make a gigabit concentrator > >for my 12 node cluster. > >It eliminated the tendency for the network to saturate under a heavy load. > > Very interesting, can you post a connection scheme and routing table? > > >It also let me use gigabit network cards in my I/O node and controlling > >node with a small improvement in file I/O. > > Streaming i/o or random access? > > cheapo disk arrays get what is it, 400MB/s handsdown or so? > > that's raid5 readspeed, plenty security at a raid5 array. > > >The compute nodes remaind with 100 Mbit to conserve power. The setup > >works rather nicely. > > what type of software do you run at it, > embarrassingly parallel software? > > Vincent > > >Glen > > > >Vincent Diepeveen wrote: > > > >>Good evening, > >> > >>It's interesting to investigate what gigabit can do for small home clusters. > >> > >>Any latency oriented approach is doomed to fail obviously at gigabit. But > >>they're cheap. For 40 euro i see several getting offered already. > >> > >>First important question is of course how much system time those NIC's eat > >>when fully loading their bandwidth. > >> > >>Example, i have an old dual k7 here with pci 2.2 (32 bits 33Mhz). > >>Suppose i put a gigabit card in it. > >> > >>In say 6 messages a second i ship 8MB data at a time. Ship and send in turn. > >> > >>So it ships a packet of 8MB, then receives a packet of 8MB. > >> > >>Other than the cost of the thread to store the packet to RAM, does such a > >>card in any way stop or block the cpu's which are 100% loaded with > >>searching software (my chessprogram diep in this case)? > >> > >>What penalty other than that thread handling the message is there in terms > >>of system time reduction to the 2 processes searching? > >> > >>Oh btw, i assume that gigabit can handle 48MB/s user data a second? > >> > >>Vincent > >> > >>_______________________________________________ > >>Beowulf mailing list, Beowulf@beowulf.org > >>To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > >> > >> > >> > > > >-- > >Glen E. Gardner, Jr. > >AA8C > >AMSAT MEMBER 10593 > >Glen.Gardner@verizon.net > > > > > >http://members.bellatlantic.net/~vze24qhw/index.html > > > > > > > > > > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Wed Mar 16 07:52:07 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:54 2009 Subject: [Beowulf] The move to gigabit - technical questions In-Reply-To: References: <3.0.32.20050315211144.016291f0@pop.xs4all.nl> Message-ID: On Wed, 16 Mar 2005, Robert G. Brown wrote: > On Tue, 15 Mar 2005, Vincent Diepeveen wrote: > > > At 05:41 PM 3/14/2005 -0500, Glen Gardner wrote: > > >Gigabit will be a little faster than 100Mbit on a small cluster, but not > > >a lot. > > > > What is 'not a lot'. > > > > I would guess it's factor 10 faster in bandwidth? I hate to reply to myself, but I should have noted that the below applies to BANDWIDTH, not latency, dominated communictions. It was implicit from Vincent's reply, but I should have made it explicit. For lots of small packets gigabit's advantage probably won't be 10x, and this is another case where a higher-end network is indicated. However, the latency probably won't change a lot with different switches or switch arrangements, either, except for the worse along paths with multiple switch hops in between. I should also have pointed out to the original poster that there are nice tools (e.g. netperf, netpipe, lmbench) that will help him analyze his raw network performance outside of a particular application that might well have poor "networking" performance for reasons that have nothing to do with the actual network. There are also lots of articles out there both in the list archives, in Cluster World magazine back issues, in linux magazine back issues, and on various websites (including mine and brahma's) that can really help one understand just what ethernet is and how it works and what its numbers should be. It is the most widely implemented and widely understood network, good, bad, and ugly features notwithstanding. rgb > > (Maybe, you don't get QUITE 100% of the raw clock advantage in all > applications on all hardware, Vincent;-). However, for most > applications on most hardware you >>should<< get a signficant advantage > -- 80-95% of 10x, or 8-9.5x. Not a just "a little". > > A really, really cheap switch might have problems with bisection > bandwidth and chop this down for simultaneous flat-out bidirectional > data streams, but relatively few parallel applications engage in > flat-out bidirectional communications. Even if it does, your problem is > more likely to be with resource contention (e.g. two hosts trying to > talk to a third at the same time) than it is with actual bandwidth > oversubscription. This is what Vincent is suggesting that you look into > (or let us look into:-) below. > > If your particular usage pattern does create resource contention, then > you might well need to either hand-optimize the pattern to avoid > saturating your cheap hardware, create a network with cheap components > that effectively breaks up the pathological communications pattern > (which it sounds like is what you actually did) or buy better hardware > (either better gigE switches or a "real" HPC network). > > However you shouldn't really trash gigE itself -- it isn't at fault and > your results aren't typical. > > rgb > > > > > >I ended up using 5 cheap gigabit switches to make a gigabit concentrator > > >for my 12 node cluster. > > >It eliminated the tendency for the network to saturate under a heavy load. > > > > Very interesting, can you post a connection scheme and routing table? > > > > >It also let me use gigabit network cards in my I/O node and controlling > > >node with a small improvement in file I/O. > > > > Streaming i/o or random access? > > > > cheapo disk arrays get what is it, 400MB/s handsdown or so? > > > > that's raid5 readspeed, plenty security at a raid5 array. > > > > >The compute nodes remaind with 100 Mbit to conserve power. The setup > > >works rather nicely. > > > > what type of software do you run at it, > > embarrassingly parallel software? > > > > Vincent > > > > >Glen > > > > > >Vincent Diepeveen wrote: > > > > > >>Good evening, > > >> > > >>It's interesting to investigate what gigabit can do for small home clusters. > > >> > > >>Any latency oriented approach is doomed to fail obviously at gigabit. But > > >>they're cheap. For 40 euro i see several getting offered already. > > >> > > >>First important question is of course how much system time those NIC's eat > > >>when fully loading their bandwidth. > > >> > > >>Example, i have an old dual k7 here with pci 2.2 (32 bits 33Mhz). > > >>Suppose i put a gigabit card in it. > > >> > > >>In say 6 messages a second i ship 8MB data at a time. Ship and send in turn. > > >> > > >>So it ships a packet of 8MB, then receives a packet of 8MB. > > >> > > >>Other than the cost of the thread to store the packet to RAM, does such a > > >>card in any way stop or block the cpu's which are 100% loaded with > > >>searching software (my chessprogram diep in this case)? > > >> > > >>What penalty other than that thread handling the message is there in terms > > >>of system time reduction to the 2 processes searching? > > >> > > >>Oh btw, i assume that gigabit can handle 48MB/s user data a second? > > >> > > >>Vincent > > >> > > >>_______________________________________________ > > >>Beowulf mailing list, Beowulf@beowulf.org > > >>To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > >> > > >> > > >> > > > > > >-- > > >Glen E. Gardner, Jr. > > >AA8C > > >AMSAT MEMBER 10593 > > >Glen.Gardner@verizon.net > > > > > > > > >http://members.bellatlantic.net/~vze24qhw/index.html > > > > > > > > > > > > > > > > > > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From diep at xs4all.nl Wed Mar 16 08:10:33 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] The move to gigabit - technical questions Message-ID: <3.0.32.20050316171033.0162d318@pop.xs4all.nl> At 10:52 AM 3/16/2005 -0500, Robert G. Brown wrote: >On Wed, 16 Mar 2005, Robert G. Brown wrote: > >> On Tue, 15 Mar 2005, Vincent Diepeveen wrote: >> >> > At 05:41 PM 3/14/2005 -0500, Glen Gardner wrote: >> > >Gigabit will be a little faster than 100Mbit on a small cluster, but not >> > >a lot. >> > >> > What is 'not a lot'. >> > >> > I would guess it's factor 10 faster in bandwidth? > >I hate to reply to myself, but I should have noted that the below >applies to BANDWIDTH, not latency, dominated communictions. It was Well Robert, it's obvious you understood me correct. I was talking about bandwidth and i find a factor 8-9 times faster bandwidth of moving from 100mbit to 1gbit a *considerable* jump forward, especially because the price for such network cards is one that everyone can afford. For latency and even more bandwidth we all know there is highend network cards like dolphin, quadrics, myri and hopefully also infiniband (i see at the linux kernel list a lot of postings regarding infiniband, which are the actual *brands* that sell a concrete card right now which can be bought stand alone? If so url?). For distributed shared memory (DSM) what supercomputers need, there is quadrics. Is there any other highend network card where one can approach the RAM of the cards directly? (like with the shmem Cray library quadrics has and i can find them on their homepage). Obviously for any latency issue one doesn't buy cheapo gigabit cards. So the remaining interesting thing is what type of bandwidth it can give us. At my 100mbit LANs i measured i could effectively put roughly 60 mbit throughput to it (bidirectional, so 60 mbit in total of sends and receives). A 8.5 to 9 times higher bandwidth would mean say roughly give 480 mbit which is about 60MB/s. Obviously there will be theoretic testsetups getting more. Not so interesting when discussing practical working of the LAN's in operational systems. Actually 60MB/s would be too little for my chess program. lmbench is not so impressive as a benchmark. It doesn't measure TLB trashing of main memory. I've had past years so so many discussions with professors who do not understand the difference between bandwidth latency that lmbench gives versus their actual application that is just busy TLB trashing which it doesn't measure very accurately. Example is that at my dual k7's doing a single read of 8 bytes in a 400MB buffer is eating up about 400 ns exactly on average using my own simplistic testset. Such researches however work with like 60ns on paper as lmbench gives it to them. I feel the word latency has been overused in that respect. The word 'bandwidth' however is very clear in this context. >implicit from Vincent's reply, but I should have made it explicit. For >lots of small packets gigabit's advantage probably won't be 10x, and >this is another case where a higher-end network is indicated. However, >the latency probably won't change a lot with different switches or >switch arrangements, either, except for the worse along paths with >multiple switch hops in between. > >I should also have pointed out to the original poster that there are >nice tools (e.g. netperf, netpipe, lmbench) that will help him analyze >his raw network performance outside of a particular application that >might well have poor "networking" performance for reasons that have >nothing to do with the actual network. There are also lots of articles >out there both in the list archives, in Cluster World magazine back >issues, in linux magazine back issues, and on various websites >(including mine and brahma's) that can really help one understand just >what ethernet is and how it works and what its numbers should be. It is >the most widely implemented and widely understood network, good, bad, >and ugly features notwithstanding. > > rgb > >> >> (Maybe, you don't get QUITE 100% of the raw clock advantage in all >> applications on all hardware, Vincent;-). However, for most >> applications on most hardware you >>should<< get a signficant advantage >> -- 80-95% of 10x, or 8-9.5x. Not a just "a little". >> >> A really, really cheap switch might have problems with bisection >> bandwidth and chop this down for simultaneous flat-out bidirectional >> data streams, but relatively few parallel applications engage in >> flat-out bidirectional communications. Even if it does, your problem is >> more likely to be with resource contention (e.g. two hosts trying to >> talk to a third at the same time) than it is with actual bandwidth >> oversubscription. This is what Vincent is suggesting that you look into >> (or let us look into:-) below. >> >> If your particular usage pattern does create resource contention, then >> you might well need to either hand-optimize the pattern to avoid >> saturating your cheap hardware, create a network with cheap components >> that effectively breaks up the pathological communications pattern >> (which it sounds like is what you actually did) or buy better hardware >> (either better gigE switches or a "real" HPC network). >> >> However you shouldn't really trash gigE itself -- it isn't at fault and >> your results aren't typical. >> >> rgb >> >> > >> > >I ended up using 5 cheap gigabit switches to make a gigabit concentrator >> > >for my 12 node cluster. >> > >It eliminated the tendency for the network to saturate under a heavy load. >> > >> > Very interesting, can you post a connection scheme and routing table? >> > >> > >It also let me use gigabit network cards in my I/O node and controlling >> > >node with a small improvement in file I/O. >> > >> > Streaming i/o or random access? >> > >> > cheapo disk arrays get what is it, 400MB/s handsdown or so? >> > >> > that's raid5 readspeed, plenty security at a raid5 array. >> > >> > >The compute nodes remaind with 100 Mbit to conserve power. The setup >> > >works rather nicely. >> > >> > what type of software do you run at it, >> > embarrassingly parallel software? >> > >> > Vincent >> > >> > >Glen >> > > >> > >Vincent Diepeveen wrote: >> > > >> > >>Good evening, >> > >> >> > >>It's interesting to investigate what gigabit can do for small home clusters. >> > >> >> > >>Any latency oriented approach is doomed to fail obviously at gigabit. But >> > >>they're cheap. For 40 euro i see several getting offered already. >> > >> >> > >>First important question is of course how much system time those NIC's eat >> > >>when fully loading their bandwidth. >> > >> >> > >>Example, i have an old dual k7 here with pci 2.2 (32 bits 33Mhz). >> > >>Suppose i put a gigabit card in it. >> > >> >> > >>In say 6 messages a second i ship 8MB data at a time. Ship and send in turn. >> > >> >> > >>So it ships a packet of 8MB, then receives a packet of 8MB. >> > >> >> > >>Other than the cost of the thread to store the packet to RAM, does such a >> > >>card in any way stop or block the cpu's which are 100% loaded with >> > >>searching software (my chessprogram diep in this case)? >> > >> >> > >>What penalty other than that thread handling the message is there in terms >> > >>of system time reduction to the 2 processes searching? >> > >> >> > >>Oh btw, i assume that gigabit can handle 48MB/s user data a second? >> > >> >> > >>Vincent >> > >> >> > >>_______________________________________________ >> > >>Beowulf mailing list, Beowulf@beowulf.org >> > >>To change your subscription (digest mode or unsubscribe) visit >> > http://www.beowulf.org/mailman/listinfo/beowulf >> > >> >> > >> >> > >> >> > > >> > >-- >> > >Glen E. Gardner, Jr. >> > >AA8C >> > >AMSAT MEMBER 10593 >> > >Glen.Gardner@verizon.net >> > > >> > > >> > >http://members.bellatlantic.net/~vze24qhw/index.html >> > > >> > > >> > > >> > > >> > > >> > >> >> > >-- >Robert G. Brown http://www.phy.duke.edu/~rgb/ >Duke University Dept. of Physics, Box 90305 >Durham, N.C. 27708-0305 >Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu > > > > From James.P.Lux at jpl.nasa.gov Wed Mar 16 09:22:29 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] The move to gigabit - technical questions In-Reply-To: References: <3.0.32.20050315211144.016291f0@pop.xs4all.nl> Message-ID: <6.1.1.1.2.20050316092047.0282ae10@mail.jpl.nasa.gov> At 06:41 AM 3/16/2005, Robert G. Brown wrote: >On Tue, 15 Mar 2005, Vincent Diepeveen wrote: > > > At 05:41 PM 3/14/2005 -0500, Glen Gardner wrote: > > >Gigabit will be a little faster than 100Mbit on a small cluster, but not > > >a lot. > > > > What is 'not a lot'. > > > > I would guess it's factor 10 faster in bandwidth? > >(Maybe, you don't get QUITE 100% of the raw clock advantage in all >applications on all hardware, Vincent;-). However, for most >applications on most hardware you >>should<< get a signficant advantage >-- 80-95% of 10x, or 8-9.5x. Not a just "a little". > One might want to go back through the archives and see what the difference between 10 Mbps and 100 Mbps Ethernet was, and more to the point, why it wasn't 10x faster. While the details might change, the basic issues remain: "wire speed" "protocol overhead" "physical interface to PC's CPU" "drivers" James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From diep at xs4all.nl Wed Mar 16 09:37:47 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Seg Fault with pvm_upkstr() and Linux. Message-ID: <3.0.32.20050316183746.0148e730@pop.xs4all.nl> At 12:51 AM 3/15/2005 -0700, Josh Zamor wrote: >I have just started in on programming using PVM and have run into an >odd problem. I have written a C (c99) program that calculates the >factorial of a number by calculating parts of the range. I'm using the >GMP for dealing with large numbers (I have done this program >successfully before using numerous methods including pthreads). The Did you configure GMP correctly? For math with big numbers it default does not use FFT calculations but way slower methods. You might want to recompile it with FFT enabled in case you didn't do this yet. Please note that GMP is a very slow library, commercial libraries are up to factor 3 faster (lineair speed not exponential, more efficient implementation). I personally use GMP with big pleasure. >basic way it works for the cluster is that the program starts on a >machine, determines the subrange to be calculated for each task and >then waits for each process to come back with the answer for it's >subrange. The main process then finds the total result of the factorial >from multiplying the subrange results back together... Pretty >standard... >The problem is that after a subrange is calculated by a task the result >is put into a character array (null terminated, created from GMP's >mpz_getstr()), it is packaged using pvm_pkstr() and sent to the parent. >The parent can receive this using pvm_recv(), but as soon as it tries >to store this into a character array in the program using pvm_upkstr(), >or explore it with pvm_bufinfo(), it segfaults. In general in parallel programming the worst performance you get when all processes must report to 1 central process. It's far more efficient when each process is equal and 'divides' the work done. A simple calculation example of a problem i had at a 512 processor SGI is that each 'hub' can handle at most 680MB data per second (for 4 processors in total yes). However if 499 other processors start reading/writing from/to this 'hub' then real disasters will happen. Things will completely lock up. Not only because all processors must divide the small bandwidth, but also because you will get switch latency overhead problems of routers and switches. If they first must stream a few bytes data from A to B and then suddenly from C to D, that's far less efficient than when 1 switch/router must stream only from A to B. Switches and routers sometimes have their own cache which is optimized for those benchmark streaming tests simply. Switch latency can cause serious problems if all processors want to use the same communication resources. The general rule is to keep the routers/switches as less possible as busy and try to make embarrassingly as possible parallel software. >I've also tried this in a couple of different ways, passing ints, it >works, but passing strings (either large or small) results in a >segfault. >Of my 2 "cluster" learning setup one is running Mac OSX and the other >is running Gentoo Linux. It is only the linux box that segfaults, the >OSX box finds the answer correctly. The segfault happens whenever the >linux box is part of the PVM created cluster, or when PVM is ran alone >(no other boxes in the cluster) on the linux box. which compiler do you compile with? I hope gcc only and not intel c++? intel c++ is notorious with floating points in order to get faster at benchmarks. Are you busy with floating point or with integers? >If anyone has any idea why this is happening, or is only happening on >the linux box I would be grateful. Thanks, tech details follow: >Gentoo Linux Box: > Proc: AMD AthlonXP 1700+ > RAM: 512MB. > Kernel: 2.4.26-gentoo-r9 > PVM: 3.4.5 > GMP: 4.1.4 > GCC: 3.3.5 >Mac OSX: > Proc: G4 1GHz > RAM: 768MB > Kernel: 10.3.8, Mach 7.8.0 > PVM: 3.4.5 > GMP: 4.1.4 > GCC: 3.3 (20030304) Are you using PGO with gcc? (pgo = profile guided optimizations) There is major bugs even in latest 3.4.3 gcc in the PGO. Those guys are all volunteers and very cool guys. Very slow in bugfixing as they have other jobs too, and i don't blame them. Vincent >Thanks again. > >Regards, >-J Zamor >jzamor@gmail.com > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From ajt at rri.sari.ac.uk Wed Mar 16 04:21:23 2005 From: ajt at rri.sari.ac.uk (Tony Travis) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Folding@Home on a Beowulf? In-Reply-To: <42376E09.30207@cypher.acomp.usf.edu> References: <4233B2D3.7000807@spiekerfamily.com> <42376E09.30207@cypher.acomp.usf.edu> Message-ID: <423824C3.2090809@rri.sari.ac.uk> Brian R Smith wrote: > [...] > Just set up f@h on each machine, giving each a different machine id > number, 1-6 should suffice. As for getting the data sets, you should > probably find some way of getting these nodes online... Do you have a > switch or some extra ethernet ports? Maybe use one of the machines as a > router? You could also whip up a script that would "pretend" to be each > instance of a machine id (1-6), receive the data, then place it in the > proper directory in each node. Hello, Brian. We've run both SETI@home and folding@home on our 64-node openMosix Beowulf cluster using David Ranch's software firewall on the 'head' node to allow IP masquerading of the compute nodes on the public internet through our private cluster LAN: http://en.tldp.org/HOWTO/IP-Masquerade-HOWTO/ It works very well :-) Tony. -- Dr. A.J.Travis, | mailto:ajt@rri.sari.ac.uk Rowett Research Institute, | http://www.rri.sari.ac.uk/~ajt Greenburn Road, Bucksburn, | phone:+44 (0)1224 712751 Aberdeen AB21 9SB, Scotland, UK. | fax:+44 (0)1224 716687 From jzamor at gmail.com Wed Mar 16 12:14:52 2005 From: jzamor at gmail.com (Josh Zamor) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Seg Fault with pvm_upkstr() and Linux. In-Reply-To: References: <19430ea71246a4e9a13899873914bff7@gmail.com> Message-ID: On Mar 16, 2005, at 7:30 AM, Robert G. Brown wrote: > I assume that you've experimented and have no difficulty returning and > unpacking ordinary ints, strings, or raw data blocks with PVM. If so > you probably aren't making a pointer error on the master server side, > although it never hurts to check. Actually, I had tried passing ints and the like in other programs and had no problem, but when I tired to pass just a string in my factorial program, not using GMP, but still having the GMP library, it would segfault. What I hadn't tried was to create a simple test program for just sending strings using PVM. I have now written a test program that does not use the GMP and tries to send a basic string back from a child to the parent. This again will segfault on the linux machine and work just fine on the OSX machine (essentially BSD). The code that I tried is posted below. For convenience I have also posted the source of this test program, my simple factorial program, Makefiles, and script outputs from both OSX and the linux machine at the following address: http://www.cet.nau.edu/~jrz4/pvmTest/ --strPVM.c-- #include #include int main(int argc, char** argv) { int info, mytid, myparent, child[2]; if(mytid = pvm_mytid() < 0) { pvm_perror("Could not get mytid"); return -1; } myparent = pvm_parent(); if((myparent < 0) && (myparent != PvmNoParent)) { pvm_perror("Some odd errr for my parent"); pvm_exit(); return -1; } /* I am parent */ if(myparent == PvmNoParent) { info = pvm_spawn(argv[0], NULL, PvmTaskDefault, NULL, 2, child); for(int i = 0; i < 2; ++i) { if(child[i] < 0) printf(" %d", child[i]); else printf("t%x\t", child[i]); } putchar('\n'); if(info != 2) { pvm_perror("Kids didn't all spawn!"); pvm_exit(); return -1; } for(int i = 0; i < 2; ++i) { char* retStr; info = pvm_recv(-1, 11); info = pvm_upkstr(retStr); printf("Recieved return string: %s\n", retStr); } pvm_exit(); return 0; } else { /* Child follows */ char str[3]; str[0] = 'a'; str[1] = 'b'; str[3] = (char)0; pvm_initsend(PvmDataDefault); pvm_pkstr(str); pvm_send(myparent, 11); pvm_exit(); return 0; } } Also, for the character array "str" in the child segment above, I have tried using malloc to create the memory, and using const char arrays as well. While all of these methods give seg faults on the linux machine, the above way was the only way that I tried that worked and didn't give a bus error on Mac OSX. Thanks again. From diep at xs4all.nl Wed Mar 16 13:07:51 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Seg Fault with pvm_upkstr() and Linux. Message-ID: <3.0.32.20050316220751.01633148@pop.xs4all.nl> At 01:25 PM 3/16/2005 -0700, Josh Zamor wrote: > >On Mar 16, 2005, at 10:37 AM, Vincent Diepeveen wrote: >> >> Did you configure GMP correctly? >> >> For math with big numbers it default does not use FFT calculations but >> way >> slower methods. You might want to recompile it with FFT enabled in >> case you >> didn't do this yet. >> > >I actually haven't done this, though I'll certainly try that soon. To quote a friend of mine: "Good programmers do not blink their eyes to speedup scientific written software by a factor 1 million". >> >> In general in parallel programming the worst performance you get when >> all >> processes must report to 1 central process. >> >> It's far more efficient when each process is equal and 'divides' the >> work >> done. >> >> A simple calculation example of a problem i had at a 512 processor SGI >> is >> that each 'hub' can handle at most 680MB data per second (for 4 >> processors >> in total yes). >> >> However if 499 other processors start reading/writing from/to this >> 'hub' >> then real disasters will happen. >> >> Things will completely lock up. Not only because all processors must >> divide >> the small bandwidth, but also because you will get switch latency >> overhead >> problems of routers and switches. >> >> If they first must stream a few bytes data from A to B and then >> suddenly >> from C to D, that's far less efficient than when 1 switch/router must >> stream only from A to B. >> >> Switches and routers sometimes have their own cache which is optimized >> for >> those benchmark streaming tests simply. Switch latency can cause >> serious >> problems if all processors want to use the same communication >> resources. >> >> The general rule is to keep the routers/switches as less possible as >> busy >> and try to make embarrassingly as possible parallel software. > > >This is exactly the sort of thing that I will be looking for shortly, >do you have any recommendations on either books or online texts that >cover this sort of thing (best practices when programming for >clusters)? I'm currently just experimenting, but this is a field that I >think I want to get involved in. Paranoia sir, is the only thing i can advice you. Never believe any datapoint a manufacturer gives you until you can prove it yourself. SGI claimed towards me for example that a random lookup at a remote processor of Origin3800 at 512p partition would cost me no more than 460 nanoseconds to get 8-128 bytes. Of course i benchmarked when the time was there at 460 processors and on average a read of 8 bytes took 5.8 us. Something like 100MB ram per processor. Each processor is doing every new read of 8 bytes a random lookup to a random processor at a random memory location within that 460*100MB. SGI is no exception. Manufacturers in highend have the problem that competitors have such unrealistic numbers that the only way they can sell their stuff is by doing an even more incredible claim. I had a cluster guy of another very large huge blue company swear to me that the one way pingpong latency of the just build 20xx processor machine was under 5 us, using network cards of the most sold highend network card on the planet in supercomputers. So i asked a friend to do that pingpong at just a partition of 128 nodes and it was 8 us there. Let alone 1000+. Later confronted that person friend with it and it was: "well i hope you realize that the problem is that what you measure is including the MPI overhead which can be significant, the numbers i quoted were measured without that stupid overhead". But well, you actual make software and *do* need to count at that 'stupid overhead' to be true. > >> >> which compiler do you compile with? >> >> I hope gcc only and not intel c++? >> >> intel c++ is notorious with floating points in order to get faster at >> benchmarks. >> >> Are you busy with floating point or with integers? > > >I am currently using gcc's c compiler, ver 3.3.x. And doing mostly >integer calculations currently. gcc should have no bugs there, except for PGO. > >> >> Are you using PGO with gcc? (pgo = profile guided optimizations) >> >> There is major bugs even in latest 3.4.3 gcc in the PGO. >> >> Those guys are all volunteers and very cool guys. >> >> Very slow in bugfixing as they have other jobs too, and i don't blame >> them. >> > >Actually, I haven't, but profilers are one of the things that I want to >get more familiar with... Thank you for the suggestions, I really >appreciate them having done limited parallel programming on this scale. I'm not referring to profilers but to for example first compiling with for example: # gcc 3.3.3 (suse) in case of x86-64 : CFLAGS = -O3 -fprofile-arcs -march=k8 -mcpu=k8 Then run your program single cpu for a while. quit your program. remote all object files. Recompile then with: CFLAGS = -O3 -fbranch-probabilities -march=k8 -mcpu=k8 # note that gcc 3.4.x the 'mcpu' has been renamed to 'mtune' Otherwise default use something like this which has the right processor name you use: CFLAGS = -O2 -mcpu=athlon-xp -march=athlon-xp Take care you optimize the GMP for the processor in question, makes a difference. >Regards, >-J Zamor >jzamor@gmail.com From rgb at phy.duke.edu Wed Mar 16 13:23:09 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Seg Fault with pvm_upkstr() and Linux. In-Reply-To: References: <19430ea71246a4e9a13899873914bff7@gmail.com> Message-ID: On Wed, 16 Mar 2005, Josh Zamor wrote: > --strPVM.c-- > > #include > #include > > int main(int argc, char** argv) { > int info, mytid, myparent, child[2]; Add char buf[3]; > > if(mytid = pvm_mytid() < 0) { > pvm_perror("Could not get mytid"); > return -1; > } > > myparent = pvm_parent(); > if((myparent < 0) && (myparent != PvmNoParent)) { > pvm_perror("Some odd errr for my parent"); > pvm_exit(); > return -1; > } > > /* I am parent */ > if(myparent == PvmNoParent) { > info = pvm_spawn(argv[0], NULL, PvmTaskDefault, NULL, 2, child); > > for(int i = 0; i < 2; ++i) { > if(child[i] < 0) > printf(" %d", child[i]); > else > printf("t%x\t", child[i]); > } > putchar('\n'); > > if(info != 2) { > pvm_perror("Kids didn't all spawn!"); > pvm_exit(); > return -1; > } > > for(int i = 0; i < 2; ++i) { > char* retStr; delete line ^^^^^^^^^^^^^^ > info = pvm_recv(-1, 11); > info = pvm_upkstr(retStr); change to info = pvm_upkstr(buf); > printf("Recieved return string: %s\n", buf); > } > > pvm_exit(); > return 0; > } else { /* Child follows */ > char str[3]; delete, and change to buf[0] = 'a'; buf[1] = 'b'; buf[2] = (char)0; (noting that: >> str[3] = (char)0; is a bug!) Or use snprintf(buf,3,"ab"); Or strncpy. > pvm_initsend(PvmDataDefault); > pvm_pkstr(buf); ^^^ > pvm_send(myparent, 11); > pvm_exit(); > return 0; > } > } > The code you sent looks to me to be wrong. You declare retStr to be a char *, but there is no space associated with it and you don't initialize it. I believe that pvm's pvm_unpkstr COPIES the returned string into the destination rather than sets the pointer to point to anonymous memory somewhere, since the latter would leak sieve-like. Malloc'ing the vector should have worked also, but depending on where str[] was allocated, just referencing str[3] could have caused a segment violation. If you want to see an example of a guaranteed-working program that sends a simple string using pvm all packaged up neatly to use as a template for doing actual work, you might look at http://www.phy.duke.edu/~rgb/General/project_pvm.php I put this up as an example PVM program template for a CWM column last year sometime. It keeps the master and slave code separate (and builds them separately). I can't see any good reason that your "fork/exec" style code shouldn't work, though. Declaring the local variables inside conditional segments might cause their location to be relatively likely to cause segment violation problems when you access the string beyond its allocated length, although I confess that USUALLY this will "work". That's probably why it worked on the mac, and why it might work on linux if you move things around a bit but leave things otherwise the same. SO, two definite bugs -- must allocate the target memory block one way or another and must not reference an allocated vector past the end. We'll have to see if these are "the" bugs -- there may be others. Either one would cause a segment violation in particular, though. In a minute I'll try to compile your program and test the fixes I suggest above (I know, should probably have done this FIRST, right?:-). HTH, rgb > Also, for the character array "str" in the child segment above, I have > tried using malloc to create the memory, and using const char arrays as > well. While all of these methods give seg faults on the linux machine, > the above way was the only way that I tried that worked and didn't give > a bus error on Mac OSX. > > Thanks again. > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Wed Mar 16 13:34:58 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Seg Fault with pvm_upkstr() and Linux. In-Reply-To: References: <19430ea71246a4e9a13899873914bff7@gmail.com> Message-ID: On Wed, 16 Mar 2005, Josh Zamor wrote: (Stuff). OK, it works perfecto-mundally. See code below: #include #include #include int main(int argc, char** argv) { int info, mytid, myparent, child[2]; char buf[1024]; if(mytid = pvm_mytid() < 0) { pvm_perror("Could not get mytid"); return -1; } myparent = pvm_parent(); if((myparent < 0) && (myparent != PvmNoParent)) { pvm_perror("Some odd errr for my parent"); pvm_exit(); return -1; } /* I am parent */ if(myparent == PvmNoParent) { info = pvm_spawn(argv[0], NULL, PvmTaskDefault, NULL, 2, child); for(int i = 0; i < 2; ++i) { if(child[i] < 0) printf(" %d", child[i]); else printf("t%x\t", child[i]); } putchar('\n'); if(info != 2) { pvm_perror("Kids didn't all spawn!"); pvm_exit(); return -1; } for(int i = 0; i < 2; ++i) { info = pvm_recv(-1, 11); info = pvm_upkstr(buf); printf("Received return string: %s\n", buf); } pvm_exit(); return 0; } else { /* Child follows */ strcpy(buf,"Testing PVM's string line."); pvm_initsend(PvmDataDefault); pvm_pkstr(buf); pvm_send(myparent, 11); pvm_exit(); return 0; } } rgb@lilith|B:1074>gcc -O3 -std=c99 -g -I/usr/share/pvm3/include -L/usr/share/pvm3/lib/LINUXI386 -o pvm_test pvm_test.c -lpvm3 -lm rgb@lilith|B:1075>/tmp/pvm_test t4000e t4000f Received return string: Testing PVM's string line. Received return string: Testing PVM's string line. (So it was probably one or the other of the bugs I pointed out.) rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From diep at xs4all.nl Wed Mar 16 13:36:48 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Seg Fault with pvm_upkstr() and Linux. Message-ID: <3.0.32.20050316223648.01639c00@pop.xs4all.nl> At 09:30 AM 3/16/2005 -0500, Robert G. Brown wrote: >On Tue, 15 Mar 2005, Josh Zamor wrote: >I assume that you've experimented and have no difficulty returning and >unpacking ordinary ints, strings, or raw data blocks with PVM. If so >you probably aren't making a pointer error on the master server side, >although it never hurts to check. I wouldn't count on it being bugfree; usually it takes a long time before people discover the beautiful 'sizeof' command in C and as we know that will give back '4' in a lot of cases at his 32 bits XP and '8' at the 64 bits g4. For example but not limited to: printf("sizeof(long) = %i\n",(int)sizeof(long)); Also integer is not safe, as standards do not force it to be 32 bits. Then there is the usual casting problem and compare problem. Notorious is this compare: int a; unsigned int b; if( b == a ) // result undefined Odds for bugs in the program are like 99.99% >If you want other eyes on your actual code (might be useful if it is >indeed programmer error) please post. > > rgb > >-- >Robert G. Brown http://www.phy.duke.edu/~rgb/ >Duke University Dept. of Physics, Box 90305 >Durham, N.C. 27708-0305 >Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu > > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From jzamor at gmail.com Wed Mar 16 12:25:01 2005 From: jzamor at gmail.com (Josh Zamor) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Seg Fault with pvm_upkstr() and Linux. In-Reply-To: <3.0.32.20050316183746.0148e730@pop.xs4all.nl> References: <3.0.32.20050316183746.0148e730@pop.xs4all.nl> Message-ID: <3ec3474f03b446c9a78f4288e3051ec8@gmail.com> On Mar 16, 2005, at 10:37 AM, Vincent Diepeveen wrote: > > Did you configure GMP correctly? > > For math with big numbers it default does not use FFT calculations but > way > slower methods. You might want to recompile it with FFT enabled in > case you > didn't do this yet. > I actually haven't done this, though I'll certainly try that soon. > > In general in parallel programming the worst performance you get when > all > processes must report to 1 central process. > > It's far more efficient when each process is equal and 'divides' the > work > done. > > A simple calculation example of a problem i had at a 512 processor SGI > is > that each 'hub' can handle at most 680MB data per second (for 4 > processors > in total yes). > > However if 499 other processors start reading/writing from/to this > 'hub' > then real disasters will happen. > > Things will completely lock up. Not only because all processors must > divide > the small bandwidth, but also because you will get switch latency > overhead > problems of routers and switches. > > If they first must stream a few bytes data from A to B and then > suddenly > from C to D, that's far less efficient than when 1 switch/router must > stream only from A to B. > > Switches and routers sometimes have their own cache which is optimized > for > those benchmark streaming tests simply. Switch latency can cause > serious > problems if all processors want to use the same communication > resources. > > The general rule is to keep the routers/switches as less possible as > busy > and try to make embarrassingly as possible parallel software. This is exactly the sort of thing that I will be looking for shortly, do you have any recommendations on either books or online texts that cover this sort of thing (best practices when programming for clusters)? I'm currently just experimenting, but this is a field that I think I want to get involved in. > > which compiler do you compile with? > > I hope gcc only and not intel c++? > > intel c++ is notorious with floating points in order to get faster at > benchmarks. > > Are you busy with floating point or with integers? I am currently using gcc's c compiler, ver 3.3.x. And doing mostly integer calculations currently. > > Are you using PGO with gcc? (pgo = profile guided optimizations) > > There is major bugs even in latest 3.4.3 gcc in the PGO. > > Those guys are all volunteers and very cool guys. > > Very slow in bugfixing as they have other jobs too, and i don't blame > them. > Actually, I haven't, but profilers are one of the things that I want to get more familiar with... Thank you for the suggestions, I really appreciate them having done limited parallel programming on this scale. Regards, -J Zamor jzamor@gmail.com From steve_heaton at ozemail.com.au Wed Mar 16 18:42:56 2005 From: steve_heaton at ozemail.com.au (steve_heaton@ozemail.com.au) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Re: The move to gigabit - technical questions Message-ID: <20050317024256.CAXF29550.swebmail01.mail.ozemail.net@localhost> G'day all Somewhat relevant... Part of my benchtesting exersize on my DIY beowulf was a comparison between running the onboard FastEthernet v's adding an Intel 1000MT GigaE adapter. I changed the MPI config to ensure the MPI traffic had the GigaE to itself and "other" traffic went via the FastE. I ran the full MPI perftest suite. Sample graphs on this web page: http://members.ozemail.com.au/~sheaton/lss/ -> Computing It was indeed "a bit" faster. There's some NetPipe results in there too. I can provide more details if any one is interested. Note: I know magic can be worked with the Intel driver but this is "vanilla" ATM. Cheers Steve This message was sent through MyMail http://www.mymail.com.au From eugen at leitl.org Thu Mar 17 12:36:29 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] [Bioclusters] notes and pictures from a "wet lab baby-biocluster" project (fwd from dag@sonsorol.org) Message-ID: <20050317203629.GJ17303@leitl.org> ----- Forwarded message from Chris Dagdigian ----- From: Chris Dagdigian Date: Thu, 17 Mar 2005 14:42:10 -0500 To: "Clustering, compute farming & distributed computing in life science informatics" Subject: [Bioclusters] notes and pictures from a "wet lab baby-biocluster" project Organization: Bioteam Inc. User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.5) Gecko/20041217 Reply-To: "Clustering, compute farming & distributed computing in life science informatics" I've had a blast the past few days doing rack-and-stack work that I normally don't get to do much anymore. Rough notes and a link to the images follow... The pictures: ------------- http://bioteam.net/gallery/wetlabcluster The challenge: -------------- In 12 days or less, design a cluster, source the parts and put it together in good working order. The cluster must meet the following requirements: - Capable of operating in a wet lab setting - Managed and operated by biologists - Linux OS required (software dependencies ...) - Require no more than 2x 20-amp power circuits - ~ 4 terabyte raw storage requirement; HA or super-performance not a requirement - Quieter than the instruments surrounding it - Small enough to (roughly) fit under a lab bench - Have sufficient CPU power to meet analytical needs - Capable of automatically processing data coming off one or more high end instruments The components: --------------- I can't share details about the requirements gathering phase of the project. We studied the instrument, the science and the stuff that needed to be done with the data coming off the instrument and determined that approx. six dual-processor boxes with AMD Opteron CPUs would be acceptable. Under a massive time crunch and some components were ordered purely on the basis of "how fast can you ship to us..." The parts list boiled down to the following pieces: From CDW.com with rush delivery :) - Digi CM 16 serial console server - Pair of 20-amp APC rack-mount power distribution units - Dirt cheap SMC 24-port gigabit ethernet unmanaged switch - Box of serial DB9 to cat5 RJ45 adaptors for serial console - Bulk quantities of 5ft grey cat5e cables (no time for special colors or lengths) From IBM via a local reseller/integrator: - 7x IBM eSeries 326 1U rackmount dual-Opteron servers (6 nodes + master) From Apple: - Apple Xserve RAID with 14x 400gb drives - Apple PCI-X fiber channel HBA card & cables - Xserve RAID spare parts kit From Extrememac.com: - Small form factor 12U "XRack Pro2" cabinet (http://www.xrackpro.com/) The problems: ------------- The biggest overall problem was that the Apple Xserve RAID was ordered with Fedex shipping but without priority delivery. This means that the storage arrived at 5pm the night before our final cluster-assembly work day. It also arrived with damaged rackmount rails but the damage was not enough to make the hardware unusable. Even worse, the cluster cabinet arrived at 1pm *on* our final work day. This was in spite of the fact that the cabinet had been ordered via credit card directly from Extrememac 7 or 8 days prior. As a vendor, they were not really on the ball with things but this could be normal for a company that seems to mostly make iPod accessories. Hopefully just a fluke experience. The IBM hardware arrived quickly and the reseller/integrator did a good job. A minor hassle was that we had to order 15,000RPM Ultra320 scsi drives because the cheaper 10,000RPM drive were on some sort of IBM global "short supply" list. The biggest problem with IBM and the reason I'll probably never purchase eSeries servers again is that apparently IBM refuses to sell any sort generic rail mounting kits for the e-series product line (this is what the integrator told me; have not verified this yet). They ship with rail kits that *only* work in IBM branded server cabinets. Given that we were installing into a non-IBM 12U cabinet this was a big issue. Our integrator found a 3rd party rail reseller that makes compatible rails but we could not order them in time. To me this is just annoying and (if true) due to the annoyance factor I'll probably buy my dual Opterons from Sun in the future (assuming Sun will sell me a generic rail kit...) Final thoughts: --------------- The 64bit version of Suse 9.2 Professional handled the fibre channel storage amazingly cleanly. It detected, mounted and provisioned the 2 Apple RAID LUNs into a LVM group with no problem at all. I was expecting the Linux -> Apple RAID stuff to be a bit more scary. I really like the XRack Pro2 cluster cabinet or whatever it's marketing name is. Well assembled with good options for choosing between quiet vs cooling. There is plenty of space for wiring and cable runs even if all 12U are packed with equipment. We have everything powered up today and working hard and are monitoring the temperature conditions internally. The Xserve RAID is one the quietest storage arrays I've ever seen - I thought it would be louder than the IBM rack-mounts but this is not the case. The biggest liability in this cluster is the lack of an internal UPS capable of cleanly shutting down the Xserve RAID chassis. There was simply no more room in the cabinet. We'll do external UPS for now and if we can squeeze out 1 compute node there is the possibility of installing one of the 1U UPS systems made by APC. -Chris -- Chris Dagdigian, BioTeam - Independent life science IT & informatics consulting Office: 617-665-6088, Mobile: 617-877-5498, Fax: 425-699-0193 PGP KeyID: 83D4310E iChat/AIM: bioteamdag Web: http://bioteam.net _______________________________________________ Bioclusters maillist - Bioclusters@bioinformatics.org https://bioinformatics.org/mailman/listinfo/bioclusters ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050317/483eb742/attachment.bin From diep at xs4all.nl Thu Mar 17 17:14:37 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Re: The move to gigabit - technical questions Message-ID: <3.0.32.20050318021437.01624af0@pop.xs4all.nl> G'morning At 01:42 PM 3/17/2005 +1100, steve_heaton@ozemail.com.au wrote: >G'day all > >Somewhat relevant... Part of my benchtesting exersize on my DIY beowulf was a comparison between running the onboard FastEthernet v's adding an Intel 1000MT GigaE adapter. > >I changed the MPI config to ensure the MPI traffic had the GigaE to itself and "other" traffic went via the FastE. > >I ran the full MPI perftest suite. Sample graphs on this web page: > >http://members.ozemail.com.au/~sheaton/lss/ >-> Computing > >It was indeed "a bit" faster. You ship 1.01E+02 = 1.01 * 10^2 = 101 bytes per block and reach roughly 240 megabit per second with it. Interesting to know is how much SYSTEM time you lose while shipping 240 megabit per second in say 2MB blocks, not 101 bytes. 101 bytes is real real little IMHO. A system is obviously pretty near useless when you have zero processortime left thanks to network. >There's some NetPipe results in there too. I can provide more details if any one is interested. How much processortime is left while running netpipe. Do your cards use DMA? >Note: I know magic can be worked with the Intel driver but >this is "vanilla" ATM. Vanilla is what we need with respect to gigabit. The non vanilla theoretic 'raw data throughput' and raw latency without protocol overhead, that's for the highends who for sure have solved the processor time issue already long long ago :) >Cheers >Steve > > >This message was sent through MyMail http://www.mymail.com.au > > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From laytonjb at charter.net Fri Mar 18 05:51:31 2005 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] [Fwd: [O-MPI users] Fwd: Thoughts on an MPI ABI] Message-ID: <423ADCE3.7090508@charter.net> Good morning, I've been a little behind on reading the mailing lists, but I saw this post on the Open-MPI mailing list and thought it might be of interest to people on this list since this has been a recent topic of discussion. Enjoy! Jeff -------------- next part -------------- An embedded message was scrubbed... From: Jeff Squyres Subject: [O-MPI users] Fwd: Thoughts on an MPI ABI Date: Sun, 13 Mar 2005 13:36:20 -0500 Size: 30766 Url: http://www.scyld.com/pipermail/beowulf/attachments/20050318/46e81e9e/ThoughtsonanMPIABI.mht From atp at piskorski.com Fri Mar 18 07:11:09 2005 From: atp at piskorski.com (Andrew Piskorski) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Re: The move to gigabit - technical questions In-Reply-To: <3.0.32.20050318021437.01624af0@pop.xs4all.nl> References: <3.0.32.20050318021437.01624af0@pop.xs4all.nl> Message-ID: <20050318151108.GA74656@piskorski.com> On Fri, Mar 18, 2005 at 02:14:37AM +0100, Vincent Diepeveen wrote: > At 01:42 PM 3/17/2005 +1100, steve_heaton@ozemail.com.au wrote: > >Note: I know magic can be worked with the Intel driver but > >this is "vanilla" ATM. > > Vanilla is what we need with respect to gigabit. Well no, it's not. When he said "vanilla", I believe he meant, "Using the stock Linux driver settngs with no attempt at tuning for the needs of my particular HPC application." Thus he has done only part of the work necessary to fully answer the question, "How much better are these gigabit cards than using 100 megabit ethernet for me". Since he is using Intel Pro/1000 cards, he probably should also try using GAMMA rather than TCP/IP. -- Andrew Piskorski http://www.piskorski.com/ From diep at xs4all.nl Fri Mar 18 12:38:04 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Re: The move to gigabit - technical questions Message-ID: <3.0.32.20050318213803.0163a370@pop.xs4all.nl> At 10:11 AM 3/18/2005 -0500, Andrew Piskorski wrote: >On Fri, Mar 18, 2005 at 02:14:37AM +0100, Vincent Diepeveen wrote: >> At 01:42 PM 3/17/2005 +1100, steve_heaton@ozemail.com.au wrote: > >> >Note: I know magic can be worked with the Intel driver but >> >this is "vanilla" ATM. >> >> Vanilla is what we need with respect to gigabit. > >Well no, it's not. When he said "vanilla", I believe he meant, "Using >the stock Linux driver settngs with no attempt at tuning for the needs >of my particular HPC application." Thus he has done only part of the >work necessary to fully answer the question, "How much better are >these gigabit cards than using 100 megabit ethernet for me". > >Since he is using Intel Pro/1000 cards, he probably should also try >using GAMMA rather than TCP/IP. I'm sorry but i fail to be able to take GAMMA serious at the moment. All my machines here are dual machines. No chance they can work with GAMMA, last time i checked that is. >-- >Andrew Piskorski >http://www.piskorski.com/ >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From cap at nsc.liu.se Fri Mar 18 01:12:07 2005 From: cap at nsc.liu.se (Peter =?iso-8859-1?q?Kjellstr=F6m?=) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Re: The move to gigabit - technical questions In-Reply-To: <20050317024256.CAXF29550.swebmail01.mail.ozemail.net@localhost> References: <20050317024256.CAXF29550.swebmail01.mail.ozemail.net@localhost> Message-ID: <200503181012.13311.cap@nsc.liu.se> Hello, e1000 is a really good performer _IF_ you switch of ITR (InterruptThrottleRate) when loading the module.. (this is not default). Try adding the following to your modules.conf (or modprobe.conf if 2.6) options e1000 InterruptThrottleRate=0,0 (use =0 if one NIC, 0,0 if two, 0,0,0,0 if four... and so on). /Peter On Thursday 17 March 2005 03.42, steve_heaton@ozemail.com.au wrote: > G'day all > > Somewhat relevant... Part of my benchtesting exersize on my DIY beowulf was > a comparison between running the onboard FastEthernet v's adding an Intel > 1000MT GigaE adapter. > > I changed the MPI config to ensure the MPI traffic had the GigaE to itself > and "other" traffic went via the FastE. > > I ran the full MPI perftest suite. Sample graphs on this web page: > > http://members.ozemail.com.au/~sheaton/lss/ > -> Computing > > It was indeed "a bit" faster. > > There's some NetPipe results in there too. I can provide more details if > any one is interested. > > Note: I know magic can be worked with the Intel driver but this is > "vanilla" ATM. > > Cheers > Steve > > > This message was sent through MyMail http://www.mymail.com.au -- ------------------------------------------------------------ Peter Kjellstr?m | National Supercomputer Centre | Sweden | http://www.nsc.liu.se -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050318/53596250/attachment.bin From jzamor at gmail.com Fri Mar 18 07:29:40 2005 From: jzamor at gmail.com (Josh Zamor) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Seg Fault with pvm_upkstr() and Linux. In-Reply-To: References: <19430ea71246a4e9a13899873914bff7@gmail.com> Message-ID: <3eee94462e89e0e116d3fab7eb45694b@gmail.com> On Mar 16, 2005, at 2:34 PM, Robert G. Brown wrote: > > OK, it works perfecto-mundally. See code below: > > #include > #include > #include > > int main(int argc, char** argv) { > int info, mytid, myparent, child[2]; > char buf[1024]; ... > for(int i = 0; i < 2; ++i) { > info = pvm_recv(-1, 11); > info = pvm_upkstr(buf); > printf("Received return string: %s\n", buf); > } > > > (So it was probably one or the other of the bugs I pointed out.) > That was it. I had tried that when experimenting with the factorial program, but I had made another mistake that masked the solution. And then of course I made the always fatal mistake of assuming that it must be somebody else's code, as it worked on one machine and not on the other. Thanks, I really appreciate your assistance. Regards, -J Zamor jzamor@gmail.com From list-beowulf at onerussian.com Fri Mar 18 09:19:06 2005 From: list-beowulf at onerussian.com (Yaroslav Halchenko) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] maui scheduler/ time sharing/ suspend Message-ID: <20050318171906.GE3114@washoe.rutgers.edu> Hi All wulfers, Presently I use ii torque 1.1.0p4-4 Tera-scale Open-source Resource and QUEue ma ii maui 3.2.6p11-2y1 Scheduler for PBS-based and other cluster en to do batch job processing. I remember I've tried to do time sharing at some point using maui but I didn't succeed. Does any one has any positive experience?? For now I just configured the nodes as having twice as many CPUs as in reality and limiting guques/users as to real number of CPUs. This way we can time share but it is really handicapped. After a while some nodes sit unloaded while others have 4 jobs running... Please advise on configuring torque/pbs as for having time sharing or/and suspend (either of maui or pbs_sched is fine) -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555] -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20050318/02ef2e81/attachment.bin From nelsoneci at gmail.com Fri Mar 18 12:52:19 2005 From: nelsoneci at gmail.com (Nelson Castillo) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Can I write Etherboot to MBR? Message-ID: <2accc2ff050318125257e783b5@mail.gmail.com> Hi. I'm booting a lot of machines in a diskless clusters. I don't want to wipe the actual content of hardisks. I'd like to overwrite the MBR without using a boot loader. I've been able to use a zlilo image with both grub and lilo. I've been able to use a zdsk image booting from a floppy. But I just want to do something like: cat eepro100.zdsk > /dev/sda just as I do eepro100.zdsk > /dev/fd0 Can I do this? Regards, Nelson.- -- Homepage : http://geocities.com/arhuaco The first principle is that you must not fool yourself and you are the easiest person to fool. -- Richard Feynman. From linuxslacker at gmail.com Fri Mar 18 15:52:14 2005 From: linuxslacker at gmail.com (Chris Peterson) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Gigabit Switch Message-ID: <219b037a05031815521e616dbb@mail.gmail.com> Hi, This is a little off topic but.. We have a gigbit switch that needs to be replaced because it does not suppoert jumbo frames. Does anyone know of a good 24-port switch with a mini-GBIC port. We have looked at the SMC8624T, but it is out of our price range. Chris Peterson From rgb at phy.duke.edu Sat Mar 19 05:54:51 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Seg Fault with pvm_upkstr() and Linux. In-Reply-To: <3eee94462e89e0e116d3fab7eb45694b@gmail.com> References: <19430ea71246a4e9a13899873914bff7@gmail.com> <3eee94462e89e0e116d3fab7eb45694b@gmail.com> Message-ID: On Fri, 18 Mar 2005, Josh Zamor wrote: > > (So it was probably one or the other of the bugs I pointed out.) > > > > > That was it. I had tried that when experimenting with the factorial > program, but I had made another mistake that masked the solution. And > then of course I made the always fatal mistake of assuming that it must > be somebody else's code, as it worked on one machine and not on the > other. Thanks, I really appreciate your assistance. :-) In the 32 years or so I've been programming, I think I've encountered problems in programs that really weren't >>my<< fault (as opposed to bugs in a compiler or even a mature library) fewer than a dozen times. So the odds were pretty good. Especially for something like gcc+PVM, which is a combination widely enough used and old enough that finding a new, unfixed, egregious bug is really really unlikely. It always helps to get another pair of eyes on buggy code though, and the list contains a lot of pairs...;-) rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From mechti01 at luther.edu Thu Mar 10 21:28:30 2005 From: mechti01 at luther.edu (Timo Mechler) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Errors running Linpack Message-ID: <6.2.1.2.0.20050310232751.01e14d58@pop.luther.edu> Hi Everyone, Just got my cluster up and running. Hardware specs as follows: 2xPIII 550mhz, 320mb ram 2x100mbit NIC, 6.4gig hdd 1xPIII 500mhz, 320mb ram 2x100mbit NIC, 6.4gig hdd 5xPIII 450mhz, 320mb ram 2x100mbit NIC, 6.4gig hdd Dual PIII 650mhz, 512mb ram, 2x100mbit NIC, 18gig 10k SCSI hdd frontend All running rocks 3.3.0 and hooked into a DLink 24port 10/100 unmanaged switch. I've been running Linpack to test things out and get an idea of performance. Right now I'm getting around 1.5gflops. Does that sound about right for performance? Sometimes when I try to run Linpack, I get errors as those listed below: p2_11545: p4_error: Found a dead connection while looking for messages: 1 rm_l_6_11443: (3.682629) net_send: could not write to fd=6, errno = 9 rm_l_6_11443: p4_error: net_send write: -1 p4_error: latest msg from perror: Bad file descriptor rm_l_1_11525: (8.407278) net_send: could not write to fd=6, errno = 9 rm_l_1_11525: p4_error: net_send write: -1 p4_error: latest msg from perror: Bad file descriptor rm_l_6_11443: (17.753033) net_send: could not write to fd=5, errno = 104 Is this hardware related? If not, what else might it be? Thanks in advance for your help. Regards, -Timo Mechler From lindahl at pathscale.com Sun Mar 20 18:33:48 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] [Fwd: [O-MPI users] Fwd: Thoughts on an MPI ABI] In-Reply-To: <423ADCE3.7090508@charter.net> References: <423ADCE3.7090508@charter.net> Message-ID: <20050321023348.GG3221@greglaptop> On Fri, Mar 18, 2005 at 08:51:31AM -0500, Jeffrey B. Layton wrote: > I've been a little behind on reading the mailing lists, but I saw > this post on the Open-MPI mailing list and thought it might > be of interest to people on this list since this has been a recent > topic of discussion. Jeff, Are you planning on forwarding the entire discussion? One posting out of context is unlikely to give a very good picture of the whole discussion. -- greg From laytonjb at charter.net Mon Mar 21 04:07:52 2005 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] [Fwd: [O-MPI users] Fwd: Thoughts on an MPI ABI] In-Reply-To: <20050321023348.GG3221@greglaptop> References: <423ADCE3.7090508@charter.net> <20050321023348.GG3221@greglaptop> Message-ID: <423EB918.90908@charter.net> Greg Lindahl wrote: >On Fri, Mar 18, 2005 at 08:51:31AM -0500, Jeffrey B. Layton wrote: > > > >> I've been a little behind on reading the mailing lists, but I saw >>this post on the Open-MPI mailing list and thought it might >>be of interest to people on this list since this has been a recent >>topic of discussion. >> >> > >Jeff, > >Are you planning on forwarding the entire discussion? One posting out >of context is unlikely to give a very good picture of the whole >discussion. > >-- greg > > Greg, I apologize. I forgot to cc the OpenMPI mailing list. However, I've only seen one other post from the OpenMPI list regarding this topic - being the one from you. To be fair do you want me to forward that one with the cc to the OpenMPI list? Jeff From steve_heaton at ozemail.com.au Sat Mar 19 19:59:45 2005 From: steve_heaton at ozemail.com.au (Fringe Dweller) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Re: The move to gigabit - technical questions In-Reply-To: <200503182001.j2IK0AZf003079@bluewest.scyld.com> References: <200503182001.j2IK0AZf003079@bluewest.scyld.com> Message-ID: <423CF531.4080301@ozemail.com.au> > When he said "vanilla", I believe he meant, "Using > the stock Linux driver settngs with no attempt at tuning for the needs > of my particular HPC application." Indeed :) > Since he is using Intel Pro/1000 cards, he probably should also try > using GAMMA rather than TCP/IP. > Yep, the kind of tweak that Peter Kjellstr?m has proposed (InterruptThrottleRate=0) is still on the cards. As is the investigation of GAMMA. Two for two Andrew ;) There's also sand pit time for the relevance of jumbo packets, falcON et al style Nbody hierarchy methodologies etc etc. I'll be documenting as I go for anyone interested to follow. Generally Nbody code seems to push the net hard during the initial data distribution, burns the processors hard on the calcs then the net gets a squeak again when the results are assembled. ...Then when my Fair Godmother turns up with that pile of dual Opertons I'll be ready =) Cheers Steve From oplehto at csc.fi Sun Mar 20 09:56:35 2005 From: oplehto at csc.fi (Olli-Pekka Lehto) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Questions regarding interconnects Message-ID: Hello, I'm writing a paper on current and emerging cluster interconnect technologies as a part of my University studies. I have included 1GbE (incl. RDMA), 10GbE, Quadrics, InfiniBand and Myrinet. The goal is to provide an introduction to the subject maybe more from a network engineer's point of view with an overview on the key features and the pros/cons of each solution. I have some questions on which I hope you could help me out with: What do you see as the key differentiating factors in the quality of an MPI implementation? This far I have come up with the following: -Completeness of the implementation -Latency/bandwidth -Asynchronous communication -Smart collective communication Are there any NICs on the market which utilize the 10GBase-CX4 standard and if there is are there any clusters which use them? Do you see it as a viable choice for an interconnect considering the relatively low cost of InfiniBand and that fact that 10GBase-T is not that far in the future? When do you estimate that commodity Gigabit NICs with integrated RDMA support will arrive to the market? (or will they?) best regards, Olli-Pekka -- Olli-Pekka Lehto, Systems Specialist, Systems Services, CSC PO Box 405 02101 Espoo, Finland; tel +358 9 457 2215, fax +358 9 457 2302 CSC is the Finnish IT Center for Science, www.csc.fi, e-mail: Olli-Pekka.Lehto@csc.fi From senot at sciences.univ-metz.fr Mon Mar 21 06:00:14 2005 From: senot at sciences.univ-metz.fr (Philippe SENOT) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] AMD Athlon with Intel Fortran Compiler, which options? Message-ID: <423ED36E.90708@sciences.univ-metz.fr> Hi, I would like to know which options I must use to compile some progs on dual athlon 2800+ with intel fortran compiler. And if you know others free fortran compiler for this platform. Thanks -- Philippe SENOT Gestionnaire du parc informatique Institut de Physique,1 Bd Arago 57078 METZ Cedex 3, FRANCE Tel. (+33) 03.87.31.58.63 Fax. (+33) 03.87.54.72.57 Abroad replace 03 by 3 above senot@sciences.univ-metz.fr -------------- next part -------------- Skipped content of type multipart/related From fant at pobox.com Mon Mar 21 06:32:01 2005 From: fant at pobox.com (Andrew D. Fant) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Sources for Switch Performance Data Message-ID: <423EDAE1.5090808@pobox.com> Good morning all, There have been many discussions over the past few months about suitable switches for Ethernet-based clusters, and I was wondering if anyone could point me towards sites that offer some sort of head-to-head comparisons of switch performance in cluster applications. For obvious reasons, I can't afford to buy one of each and test them all myself. I've heard that Cisco gear can be over-priced for performance by port, and that some people swear by lower-cost commodity gigE switches for their clusters, but I am hoping to be more analytical in performance numbers, especially cost per sustainable mb/s per port type numbers. If anyone can point me in the right direction for these sorts of numbers, I would be very grateful. Andy From deadline at clusterworld.com Mon Mar 21 12:36:11 2005 From: deadline at clusterworld.com (Douglas Eadline - ClusterWorld Magazine) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Why Do Clusters Suck? Message-ID: So why do clusters suck? Not that I want to start a flame war or anything (at least not yet). This question is a backhand way of saying "What do we need to move the cluster community forward and make clusters better?" If you have been reading ClusterWorld you may have noticed that we have recently been hinting at this question. To help organize this process, we have begun the The Cluster Agenda Initiative (http://www.clusterworld.com/agenda06/). The web page explains it pretty well, but in short the Agenda is an effort to try and to help "get the community a bit more organized". It is meant to be the starting point and road map where resources, best practices and challenges are identified - it is NOT an endpoint or a standards document. In order to determine what the "issues" are, we were invited by Tom Sterling to ask the question "Why do Clusters Suck?". You can supply your answer at: http://basement-supercomputing.com/agenda/agendaform.html I invite users, administrators, vendors, and anyone remotely interested clusters, to check out the site and either fill out the form or contact me directly. Your input is important. We are going to finalize the Agenda Framework at the ClusterWorld Summit in May. Doug ---------------------------------------------------------------- Editor-in-chief ClusterWorld Magazine Desk: 610.865.6061 Fax: 610.865.6618 www.clusterworld.com From mathog at mendel.bio.caltech.edu Mon Mar 21 15:51:24 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Daisychained rcp script Message-ID: Here's a script for copying a file across a list of nodes. ftp://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/pdist_file.sh It uses a daisychain method similar to that in "dolly". I'm a bit curious how it holds up on larger sites with different network hardware. We have a switched 100baseT network with data starting on the headnode and going to up to 20 nodes, all nodes are identical. Here are some timings with nothing else running: Nodes Time (s Mb/s Repeater Nodes 1 8 10.8 0 2 8.9 - 9.3 9.7 - 9.3 1 1-5 13.5-14.8 6.4 - 5.8 4 1-10 13.5-17.4 6.4 - 5.0 9 1-20 19.5-20.5 4.4 - 4.2 19 The test file was 86.4 Mb (for no good reason.) 1-5 means the first 5 nodes were written. Repeater nodes are those that read from the net, store locally, and also write to the net. There's always a first node (only writes to the net) and a last node (only reads from the net and stores to disk.) Ideally the daisychain method would scale up better than this. I think that what's happening is that the there is more and more chance of wasted time on the repeater nodes because N+1 is writing to N+2 when it should be reading from N). Also the writes to disk and read/write from the network are not synch'd very well, so again, it's probably doing the wrong thing at the wrong time which introduces progressively more delays. Consequently it uses less and less of the available bandwidth as the number of nodes in the chain increases. That said, it's still a lot faster moving data out this way than with 20 sequential rcp's. It also doesn't massacre the NFS server as would 20 of these simultaneously: rsh remotenode "cp /nfsmount/data /localdisk" I also times this using my variant of dolly 0.57C, which should be about the same as 0.58. Interestingly even though dolly reports that it is moving Time: 8.935656 MBytes/s: 9.674 when I use "time" to measure the actual elapsed time the transfer actually takes 16.0 seconds total elapsed time, for 5.4 Mb/s. (And that doesn't count the 1 second or so for rsh to set up the 20 slave dolly processes.) So dolly is a little better than my simple script but it also can't keep the network running flat out. Anybody have a better "daisychain" (or other) data replicator? Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From felix.rauch.valenti at gmail.com Mon Mar 21 23:04:12 2005 From: felix.rauch.valenti at gmail.com (Felix Rauch Valenti) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Daisychained rcp script In-Reply-To: References: Message-ID: <4eafc81b05032123041ff367b0@mail.gmail.com> On Mon, 21 Mar 2005 15:51:24 -0800, David Mathog wrote: > Here's a script for copying a file across a list of nodes. > > ftp://saf.bio.caltech.edu/pub/software/linux_or_unix_tools/pdist_file.sh > > It uses a daisychain method similar to that in "dolly". I'm > a bit curious how it holds up on larger sites with different network > hardware. We have a switched 100baseT network with data starting > on the headnode and going to up to 20 nodes, all nodes are identical. > Here are some timings with nothing else running: > > Nodes Time (s Mb/s Repeater Nodes > 1 8 10.8 0 > 2 8.9 - 9.3 9.7 - 9.3 1 > 1-5 13.5-14.8 6.4 - 5.8 4 > 1-10 13.5-17.4 6.4 - 5.0 9 > 1-20 19.5-20.5 4.4 - 4.2 19 I only had a quick look at your script, but it seems that it uses (named) pipes and "tee", so I'd guess it does more data copies than "dolly" (which implements the whole replication in a single C program). That could explain a difference in throughput between 1 node and multiple nodes, because the repeater nodes limit the performance. A reason for the farther decrease in performance with higher numbers of nodes might be that the pipes and your network connection don't use the same blocksize (I'm not sure though), which could result in "hiccups" in the daisychain due to bad synchronisation between data streams. [...] > I also times this using my variant of dolly 0.57C, which should > be about the same as 0.58. Interestingly even though dolly > reports that it is moving > > Time: 8.935656 > MBytes/s: 9.674 > > when I use "time" to measure the actual elapsed time the transfer > actually takes 16.0 seconds total elapsed time, for 5.4 Mb/s. > (And that doesn't count the 1 second or so for rsh to set up > the 20 slave dolly processes.) So dolly is a little better > than my simple script but it also can't keep the network > running flat out. I didn't check dolly's code, but I guess it doesn't measure the startup and teardown phases, because I was mostly interested in throughput for very large files (that's what dolly was written for). To get rid of the startup phase in dolly -- and thus achiever higher throughputs for medium sized files -- one might want to use a dolly daemon. Such a daemon would be started once, set up all the daisychain connections, and then wait for files to transmit. Thus, the file replication could start immediately after writing the file to the dolly server daemon, withouth any setup or teardownup delays. - Felix From baenni at kiecks.de Tue Mar 22 05:03:50 2005 From: baenni at kiecks.de (baenni@kiecks.de) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] newbie question about mpich2 on heterogenous cluster Message-ID: <200503221403.50619.baenni@kiecks.de> Dear List I installed mpich2-1.0 on my little cluster (2 Linux nodes and 3 Solaris nodes). I first worked only on the two linux nodes, where the programms run without troubles. But when I would like to invoke the solaris nodes, i.e. when I run the programs on a heterogenous cluster, it ents up in error messages. For some reoson, the -arch parameter is not implemented in mpich2-1.0. Does anyone have experience with such problems? Can I run mpich2 on a heterogonous cluster? Thanks in advance for any help mpiexec -n 1 -host shaw -path /home1/00117cfd/CFD_3D/example/PARALLEL/cpi _cpi : -n 1 -host devienne -path /home1/00117cfd/CFD_3D/example/PARALLEL/cpi _cpi : -n 1 -host gallay -path /export/home/baenni/example/PARALLEL/cpi _cpi : -n 2 -host gallay1 -path /export/home/baenni/example/PARALLEL/cpi _cpi aborting job: Fatal error in MPI_Bcast: Other MPI error, error stack: MPI_Bcast(821): MPI_Bcast(buf=0x8145480, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed MPIR_Bcast(229): MPIC_Send(48): MPIC_Wait(308): MPIDI_CH3_Progress_wait(207): an error occurred while handling an event returned by MPIDU_Sock_Wait() MPIDI_CH3I_Progress_handle_sock_event(492): connection_recv_fail(1728): MPIDU_Socki_handle_read(590): connection closed by peer (set=0,sock=1) aborting job: Fatal error in MPI_Bcast: Internal MPI error!, error stack: MPI_Bcast(821): MPI_Bcast(buf=1786e0, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed MPIR_Bcast(197): MPIC_Recv(98): MPIC_Wait(308): MPIDI_CH3_Progress_wait(207): an error occurred while handling an event returned by MPIDU_Sock_Wait() MPIDI_CH3I_Progress_handle_sock_event(849): [ch3:sock] received packet of unknown type (369098752) rank 4 in job 19 shaw_33110 caused collective abort of all ranks exit status of rank 4: killed by signal 9 From jsquyres at open-mpi.org Tue Mar 22 07:46:28 2005 From: jsquyres at open-mpi.org (Jeff Squyres) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Alternative to MPI ABI Message-ID: I have tried to reply to Greg's answer to my long-winded prior post about my thoughts on an MPI ABI (http://www.open-mpi.org/community/lists/users/2005/03/0021.php), but I simply have not had the time to compose a suitably-detailed/precise reply. Instead, I would like to propose an alternative to an MPI ABI. Create a new software project (preferably open source, preferably with an BSD-like license so that ISV's can incorporate this software into their products) that provides a compatibility layer for all the different MPI implementations out there. Let's call it MorphMPI. It would contain the following main components: 1. its own mpi.h / mpif.h 2. its own wrapper compilers (mpicc et al.) 3. its own library (perhaps named libmpi.*) mpi.h contains all the normal mpi.h things (prototypes for all the MPI and PMPI functions, declarations of all the MPI constants, etc.), and then potentially a remapping from MPI functions to MorphMPI functions (e.g., "#define MPI_Send morph_mpi.mpi_send", where morph_mpi is a struct full of function pointers). The wrapper compilers do the standard wrapper compiler things, enabling finding mpi.h / mpif.h, automatically finding and linking to MorphMPI's library(ies), etc. The library is where the bulk of the work will be. In MorphMPI's equivalent to MPI_INIT (perhaps named Morph_MPI_Init()), it dlopen's a back-end MPI implementation and sets oodles of internal tables to point to the back-end MPI functions and constants. For example, morph_mpi.mpi_send is set equal to the result of a dlsym to find the symbol for "MPI_Send". Morph_MPI_Init() can do some clever / user friendly things to pick which back-end MPI to dlopen, what dependent libraries also need to be dlopen'ed, etc. This can be arbitrarily feature-ized. There's still some technical issues to solve, but an industrious developer can figure them out. For example, how to handle MPI compile-time constants (e.g., "MPI_Comm mycomm = MPI_COMM_WORLD;")? One possible solution is to have MorphMPI have a wrapper function for each MPI function (e.g., "#define MPI_Send Morph_MPI_Send"). The wrapper function does a translation of the MorphMPI MPI handles to the back-end MPI handles. If MorphMPI's handles are integers, this can be relatively straightforward, something along the lines of: int Morph_MPI_Send(...dtype, ..., comm) { return morph_mpi.mpi_send(..., Morph_MPI_datatypes[dtype], ..., Morph_MPI_communicators[comm]); } You get the idea. There's a slight performance penalty for the translation layer, but for those who want an MPI ABI, this might well be an acceptable price to pay. ------ Of course, such a compatibility layer doesn't have to be *exactly* like this. I simply proposed one possible implementation -- there's several other, similar ways to do it. S/He who implements, wins. :-) The main ideas of this proposal are: 1. A 3rd party project can provide MPI ABI-like functionality (with all the benefits and drawbacks therein) 2. Cancel the MPI Implementor's Ultimate Prize Fighting Cage Match on pay-per-view (read: no need for time-consuming, potentially fruitless attempts to get MPI implementors to agree on anything) 3. With an appropriate FOSS license, anyone who wants ABI-like functionality can have it, but those who don't want it don't have it forced upon them 4. MPI implementors can keep doing what they do best: working on making their software great This seems like a perfect project for a bright Master's student. Anyone care to open up a SourceForge project for it? :-) -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/ From rbw at ahpcrc.org Tue Mar 22 08:41:54 2005 From: rbw at ahpcrc.org (Richard Walsh) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Questions regarding interconnects In-Reply-To: References: Message-ID: <42404AD2.3060400@ahpcrc.org> Olli-Pekka Lehto wrote: > Hello, > > I'm writing a paper on current and emerging cluster interconnect > technologies as a part of my University studies. I have included 1GbE > (incl. RDMA), 10GbE, Quadrics, InfiniBand and Myrinet. The goal is to > provide an introduction to the subject maybe more from a network > engineer's point of view with an overview on the key features and the > pros/cons of each solution. I have some questions on which I hope you > could help me out with: I think that integrating a custom interconnect for comparison into you analysis would be useful to contrast the capabilities of "commodity cluster" interconnects with those of the presumptive custom leading edge. I would choose the Cray X1e or Altix interconnects for this. > > What do you see as the key differentiating factors in the quality of > an MPI implementation? This far I have come up with the following: > -Completeness of the implementation > -Latency/bandwidth > -Asynchronous communication > -Smart collective communication I think that explicit treatment/comparison of the interconnect's RDMA capabilities is important as they support both MPI-2 and the new-ish UPC and CAF compilers for cluster systems. I can send you a recent article I wrote comparing Quadrics to the Cray X1 interconnect relative to the performance of these global address space programming models (UPC and CAF). Another thing to look at is the latency advantage/potential of alternative paths to the processor (i.e HT/Infinipath) > > Are there any NICs on the market which utilize the 10GBase-CX4 > standard and if there is are there any clusters which use them? Do you > see it as a viable choice for an interconnect considering the > relatively low cost of InfiniBand and that fact that 10GBase-T is not > that far in the future? > > When do you estimate that commodity Gigabit NICs with integrated RDMA > support will arrive to the market? (or will they?) AMASSO already sells one. > > > best regards, > Olli-Pekka Richard Walsh AHPCRC From becker at scyld.com Tue Mar 22 09:10:11 2005 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Alternative to MPI ABI In-Reply-To: Message-ID: On Tue, 22 Mar 2005, Jeff Squyres wrote: > Instead, I would like to propose an alternative to an MPI ABI. This is related to one of my pet "thought projects" for several years. We have largely completed developing an alternate cluster programming model that is better suited to most applications. The current MPI design is the wrong model for clusters. It represents a static view of the cooperating machines The MPI "cluster size", MPI_Comm_size(MPI_COMM_WORLD, ...), is static for all time. There is no way to take advantage of new machines, or reduce the number of machines that the application depends on. MPI has a model of initialize-compute-terminate. There is no explicit support for checkpointing, executing as a service, or running "forever". MPI has no concept of failures There is no mechanism to report that a rank has failed, and no way to recover from that failure. MPI applications may issue undone work to other ranks and complete the actual task, but they cannot terminate cleanly with missing nodes. MPI's strength is collective mathematically-oriented operations, not communication. I understand that even the name "Message Passing.." indicates that stream communication isn't the focus, but many applications expect and work well with a sockets-based model. Cross-architecture jobs are theoretically supported, but very difficult to implement. The capability adds complexity without benefit. Communicators besides MPI_COMM_WORLD are rarely used. The capability adds complexity with little benefit. Some of the implications of using a dynamic model are - We need information and scheduling interfaces - We need true dynamic sizing, with process creation and termination primitives - We need status signaling external to application messages. There needs to be new information interfaces which - report usable nodes (which node are up, ready and will permit us to start processes) - report the capability of those nodes (speed, total memory) - report the availability of those nodes (current load, available memory) Each of these information types is different and may be provided by a different library and subsystem. We created 'beostat', a status and statistics library, to provide most of this information. There needs to be an explicit scheduler or mapper interface. We use 'beomap', which can utilize an external scheduler or internally create a list of usable compute nodes. An application should be able to run single-threaded if it decides that multiple processes are not useful. An application should be able to use only a subset of provided processors if they will not be useful (e.g. an application that uses a regular grid might choose to use only 16 of 23 provided nodes. The unused nodes should be truly unused: if they crash or are otherwise unexpectedly removed they shouldn't affect correct, error-free completion. Ideally processes should never be started on those nodes. There needs to be new process creation primitives. We already have a well-tested model for this: Unix process management. The only additional element is the ability to specify remote nodes when creating processes. Monitoring and handling terminating processes does not need extensions. We use BProc, but the concept of remote_fork() existed long before BProc. A standard set of calls should not have the same names, but should use exactly the Unix semantics so that the library only needs to "wrap" the actual system calls. There needs to be asynchronous signaling methods We already have this as well: Unix signals. Their (still modest) complexity represents many years of experience and demonstrates that getting the semantics for handling asynchronous events is difficult. > 2. Cancel the MPI Implementor's Ultimate Prize Fighting Cage Match on > pay-per-view (read: no need for time-consuming, potentially fruitless > attempts to get MPI implementors to agree on anything) Hmmmm, does this show up on the sports page? Donald Becker Scyld Software Annapolis MD 21403 410-990-9993 From kus at free.net Tue Mar 22 09:24:12 2005 From: kus at free.net (Mikhail Kuzminsky) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] AMD Athlon with Intel Fortran Compiler, which options? In-Reply-To: <423ED36E.90708@sciences.univ-metz.fr> Message-ID: In message from Philippe SENOT (Mon, 21 Mar 2005 15:00:14 +0100): >Hi, > >I would like to know which options I must use to compile some progs >on dual athlon 2800+ with intel fortran compiler. You may use all the optimization options available for Pentium III. Of course, I assume 32-bit version of Intel compiler. >And if you know others free fortran compiler for this platform. g77 or (must be soon available) g95. Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow > >Thanks > >-- > >Philippe SENOT >Gestionnaire du parc informatique >Institut de Physique,1 Bd Arago >57078 METZ Cedex 3, FRANCE >Tel. (+33) 03.87.31.58.63 >Fax. (+33) 03.87.54.72.57 >Abroad replace 03 by 3 above >senot@sciences.univ-metz.fr > > > From joachim at ccrl-nece.de Tue Mar 22 10:22:36 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Alternative to MPI ABI In-Reply-To: References: Message-ID: <4240626C.5000408@ccrl-nece.de> Donald Becker wrote: > On Tue, 22 Mar 2005, Jeff Squyres wrote: > > >>Instead, I would like to propose an alternative to an MPI ABI. > > > This is related to one of my pet "thought projects" for several years. > We have largely completed developing an alternate cluster programming > model that is better suited to most applications. Are you sure you know what most application developers want, and are capable to handle? > The current MPI design is the wrong model for clusters. > It represents a static view of the cooperating machines > The MPI "cluster size", MPI_Comm_size(MPI_COMM_WORLD, ...), is > static for all time. There is no way to take advantage of new > machines, or reduce the number of machines that the application > depends on. New nodes can be added (MPI_COMM_UNIVERSE!). For the removal of nodes, you are right, see "failures". > MPI has a model of initialize-compute-terminate. > There is no explicit support for checkpointing, executing as a > service, or running "forever". There are a lot of examples for either application-driven checkpointing, automatic checkpointing or a combination of both. What kind of support do you have in mind? MPI was not designed to let machines communicate on some system-level. What kind of "MPI service" do you have in mind? Did you check the whole dynamic-process chapter of MPI-2 (accept, connect,...)? I also think that some sort of defined interaction with the scheduling systems is desirable for MPI-2 dynamic processes (to get an answer to the question "where can I run my new processes?"). But this needs to be very ligthweight and abstract, to allow a wide range of actual implemenations. > MPI has no concept of failures This is true, and is addressed by current research projects (MPI-FT, FT-MPI, maybe Open-MPI) and should be covered by a potential update of the MPI standard. > MPI's strength is collective mathematically-oriented operations, not > communication. I understand that even the name "Message Passing.." > indicates that stream communication isn't the focus, but many > applications expect and work well with a sockets-based model. Do you think a one-model-for-everything strategy would be successful? > Cross-architecture jobs are theoretically supported, but very > difficult to implement. The capability adds complexity without > benefit. MPI provides everything you need. The fact that it is hard to implement or set up is a problem of the implementation, not the standard. > Communicators besides MPI_COMM_WORLD are rarely used. The capability > adds complexity with little benefit. Hmm, so you don't want communicators? I'd say that they are critical to any sort of dynamic environment. Besides that, they are already used today in many more complex codes, and you can't cleanly implement a library calling MPI without them. Largely, what you describe sounds more like some variant of grid computing. MPI is not about grid computing, it's mostly about performance, but can be embedded in a grid environment. But it should not try to represent a grid environment itself. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From mathog at mendel.bio.caltech.edu Tue Mar 22 11:42:57 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Re: Why Do Clusters Suck? Message-ID: On Mon, 21 Mar 2005 Douglas Eadline wrote: > > So why do clusters suck? Would that they did suck - it would make cooling them a lot easier. Unfortunately while most of them blow pretty well their sucking sucks. Which has nothing to do with your original post, assuming either of these messages actually make it through anybody's spam filters. More to the point, while there is certainly a lot of room for improvement, an awful lot of work is getting done today using existing cluster technology and it's far from clear to me that an advance in cluster management software would result in much more productivity. As opposed to, for instance, improving network throughput, CPU power, or component reliability by a factor of 10, any one of which would lead to an immediate and dramatic productivity increase. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rgb at phy.duke.edu Tue Mar 22 11:51:12 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Alternative to MPI ABI In-Reply-To: References: Message-ID: Hmmm, looks like the list is about to have Doug's much desired discussion of Community Goals online. I'll definitely play. The following is a standard rgb thingie, so humans with actual work to do might not want to read it all right now... (Jeff L., I do have a special message coming to the list "just for you";-) On Tue, 22 Mar 2005, Donald Becker wrote: > Some of the implications of using a dynamic model are > - We need information and scheduling interfaces > - We need true dynamic sizing, with process creation and termination > primitives > - We need status signaling external to application messages. > > There needs to be new information interfaces which > - report usable nodes (which node are up, ready and will permit us > to start processes) > - report the capability of those nodes (speed, total memory) > - report the availability of those nodes (current load, available > memory) > Each of these information types is different and may be provided > by a different library and subsystem. We created 'beostat', a status > and statistics library, to provide most of this information. > There needs to be an explicit scheduler or mapper interface. > We use 'beomap', which can utilize an external scheduler or > internally create a list of usable compute nodes. Agreed. I also have been working on daemons and associated library tools to provide some of this information as well on the fully GPL/freely distributable side of things for a long time. I've just started a new project (xmlbenchd) that should be able to provide really detailed capabilities information about nodes via a daemon interface from "plug in" benchmarks, both micro and macro (supplemented with data snarfed from /proc using xmlsysd code fragments). xmlsysd already provides CPU clock, total memory, L2 cache size, total and available memory, and PID snapshots of running jobs. Both daemons use (or will use in the case of xmlbenchd) xml tags to wrap all output, which should make creating applications that use the data pretty simple, and there is a provided library that can manage connections to a specified cluster. XMLified tags also FORCE one to organize and present the data hierarchically and extensibly. I've tried (unsuccessfully) to convince Linus to lean on the maintainers of the /proc interface to achieve some small degree of hierarchical organization and uniformity in presentation, but it looks like it will remain the disorganized and inconsistent mess that it is now for the indefinite future (speaking as one who has now written quite a lot of code parsing it, and having had to hand-write and customize that code pretty much per /proc-based file). (And yes, I know there are performance issues, but frankly either screw ASCII altogether and make /proc a binary interface with an accompanying library for real efficiency -- parts of it might as well be binary now for all the documentation or consistency -- or provide a good topdown hierarchical and consistent organization of the data in ASCII. Or do both but with different refresh rates (fast binary, slow ascii/xml), but don't do ugly. What's there now is ugly.) There are several advantages to using daemons (compared to a kernel module or kernel-level insertion and a library of extended systems calls) for providing system information both locally and across a network: a) The daemons access information read only on the host system and can be safely run from userspace (an actual user or nobody/xinetd) with very familiar and standardized tools to control access on the basis of privacy rather than modification. b) They can be trivially packaged as e.g. rpm's per distribution and don't have to be rebuilt every time you update/change the kernel, which means that kernels can dynamically update to fix e.g. security problems, bugs, performance issues. Even fairly major kernel changes will generally not break the tools (as long as /proc remains structurally intact). c) It's a bit safer and easier to modify and develop objects that live in userspace than it is things that will run as root or inside the kernel. A bug is always a problem in code, but bugs won't necessarily immediately compromise root on a system if they live in an anonymous daemon, nor will they cause a system crash leading to a reboot. d) One hopes to be able to extensibly separate the SOURCE of a daemon (especially a GPL daemon) from the COMMUNICATIONS "language" the daemon speaks, so that there can be multiple implementations and so that applications that use the information can be written independent of the implementation. A "trivial" ascii-based communications/command set and XML-encapsulated returns mean that you can talk to the daemon with telnet, perl, c code, absolutely anything that can manage a connection, and write applications without worrying about how the information is being provided. This is possible with a kernel-based tool as well, of course, but again, the level of programming expertise required to do safe kernel programming is a fairly solid barrier against multiple implementations, leaving control over the interface itself in a single set of hands (which can be both good and bad, depending on the hands:-). The weakness in what I'm doing is that I'm a single human being and have numerous responsibilities outside of the projects, so I haven't yet written ALL the applications that could use this information. Its a sort of "if I build it, they will come" thing; even though this isn't really horribly likely (sorry Jeff, don't wait for people to pound down your door -- if you want a nifty piece of code written, the only way to get there is to carry it yourself). Still, I totally agree that this is EXACTLY the kind of information that needs to be available via an open standard, universal, extensible, interface. > An application should be able to run single-threaded if it decides > that multiple processes are not useful. Sure. > An application should be able to use only a subset of provided > processors if they will not be useful (e.g. an application that uses > a regular grid might choose to use only 16 of 23 provided nodes. > The unused nodes should be truly unused: if they crash or are > otherwise unexpectedly removed they shouldn't affect correct, > error-free completion. Ideally processes should never be started on > those nodes. Absolutely. And this needs to be done in such a way that the programmer doesn't have to work too hard to arrange it. I imagine that this CAN be done with e.g. PVM or some MPIs (although I'm not sure about the latter) but is it easy? > There needs to be new process creation primitives. > We already have a well-tested model for this: Unix process > management. The only additional element is the ability to specify > remote nodes when creating processes. Monitoring and handling > terminating processes does not need extensions. We use BProc, but > the concept of remote_fork() existed long before BProc. A standard > set of calls should not have the same names, but should use exactly > the Unix semantics so that the library only needs to "wrap" the > actual system calls. Agreed, but while dealing with this one also needs to think about security. Grids and other distributed parallel computing paradigms are increasingly popular as problems that map well into them proliferate, and with open networks of potential compute resources comes a need to be able to manage SECURE and AUTHENTICATED new process creation. On a WAN, this will likely require both host and user identification, bidirectional encryption of traffic, and more -- something much closer to ssh and/or ssl than rsh. I also personally would much prefer that the actual primitives run outside the kernel for the reasons noted above -- at this point in time it should be quite possible to build a system that can run remotely submitted user applications in a chroot jail on top of an absolutely standard distribution and kernel with not only no particularly special privileges, but with FEWER privileges than any real user of the system. This may be naive or not a viable solution for all kinds of programs, but either way I think that we need to develop a viable and efficient and controllable security model of clustering. Up to now too many tools assume that a "cluster" is de facto firewalled and that this makes it ok to be sloppy about security. This in turn severely restricts their portability and generality. > There needs to be asynchronous signaling methods > We already have this as well: Unix signals. Their (still modest) > complexity represents many years of experience and demonstrates that > getting the semantics for handling asynchronous events is difficult. Amen. Although I'm not sure that we have a network equivalent of async/out of band signalling. Wish we did. This is where life gets complicated from the point of view of root control and security -- Unix signals live in a very solidly defined place in rootspace with a very well defined user interface. It gives me a bit of a headache to think of how one might implement something "transparently similar" across a network, especially a WAN, securely, with or without a kernel insertion. Another point is that Unix signals tend to be largely predefined with only a couple provided for "user defined" purposes. I'd think that cluster signals would be almost all user defined, with a relatively small set that provide similar predefined/default functionality similar to their underlying Unix counterparts. That is, cluster signals might need to be more extensible. However it would be very good to implement at least some signals that behave "just like" their Unix counterparts for cluster-distributed apps and to add a few others that are just as universally defined but cluster specific (such as a signal for "checkpointing" that -- if implemented -- causes all node tasks to stop what they are doing and checkpoint, possible exiting immediately afterwards). > > 2. Cancel the MPI Implementor's Ultimate Prize Fighting Cage Match on > > pay-per-view (read: no need for time-consuming, potentially fruitless > > attempts to get MPI implementors to agree on anything) > > Hmmmm, does this show up on the sports page? It's actually very interesting -- the Head Node article in CWM seemed to me to be a prediction that MPI was "finished" in the sense that it is complete and all anyone ever needs. At the same time, on this list, very knowledgeable people seem to be suggesting that far from being "finished" in the sense of being complete and needing any further tinkering, it may be "finished" in the sense that it needs such a radical rewrite and set of extensions that the new product might as well be given a new name altogether, even if it does share call names for the actual message passing primitives. This also brings to mind the ways that PVM and MPI are often differentiated -- by their control interface. PVM has its lovely console, where one can go in (in userspace) and feel like one is "configuring a virtual cluster" with at least some controls an information gathering tools (even though they fall far short of what they SHOULD be). PVM also does allow one to add hosts and configure a cluster INSIDE a PVM program -- one CAN write a program that starts on node a, which creates a virtual cluster consisting of a, b and c, spawns tasks across the cluster, and when the subtask on node c completes, release it from the virtual cluster while a and b carry on. PVM also has a crude interface to signal(), although I'm not sure it is exactly what you are suggesting for MPI or general cluster support. To summarize, I think that the basic argument being advanced (correct me if I'm wrong) is that there should be a whole layer of what amount to meta-information tools inserted underneath message passing libraries of any flavor so that (for example) the pvm console command "conf" returns a number that MEANS SOMETHING for "speed", and in fact so that the pvm library (or mpi library or an add-on library used both by the pvm console code and the user's application) can contain primitives that can retrieve e.g. the actual speeds of the system (where the plural is deliberate) that might be needed to manage e.g. resource allocation, task partitioning, dynamic code optimization, and more. I also think that it is worth looking at the philosophical differences between PVM and MPI more closely as to me at least they ARE very different for all that they are both message passing libraries. > > Donald Becker > Scyld Software > Annapolis MD 21403 410-990-9993 > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Tue Mar 22 11:52:36 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] For -- mostly -- Jeff Layton...;-) Message-ID: This is a complete message from rgb. (Just so you could say you've read at least one... see this month's CWM:-) rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From ctierney at HPTI.com Tue Mar 22 13:26:06 2005 From: ctierney at HPTI.com (Craig Tierney) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Re: Why Do Clusters Suck? In-Reply-To: References: Message-ID: <1111526485.2641.133.camel@localhost.localdomain> On Tue, 2005-03-22 at 12:42, David Mathog wrote: > On Mon, 21 Mar 2005 Douglas Eadline wrote: > > > > So why do clusters suck? > > Would that they did suck - it would make cooling them a lot > easier. Unfortunately while most of them blow pretty well their > sucking sucks. > > Which has nothing to do with your original post, assuming either of > these messages actually make it through anybody's spam filters. > > More to the point, while there is certainly a lot > of room for improvement, an awful lot of work is > getting done today using existing cluster technology > and it's far from clear to me that an advance > in cluster management software would result in much more > productivity. As opposed to, for instance, improving > network throughput, CPU power, or component reliability by > a factor of 10, any one of which would lead to an immediate > and dramatic productivity increase. > Would it? As far as hardware specs, it depends on your needs. Myrinet or IB is more than enough bandwidth for us (weather and ocean modes, nearest neighbor communications), we prefer better latency. CPU power is nice, but you can use the same CPUs in a cluster that you can use in big iron, you just have to pay more. That isn't a cluster issue. We have over a thousand nodes and hardware reliability has never significantly impacted our users and their productivity. Some HA setups might help with filesystems and admin servers, but we are already as > 99.9% availability on hardware that is a single point of failure in the cluster without HA. Our biggest problem is the immaturity of development tools. Another way to put that is "my compiler doesn't reproduce the bugs in the other compilers my users are accustom to using" or "Fortran isn't a standard, it is a suggestion". It is a rare creature that writes clean, portable code. It is all too common to hear developers tell me things like "does it work if you turn off bounds checking?". I spend way too much time with new users trying to explain to them the difference between 'code porting' and 'bug fixing'. Craig > Regards, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From becker at scyld.com Tue Mar 22 13:35:18 2005 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Alternative to MPI ABI In-Reply-To: Message-ID: On Tue, 22 Mar 2005, Robert G. Brown wrote: > Hmmm, looks like the list is about to have Doug's much desired > discussion of Community Goals online. I'll definitely play. Yes, Doug's prep work for the ClusterWorld Summit in May triggered my initial response. > The following is a standard rgb thingie, so humans with actual work to > do might not want to read it all right now... (Jeff L., I do have a > special message coming to the list "just for you";-) > > On Tue, 22 Mar 2005, Donald Becker wrote: I sent this just a few minutes ago... how did you write a chapter-long reply? Presumably clones, but how do you synchronize them? > > There needs to be new information interfaces which > > - report usable nodes (which node are up, ready and will permit us > > to start processes) > > - report the capability of those nodes (speed, total memory) > > - report the availability of those nodes (current load, available > > memory) > > Each of these information types is different and may be provided > > by a different library and subsystem. We created 'beostat', a status > > and statistics library, to provide most of this information. > Agreed. I also have been working on daemons and associated library > tools to provide some of this information as well on the fully > GPL/freely distributable side of things for a long time. There are GPL versions of BeoStat and BeoMap. (Note: GPL not LGPL.) Admittedly they are older versions, but they are still valid. We are pretty good about not changing the API unless there is a flaw that can't be worked around. Many other projects seem to take the approach of "that's last weeks API". That said, we are designing a new API for BeoStat and extensions to BeoMap. We have to make significant changes for hyperthreading and multi-core, and how they relate to NUMA. We are taking this opportunity to clean up the ugliness that lingers from back when the Alpha was "the" 64 bit processor. > I've just > started a new project (xmlbenchd) that should be able to provide really > detailed capabilities information about nodes via a daemon interface > from "plug in" benchmarks, both micro and macro (supplemented with data > snarfed from /proc using xmlsysd code fragments). The trick is providing useful capability information, without introducing complexity. I see don't benchmark results, even microBMs, as being directly usable for local schedulers like BeoMap. > xmlsysd already provides CPU clock, total memory, L2 cache size, total > and available memory, and PID snapshots of running jobs. BeoStat provides both static info the CPU clock speed and total memory. It provides dynamic info on load average (the standard three values per node), CPU utilization (per processor! not per node), memory used, network traffic for up to four interfaces. I don't see L2 cache size as being useful. Beostat provides the processor type, which is marginally more useful but still largely unused. Nor is the PID count useful in old-style clusters. I know that Ganglia reports it, but like most Ganglia statistics it's mostly because the number is easy to get, not because they know what to do with it! (I wrote the BeoStat->Ganglia translator, and consider most of their decisions as being, uhmmm, ad hoc.) The PID count *is* useful in Scyld clusters, since we run mostly applications, not 20 or 30 daemons. Not that BeoStat doesn't have it's share of useless information. Like reporting the available disk space -- a number which is best ignored. > use (or will use in the case of xmlbenchd) xml tags to wrap all output, Acckkk! XML! Shoot that man before he reproduces. (too late) > XMLified tags also FORCE one to organize and present the data My, my, my. Just like Pascal/ADA/etc forces you write structured programs? > hierarchically and extensibly. I've tried (unsuccessfully) to convince > Linus to lean on the maintainers of the /proc interface to achieve some > small degree of hierarchical organization and uniformity in Doh! Don't you dare use one of my own pet peeves against me! /proc is a hodge-podge of formats, and people change them without thinking them through. I still remember the change to /proc/net/dev to add a field. The fact that the new field was only used for PPP, while breaking every existing installation ("who uses 'ifconfig' or 'netstat'?") didn't seem to deter the change, and once made it was impossible to change back. But that still doesn't make XML the right thing. Having a real design and keeping interfaces stable until the next big design change is the answer. > There are several advantages to using daemons (compared to a kernel Many people assume that much of what Scyld does is in the kernel. There are only a few basic mechanisms in the kernel, largely BProc, with the rest implemented as user-level subsystems. And most of those subsystems are partially usable in a non-Scyld system. The reason to use kernel features is to implement security mechanism (never policy) to allow applications to run unchanged We use kernel hooks only for the unified process space, process migration and the security mechanism. Managing process table entries and correctly forwarding signals can only work with a kernel interface. Having the node security mechanism in the kernel allows us to implement policy with unprivileged user-level libraries. That means an application can provide its own tuned scheduler function, or use a dynamic library provided by the end user. Otherwise the scheduler must run as a privileged daemon, tunable only by the system administrator. It could be worse: some process migration systems put the scheduler policy in the kernel itself! > Still, I totally agree that this is EXACTLY the kind of information that > needs to be available via an open standard, universal, extensible, > interface. Strike the word "extensible". That's a sure way to end up with a mechanism that is complex and doesn't do anything well. > > An application should be able to use only a subset of provided > > processors if they will not be useful (e.g. an application that uses > > a regular grid might choose to use only 16 of 23 provided nodes. > Absolutely. And this needs to be done in such a way that the programmer > doesn't have to work too hard to arrange it. I imagine that this CAN be > done with e.g. PVM or some MPIs (although I'm not sure about the latter) > but is it easy? It is with out BeoMPI, which runs single threaded until it hits MPI_Init(). That means the application can modify its own schedule, or read its configuration information before deciding it will use MPI. One aspect of our current approach is that it requires remote_fork(). Scyld Beowulf already has this, but even with our system a remote fork may be more expensive than just a remote exec() if the address space is dirty. I believe that MPI-like library can get the much of the same benefit, at the cost of a little extra programming, by providing flags that are only set when this is a remote or slave (non-rank-0) process. (That last line is confusing: consider the case where MPI rank 0 is supposed to end up on a remote machine, with no processes left on the originating machine.) > > There needs to be new process creation primitives. > > We already have a well-tested model for this: Unix process > Agreed, but while dealing with this one also needs to think about > security. We have ;->. > Grids and other distributed parallel computing paradigms are Grids have a fundamental security problem: how do you know what you are running on the remote machine? Is it the same binary? With the same libraries? Linked in the same order? With the same kernel? Really the same kernel, or one with changed semantics like RedHat's "2.4-w/2.6-threading". That's not even covering the malicious angle: "thank you for providing me with the credentials for reading all of your files". Operationally, Grids have the problem that they must define both the protocols and semantics before they can even start to work, and then there will be a lifetime of backwards and forward compatibility issue. You won't see this at first, just like the first version of Perl/Python/Java was "the first portable language". But version skew and semantic compatibility is *the* issue to deal with, not "how can we hack it to do something for SCXX". > > > 2. Cancel the MPI Implementor's Ultimate Prize Fighting Cage Match on > > > pay-per-view (read: no need for time-consuming, potentially fruitless > > > attempts to get MPI implementors to agree on anything) > > > > Hmmmm, does this show up on the sports page? > > It's actually very interesting -- the Head Node article in CWM seemed to > me to be a prediction that MPI was "finished" in the sense that it is ... > "finished" in the sense of being complete and needing any further .. > "finished" in the sense that it needs such a radical rewrite I fall into the "finished -- lets not break it by piecemeal changes" camp. Many "add checkpointing to MPI" and "add fault tolerance to MPI" projects have been funded for years. We need a new model that handles dynamic growth, with failures being just a minor aspect of the design. I don't see how MPI evolves into that new thing. > To summarize, I think that the basic argument being advanced (correct me > if I'm wrong) is that there should be a whole layer of what amount to > meta-information tools inserted underneath message passing libraries of > any flavor so that (for example) the pvm console command "conf" returns > a number that MEANS SOMETHING for "speed", and in fact so that the pvm Slight disagreement here: I think we need multiple subsystems that work well together, rather than a single do-it-all library. The architecture for a status-and-statistics system (BeoStat for Scyld) is different than for the scheduler (e.g. BeoMap), even though one may depend on the API of the other. If we put it all into one big library, it will be difficult to evolve and fix. (I'm assuming we can avoid API creep, which may not hold true.) Donald Becker Scyld Software Annapolis MD 21403 410-990-9993 From mathog at mendel.bio.caltech.edu Tue Mar 22 14:12:05 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Re: Why Do Clusters Suck? Message-ID: > On Tue, 2005-03-22 at 12:42, David Mathog wrote: > > > > More to the point, while there is certainly a lot > > of room for improvement, an awful lot of work is > > getting done today using existing cluster technology > > and it's far from clear to me that an advance > > in cluster management software would result in much more > > productivity. As opposed to, for instance, improving > > network throughput, CPU power, or component reliability by > > a factor of 10, any one of which would lead to an immediate > > and dramatic productivity increase. > > > > Would it? Yes. Programs tend to either be CPU limited and/or bandwidth limited. If you improve the relevant components the program will speed up to the point that something else becomes the new bottleneck. For most of our work now the CPU or memory bandwidth is limiting but for some operations (data distribution) the network bandwidth is. > Myrinet or IB is more than enough bandwidth for > us Ok. Now imagine what would happen if you dropped back to 100baseT, which is what I'm still using. (weather and ocean modes, nearest neighbor communications), we prefer > better latency. > We have over a thousand nodes and hardware > reliability has never significantly impacted our users and their > productivity. We've lost up to 2 of our 20 nodes at a time. Most of our tasks depend upon particular data set slices being distributed across the nodes. When one node goes down it takes several hours to redistribute the data appropriately among the remaining nodes. If I had 1000 nodes this would become enough of a problem that I'd have to redo the data distribution method and build in something resembling a RAID like redundancy. > > Our biggest problem is the immaturity of development > tools. I feel your pain on that one. > It is all too common to hear > developers tell me things like "does it work if you turn off bounds > checking?". Egads! I'm a big fan of building and testing programs on as many completely different platforms as possible, and with every possible warning enabled. That does wonders for wringing latent bugs out of code. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From landman at scalableinformatics.com Tue Mar 22 14:22:14 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Re: Why Do Clusters Suck? In-Reply-To: <1111526485.2641.133.camel@localhost.localdomain> References: <1111526485.2641.133.camel@localhost.localdomain> Message-ID: <42409A96.4050709@scalableinformatics.com> Craig Tierney wrote: > On Tue, 2005-03-22 at 12:42, David Mathog wrote: >>On Mon, 21 Mar 2005 Douglas Eadline wrote: >>>So why do clusters suck? >>Would that they did suck - it would make cooling them a lot >>easier. Unfortunately while most of them blow pretty well their >>sucking sucks. Sometimes they neither suck nor blow, as their fans are clogged, and they basically overheat. But that is another thread. [...] > Our biggest problem is the immaturity of development > tools. Another way to put that is "my compiler doesn't reproduce > the bugs in the other compilers my users are accustom to using" > or "Fortran isn't a standard, it is a suggestion". It is a rare creature > that writes clean, portable code. It is all too common to hear > developers tell me things like "does it work if you turn off bounds > checking?". I spend way too much time with new users trying to explain > to them the difference between 'code porting' and 'bug fixing'. me: "how do you know it works" them:"it compiles with no errors" me: "no... how do you know it works, functions correctly?" them:(puzzled look) "it compiles with no errors ..." -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From epaulson at cs.wisc.edu Tue Mar 22 11:17:40 2005 From: epaulson at cs.wisc.edu (Erik Paulson) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Alternative to MPI ABI In-Reply-To: <4240626C.5000408@ccrl-nece.de> References: <4240626C.5000408@ccrl-nece.de> Message-ID: <20050322191740.GD19499@cobalt.cs.wisc.edu> On Tue, Mar 22, 2005 at 07:22:36PM +0100, Joachim Worringen wrote: > > Largely, what you describe sounds more like some variant of grid > computing. MPI is not about grid computing, it's mostly about > performance, but can be embedded in a grid environment. But it should > not try to represent a grid environment itself. > I think the point Don is trying to make is that MPI was designed to be a parallel environment for machines that could be tightly-coupled MPPs or piles of machines in racks or shelves - and anything that couldn't be done well on both sides was left out of the standard. The situation on the ground today is that effectively all that is left is piles of machines - some with fast interconnects, but most with just ethernet, and all with a higher-than-we'd-like failure rate. But we're still constrained with a programming model that is designed to take the same program and run it efficiently on a T3E or a cluster of PCs. I'd also bet that if we looked at most MPI programs out there, they really could be restructured very easily to take advantage of a dynamic environment, if MPI properly exposed that to them (and MPI-2 is _not_ good enough) >From looking at Don's earlier post, I was thinking that PVM gave us a lot of what was listed (and the LAM underneath LAM/MPI, and HARNESS, etc). If the OpenMPI effort decides to revise some of the MPI standard, I hope they focus on exposing the cluster reality that most parallel programs run on, and don't worry so much about making it run efficiently on machines that no longer exist. -Erik From stuart.midgley at anu.edu.au Tue Mar 22 13:33:54 2005 From: stuart.midgley at anu.edu.au (Stuart Midgley) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] AMD Athlon with Intel Fortran Compiler, which options? In-Reply-To: References: Message-ID: <3239900dd6a7dd4f3b6454cbc4dd0600@anu.edu.au> Actually you can use everything except the SSE3 stuff. Once the 2.6GHz Opterons are out and about, you will be able to use SSE3 as well. I have managed to compile code in 64bit mode on an opteron using the Intel compilers. If I remember correctly, I used something like -fast -xW whereas -fast usually turns on -xP which gives SSE3 stuff. Stu. On 23/03/2005, at 4:24, Mikhail Kuzminsky wrote: > In message from Philippe SENOT (Mon, 21 > Mar 2005 15:00:14 +0100): >> Hi, >> >> I would like to know which options I must use to compile some progs >> on dual athlon 2800+ with intel fortran compiler. > You may use all the optimization options available for Pentium III. > Of course, I assume 32-bit version of Intel compiler. > >> And if you know others free fortran compiler for this platform. > g77 or (must be soon available) g95. > > Mikhail Kuzminsky > Zelinsky Institute of Organic Chemistry > Moscow -- <---------------------------------------------------------------------> Dr Stuart Midgley | stuart.midgley@anu.edu.au Supercomputer Facility | smidgley@netspace.net.au Leonard Huxley Building 56 | +61 (0)2 6125 5988 Work Australian National University | +61 (0)2 6125 8199 Fax CANBERRA ACT 0200 | +61 (0)4 1125 2488 Mob From ctierney at HPTI.com Tue Mar 22 15:01:09 2005 From: ctierney at HPTI.com (Craig Tierney) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Re: Why Do Clusters Suck? In-Reply-To: References: Message-ID: <1111532469.2641.224.camel@localhost.localdomain> On Tue, 2005-03-22 at 15:12, David Mathog wrote: > > On Tue, 2005-03-22 at 12:42, David Mathog wrote: > > > > > > More to the point, while there is certainly a lot > > > of room for improvement, an awful lot of work is > > > getting done today using existing cluster technology > > > and it's far from clear to me that an advance > > > in cluster management software would result in much more > > > productivity. As opposed to, for instance, improving > > > network throughput, CPU power, or component reliability by > > > a factor of 10, any one of which would lead to an immediate > > > and dramatic productivity increase. > > > > > > > Would it? > > Yes. Programs tend to either be CPU limited and/or bandwidth > limited. If you improve the relevant components the program > will speed up to the point that something else becomes the new > bottleneck. For most of our work now the CPU or memory bandwidth > is limiting but for some operations (data distribution) the > network bandwidth is. > > > Myrinet or IB is more than enough bandwidth for > > us > > Ok. Now imagine what would happen if you dropped back to 100baseT, > which is what I'm still using. I think we are talking two different things. I thought we were talking 'Why do clusters suck'. My response was to indicate that your problems aren't necessarily related to clusters. It might be related to your cluster, but not clusters in general. Of course if we don't throw in "It depends" we will start more arguments. If you want to compare network bandwidth of a cluster to an Altix or Cray then yes, the bandwidth is woefully inadequate. As far as CPUs, you can buy nodes with Itanium and build a cheap altix if you don't want share memory. Cray vector processors are a bit difficult. Even for IBM system, you can get the POWER5 in a small form factor and build a cluster. For the nodes themselves, you can buy systems with redundant power and redundant disk. You can buy from an system vendor that qualifies their hardware much more rigorously than another. When you buy cheap hardware you get cheap hardware. > > (weather and ocean modes, nearest neighbor communications), we prefer > > better latency. > > > We have over a thousand nodes and hardware > > reliability has never significantly impacted our users and their > > productivity. > > We've lost up to 2 of our 20 nodes at a time. Most of our > tasks depend upon particular data set slices being distributed > across the nodes. When one node goes down it takes several > hours to redistribute the data appropriately among the > remaining nodes. If I had 1000 nodes this would become > enough of a problem that I'd have to redo the data distribution > method and build in something resembling a RAID like redundancy. Did you have to architect your system this way? This is an issue with your problem and your solution, not clusters. Losing nodes is only critical if you lose the data. You should be able to pull the disk, plop it in another node, turn it on and keep going. That shouldn't take long as someone has access. Is the downtime of that process the most cost effective way to maintain your availability. I don't know the format of your data, but you could buy one more node and add it to the cluster. At certain intervals, copy the data from one node (assuming local disk) to the next. This is probably better than redistributing your data from 20 nodes to 18 nodes. In case of a failure, have the extra node kick in and redistribute the task, not the data, to the right nodes. You will have to balance out the number of checkpoints to the amount of processing that gets done, but it is doable. Or, I am completely wrong. I don't really know your problem. My point is have you considered finding a way to not have to redistribute all of the data? > > > > > Our biggest problem is the immaturity of development > > tools. > > I feel your pain on that one. > > > It is all too common to hear > > developers tell me things like "does it work if you turn off bounds > > checking?". > > Egads! I'm a big fan of building and testing programs on as many > completely different platforms as possible, and with every > possible warning enabled. That does wonders for wringing latent > bugs out of code. What you have are scientists playing computational scientists. The code isn't what is important (for my users), the results are. However, scientists rarely are willing to give a dime of funding to the hire computational scientists even though in the long run they will probably write more papers and get more science done. Some get it, but not as many that should. Craig > > Regards, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech From stuart.midgley at anu.edu.au Tue Mar 22 15:12:17 2005 From: stuart.midgley at anu.edu.au (Stuart Midgley) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Re: Why Do Clusters Suck? In-Reply-To: <1111532469.2641.224.camel@localhost.localdomain> References: <1111532469.2641.224.camel@localhost.localdomain> Message-ID: Of course, you can buy expensive hardware (take our AUD10mil Alphaserver SC - based on 127 ES45's for example) and still have every CPU replaced, every dimm replaced and every disk replaced over 3 years :) We are still loosing a node disk (10k rpm Ultra SCSI 320 disks) roughly every 3-4 days. Fortunately, our Quadrics has never failed. Contrast this to our 152 node Dell cluster (single cpu PR350's), which has had a couple of disks and a couple of fans fail in 2 years... Stu. > For the nodes themselves, you can buy systems with redundant power and > redundant disk. You can buy from an system vendor that qualifies > their hardware much more rigorously than another. When you buy cheap > hardware you get cheap hardware. -- <---------------------------------------------------------------------> Dr Stuart Midgley | stuart.midgley@anu.edu.au Supercomputer Facility | smidgley@netspace.net.au Leonard Huxley Building 56 | +61 (0)2 6125 5988 Work Australian National University | +61 (0)2 6125 8199 Fax CANBERRA ACT 0200 | +61 (0)4 1125 2488 Mob From fant at pobox.com Tue Mar 22 15:04:14 2005 From: fant at pobox.com (Andrew D. Fant) Date: Wed Nov 25 01:03:55 2009 Subject: [Beowulf] Re: Why Do Clusters Suck? In-Reply-To: <42409A96.4050709@scalableinformatics.com> References: <1111526485.2641.133.camel@localhost.localdomain> <42409A96.4050709@scalableinformatics.com> Message-ID: <4240A46E.4040408@pobox.com> Joe Landman wrote: > > > Craig Tierney wrote: > >> Our biggest problem is the immaturity of development >> tools. Another way to put that is "my compiler doesn't reproduce >> the bugs in the other compilers my users are accustom to using" >> or "Fortran isn't a standard, it is a suggestion". It is a rare creature >> that writes clean, portable code. It is all too common to hear >> developers tell me things like "does it work if you turn off bounds >> checking?". I spend way too much time with new users trying to explain >> to them the difference between 'code porting' and 'bug fixing'. > > > > > me: "how do you know it works" > them:"it compiles with no errors" > me: "no... how do you know it works, functions correctly?" > them:(puzzled look) "it compiles with no errors ..." > Amen to that, Joe. My personal complaint is that there aren't enough good standard test/validation suites out there for cluster building. Some libraries like Atlas include them, but they are also tied to that specific package. It would be really great if as a community we could do something like the Linux test project oriented towards cluster-building and scientific computing. Something that I can run when my boss wants "proof" that upgrading a library didn't completely rejigger the numerical stability of the results. I know that the stock answer here is that we ought to generate our own regression tests based on our on particular application set, but I think it would be a boon for a more generic framework and solution to evolve. If nothing else, it would offer a basis for heterogeneous systems in a grid environment to trust each other's results without necessarily requiring full application cross-validation. It might be a pipe dream, but I like it 8-) Andy From ctierney at HPTI.com Tue Mar 22 15:28:44 2005 From: ctierney at HPTI.com (Craig Tierney) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Re: Why Do Clusters Suck? In-Reply-To: References: <1111532469.2641.224.camel@localhost.localdomain> Message-ID: <1111534124.2641.241.camel@localhost.localdomain> On Tue, 2005-03-22 at 16:12, Stuart Midgley wrote: > Of course, you can buy expensive hardware (take our AUD10mil > Alphaserver SC - based on 127 ES45's for example) and still have every > CPU replaced, every dimm replaced and every disk replaced over 3 years > :) We are still loosing a node disk (10k rpm Ultra SCSI 320 disks) > roughly every 3-4 days. Fortunately, our Quadrics has never failed. > > Contrast this to our 152 node Dell cluster (single cpu PR350's), which > has had a couple of disks and a couple of fans fail in 2 years... > Corollary to "when you buy cheap hardware you get cheap hardware": "when you buy expensive hardware it still might be cheap". We had a similar problem with Compaq XP1000 workstations. Every time we turned the system off we would lose 5-10 power supplies (out of 275). A couple years later we bought some whitebox systems. Three times the node count with 1/10 the total hardware problems. Craig > Stu. > > > > For the nodes themselves, you can buy systems with redundant power and > > redundant disk. You can buy from an system vendor that qualifies > > their hardware much more rigorously than another. When you buy cheap > > hardware you get cheap hardware. > > -- > <---------------------------------------------------------------------> > Dr Stuart Midgley | stuart.midgley@anu.edu.au > Supercomputer Facility | smidgley@netspace.net.au > Leonard Huxley Building 56 | +61 (0)2 6125 5988 Work > Australian National University | +61 (0)2 6125 8199 Fax > CANBERRA ACT 0200 | +61 (0)4 1125 2488 Mob > From stuart.midgley at anu.edu.au Tue Mar 22 15:34:08 2005 From: stuart.midgley at anu.edu.au (Stuart Midgley) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Why Do Clusters Suck? In-Reply-To: References: Message-ID: <235ed951f8958ecdfd7d9881380e6605@anu.edu.au> On 22/03/2005, at 7:36, Douglas Eadline - ClusterWorld Magazine wrote: > So why do clusters suck? From my position, this issue is really complex. In the Australian scene, the main reason "clusters suck" has nothing to do with distros, hardware or associated software. It is more an issue with support staff. It is easy to buy hardware, software and download a distro. However, it is very difficult to get good support staff. Clusters, by their nature and design, are not simple beasts. When everything is running well, you can manage them with almost no staff. However, when something goes wrong the diagnostic/resolution cycle can be long and very complex. An error in an MPI program could be the actual user code, the MPI layer, a system software issue, the interconnect, some hardware failure or a combination of all three. Getting good staff to understand and handle all these layers is difficult. Spending $100k will get you a reasonable sized cluster on the floor within a few weeks, which will last say 3 years. Yet, in the staff space $100k doesn't even get a good system administrator for a single year. And, a system administrator is not always what is required. They may not have a good understanding of MPI/applications etc. How to make clusters less sucky? Well, for a large cluster users/system administrators, decent training would be a good start. Training which takes people through the process of building, installing, breaking and fixing a cluster. Of course, then there is the MPI/application side of things which would be another course. Try to wrap 10years worth of system/computational experience up into a 5 days course ;) Stu. -- <---------------------------------------------------------------------> Dr Stuart Midgley | stuart.midgley@anu.edu.au Supercomputer Facility | smidgley@netspace.net.au Leonard Huxley Building 56 | +61 (0)2 6125 5988 Work Australian National University | +61 (0)2 6125 8199 Fax CANBERRA ACT 0200 | +61 (0)4 1125 2488 Mob From landman at scalableinformatics.com Tue Mar 22 15:43:32 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] AMD Athlon with Intel Fortran Compiler, which options? In-Reply-To: <3239900dd6a7dd4f3b6454cbc4dd0600@anu.edu.au> References: <3239900dd6a7dd4f3b6454cbc4dd0600@anu.edu.au> Message-ID: <4240ADA4.6050701@scalableinformatics.com> I thought they still had processor check code in there, that checks the intel CPU id, and defaults to p3/p4 base w/o SSE. You can compile it, it just might not select that code path (SSE*). Does this for Opterons. Stuart Midgley wrote: > Actually you can use everything except the SSE3 stuff. Once the 2.6GHz > Opterons are out and about, you will be able to use SSE3 as well. > > I have managed to compile code in 64bit mode on an opteron using the > Intel compilers. If I remember correctly, I used something like > > -fast -xW > > whereas -fast usually turns on -xP which gives SSE3 stuff. > > Stu. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From becker at scyld.com Tue Mar 22 16:18:31 2005 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Why Do Clusters Suck? In-Reply-To: <235ed951f8958ecdfd7d9881380e6605@anu.edu.au> Message-ID: On Wed, 23 Mar 2005, Stuart Midgley wrote: > On 22/03/2005, at 7:36, Douglas Eadline - ClusterWorld Magazine wrote: > > So why do clusters suck? > From my position, this issue is really complex. In the Australian > scene, the main reason "clusters suck" has nothing to do with distros, > hardware or associated software. It is more an issue with support > staff... ... > Clusters, by their nature and design, are not simple beasts. Just like "Math is hard", "Computers are hard". But there many things that can be done to make clusters barely more difficult to use than single computers. If you are already used to running a cluster, you may not realize all of the extra complexity that you have introduced. This is especially true when you write ad hoc programs and scripts. When they work, everything is looks fine. But do they work in any other environment, or if anything changes, and what happens when they break? > When everything is running well, you can manage them with almost no staff. > However, when something goes wrong the diagnostic/resolution cycle can > be long and very complex. Yup. It's not how easy it looks when things go right, but how complex the system is when things go wrong. That's a corollary to "an abstraction layer is worse than useless when you have to look underneath". It's important to have diagnosable, documented tools and a system that is as simple as possible. > How to make clusters less sucky? Well, for a large cluster > users/system administrators, decent training would be a good start. > Training which takes people through the process of building, > installing, breaking and fixing a cluster. Of course, then there is > the MPI/application side of things which would be another course. Try > to wrap 10years worth of system/computational experience up into a 5 > days course ;) I'm the instructor for many of our introductory training courses. That is one my motivations make our system as simple as possible. Sometimes it's faster to write the code to avoid an exception to a general rule than to figure out how to explain it. A good example is handling heterogeneous hardware. I don't mean mixing Alphas with Itaniums with Opterons. I mean the gritty, everyday kind of minor system differences. Similar looking systems with a different Ethernet adapters. A mix of diskless and disk-based systems. Different versions of PXE. Toss in a few dual processor machines with one CPU removed, a mix of memory sizes, and that flaky disk that you can't quite admit is broken. Each of these differences can potentially be handled automatically. If you do a full install, the installer might handle the difference and you might not even notice them... until you consider long-term administration. What happens when you do an update? How do you recover when a system disk goes bad? A cluster need not be a collection of workstation environments, and treating it like one adds more complexity than someone with a lot of experience might initially perceive. From rgb at phy.duke.edu Wed Mar 23 05:48:50 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Re: Why Do Clusters Suck? In-Reply-To: <4240A46E.4040408@pobox.com> References: <1111526485.2641.133.camel@localhost.localdomain> <42409A96.4050709@scalableinformatics.com> <4240A46E.4040408@pobox.com> Message-ID: On Tue, 22 Mar 2005, Andrew D. Fant wrote: > My personal complaint is that there aren't enough good standard > test/validation suites out there for cluster building. Some libraries > like Atlas include them, but they are also tied to that specific > package. It would be really great if as a community we could do > something like the Linux test project oriented towards cluster-building > and scientific computing. Something that I can run when my boss wants > "proof" that upgrading a library didn't completely rejigger the > numerical stability of the results. I know that the stock answer here > is that we ought to generate our own regression tests based on our on > particular application set, but I think it would be a boon for a more > generic framework and solution to evolve. If nothing else, it would > offer a basis for heterogeneous systems in a grid environment to trust > each other's results without necessarily requiring full application > cross-validation. It might be a pipe dream, but I like it 8-) Hmmm. OK, how's this. Just supposing that I finish building xmlbenchd before an infinite amount of time elapses (I've once again gotten mired in teaching and haven't had time to work on it for a week+). Suppose xmlbenchd can run any given program inside a fairly standard timing wrapper (probably a perl script for maximum portability and ease of use). Suppose that the perl script, which will certainly contain the command line for the application which therefore will (for fixed random number seeds where appropriate) produce some sort of fixed output. Then it would be trivial to add a segment to at LEAST diff the output with the expected output, and not a horrible amount of work to actually compute a chisq difference between the two. I can easily introduce xml tags for returning a validation score on the actual result (or even a set of such scores) because extensibility IS useful during the time a new thing is being invented (sorry, Don:-) if not beyond. This would permit the best of both worlds: a) I expect to assemble a set of macro-level applications to function somewhat like the spec suite does today but without the "corporation" baggage, for distribution WITH the package. At that point I will actually solicit this list for good candidates for primary inclusion. This set can actually be quite large, permitting users to preselect at configuration time the ones to run for their particular site. For the ones that are selected, I will go ahead and do the validation test for when I wrap them up in the timing script. b) Users who want to wrap their OWN application set up for automated benchmarking inside the provided template script will then be able to follow fairly simple instructions and (presuming that they know enough perl to be able to parse their application's output file(s)) validate as well as time to their heart's content. This may not be sufficient for all users -- I'm probably not going to write a core loop that would permit a sweep of an input parameter in the command line, for example, and to test e.g. special function calls in the GSL that change algorithms at certain breakpoints that kind of thing is really necessary. However, folks with more advanced needs will presumably be more advanced programmers and the perl to add such a sweep and generate a more complex validation isn't terrribly challenging. Would that do? rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From landman at scalableinformatics.com Wed Mar 23 06:49:21 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Re: Why Do Clusters Suck? In-Reply-To: References: <1111526485.2641.133.camel@localhost.localdomain> <42409A96.4050709@scalableinformatics.com> <4240A46E.4040408@pobox.com> Message-ID: <424181F1.5060503@scalableinformatics.com> Robert G. Brown wrote: > On Tue, 22 Mar 2005, Andrew D. Fant wrote: [...] > Hmmm. OK, how's this. Just supposing that I finish building xmlbenchd > before an infinite amount of time elapses (I've once again gotten mired > in teaching and haven't had time to work on it for a week+). Suppose > xmlbenchd can run any given program inside a fairly standard timing > wrapper (probably a perl script for maximum portability and ease of > use). Suppose that the perl script, which will certainly contain the > command line for the application which therefore will (for fixed random > number seeds where appropriate) produce some sort of fixed output. Hmmm. Have you seen the input and output to BBS (in retrospect, a poorly named tool)? In BBS (bioinformatics benchmark system), you create input experiments using simple XML documents that describe the binary, the arguments, where STD* should go or come from. You get output in XML format for the experiment in the form of timing. It would not be too difficult a stretch to include regex pattern matchers on the output, or similar, to specifically test for "goodness". That is, suppose I want to run a nice simple GAMESS run: This package implements a GAMESS benchmark we have found useful for testing compiler and machine options and configurations. GPL 2.0 This will run 2 GAMESS timing experiments (benchmarks), with names of test1-1CPU, and test1-2CPU, hand the appropriate arguments to the program and run each test in order. We can launch multiple instances, and simultaneous invocations of multiple instances with different experiments. The output from this looks like this (after filtering through simplexml, included in the BBS package): We could add in some tags like ... or similar. FWIW: BBS is at http://www.scalableinformatics.com/BBS , is GPL, and is in active use by a number of groups/companies for testing purposes. > > Then it would be trivial to add a segment to at LEAST diff the output > with the expected output, and not a horrible amount of work to actually > compute a chisq difference between the two. I can easily introduce xml > tags for returning a validation score on the actual result (or even a > set of such scores) because extensibility IS useful during the time a > new thing is being invented (sorry, Don:-) if not beyond. This would > permit the best of both worlds: Could leverage what exists rather than re-inventing this particular wheel. Let me know if there are particular things you would like to see in the output comparison. Could do a chi-square, but this makes more sense for numerical bits than non-numerical bits (BBS doesn't care, and may the solution to this are analysis plug-ins that implement the appropriate tests). > > a) I expect to assemble a set of macro-level applications to function > somewhat like the spec suite does today but without the "corporation" > baggage, for distribution WITH the package. At that point I will > actually solicit this list for good candidates for primary inclusion. > This set can actually be quite large, permitting users to preselect at > configuration time the ones to run for their particular site. For the > ones that are selected, I will go ahead and do the validation test for > when I wrap them up in the timing script. Heh... bbsrun --experiment "test1-1CPU" --debug < gamess.xml Include the XML with your package, and bbs can (largely) do the rest. > > b) Users who want to wrap their OWN application set up for automated > benchmarking inside the provided template script will then be able to > follow fairly simple instructions and (presuming that they know enough > perl to be able to parse their application's output file(s)) validate as > well as time to their heart's content. > > This may not be sufficient for all users -- I'm probably not going to > write a core loop that would permit a sweep of an input parameter in the > command line, for example, and to test e.g. special function calls in > the GSL that change algorithms at certain breakpoints that kind of thing > is really necessary. However, folks with more advanced needs will > presumably be more advanced programmers and the perl to add such a sweep > and generate a more complex validation isn't terrribly challenging. :) > > Would that do? As most of this exists in BBS now, and it is in active use, I would say yes. :) > > rgb > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From erwan at seanodes.com Wed Mar 23 04:16:19 2005 From: erwan at seanodes.com (Velu Erwan) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] What kind of I/O benchmark ? Message-ID: <1111580179.21799.42.camel@R1.seanodes.com> Hi folks, I'm searching the best way to stress a storage system but using some real applications. I mean, using some bonnie++, Iozone, b_eff_io benchmarks could give some raw performances about what your storage infrastructure is able to provide. Some benchmarks like mpi-tile-io seems starting another way by trying to match what could be the "real" performance of your storage regarding a kind of application (visualization for mpi-tile-io). Does other people are working on such kind of application oriented benchmark ? I was heard that some BLAST code could be used for such use, does some of you follow this way of validating/benchmarking clusters ? From rgb at phy.duke.edu Wed Mar 23 07:14:56 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Alternative to MPI ABI In-Reply-To: References: Message-ID: On Tue, 22 Mar 2005, Donald Becker wrote: > > On Tue, 22 Mar 2005, Donald Becker wrote: > > I sent this just a few minutes ago... how did you write a chapter-long > reply? Presumably clones, but how do you synchronize them? Easy. Spawn a few (don't ask) and send them an MPI_Barrier(:-P) > > Agreed. I also have been working on daemons and associated library > > tools to provide some of this information as well on the fully > > GPL/freely distributable side of things for a long time. > > There are GPL versions of BeoStat and BeoMap. (Note: GPL not LGPL.) > > Admittedly they are older versions, but they are still valid. We are > pretty good about not changing the API unless there is a flaw that can't > be worked around. Many other projects seem to take the approach of > "that's last weeks API". > > That said, we are designing a new API for BeoStat and extensions to > BeoMap. We have to make significant changes for hyperthreading and > multi-core, and how they relate to NUMA. We are taking this opportunity > to clean up the ugliness that lingers from back when the Alpha was "the" > 64 bit processor. And I'm not critiquing your corporate efforts in any way -- as I said, I hardly have time to do the things I want to do in this arena because I have to make a (meager) living, care for a family, and advance my World of Warcraft character at the expense of sleep. Finding a corporate model that pays you to do what you'd likely want to be doing anyway is laudable. I just happen to work on the fully GPL side of things and think that there are certain wheels that need to be reinvented several times before they are really gotten right. > > I've just > > started a new project (xmlbenchd) that should be able to provide really > > detailed capabilities information about nodes via a daemon interface > > from "plug in" benchmarks, both micro and macro (supplemented with data > > snarfed from /proc using xmlsysd code fragments). > > The trick is providing useful capability information, without > introducing complexity. I see don't benchmark results, even microBMs, as > being directly usable for local schedulers like BeoMap. For local schedulers, perhaps not, because they tend to run in heterogeneous environments anyway. I am thinking ahead to gridware. A grid scheduler really does need to be able to ask for N nodes that have at least M memory, L L2 cache size (which might e.g. affect whether the presumed code blocking runs fast or slow), random memory access rates of at least R, and stream numbers of at least [a,b,c,d]. Or it might ask for N nodes that can run a particular "fingerprint" macro application in at most S seconds (plus sundry memory size constraints). The macro application to be tested might even be their own. The grid might contain Celerons (small cache) to Xeons and Opterons, 32 bit and 64 bit memory pathways. Simply knowing CPU clock isn't enough as e.g. Opterons or AMDs in general might have significantly lower clocks than P4's or Celeries and still be superior in speed. A typical grid user may not have the faintest idea of how all those parameters contribute to overall performance, but if the daemon is rigged to return a "fingerprint" vector of results, it becomes at least possible to write e.g. a nifty front end GUI with little slider bars or the like for users to set capability requests. Note well that the complexity is there, like it or not. One can do what is usually done and simply ignore it, or provide a tool that can produce a projective picture of the complexity that is "sufficient" that most programmers can find within it a metric that can be used to optimize performance of their code in some way. This is the other place that I see xmlbenchd as being (more) useful: inside applications. ATLAS's optimization design is predicated on the existence of a hierarchy of memory latencies with close to an order of magnitude difference between them, as well as certain CPU instructions that speed certain orderings of operations by as much as 2-3. Algorithms and block sizes are switched to take maximum advantage of the PARTICULAR L2 cache size of a given architecture in its PARTICULAR latency relationship to both registers and regular memory. However, ATLAS's autotuning build process is really, really complex but because it is prepackaged, it hides all the complexity of the system inside is gradient-searching build scripts. I think that a perfectly legitimate question in computer science is whether or not this is really necessary -- whether in particular a suitable set of projective measures exist that can be extracted by MB's and used to "tune" a linear algebra library at runtime rather than at build time. By extension, whether or not there are programs out there with perhaps simpler blocking and algorithmic decisions that can be decided on the basis of one or more of the projective measures, where it may be as simple as choosing how to manage trigonometry in code (using sqrt() calls instead of evaluating sin/cos/tan). It is quite harmless for a daemon to run a rather large set of benchmarks -- for example to time all the math routines in libm (not that I'M going to write the code for this:-) -- so that anybody writing code can run a GUI, select both host and e.g. sin() from scrolled lists, and have the function's timings on the PARTICULAR SYSTEM instantly displayed. Or (if you prefer) click a couple of things and see not only stream, but a graph of stream as a function of vector size from vectors of less than a page through 20 MB in length on a log scale). I can't help but think that this information would be really useful to many programmers, even if they didn't actually write hooks to query the daemons into their actual code. My goal is to make this so simple that it is just plain automatic -- install the rpm, boot the system, wait a bit (for the daemon to run benchmarks in specified windows of idle time e.g. during the first boot) and from then on one has access to the information. This is what is NOT true with e.g. stream, lmbench etc today. Just getting lmbench is not an exercise for the faint of heart as you have to install tools to get the tool. Stream is better, but it certainly isn't a prepackaged component of all distibutions where you can just "yum install stream" and have not only stream installed but run and its results placed somewhere permanent from which they can be retrieved in a heartbeat. And stream isn't enough -- it only provides four projective measures of systems performance on a single plane of the primary relevant dimension! > > xmlsysd already provides CPU clock, total memory, L2 cache size, total > > and available memory, and PID snapshots of running jobs. > > BeoStat provides both static info the CPU clock speed and total memory. > It provides dynamic info on load average (the standard three values per > node), CPU utilization (per processor! not per node), memory used, network > traffic for up to four interfaces. xmlsysd wraps up quite a bit more information. wulfstat's "memory" display shows pretty much the same set of information as running "free" on all the nodes inside a delay loop -- this can be useful when debugging e.g. a memory leak (including the ones I had when writing wulfstat itself -- using a tool to debug itself:-). However, wulfstat's default display is close to this as it is the most important information I agree. > I don't see L2 cache size as being useful. Beostat provides the processor > type, which is marginally more useful but still largely unused. Useful or not, it is trivial to provide, and as I argue above it SHOULD be useful to programmers who can access the information inside applications seeking to optimize block sizes and strides for certain vector operations. Whether it is useful at this moment or not may be more related to the fact that most programmers don't have the patience to write the code to parse the information out of /proc/cpuinfo or just "know" what it is for the architecture of their particular cluster and do a rebuild after altering a few #defines instead of a dynamic optimization as a consequence. This is fine (again) for heterogeneous environments but less good for grids, especially ones that mix generations of hardware. > Nor is the PID count useful in old-style clusters. I know that Ganglia > reports it, but like most Ganglia statistics it's mostly because the > number is easy to get, not because they know what to do with it! (I wrote > the BeoStat->Ganglia translator, and consider most of their decisions as > being, uhmmm, ad hoc.) The PID count *is* useful in Scyld clusters, since > we run mostly applications, not 20 or 30 daemons. > > Not that BeoStat doesn't have it's share of useless information. > Like reporting the available disk space -- a number which is best ignored. I didn't mean pid count. xmlsysd/wulfstat provides a top-like view of running processes, with user-specifiable filters to exclude/include unwanted/wanted processes. The default excludes all root processes, for example. > > use (or will use in the case of xmlbenchd) xml tags to wrap all output, > > Acckkk! XML! Shoot that man before he reproduces. (too late) I'll just spawn more clones...;-) > > XMLified tags also FORCE one to organize and present the data > > My, my, my. > Just like Pascal/ADA/etc forces you write structured programs? Ah, I can see that we'll have to agree to semi-disagree here. Yes, pascal sucks, partly because a structured program isn't what its designers thought that it was. However, ANSI C, as opposed to K&R C, does not suck because it does indeed force one towards more structure where it counts. XML can certainly be used correctly or abused, and in my cynical view the first cut at an xml encapsulation of any given data structure is likely to be wrong just like the first cut of writing the key structs in a C or C++ application (the "data objects") is likely to be wrong. However, FOR ITS INTENDED PURPOSE it incorporates a particular discipline, simply by its requirement for strict nesting of tags. Yes, there is nothing to stop one from loading multiple data objects inside a single tag and forcing an end user to parse them out the hard way, and it is sometimes not easy to see what should be a tag by itself and what should be in an attribute, but still, a good xml encapsulation of the kind of data I'm talking about is pretty much a 1:1 map onto a data structure and should precisely mirror that data structure. > > hierarchically and extensibly. I've tried (unsuccessfully) to convince > > Linus to lean on the maintainers of the /proc interface to achieve some > > small degree of hierarchical organization and uniformity in > > Doh! Don't you dare use one of my own pet peeves against me! /proc is a > hodge-podge of formats, and people change them without thinking them > through. > > I still remember the change to /proc/net/dev to add a field. > The fact that the new field was only used for PPP, while breaking every > existing installation ("who uses 'ifconfig' or 'netstat'?") didn't seem to > deter the change, and once made it was impossible to change back. > > But that still doesn't make XML the right thing. Having a real design and > keeping interfaces stable until the next big design change is the answer. Oh, I agree, but the "real design" is going to have exactly the same hierarchical features that xml attempts to enforce or it will be more of the same old crap. Does one have to use xml to design a decent data hierarchy with consistent parsing rules? No, of course not. /etc/passwd, /etc/group, /etc/shadow for example are a triplet of files that are a living counterexample (although they are not autodocumenting, which I personally think is a useful feature of xml tags). Do even the best programmers in the world come CLOSE to achieving hierarchical consistency as a general rule? They do not, as /proc clearly demonstrates although there are plenty of other data views in /etc that are equally poignant counterexamples. The thing that is nice about xml is that it IS, like it or not, a consistently parseable view of structured data with rules that are intended to enforce what all programmers should be doing anyway. It doesn't guarantee that they will accomplish this intent, and it can be munged. But it isn't as EASY to produce garbage as it is with free-form roll-your-own interfaces. I also think that you underestimate the importance of extensibility. One major PITA about /proc is that in the ordinary course of the evolution of new technologies one eventually adds a new feature or device (such as a new network) that is clearly in the "network" hierarchy, but that has new objects that are a part of its essential description. How CAN one add the new data without breaking old tools? With xml the problem doesn't even exist. Tags that aren't parsed are ignored, and if one's hierarchical description was halfway decent in the first place the addition of a new kind of network can "inherit" all the relevant old features, add new tags for the new features, and permit new tools to be written or old ones to be modified that can use the new information without breaking the old ones in any way. This is a problem with e.g. /etc/passwd. Suppose one suddenly needed to add a field. For example, let's imagine that /etc/passwd is going to be modified to function across an entire toplevel domain, e.g. a University, for single sign-in purposes. In addition to the usual information, it now will need a field hierarchy to set access permissions by e.g. department, and may need different shadowed passwords per department, or different user id's per department. It is impossible to add these to /etc/passwd now without breaking more things than one can possibly imagine. The only ways I can think of to do it are to overload the one data field that doesn't have a prescribed function (often done, actually, as a hack) or create another file cross-referenced by e.g. user id. If /etc/passwd were laid out in xml (or an EQUIVALENT hierarchy), it would be trivial and would break nothing. This is very similar to the WYSIWYG vs markup debate. There are those who swear that WYSIWYG editors are great and permit complete idiots to produce lovely looking professional documents. They are, of course, totally wrong 90% of the time -- what you actually get out of most WYSIWYG editors is a pile of user-formatted crap that doesn't even vaguely comply with a unified style (what size and style font should I use here to start sections, today, hmmm:-). Markup or e.g. latex "force" a consistent hierarchical view of plain old text documents the same way that xml CAN "force" such a view for data objects. Now I really must go... rgb > > There are several advantages to using daemons (compared to a kernel > > Many people assume that much of what Scyld does is in the kernel. > There are only a few basic mechanisms in the kernel, largely BProc, with > the rest implemented as user-level subsystems. And most of those subsystems > are partially usable in a non-Scyld system. > > The reason to use kernel features is > to implement security mechanism (never policy) > to allow applications to run unchanged > We use kernel hooks only for the unified process space, process migration > and the security mechanism. Managing process table entries and correctly > forwarding signals can only work with a kernel interface. > > Having the node security mechanism in the kernel allows us to implement > policy with unprivileged user-level libraries. That means an application > can provide its own tuned scheduler function, or use a dynamic library > provided by the end user. > > Otherwise the scheduler must run as a privileged daemon, tunable only by > the system administrator. It could be worse: some process migration > systems put the scheduler policy in the kernel itself! > > > Still, I totally agree that this is EXACTLY the kind of information that > > needs to be available via an open standard, universal, extensible, > > interface. > > Strike the word "extensible". That's a sure way to end up with a > mechanism that is complex and doesn't do anything well. > > > > An application should be able to use only a subset of provided > > > processors if they will not be useful (e.g. an application that uses > > > a regular grid might choose to use only 16 of 23 provided nodes. > > Absolutely. And this needs to be done in such a way that the programmer > > doesn't have to work too hard to arrange it. I imagine that this CAN be > > done with e.g. PVM or some MPIs (although I'm not sure about the latter) > > but is it easy? > > It is with out BeoMPI, which runs single threaded until it hits > MPI_Init(). That means the application can modify its own schedule, or > read its configuration information before deciding it will use MPI. > > One aspect of our current approach is that it requires remote_fork(). > Scyld Beowulf already has this, but even with our system a remote fork may > be more expensive than just a remote exec() if the address space is dirty. > I believe that MPI-like library can get the much of the same benefit, at > the cost of a little extra programming, by providing flags that are only > set when this is a remote or slave (non-rank-0) process. (That > last line is confusing: consider the case where MPI rank 0 is supposed to > end up on a remote machine, with no processes left on the originating > machine.) > > > > There needs to be new process creation primitives. > > > We already have a well-tested model for this: Unix process > > Agreed, but while dealing with this one also needs to think about > > security. > > We have ;->. > > > Grids and other distributed parallel computing paradigms are > > Grids have a fundamental security problem: how do you know what you are > running on the remote machine? Is it the same binary? With the same > libraries? Linked in the same order? With the same kernel? Really the > same kernel, or one with changed semantics like RedHat's > "2.4-w/2.6-threading". That's not even covering the malicious angle: > "thank you for providing me with the credentials for reading all of your > files". > > Operationally, Grids have the problem that they must define both the > protocols and semantics before they can even start to work, and then there > will be a lifetime of backwards and forward compatibility issue. > You won't see this at first, just like the first version of > Perl/Python/Java was "the first portable language". But version skew and > semantic compatibility is *the* issue to deal with, not "how can we hack > it to do something for SCXX". > > > > > 2. Cancel the MPI Implementor's Ultimate Prize Fighting Cage Match on > > > > pay-per-view (read: no need for time-consuming, potentially fruitless > > > > attempts to get MPI implementors to agree on anything) > > > > > > Hmmmm, does this show up on the sports page? > > > > It's actually very interesting -- the Head Node article in CWM seemed to > > me to be a prediction that MPI was "finished" in the sense that it is > ... > > "finished" in the sense of being complete and needing any further > .. > > "finished" in the sense that it needs such a radical rewrite > > I fall into the "finished -- lets not break it by piecemeal changes" camp. > > Many "add checkpointing to MPI" and "add fault tolerance to MPI" projects > have been funded for years. We need a new model that handles dynamic > growth, with failures being just a minor aspect of the design. I don't > see how MPI evolves into that new thing. > > > To summarize, I think that the basic argument being advanced (correct me > > if I'm wrong) is that there should be a whole layer of what amount to > > meta-information tools inserted underneath message passing libraries of > > any flavor so that (for example) the pvm console command "conf" returns > > a number that MEANS SOMETHING for "speed", and in fact so that the pvm > > Slight disagreement here: I think we need multiple subsystems that work > well together, rather than a single do-it-all library. The architecture > for a status-and-statistics system (BeoStat for Scyld) is different than > for the scheduler (e.g. BeoMap), even though one may depend on the API of > the other. If we put it all into one big library, it will be difficult to > evolve and fix. (I'm assuming we can avoid API creep, which may not hold > true.) > > Donald Becker > Scyld Software > Annapolis MD 21403 410-990-9993 > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From landman at scalableinformatics.com Wed Mar 23 07:30:46 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] What kind of I/O benchmark ? In-Reply-To: <1111580179.21799.42.camel@R1.seanodes.com> References: <1111580179.21799.42.camel@R1.seanodes.com> Message-ID: <42418BA6.9030405@scalableinformatics.com> Velu Erwan wrote: > Does other people are working on such kind of application oriented > benchmark ? yes. > I was heard that some BLAST code could be used for such use, does some > of you follow this way of validating/benchmarking clusters ? Yes it can. You will be throttled by the speed of mmap implementation if you are using NCBI BLAST, and you will spend most of the time paging if you don't segment your databases (nt in particular). But you could use BLAST to benchmark the IO. Not sure how useful it is though for what you want to measure. What specifically are you trying to measure (in the context of which applications are you going to run)? If you do large block sequential IO's you will want a very different benchmark test from small block random IOs and seeks. Joe > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From bropers at cct.lsu.edu Wed Mar 23 07:19:29 2005 From: bropers at cct.lsu.edu (Brian D. Ropers-Huilman) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] What kind of I/O benchmark ? In-Reply-To: <1111580179.21799.42.camel@R1.seanodes.com> References: <1111580179.21799.42.camel@R1.seanodes.com> Message-ID: <42418901.6040809@cct.lsu.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I am currently engaged in a "bake-off" of sorts between two different vendors. We used IOzone as an initial test, but picked three of our most representative codes to do the remainder of the tests. With these codes we scaled up the number of concurrent processors, the number of files, and the file size, including parallel and single node writes and reads. As always, the best test is your own code. Velu Erwan said the following on 2005.03.23 06:16: > Hi folks, > I'm searching the best way to stress a storage system but using some > real applications. I mean, using some bonnie++, Iozone, b_eff_io > benchmarks could give some raw performances about what your storage > infrastructure is able to provide. > > Some benchmarks like mpi-tile-io seems starting another way by trying to > match what could be the "real" performance of your storage regarding a > kind of application (visualization for mpi-tile-io). > > Does other people are working on such kind of application oriented > benchmark ? > > I was heard that some BLAST code could be used for such use, does some > of you follow this way of validating/benchmarking clusters ? - -- Brian D. Ropers-Huilman .:. Asst. Director .:. HPC and Computation Center for Computation & Technology (CCT) bropers@cct.lsu.edu Johnston Hall, Rm. 350 +1 225.578.3272 (V) Louisiana State University +1 225.578.5362 (F) Baton Rouge, LA 70803-1900 USA http://www.cct.lsu.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (Darwin) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFCQYkBwRr6eFHB5lgRAo5VAJ47cboq960SlxTHVjQv556qx5cbdwCfVtHZ /FuqKe5n/yW1Z0U0K1+Uryc= =8qVQ -----END PGP SIGNATURE----- From rgb at phy.duke.edu Wed Mar 23 08:13:33 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Re: Why Do Clusters Suck? In-Reply-To: <424181F1.5060503@scalableinformatics.com> References: <1111526485.2641.133.camel@localhost.localdomain> <42409A96.4050709@scalableinformatics.com> <4240A46E.4040408@pobox.com> <424181F1.5060503@scalableinformatics.com> Message-ID: On Wed, 23 Mar 2005, Joe Landman wrote: > Robert G. Brown wrote: > > On Tue, 22 Mar 2005, Andrew D. Fant wrote: > > [...] > > > Hmmm. OK, how's this. Just supposing that I finish building xmlbenchd > > before an infinite amount of time elapses (I've once again gotten mired > > in teaching and haven't had time to work on it for a week+). Suppose > > xmlbenchd can run any given program inside a fairly standard timing > > wrapper (probably a perl script for maximum portability and ease of > > use). Suppose that the perl script, which will certainly contain the > > command line for the application which therefore will (for fixed random > > number seeds where appropriate) produce some sort of fixed output. > > Hmmm. Have you seen the input and output to BBS (in retrospect, a > poorly named tool)? > > In BBS (bioinformatics benchmark system), you create input experiments > or similar. > > FWIW: BBS is at http://www.scalableinformatics.com/BBS , is GPL, and is > in active use by a number of groups/companies for testing purposes. Yeah, like that. Very much like that. I'll look into this for sure. Might be time to eliminate the leading "B" and just make it "BS";-). I have a functioning first cut on the benchmarking tags needed for more general benchmarking contexts (still not adequate, but a starting point) implemented in my working copy of benchmaster (micro, not macro) -- perhaps we can merge the two xmls without breaking either one of them, or we can add my tags to yours where they are different. For example (output from the savage benchmark): Content-Length: 978 Benchmaster 1.1.2 lilith GenuineIntel Celeron (Coppermine) 801.923 128 KB 320940 53224 cpu cycle counter nanotimer 86.694 savage ./benchmaster -t 9 xtest = tan(atan(exp(log(sqrt(xtest*xtest))))) 1024 4.63e-01 9.25e+02 9.38e+02 1.08e+00 Clearly very similar -- there will be a mirror input/configuration file for xmlbenchd that contains e.g. the first 3 fields of to tell it what to run, as well as tags to set scheduling policy and some other stuff. Note that benchmaster already does internal statistics on a set of independent timing runs by default -- hence the min/max/mean/stddev for timing. This doesn't validate the result of savage per se, although the savage code itself could easily do so. This is the easy part and most people will come up with very similar tag sets for this sort of encapsulation. What I'm working on is how to both input and output tables/vectors of results, e.g. what one gets from running benchmaster's encapsulation of stream for a series of command line specified vector lengths. I want to be able to encapsulate at least a vector if not a 2D/3D table (performance line or performance surface) for graphical presentation. But we can talk offline about this. It looks like we are indeed on the same conceptual page here. > > Then it would be trivial to add a segment to at LEAST diff the output > > with the expected output, and not a horrible amount of work to actually > > compute a chisq difference between the two. I can easily introduce xml > > tags for returning a validation score on the actual result (or even a > > set of such scores) because extensibility IS useful during the time a > > new thing is being invented (sorry, Don:-) if not beyond. This would > > permit the best of both worlds: > > Could leverage what exists rather than re-inventing this particular > wheel. Let me know if there are particular things you would like to see > in the output comparison. Could do a chi-square, but this makes more > sense for numerical bits than non-numerical bits (BBS doesn't care, and > may the solution to this are analysis plug-ins that implement the > appropriate tests). Absolutely. As I said, perhaps we can do a merge of some sort, or (xml being what it is) a hierarchical encapsulation. I'm glad to see that this tool IS out there -- this kind of memetic exchange is what makes GPL development "interesting". > > a) I expect to assemble a set of macro-level applications to function > > somewhat like the spec suite does today but without the "corporation" > > baggage, for distribution WITH the package. At that point I will > > actually solicit this list for good candidates for primary inclusion. > > This set can actually be quite large, permitting users to preselect at > > configuration time the ones to run for their particular site. For the > > ones that are selected, I will go ahead and do the validation test for > > when I wrap them up in the timing script. > > Heh... > > bbsrun --experiment "test1-1CPU" --debug < gamess.xml > > Include the XML with your package, and bbs can (largely) do the rest. > > > > > b) Users who want to wrap their OWN application set up for automated > > benchmarking inside the provided template script will then be able to > > follow fairly simple instructions and (presuming that they know enough > > perl to be able to parse their application's output file(s)) validate as > > well as time to their heart's content. > > > > This may not be sufficient for all users -- I'm probably not going to > > write a core loop that would permit a sweep of an input parameter in the > > command line, for example, and to test e.g. special function calls in > > the GSL that change algorithms at certain breakpoints that kind of thing > > is really necessary. However, folks with more advanced needs will > > presumably be more advanced programmers and the perl to add such a sweep > > and generate a more complex validation isn't terrribly challenging. > > :) > > > > > Would that do? > > As most of this exists in BBS now, and it is in active use, I would say > yes. :) I'll give it a look. BBS does indeed look like it is within spittin' distance of what is required on the operational front. We'll see if the xml's can be merged painlessly (probably so given that mine is defined mostly in my head in a single working copy of benchmaster and hence NOT in production, so there is little barrier to it being changed). It might require too much revision for BBS to remain unbroken, though, so we may yet need to create "son-o-BBS". rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From landman at scalableinformatics.com Wed Mar 23 08:19:48 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Re: Why Do Clusters Suck? In-Reply-To: References: <1111526485.2641.133.camel@localhost.localdomain> <42409A96.4050709@scalableinformatics.com> <4240A46E.4040408@pobox.com> <424181F1.5060503@scalableinformatics.com> Message-ID: <42419724.8020902@scalableinformatics.com> Robert G. Brown wrote: [...] > > >>or similar. >> >>FWIW: BBS is at http://www.scalableinformatics.com/BBS , is GPL, and is >>in active use by a number of groups/companies for testing purposes. > > > Yeah, like that. Very much like that. I'll look into this for sure. > Might be time to eliminate the leading "B" and just make it "BS";-). I ROTFLMAO ! > have a functioning first cut on the benchmarking tags needed for more > general benchmarking contexts (still not adequate, but a starting point) > implemented in my working copy of benchmaster (micro, not macro) -- > perhaps we can merge the two xmls without breaking either one of them, Sounds good. [...] > Absolutely. As I said, perhaps we can do a merge of some sort, or (xml > being what it is) a hierarchical encapsulation. I'm glad to see that > this tool IS out there -- this kind of memetic exchange is what makes > GPL development "interesting". Agreed. [...] >>>Would that do? >> >>As most of this exists in BBS now, and it is in active use, I would say >>yes. :) > > > I'll give it a look. BBS does indeed look like it is within spittin' > distance of what is required on the operational front. We'll see if the > xml's can be merged painlessly (probably so given that mine is defined > mostly in my head in a single working copy of benchmaster and hence NOT > in production, so there is little barrier to it being changed). It > might require too much revision for BBS to remain unbroken, though, so > we may yet need to create "son-o-BBS". I have been working up a roadmap for it, and the first B needs to be dropped or exchanged in favor of other things. Somehow, I thought Cluster Benchmark System might elicit some cease and desist letters from lawyers, so still thinking on this ... > > rgb > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From gmpc at sanger.ac.uk Wed Mar 23 08:25:08 2005 From: gmpc at sanger.ac.uk (Guy Coates) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] What kind of I/O benchmark ? In-Reply-To: <1111580179.21799.42.camel@R1.seanodes.com> References: <1111580179.21799.42.camel@R1.seanodes.com> Message-ID: > > I was heard that some BLAST code could be used for such use, does some > of you follow this way of validating/benchmarking clusters ? > Yes; blast makes a good addition to the benchmarks you already mentioned. The IO pattern is memory-mapped streaming reads, typically of "large" (>2GB) files. Blast won't tell you much about your write performance though. Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK Tel: +44 (0)1223 834244 ex 7199 From rgb at phy.duke.edu Wed Mar 23 08:45:27 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] What kind of I/O benchmark ? In-Reply-To: <42418901.6040809@cct.lsu.edu> References: <1111580179.21799.42.camel@R1.seanodes.com> <42418901.6040809@cct.lsu.edu> Message-ID: On Wed, 23 Mar 2005, Brian D. Ropers-Huilman wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I am currently engaged in a "bake-off" of sorts between two different > vendors. We used IOzone as an initial test, but picked three of our most > representative codes to do the remainder of the tests. With these codes we > scaled up the number of concurrent processors, the number of files, and the > file size, including parallel and single node writes and reads. That reminds me (for Joe) that this is also a design goal of xmlbenchd's tag set (in possible BBS merge). It needs to be able to run and display benchmarks that DO demonstrate the scaling properties of a cluster, as this is by far the biggest weakness in e.g. linpack and the top500 today. The top500 listings are utterly useless for determining scaling properties for the architectures being listed as this is minimally a CURVE of speed vs number of nodes and ideally a SURFACE of speed vs number of nodes vs "size". As indicated, IIRC, in Don et. al's very on book on how to build a beowulf;-). How much beyond 2d/3d to go I'm not certain. For example, in a given stream-like RDMA memory test, one could concievably vary vector size (one ordinal variable), number of nodes (two ordinal variables), and number of simultaneous test threads (three ordinal variables), stride (four), size of data object being accessed (five) as well as any number of discrete descriptive (non-ordinal) variables (e.g. random number seed for a shuffled/random non-streaming test, a flag to force the test to run backwards). It is possible that the performance surface for this decomposes into distinct projections on representative planes at e.g. the boundaries where the task is split onto nodes or runs over L1 and L2 cache boundaries, so that running a relatively small set of 2d/3d views suffices to determine the 5d/6d surfaces, but this is really a computer science research question. I'd LIKE for the tool to be usable to ANSWER this question, as that would give the world's Real Computer Scientists the opportunity to focus on the relevant metrics when e.g. optimizing BLAST. In particular, as it seems like its optimal design might depend nontrivially on just the parameters varied in this example. > As always, the best test is your own code. So let's make it EASY to test your own code and "publish" the result in a consistent way... :-) rgb > > Velu Erwan said the following on 2005.03.23 06:16: > > Hi folks, > > I'm searching the best way to stress a storage system but using some > > real applications. I mean, using some bonnie++, Iozone, b_eff_io > > benchmarks could give some raw performances about what your storage > > infrastructure is able to provide. > > > > Some benchmarks like mpi-tile-io seems starting another way by trying to > > match what could be the "real" performance of your storage regarding a > > kind of application (visualization for mpi-tile-io). > > > > Does other people are working on such kind of application oriented > > benchmark ? > > > > I was heard that some BLAST code could be used for such use, does some > > of you follow this way of validating/benchmarking clusters ? > > - -- > Brian D. Ropers-Huilman .:. Asst. Director .:. HPC and Computation > Center for Computation & Technology (CCT) bropers@cct.lsu.edu > Johnston Hall, Rm. 350 +1 225.578.3272 (V) > Louisiana State University +1 225.578.5362 (F) > Baton Rouge, LA 70803-1900 USA http://www.cct.lsu.edu/ > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.2.4 (Darwin) > Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org > > iD8DBQFCQYkBwRr6eFHB5lgRAo5VAJ47cboq960SlxTHVjQv556qx5cbdwCfVtHZ > /FuqKe5n/yW1Z0U0K1+Uryc= > =8qVQ > -----END PGP SIGNATURE----- > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Wed Mar 23 08:56:48 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Re: Why Do Clusters Suck? In-Reply-To: <42419724.8020902@scalableinformatics.com> References: <1111526485.2641.133.camel@localhost.localdomain> <42409A96.4050709@scalableinformatics.com> <4240A46E.4040408@pobox.com> <424181F1.5060503@scalableinformatics.com> <42419724.8020902@scalableinformatics.com> Message-ID: On Wed, 23 Mar 2005, Joe Landman wrote: > > might require too much revision for BBS to remain unbroken, though, so > > we may yet need to create "son-o-BBS". > > I have been working up a roadmap for it, and the first B needs to be > dropped or exchanged in favor of other things. Somehow, I thought > Cluster Benchmark System might elicit some cease and desist letters from > lawyers, so still thinking on this ... Feel free to co-opt plain old benchd or xmlbenchd. Or XBS -- XML-based Benchmark System. This might create a desireable level of abstraction between the "daemon" -- which is simply a tool that can run the shell that runs the benchmark(s) according to policy, extract the results and put them into some sort of database (to be determined, actually), and on demand return those results via the daemon interface -- and the shell itself that runs the benchmarks, which might be and XBS compliant microbenchmark binary (after we finish a draft of the unified XML and I finish hacking benchmaster to comply with it) or an XBS-script-wrapped non-compliant microbenchmark binary (e.g. stream or lmbench or netperf of netpipe) or an XBS-script-wrapped macro benchmark, a.k.a. "any old application, ideally but not limited to your own" (e.g. BLAST or linpack or a partitioned lattice problem written in MPI or PVM) done with wall clock time. BTW, at some point this discussion may becoming boring or offensive as a "waste of time" to list members. I'm hoping not yet, as this is VERY USEFUL to me as I seek to figure out what should be in the toolset and how it should be split up into (as Don points out) RELATIVELY simple and disjoint tools with distinct functionality and an ABI between them, but if so just yell and we can try to set up an offline group or list (which I'm sure we'll do soo anyway). rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From kus at free.net Wed Mar 23 09:08:29 2005 From: kus at free.net (Mikhail Kuzminsky) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] AMD Athlon with Intel Fortran Compiler, which options? In-Reply-To: <4240ADA4.6050701@scalableinformatics.com> Message-ID: In message from Joe Landman (Tue, 22 Mar 2005 18:43:32 -0500): >I thought they still had processor check code in there, that checks >the intel CPU id, and defaults to p3/p4 base w/o SSE. You can compile >it, it just might not select that code path (SSE*). Does this for > Opterons. The checks of CPUID (at runtime) were added to 8.1 versions of Intel compiler. > >Stuart Midgley wrote: >> Actually you can use everything except the SSE3 stuff. If I remember correctly, Athlon's have no SSE2 support. Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow >> Once the >>2.6GHz >> Opterons are out and about, you will be able to use SSE3 as well. >> >> I have managed to compile code in 64bit mode on an opteron using the >> Intel compilers. If I remember correctly, I used something like >> >> -fast -xW >> >> whereas -fast usually turns on -xP which gives SSE3 stuff. >> >> Stu. > > >-- >Joseph Landman, Ph.D >Founder and CEO >Scalable Informatics LLC, >email: landman@scalableinformatics.com >web : http://www.scalableinformatics.com >phone: +1 734 786 8423 >fax : +1 734 786 8452 >cell : +1 734 612 4615 From kus at free.net Wed Mar 23 11:13:13 2005 From: kus at free.net (Mikhail Kuzminsky) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Measurement of TCP traffic between nodes Message-ID: Dear colleagues, I want to measure message sizes (and "frequencies" of messages exchanges) between cluster nodes under parallelization of some applications. These applications use parallelization tools working over TCP (Linda, DDI), so I need to know the size of TCP packages and the number of TCP packages per second. 1) There is no other traffic on the corresponding interfaces, and (at least for 2-nodes "subcluster"; it's used for simplicity) I may use simple ifconfig output (in a loop). Do I understand correctly, that the number of bytes transmitted at "application level" is something about (for example for transmission) Tx_bytes - TX_packets*40 ? (i.e. a) there is no bytes from ethernet headers - in ifconfig output b) the number of application transmitted bytes per TCP packet = size_of_packet - 40 (in average)) 2) Are the tools like tcpdump better for this task (pls take into account that I need only timestamps and data sizes) ? Yours Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow From brian at cmrl.wustl.edu Wed Mar 23 07:41:46 2005 From: brian at cmrl.wustl.edu (Brian Henerey) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] cluster storage design Message-ID: <200503231535.j2NFZh7Q016327@bluewest.scyld.com> Hello all, I have a 32 node cluster with 1 master and 1 data storage server with 1.5 TB's of storage. The master used to have storage:/home mounted on /home via NFS. I moved the 1.5TB RAID array of storage so it was directly on the master. This decreased the time it took for our program to run by a factor of 4. I read somewhere that mounting the data to the master via NFS was a bad idea for performance, but am not sure what the best alternative is. I don't want to have to move data on/off the master each time I run a job because this will slow it down as more people are using it. I know there are probably many solutions but I'm curious what the people on this list do. It seems to me that SAN's are very expensive compared to just building servers with 4 x 500GB hard drives. I've considered just launching my lam-mpi jobs from whatever storage server has the appropriate data on it, but this doesn't seem ideal. How does performance compare from having the data local on the master via running it off a PVFS? Thanks in advance, Brian Henerey Systems Analyst Cardiovascular Magnetic Resonance Laboratories Washington University Medical School 660 S. Euclid Ave Campus Box 8086 St. Louis, MO 63110 314-454-8368 314-454-7490(Fax) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20050323/6c56c851/attachment.html From rgb at phy.duke.edu Wed Mar 23 12:09:01 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Measurement of TCP traffic between nodes In-Reply-To: References: Message-ID: On Wed, 23 Mar 2005, Mikhail Kuzminsky wrote: > > Dear colleagues, > > I want to measure message sizes (and "frequencies" of messages > exchanges) between cluster nodes under parallelization of > some applications. These applications use parallelization tools > working over TCP (Linda, DDI), so I need to know the size of TCP > packages and the number of TCP packages per second. > > 1) There is no other traffic on the corresponding interfaces, > and (at least for 2-nodes "subcluster"; it's used for simplicity) I > may use simple ifconfig output (in a loop). Do I understand correctly, > that > the number of bytes transmitted at "application level" is > something about (for example for transmission) > Tx_bytes - TX_packets*40 ? > (i.e. a) there is no bytes from ethernet headers - in ifconfig output > b) the number of application transmitted bytes per TCP packet = > > size_of_packet - 40 (in average)) I'm not sure that this is a safe assumption. Let me try to analyze this briefly, although other folks will need to check me as it is easy to get confused here. * Don't use ifconfig in a loop. ifconfig's data comes from /proc/net/dev in columns. Easier to parse, and lower overhead: rgb@ganesh|B:1007>cat /proc/net/dev > /tmp/dev (contents of /tmp/dev) Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed lo: 2487503 8119 0 0 0 0 0 0 2487503 8119 0 0 0 0 0 0 sit0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 usb0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 eth0:2244262754 11251382 0 0 0 0 0 0 1079809175 2321952 0 0 0 0 0 0 eth1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 * xmlsysd will return this information wrapped in xml tags if you'd rather parse it that way. * From RFC 791, the bitwise layout of an IP header is: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Version| IHL |Type of Service| Total Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Identification |Flags| Fragment Offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Time to Live | Protocol | Header Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Destination Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ or 24 bytes. * From RFC 768, bitwise layout of UDP header is: 0 7 8 15 16 23 24 31 +--------+--------+--------+--------+ | Source | Destination | | Port | Port | +--------+--------+--------+--------+ | | | | Length | Checksum | +--------+--------+--------+--------+ | | data octets ... +---------------- ... or 8 more bytes. * From RFC 793, the bitwise layout of a TCP header is: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Port | Destination Port | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Acknowledgment Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data | |U|A|P|R|S|F| | | Offset| Reserved |R|C|S|S|Y|I| Window | | | |G|K|H|T|N|N| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Checksum | Urgent Pointer | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ or 24 bytes. So there are 48 bytes in IP+TCP plus 14 in the ethernet header (plus 8 in the pre-header plus 4 in the trailing CRC that are usually ignored). Thus the smallest non-empty TCP message is 48+14 = 62+1 = 63 bytes, IIRC. For single messages that aren't part of a stream, this header cost will be preserved per message, so you'll want to count 48 or 62 bytes. For streaming messages the second and subsequent packages have abbreviated TCP headers but I cannot remember their length and don't have time to look it up (but the reference RFCs are indicated, I think). > 2) Are the tools like tcpdump better for this task (pls > take into account that I need only timestamps and data sizes) ? tcpdump would let you see each packet's headers, which contain everything you need to know, per packet, if you can figure out how to extract it, so I'd say yes. The downside is that it generates a lot of data from a short session of monitoring. rgb > > > Yours > Mikhail Kuzminsky > Zelinsky Institute of Organic Chemistry > Moscow > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From landman at scalableinformatics.com Wed Mar 23 12:36:41 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] cluster storage design In-Reply-To: <200503231535.j2NFZh7Q016327@bluewest.scyld.com> References: <200503231535.j2NFZh7Q016327@bluewest.scyld.com> Message-ID: <4241D359.1010606@scalableinformatics.com> Hi Brian: Brian Henerey wrote: > Hello all, > > I have a 32 node cluster with 1 master and 1 data storage server with > 1.5 TB?s of storage. The master used to have storage:/home mounted on > /home via NFS. I moved the 1.5TB RAID array of storage so it was > directly on the master. This decreased the time it took for our > program to run by a factor of 4. I read somewhere that mounting the > data to the master via NFS was a bad idea for performance, but am not > sure what the best alternative is. I don?t want to have to move data > on/off the master each time I run a job because this will slow it down > as more people are using it. > If your problems are I/O bound, and you have enough local storage on each compute node, and you can move the data in a reasonable amount of time, the local I/O will likely be the fastest solution. You have already discovered this when you moved to a local attached RAID. If you have multiple parallel reads/writes to the data from each compute node, you will want some sort of distributed system. If the master thread is the only one doing IO, then you want the fast storage where it is. NFS provides effectively a single point of data flow, and hence is a limiting factor (generally). I presume most of your users start their runs on the head node itself (based upon your description). If this is the case and the method that you want the users to continue to use, then you shouldn't necessarily change it. If the file system is not broken, you dont need to fix it. If you want to increase the bandwidth to the file server, you might look at channel bonding some ethernets together. This works quite nicely in a number of cases, and it is cheap/easy to do. I might recommend this if you determine you need it.... ... and that is the critical aspect. Are your runs slowing down as more users start running on the system? Can you identify the culprit (NFS, disk io, buffer cache, memory pressure, network traffic,...)? Basically the question is, do you need to do any changes, and if so, what changes make the most sense? To answer that, you need to watch how your system is being used, and identify the hotspots, as well as talk with users about their needs. More often than not, simple things (kernel tuning/tweaking, adding more memory, etc) can go a really long way to fix problems. Joe > I know there are probably many solutions but I?m curious what the > people on this list do. It seems to me that SAN?s are very expensive > compared to just building servers with 4 x 500GB hard drives. I?ve > considered just launching my lam-mpi jobs from whatever storage server > has the appropriate data on it, but this doesn?t seem ideal. > > How does performance compare from having the data local on the master > via running it off a PVFS? > > Thanks in advance, > > Brian Henerey > > Systems Analyst > > Cardiovascular Magnetic Resonance Laboratories > > Washington University Medical School > > 660 S. Euclid Ave > > Campus Box 8086 > > St. Louis, MO 63110 > > 314-454-8368 314-454-7490(Fax) > >------------------------------------------------------------------------ > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From wburke999 at msn.com Wed Mar 23 12:42:39 2005 From: wburke999 at msn.com (William Burke) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Re: Grid Engine, Parallel Environment, Scheduling, Myrinet, and MPICH Message-ID: I can't get PE to work on a 50 node class II Beowulf. It has a front-end Sunfire v40 (qmaster host) and 49 Sunfire v20s (execution hosts) running Linux configured to communicate data over Myrinet using MPICH-GM version 1.26.14a. These are the requirements of the N1GE environment to handle: 1. Serial type jobs for pre-processing the data - average runtime 15 minutes. 2. Output is pipelined into parallel processing jobs - range of runtime 1- 6 hours. 3. Concurrently running is post-processing serial jobs. I have setup a Parallel Environment called mpich-gm and a straight-forward FIFO scheduling schema for testing. When I submit parallel jobs they hang in limbo in a 'qw' state pending submission. I am not sure why the scheduler does not see jobs that I submit. I used the myrinet mpich template located $SGE_ROOT/< sge_cell >/mpi/myrinet directory to configure the pe (parallel environment) plus I copied the sge_mpirun script to the $SGE_ROOT/< sge_cell >/bin directory. I configured a Production.q queue that runs only parallel jobs. As a last sanity check I ran a trace on the scheduler, submitted a simple parallel job, and this is the results that I got from the logs: JOB RUN Window [wems@wems examples]$ qsub -now y -pe mpich-gm 1-4 -b y hello++ Your job 277 ("hello++") has been submitted. Waiting for immediate job to be scheduled. Your qsub request could not be scheduled, try again later. [wems@wems examples]$ qsub -pe mpich-gm 1-4 -b y hello++ Your job 278 ("hello++") has been submitted. [wems@wems examples]$ qsub -pe mpich-gm 1-4 -b y hello++ Your job 279 ("hello++") has been submitted. This is the 2nd window SCHEDULER LOG [root@wems bin]# qconf -tsm [root@wems bin]# qconf -tsm [root@wems bin]# cat /WEMS/grid/default/common/schedd_runlog Wed Mar 23 06:08:55 2005|-------------START-SCHEDULER-RUN------------- Wed Mar 23 06:08:55 2005|queue instance "all.q@wems10.grid.wni.com" dropped because it is temporarily not available Wed Mar 23 06:08:55 2005|queue instance "Production.q@wems10.grid.wni.com" dropped because it is temporarily not available Wed Mar 23 06:08:55 2005|queues dropped because they are temporarily not available: all.q@wems10.grid.wni.com Production.q@wems10.grid.wni.com Wed Mar 23 06:08:55 2005|no pending jobs to perform scheduling on Wed Mar 23 06:08:55 2005|--------------STOP-SCHEDULER-RUN------------- Wed Mar 23 06:11:37 2005|-------------START-SCHEDULER-RUN------------- Wed Mar 23 06:11:37 2005|queue instance "all.q@wems10.grid.wni.com" dropped because it is temporarily not available Wed Mar 23 06:11:37 2005|queue instance "Production.q@wems10.grid.wni.com" dropped because it is temporarily not available Wed Mar 23 06:11:37 2005|queues dropped because they are temporarily not available: all.q@wems10.grid.wni.com Production.q@wems10.grid.wni.com Wed Mar 23 06:11:37 2005|no pending jobs to perform scheduling on Wed Mar 23 06:11:37 2005|--------------STOP-SCHEDULER-RUN------------- [root@wems bin]# qstat job-ID prior name user state submit/start at queue slots ja-task-ID ---------------------------------------------------------------------------- ------------------------------------- 279 0.55500 hello++ wems qw 03/23/2005 06:11:43 1 [root@wems bin]# BTW that node wems10.grid.wni.com has connectivity issues and I have not removed it from the cluster queue. What causes this type of problem in N1GE to return "no pending jobs to perform scheduling on" in the schedd_runlog even though there are available slots ready to take jobs? I had no problem submitting serial jobs, only the parallel jobs resulted as such. Are there N1GE - Myrinet issue that I am not aware of? FYI the same binary (hello++) runs with no problems from the command line. Since I generally run scripts from qsub instead of binaries I created a script to run the mpich executable but that yield the same result. I have an additional question regarding setting a queue.conf parameter called "subordinate_list". How is it read from the result of qconf -mq ? Example i.e., subordinate_list low_pri.q=5,small.q. Which queue has priority over the other based on the slots? William Burke Tellitec Sollutions -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20050323/611516b3/attachment.html From mathog at mendel.bio.caltech.edu Wed Mar 23 14:02:51 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] cluster storage design Message-ID: Joe Landman wrote: > > Brian Henerey wrote: > > > Hello all, > > > > I have a 32 node cluster with 1 master and 1 data storage server with > > 1.5 TB?s of storage. The master used to have storage:/home mounted on > > /home via NFS. I moved the 1.5TB RAID array of storage so it was > > directly on the master. This decreased the time it took for our > > program to run by a factor of 4. I read somewhere that mounting the > > data to the master via NFS was a bad idea for performance, but am not > > sure what the best alternative is. I don?t want to have to move data > > on/off the master each time I run a job because this will slow it down > > as more people are using it. > > > > If your problems are I/O bound, and you have enough local storage on > each compute node, and you can move the data in a reasonable amount of > time, the local I/O will likely be the fastest solution. You have > already discovered this when you moved to a local attached RAID. If you > have multiple parallel reads/writes to the data from each compute node, > you will want some sort of distributed system. If the master thread is > the only one doing IO, then you want the fast storage where it is. Also keep in mind that if the data used on the nodes fits into memory _and_ you tend to run the same software over and over, then typically that data will only need to be read off disk once on each node and will subsequently be accessed from the file system cache. That mode of data access is many times faster than physically reading from a disk. So don't toss out the idea of local data storage if the cluster happens to have slowish disks on the compute nodes. It will also cache from NFS but it may take a very, very long time for all nodes to read it at once. Depending on your cluster topology, interconnect, and budget you might also consider multiple file servers. That will speed things up at the cost of a bit more hardware, more complexity (which node mounts which file server). Also, for that to work well data should be mostly reads, since writes to a common file need to go to M file servers instead of just one. Finally, and this effect can be surprising large - be careful about writes of results back to a single file server. When N nodes naively direct stdout back to a single NFS server the line by line writes can drive that server into the ground. Conversely, if the nodes write to /tmp and then when done copy that fall to the NFS server in one fell swoop it may work better, especially if the processes finish asynchronously. If they all finish at the same time think twice before having them all do: cp /tmp/$HOSTNAME_output.txt /nfsmntpoint/accum_dir/ simultaneously. > NFS provides effectively a single point of data flow, and hence is a > limiting factor (generally). Also double check that NFS is using hard mounts. Else you may fall prey to the dreaded "big block of nulls" problem. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From john.hearns at streamline-computing.com Wed Mar 23 14:10:31 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Re: Grid Engine, Parallel Environment, Scheduling, Myrinet, and MPICH In-Reply-To: References: Message-ID: <33783.212.159.87.168.1111615831.squirrel@webmail.streamline-computing.com> > I can't get PE to work on a 50 node class II Beowulf. It has a front-end William, why don't you ask this question on the Gridengine mailing list please? Sorry I can't give you much more constructive help. or some diagnostics to run on Gridengine, but truth be told I am dg tired after working since 8:30 am this morning on a cluster and getting home at 10pm. If I was less tired I'd try to help, honest. From alvin at ns.Linux-Consulting.com Wed Mar 23 16:36:22 2005 From: alvin at ns.Linux-Consulting.com (Alvin Oga) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] cluster storage design In-Reply-To: <200503231535.j2NFZh7Q016327@bluewest.scyld.com> References: <200503231535.j2NFZh7Q016327@bluewest.scyld.com> Message-ID: <20050324003622.GA24947@Maggie.Linux-Consulting.com> On Wed, Mar 23, 2005 at 09:41:46AM -0600, Brian Henerey wrote: > > I have a 32 node cluster with 1 master and 1 data storage server with 1.5 > TB's of storage. The master used to have storage:/home mounted on /home via > NFS. I moved the 1.5TB RAID array of storage so it was directly on the > master. This decreased the time it took for our program to run by a factor > of 4. yes .. that is a good thing > I read somewhere that mounting the data to the master via NFS was a > bad idea for performance, but am not sure what the best alternative is. I > don't want to have to move data on/off the master each time I run a job > because this will slow it down as more people are using it. for users, you have 2 choices ?? /home on one big "home server" or automagically sync users loginID and pwd from node to node ( little more work.. but not as bad as it sounds ) if the "home server" dies ... everybody is dead if each node is standalone .. there is no issues with "master" dying for running jobs .... an automated queue is good ... users doesn't necessarily dictate which nodes to run the jobs on, but a good queuer will allow users to specify preferences for "/data" where all nodes share a common big 100TB data farm .. - you have NFS or SANs or ?? - getting good nic cards and good switches helps a lot - change your NFS parameters to send 16K or 32K bytes at a time instead of 512K - dual or quad channel bonding should help with thruput too.. - a TB sized "/data" shouldn't be noticably slow across the nodes - /data should be on the machine where the apps uses it the most - since /data is probably shared across multiple nodes, it might be worth it ( definitely worth it ) to buy another 4 or 8 disks and use it as backups of /data on other nodes - you now have 3 "master nodes" with local /data - you will have to rsync and rdiff your changes from node to node - 1 TB of disks is about $600 now days ( 4x $150 each ) - structuring your /data into /data/xxx and /data/yyy and /data/zzz will allow multiple nodes to have all of its data local to where all the disk i/o access is being done local to itself as opposed to across the slow ethernet > I know there are probably many solutions but I'm curious what the people on > this list do. It seems to me that SAN's are very expensive compared to just > building servers with 4 x 500GB hard drives. I've considered just launching > my lam-mpi jobs from whatever storage server has the appropriate data on it, > but this doesn't seem ideal. for me ... lots of redundant IDE disks is way way better/faster than san/nas > How does performance compare from having the data local on the master via > running it off a PVFS? c ya alvin From apseyed at bu.edu Wed Mar 23 15:11:41 2005 From: apseyed at bu.edu (apseyed@bu.edu) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] cluster storage design In-Reply-To: from "David Mathog" at Mar 23, 2005 02:02:51 PM Message-ID: <200503232311.j2NNBgYg101594@acsrs3.bu.edu> I concur with David, when necessary running jobs on local compute node disk takes immense load off on a storage node / nfs file server. Here is some brief documentation and template on our website for using this method (/scratch can be /tmp): http://www.bu.edu/dbin/sph/departments/biostatistics/linga_documentation.php#scratch Cheers, Patrice > > Joe Landman wrote: > > > > Brian Henerey wrote: > > > > > Hello all, > > > > > > I have a 32 node cluster with 1 master and 1 data storage server with > > > 1.5 TB?s of storage. The master used to have storage:/home mounted on > > > /home via NFS. I moved the 1.5TB RAID array of storage so it was > > > directly on the master. This decreased the time it took for our > > > program to run by a factor of 4. I read somewhere that mounting the > > > data to the master via NFS was a bad idea for performance, but am not > > > sure what the best alternative is. I don?t want to have to move data > > > on/off the master each time I run a job because this will slow it down > > > as more people are using it. > > > > > > > If your problems are I/O bound, and you have enough local storage on > > each compute node, and you can move the data in a reasonable amount of > > time, the local I/O will likely be the fastest solution. You have > > already discovered this when you moved to a local attached RAID. If you > > have multiple parallel reads/writes to the data from each compute node, > > you will want some sort of distributed system. If the master thread is > > the only one doing IO, then you want the fast storage where it is. > > Also keep in mind that if the data used on the nodes fits > into memory _and_ you tend to run the same software over and > over, then typically that data will only need to be read off disk once > on each node and will subsequently be accessed from the file system > cache. That mode of data access is many times faster > than physically reading from a disk. So don't toss out the idea > of local data storage if the cluster happens to have slowish disks > on the compute nodes. It will also cache from NFS but it may take > a very, very long time for all nodes to read it at once. > > Depending on your cluster topology, interconnect, and budget > you might also consider multiple file servers. That will > speed things up at the cost of a bit more hardware, more > complexity (which node mounts which file server). Also, for that to > work well data should be mostly reads, since writes to a common > file need to go to M file servers instead of just one. > > Finally, and this effect can be surprising large - be careful about > writes of results back to a single file server. When N nodes naively > direct stdout back to a single NFS server the line by line writes can > drive that server into the ground. Conversely, if the nodes > write to /tmp and then when done copy that fall to the NFS server > in one fell swoop it may work better, especially if the processes > finish asynchronously. If they all finish at the same time think > twice before having them all do: > > cp /tmp/$HOSTNAME_output.txt /nfsmntpoint/accum_dir/ > > simultaneously. > > > > NFS provides effectively a single point of data flow, and hence is a > > limiting factor (generally). > > Also double check that NFS is using hard mounts. Else you may > fall prey to the dreaded "big block of nulls" problem. > > Regards, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From reuti at staff.uni-marburg.de Wed Mar 23 15:25:30 2005 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] In-Reply-To: References: Message-ID: <1111620330.4241faeabe336@home.staff.uni-marburg.de> Hi, I'd suggest to move over to the SGE users list at: http://gridengine.sunsource.net/servlets/ProjectMailingListList But anyway, let's sort the things out: Quoting William Burke : > I can't get PE to work on a 50 node class II Beowulf. It has a front-end > Sunfire v40 (qmaster host) and 49 Sunfire v20s (execution hosts) running > Linux configured to communicate data over Myrinet using MPICH-GM version > 1.26.14a. Although there is a special Myrinet directory, you can also try to use the files in the mpi directory instead. > These are the requirements of the N1GE environment to handle: > > 1. Serial type jobs for pre-processing the data - average runtime 15 > minutes. > 2. Output is pipelined into parallel processing jobs - range of runtime > 1- 6 hours. > 3. Concurrently running is post-processing serial jobs. > > I have setup a Parallel Environment called mpich-gm and a straight-forward > FIFO scheduling schema for testing. When I submit parallel jobs they hang > in > limbo in a 'qw' state pending submission. I am not sure why the scheduler > does not see jobs that I submit. > > > > I used the myrinet mpich template located $SGE_ROOT/< sge_cell > >/mpi/myrinet > directory to configure the pe (parallel environment) plus I copied the > sge_mpirun script to the $SGE_ROOT/< sge_cell >/bin directory. I > configured > a Production.q queue that runs only parallel jobs. As a last sanity check I > ran a trace on the scheduler, submitted a simple parallel job, and this is > the results that I got from the logs: Can you please give more details of your queue and PE setup (qconf -sq/sp output). > JOB RUN Window > > [wems@wems examples]$ qsub -now y -pe mpich-gm 1-4 -b y hello++ > > Your job 277 ("hello++") has been submitted. > > Waiting for immediate job to be scheduled. > > > > Your qsub request could not be scheduled, try again later. > > [wems@wems examples]$ qsub -pe mpich-gm 1-4 -b y hello++ > > Your job 278 ("hello++") has been submitted. > > [wems@wems examples]$ qsub -pe mpich-gm 1-4 -b y hello++ > > Your job 279 ("hello++") has been submitted. You can't start a parallel job this way, as there is no mpirun used. When you used your mentioned script, you get the same behavior (and there you used mpirun -np $NSLOTS ...)? > This is the 2nd window SCHEDULER LOG > > [root@wems bin]# qconf -tsm > > [root@wems bin]# qconf -tsm > > [root@wems bin]# cat /WEMS/grid/default/common/schedd_runlog > > Wed Mar 23 06:08:55 2005|-------------START-SCHEDULER-RUN------------- > > Wed Mar 23 06:08:55 2005|queue instance "all.q@wems10.grid.wni.com" dropped > because it is temporarily not available > > Wed Mar 23 06:08:55 2005|queue instance "Production.q@wems10.grid.wni.com" > dropped because it is temporarily not available > > Wed Mar 23 06:08:55 2005|queues dropped because they are temporarily not > available: all.q@wems10.grid.wni.com Production.q@wems10.grid.wni.com > > Wed Mar 23 06:08:55 2005|no pending jobs to perform scheduling on > > Wed Mar 23 06:08:55 2005|--------------STOP-SCHEDULER-RUN------------- > > Wed Mar 23 06:11:37 2005|-------------START-SCHEDULER-RUN------------- > > Wed Mar 23 06:11:37 2005|queue instance "all.q@wems10.grid.wni.com" dropped > because it is temporarily not available > > Wed Mar 23 06:11:37 2005|queue instance "Production.q@wems10.grid.wni.com" > dropped because it is temporarily not available > > Wed Mar 23 06:11:37 2005|queues dropped because they are temporarily not > available: all.q@wems10.grid.wni.com Production.q@wems10.grid.wni.com > > Wed Mar 23 06:11:37 2005|no pending jobs to perform scheduling on > > Wed Mar 23 06:11:37 2005|--------------STOP-SCHEDULER-RUN------------- > > [root@wems bin]# qstat > > job-ID prior name user state submit/start at queue > slots ja-task-ID > > ---------------------------------------------------------------------------- > ------------------------------------- > > 279 0.55500 hello++ wems qw 03/23/2005 06:11:43 > 1 > > [root@wems bin]# Do you have an admin account for SGE? I'd prefer not to do anything in SGE as root. > BTW that node wems10.grid.wni.com has connectivity issues and I have not > removed it from the cluster queue. > > > > What causes this type of problem in N1GE to return "no pending jobs to > perform scheduling on" in the schedd_runlog even though there are available > slots ready to take jobs? > > I had no problem submitting serial jobs, only the parallel jobs resulted as > such. Are there N1GE - Myrinet issue that I am not aware of? FYI the same > binary (hello++) runs with no problems from the command line. If you just start hello++, it will not run in parallel I think. Not really an issue: you have to make a small change to the mpirun.ch_gm.pl to make all jobs staying in the same process group to get them correctly killed in case of a jobb abort: http://gridengine.sunsource.net/howto/mpich-integration.html > Since I generally run scripts from qsub instead of binaries I created a > script to run the mpich executable but that yield the same result. > > > > I have an additional question regarding setting a queue.conf parameter > called "subordinate_list". How is it read from the result of qconf -mq > ? > > Example > > i.e., subordinate_list low_pri.q=5,small.q. The queue "low_pri.q" will be suspended, when 5 or more slots of "" are filled. The "small.q" will be suspened, if all slots of "" are filled. Cheers - Reuti From list-beowulf at onerussian.com Thu Mar 24 07:40:07 2005 From: list-beowulf at onerussian.com (Yaroslav Halchenko) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Jumbo Gigabit Switches again (SMC vs HP) Message-ID: <20050324154007.GI20659@washoe.rutgers.edu> Dear Beowulfers, Sorry to bring up the old question but can anyone advice me to choose between cheaper SMC8648T which supports 9K jumbo frames and more expensive but I think with better warranty and support HP ProCurve Switch 2848 here its brochure http://www.hp.com/rnd/pdfs/datasheets/switch_2800_series.pdf it is unclear though what is the size of jumbo frame in HP one. The goals I'm looking for: * high density -- we have 26 nodes + 1 service connections, so 24 port switch is not enough any more * high throughput from the server (so there is not that much MPI going on on our cluster) * reduced server load (now with 23 nodes, intensive I/O brings server down to its kneese - the crowd of nfsd's starves CPU time), that is why I'm trying to grab a switch with jumbo frames Please advise between the two -- Yaroslav Halchenko Research Assistant, Psychology Department, Rutgers-Newark Office: (973) 353-5440x263 | FWD: 82823 | Fax: (973) 353-1171 Student Ph.D. @ CS Dept. NJIT, Master @ CS Dept. UNM -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20050324/a8c3bdc2/attachment.bin From edwardsa at afrl.kirtland.af.mil Thu Mar 24 10:02:18 2005 From: edwardsa at afrl.kirtland.af.mil (Art Edwards) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Jumbo frame cards and switches Message-ID: <200503241802.j2OI2JJ7000237@bell.kirtland.af.mil> I have spent some time looking at posts on Gb cards and Gb switches for jumbo frames. We currently have broadcom GB BCM7501 cards and the catalyst 4000 switch. I am able to set mtu to 9000 on the cards, but I have found out that the switch does not handle jumbo frames. My questions are: 1. If I can set the mtu to 9000, does this mean the card can actully send and receive messages this size? 2. How much would I have to spend for a 64 port GB switch that can handle jumbo frames? Are there vendor recommendations? Art Edwards -- Art Edwards Senior Research Physicist Air Force Research Laboratory Electronics Foundations Branch KAFB, New Mexico (505) 853-6042 (v) (505) 846-2290 (f) From lindahl at pathscale.com Thu Mar 24 17:34:48 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Alternative to MPI ABI In-Reply-To: References: Message-ID: <20050325013448.GB4459@greglaptop.internal.keyresearch.com> > Create a new software project (preferably open source, preferably with > an BSD-like license so that ISV's can incorporate this software into > their products) that provides a compatibility layer for all the > different MPI implementations out there. Let's call it MorphMPI. Jeff, A similar idea was actually written up by Bill Gropp and was mentioned by him 5 weeks ago on the beowulf list. Quoth he: | I wrote a paper that appeared in the EuroPVMMPI'02 meeting that discusses | the issues of a common ABI. The paper is "Building Library Components That | Can Use Any MPI Implementation" and is available as | http://www-unix.mcs.anl.gov/~gropp/bib/papers/2002/gmpishort.pdf . I think this overall idea falls short of the benefit of an ABI for a couple of reasons. The first is that it's unlikely to get wide distribution if it doesn't come with MPI implementations. The second is that it's harder to maintain "out of tree"; the minute that an MPI implementation changes something, MorphMPI is going to be broken. You keep on coming back to this: > 2. Cancel the MPI Implementor's Ultimate Prize Fighting Cage Match on > pay-per-view (read: no need for time-consuming, potentially fruitless > attempts to get MPI implementors to agree on anything) Was there a big fight over the Fortran interface? It nails down the types because it has to. All the ABI does is make C look like Fortran; no internals need change in any implementation. -- greg From lindahl at pathscale.com Thu Mar 24 18:41:36 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Questions regarding interconnects In-Reply-To: References: Message-ID: <20050325024136.GA5033@greglaptop.internal.keyresearch.com> On Sun, Mar 20, 2005 at 07:56:35PM +0200, Olli-Pekka Lehto wrote: > What do you see as the key differentiating factors in the quality of an > MPI implementation? This far I have come up with the following: > -Completeness of the implementation > -Latency/bandwidth > -Asynchronous communication > -Smart collective communication Those are superficial differences. What people actually want is performance. If dumb collectives gave better performance, would you actually care that they were dumb? What if collective performance was dominated by (1) small packet latency and (2) OS jitter? Likewise, people want asynchronous communication because they imagine that it will give them better performance. Finally, latency/bandwidth is less relevant to real apps than the latency/bandwidth at the message size that the apps actually use. For most interconnects, the predicted latency/bandwidth at 2k packets isn't that close to what you'd predict from published 0-byte latency and infinite-size bandwidth. > Are there any NICs on the market which utilize the 10GBase-CX4 > standard and if there is are there any clusters which use them? You can't buy a big switch for it, so there might be small clusters, but people don't talk about small clusters much. Orion's 96-node clusters, if I read my tea leaves right, are hooked together using 10G-CX4 uplinks. But that's just building a 96-port 1-gig switch for cheap. > When do you estimate that commodity Gigabit NICs with integrated RDMA > support will arrive to the market? (or will they?) They arrived a while ago, didn't seem to make much of a splash. I don't personally think much of offload. Just one man's (likely-to-be-disputed) opinion, -- greg From hahn at physics.mcmaster.ca Thu Mar 24 21:47:45 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Questions regarding interconnects In-Reply-To: <20050325024136.GA5033@greglaptop.internal.keyresearch.com> Message-ID: > > What do you see as the key differentiating factors in the quality of an > > MPI implementation? This far I have come up with the following: > > -Completeness of the implementation this really depends on the "maturity" of the application. I know of one application which has covered a lot of ground, including cray-shm, openmp and mpi-2 (with heavy use of post-mpi-1 features.) it cares about completeness, but a new app written from scratch doesn't. > > -Latency/bandwidth it would be hard to argue that these don't matter. as Greg points out, zero-byte latency and infinite-byte bandwidth don't necessarily predict the performance that real-app-sized packets will see. then again, if a more accurate prediction were desired, just fit three lines to the s-curve. that's the appeal of quoting half-bandwidth packet size anyway, isn't it? > > -Asynchronous communication this appeals because people recognize that it *could* provide higher performance. it seems like most implementations are fairly disappointing in how they implement asynchrony, but that's not a reason to ignore asynchrony present in your program. > > -Smart collective communication this would appeal more widely if there was hardware support that gave a real speedup (as in Quadrics) rather than shifting code from app-space to library-space. in other words, people care less about convenience functions. libmpi.a may do a wonderful O(nlogn) bcast, but it would be a lot sexier if the interconnect provided hardware acceleration. > Likewise, people want asynchronous communication because they imagine > that it will give them better performance. I think there's more to it than that. any programmer notices when there are dependencies and when there is slack. if there was a smart MPI/interconnect coprocessor, taking advantage of the slack would turn asynchrony into better performance - basically latency hiding. > > When do you estimate that commodity Gigabit NICs with integrated RDMA > > support will arrive to the market? (or will they?) > > They arrived a while ago, didn't seem to make much of a splash. I don't > personally think much of offload. TOE folk don't seem to understand the concept of fast-paths. sure, RDMA is attractive, but does that mean the whole TCP stack (plus some new extra RDMA gunk) needs to go onto the nic? suppose you had a nic which could generate packets in response to very specific filters on incoming packets. in other words, "reflex" responses to the expected state transitions, avoiding host involvement if the pattern is as expected. of course, it's also true that TCP has very little justification in a cluster setting, so what's TOE for? trying to run really giant webservers on a single K6-2? most internet-related TCP services can be quite readily clusterized in the first place, so scaling is not a problem. one could easily argue that network state machines have shown far less innovation and paradigm shift than graphics accelerators. and look at the awesome amount of offload in your video card - it could easily have more transistors and flops than your host cpu. as far as I can tell, this argument only fails because the mass market is not anywhere close to being net-bottlenecked, and that it's harder to throw hardware at networking. it's easy to be limited by graphics (turn up the resolution, framerate, quality, AA, etc), and it's easy to throw another dozen pixel pipelines at the problem. imagine if you had an interconnect coprocessor with 220M transistors and 30 GB/s private memory bandwidth sitting on 16x PCI-E. the only think I can think of to use that horsepower for would be a distributed directory-based shared-memory scheme that implemented FP collectives... From diep at xs4all.nl Fri Mar 25 06:04:53 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Questions regarding interconnects Message-ID: <3.0.32.20050325150452.016766a0@pop.xs4all.nl> I feel very important to look at is 'shmem' capabilities. That avoids so much problems. To give a simple example, if i want to modify a searching node then: In MPI you ship a nonblocking message from node A to B. In order for B to receive, it has to have either a special thread that regurarly polls. If you have a thread that polls say each 10 milliseconds, then what's the use of using a highend network card (other than it's DMA capabilities)? If you *poll* within the searching thread (that eats all system time) sometimes for the MPI, then that's the best solution. However, it's very expensive to poll. Perhaps someone can calculate for me exactly how many cpu cycles i would lose at say a 2.2Ghz processor. On the other hand using the 'shmem', what happens is that A ships a nonblocking write to B of just a few bytes. The network card in B simply writes it in the RAM. Now and then the searching process at B only has to poll its own main memory to see whether it has a '1'. So sometimes you lose a TLB trashing call to it, but other times it comes from L2 cache. A TLB trashing call is even at old chipsets just 400ns and at dual opteron around 133ns. A L2 cache lookup is 20 cycles in case of k7 and 13 cycles in case of opteron. That last case is roughly 5-6 nanoseconds. So for short messages which are latency sensitive that 'shmem' of quadrics is just far superior. Do other cards implement something similar? As far as i know they do not. The overhead of the MPI implementation layer *receiving* bytes is just so so huge. A cards theoretic one-way pingpong latency is just irrelevant to that, because that one way pingpong programs at all cards is eating 100% system time, effectively losing a full cpu. If you lose a full cpu, the efficiency of your software degrades incredible. In fact at nodes with 1 cpu you hardly have left system time. So real important is measuring the effective wall clock time you lose before you receive the message, without hurting the main processor too much. Additional also measurement is needed how many main processing time you lose to the network. In the end what matters is how quickly the main processor gets the job for its process and how much system time this main process can use from the main processor(s). The network is just a tool to deliver data from A to B and should not get in the way of the real job to be done. Vincent At 10:41 AM 3/22/2005 -0600, Richard Walsh wrote: >Olli-Pekka Lehto wrote: > >> Hello, >> >> I'm writing a paper on current and emerging cluster interconnect >> technologies as a part of my University studies. I have included 1GbE >> (incl. RDMA), 10GbE, Quadrics, InfiniBand and Myrinet. The goal is to >> provide an introduction to the subject maybe more from a network >> engineer's point of view with an overview on the key features and the >> pros/cons of each solution. I have some questions on which I hope you >> could help me out with: > > I think that integrating a custom interconnect for comparison into >you analysis would be useful to contrast > the capabilities of "commodity cluster" interconnects with those of >the presumptive custom leading edge. > I would choose the Cray X1e or Altix interconnects for this. > >> >> What do you see as the key differentiating factors in the quality of >> an MPI implementation? This far I have come up with the following: >> -Completeness of the implementation >> -Latency/bandwidth >> -Asynchronous communication >> -Smart collective communication > > I think that explicit treatment/comparison of the interconnect's >RDMA capabilities is important as they support > both MPI-2 and the new-ish UPC and CAF compilers for cluster >systems. I can send you a recent article I wrote > comparing Quadrics to the Cray X1 interconnect relative to the >performance of these global address space programming > models (UPC and CAF). > > Another thing to look at is the latency advantage/potential of >alternative paths to the processor (i.e HT/Infinipath) > >> >> Are there any NICs on the market which utilize the 10GBase-CX4 >> standard and if there is are there any clusters which use them? Do you >> see it as a viable choice for an interconnect considering the >> relatively low cost of InfiniBand and that fact that 10GBase-T is not >> that far in the future? >> >> When do you estimate that commodity Gigabit NICs with integrated RDMA >> support will arrive to the market? (or will they?) > > > AMASSO already sells one. > >> >> >> best regards, >> Olli-Pekka > > > >Richard Walsh >AHPCRC > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From jsquyres at open-mpi.org Fri Mar 25 07:00:17 2005 From: jsquyres at open-mpi.org (Jeff Squyres) Date: Wed Nov 25 01:03:56 2009 Subject: [O-MPI users] Re: [Beowulf] Alternative to MPI ABI In-Reply-To: <20050325013448.GB4459@greglaptop.internal.keyresearch.com> References: <20050325013448.GB4459@greglaptop.internal.keyresearch.com> Message-ID: On Mar 24, 2005, at 8:34 PM, Greg Lindahl wrote: > A similar idea was actually written up by Bill Gropp and was mentioned > by him 5 weeks ago on the beowulf list. Quoth he: > > | I wrote a paper that appeared in the EuroPVMMPI'02 meeting that > discusses > | the issues of a common ABI. The paper is "Building Library > Components That > | Can Use Any MPI Implementation" and is available as > | http://www-unix.mcs.anl.gov/~gropp/bib/papers/2002/gmpishort.pdf . > > I think this overall idea falls short of the benefit of an ABI for a > couple of reasons. The first is that it's unlikely to get wide > distribution if it doesn't come with MPI implementations. The second > is that it's harder to maintain "out of tree"; the minute that an MPI > implementation changes something, MorphMPI is going to be broken. Yes, I read this paper. I freely admit that my post was inspired by this paper (and several other factors). Mea culpa for not citing it (sorry Bill! :-( ). I'm just widening the scope of the ideas a little, and suggesting that a bright MS student can actually go implement it with *FAR* less work than trying to do an MPI ABI and in a *MUCH* shorter timescale. Software dependencies are a fact of life. But also consider that MPI's don't change their innards frequently (or at all). If an implementation chooses integers for MPI handles, for example, they'll stay with integers -- they won't suddenly change to pointers between version a.b.c and a.b.(c+1). So the perturbation rate is actually quite low; a MorphMPI that relies on MPI handles being integers for a specific MPI implementation (for example) would be relatively stable. >> 2. Cancel the MPI Implementor's Ultimate Prize Fighting Cage Match on >> pay-per-view (read: no need for time-consuming, potentially fruitless >> attempts to get MPI implementors to agree on anything) > > Was there a big fight over the Fortran interface? It nails down the > types because it has to. I was not involved in MPI-1, so I cannot say. > All the ABI does is make C look like Fortran; > no internals need change in any implementation. That is an extremely inaccurate statement. :-) -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/ From lindahl at pathscale.com Fri Mar 25 09:26:49 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Questions regarding interconnects In-Reply-To: <3.0.32.20050325150452.016766a0@pop.xs4all.nl> References: <3.0.32.20050325150452.016766a0@pop.xs4all.nl> Message-ID: <20050325172649.GB4470@greglaptop.hsd1.ca.comcast.net> On Fri, Mar 25, 2005 at 03:04:53PM +0100, Vincent Diepeveen wrote: > Do other cards implement something similar? > > As far as i know they do not. They do, in fact checking for message arrival by only having to look in main memory is a standard feature of just about all OS-bypass card implementations. -- greg From andrewxwang at yahoo.com.tw Fri Mar 25 06:51:25 2005 From: andrewxwang at yahoo.com.tw (Andrew Wang) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Re: Grid Engine, Parallel Environment, Scheduling, Myrinet, and MPICH In-Reply-To: Message-ID: <20050325145125.2232.qmail@web18003.mail.tpe.yahoo.com> Please send your question to the SGE mailing list: http://gridengine.sunsource.net/project/gridengine/maillist.html The "users" list is what you want. BTW, you should try commands like "qstat -f", or "qhost" to find out the status of the machines. ALso, do serial jobs work? Andrew. --- William Burke ªº°T®§¡G > I can't get PE to work on a 50 node class II > Beowulf. It has a front-end > Sunfire v40 (qmaster host) and 49 Sunfire v20s > (execution hosts) running > Linux configured to communicate data over Myrinet > using MPICH-GM version > 1.26.14a. > > > > These are the requirements of the N1GE environment > to handle: > > 1. Serial type jobs for pre-processing the data - > average runtime 15 > minutes. > 2. Output is pipelined into parallel processing jobs > - range of runtime > 1- 6 hours. > 3. Concurrently running is post-processing serial > jobs. > > I have setup a Parallel Environment called mpich-gm > and a straight-forward > FIFO scheduling schema for testing. When I submit > parallel jobs they hang in > limbo in a 'qw' state pending submission. I am not > sure why the scheduler > does not see jobs that I submit. > > > > I used the myrinet mpich template located > $SGE_ROOT/< sge_cell >/mpi/myrinet > directory to configure the pe (parallel environment) > plus I copied the > sge_mpirun script to the $SGE_ROOT/< sge_cell >/bin > directory. I configured > a Production.q queue that runs only parallel jobs. > As a last sanity check I > ran a trace on the scheduler, submitted a simple > parallel job, and this is > the results that I got from the logs: > > > > > > JOB RUN Window > > [wems@wems examples]$ qsub -now y -pe mpich-gm 1-4 > -b y hello++ > > Your job 277 ("hello++") has been submitted. > > Waiting for immediate job to be scheduled. > > > > Your qsub request could not be scheduled, try again > later. > > [wems@wems examples]$ qsub -pe mpich-gm 1-4 -b y > hello++ > > Your job 278 ("hello++") has been submitted. > > [wems@wems examples]$ qsub -pe mpich-gm 1-4 -b y > hello++ > > Your job 279 ("hello++") has been submitted. > > > > This is the 2nd window SCHEDULER LOG > > [root@wems bin]# qconf -tsm > > [root@wems bin]# qconf -tsm > > [root@wems bin]# cat > /WEMS/grid/default/common/schedd_runlog > > Wed Mar 23 06:08:55 > 2005|-------------START-SCHEDULER-RUN------------- > > Wed Mar 23 06:08:55 2005|queue instance > "all.q@wems10.grid.wni.com" dropped > because it is temporarily not available > > Wed Mar 23 06:08:55 2005|queue instance > "Production.q@wems10.grid.wni.com" > dropped because it is temporarily not available > > Wed Mar 23 06:08:55 2005|queues dropped because they > are temporarily not > available: all.q@wems10.grid.wni.com > Production.q@wems10.grid.wni.com > > Wed Mar 23 06:08:55 2005|no pending jobs to perform > scheduling on > > Wed Mar 23 06:08:55 > 2005|--------------STOP-SCHEDULER-RUN------------- > > Wed Mar 23 06:11:37 > 2005|-------------START-SCHEDULER-RUN------------- > > Wed Mar 23 06:11:37 2005|queue instance > "all.q@wems10.grid.wni.com" dropped > because it is temporarily not available > > Wed Mar 23 06:11:37 2005|queue instance > "Production.q@wems10.grid.wni.com" > dropped because it is temporarily not available > > Wed Mar 23 06:11:37 2005|queues dropped because they > are temporarily not > available: all.q@wems10.grid.wni.com > Production.q@wems10.grid.wni.com > > Wed Mar 23 06:11:37 2005|no pending jobs to perform > scheduling on > > Wed Mar 23 06:11:37 > 2005|--------------STOP-SCHEDULER-RUN------------- > > [root@wems bin]# qstat > > job-ID prior name user state > submit/start at queue > slots ja-task-ID > > ---------------------------------------------------------------------------- > ------------------------------------- > > 279 0.55500 hello++ wems qw > 03/23/2005 06:11:43 > 1 > > [root@wems bin]# > > > > BTW that node wems10.grid.wni.com has connectivity > issues and I have not > removed it from the cluster queue. > > > > What causes this type of problem in N1GE to return > "no pending jobs to > perform scheduling on" in the schedd_runlog even > though there are available > slots ready to take jobs? > > I had no problem submitting serial jobs, only the > parallel jobs resulted as > such. Are there N1GE - Myrinet issue that I am not > aware of? FYI the same > binary (hello++) runs with no problems from the > command line. > > Since I generally run scripts from qsub instead of > binaries I created a > script to run the mpich executable but that yield > the same result. > > > > I have an additional question regarding setting a > queue.conf parameter > called "subordinate_list". How is it read from the > result of qconf -mq > ? > > Example > > i.e., subordinate_list > low_pri.q=5,small.q. > > > > Which queue has priority over the other based on the > slots? > > > > > > William Burke > > Tellitec Sollutions > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or > unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________________________________ Yahoo!©_¼¯¹q¤l«H½c §K¶O®e¶q250MB¡A«H¥ó¦b¦h¤]¤£©È http://tw.promo.yahoo.com/mail_new/index.html From laytonjb at charter.net Fri Mar 25 09:42:46 2005 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] [Fwd: [O-MPI users] Fwd: Thoughts on an MPI ABI] In-Reply-To: <423EB918.90908@charter.net> References: <423ADCE3.7090508@charter.net> <20050321023348.GG3221@greglaptop> <423EB918.90908@charter.net> Message-ID: <42444D96.8070808@charter.net> Jeffrey B. Layton wrote: > Greg, > > I apologize. I forgot to cc the OpenMPI mailing list. However, > I've only seen one other post from the OpenMPI list regarding > this topic - being the one from you. To be fair do you want me to > forward that one with the cc to the OpenMPI list? I've been delinquent in posting the links to the conversation on the Open-MPI mailing list. Here is the first post to the Open-MPI mailing list which was cross-posted from the comp.parallel.mpi newsgroup. http://www.open-mpi.org/community/lists/users/2005/03/0014.php This is the link to the original posting that I forwarded to the beowulf mailing list: http://www.open-mpi.org/community/lists/users/2005/03/0021.php If you follow this thread you will see Greg's response to Jeff's initial post. Jeff From atp at piskorski.com Fri Mar 25 10:12:58 2005 From: atp at piskorski.com (Andrew Piskorski) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Bernhard Kuhn's real time interrupt patch Message-ID: <20050325181258.GA83760@piskorski.com> Last year sometime I was reading about Bernhard Kuhn's real time interrupt patch (aka rtirq-patch) for Linux: http://home.t-online.de/home/Bernhard_Kuhn/rtirq/20040304/rtirq.html http://linuxdevices.com/articles/AT6105045931.html He seems to work on embedded systems, but for those there are other real-time options. But since it does not involve a separate real-time kernel, low-latency cluster interconnects seem like the obvious niche where his patch might be a big win. So I was wondering, has anyone tried that? I believe the low-latency OS-bypass NICs are already doing some fancy mix of polling and waiting for slow Linux interrupts, but shouldn't Kuhn's rtirq-patch still help? -- Andrew Piskorski http://www.piskorski.com/ From diep at xs4all.nl Fri Mar 25 10:53:17 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Questions regarding interconnects Message-ID: <3.0.32.20050325195317.016766a0@pop.xs4all.nl> At 09:26 AM 3/25/2005 -0800, Greg Lindahl wrote: >On Fri, Mar 25, 2005 at 03:04:53PM +0100, Vincent Diepeveen wrote: > >> Do other cards implement something similar? >> >> As far as i know they do not. > >They do, in fact checking for message arrival by only having to look >in main memory is a standard feature of just about all OS-bypass card >implementations. Functionname please and the cost in nanoseconds or clockcycles of those function calls please when used in a program. So cost at software level, not hardware level. "Paper supports everything" - Arturo Ochoa Vincent >-- greg >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From patrick at myri.com Fri Mar 25 11:34:43 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Alternative to MPI ABI In-Reply-To: References: Message-ID: <424467D3.3050009@myri.com> Hi Jeff, Jeff Squyres wrote: > You get the idea. This is pretty much the position I reached after talking with other MPI developers and some ISVs. Bill's paper shows that it is technically possible, what is missing is the critical mass. We may have that today. > There's a slight performance penalty for the translation layer, but for > those who want an MPI ABI, this might well be an acceptable price to pay. This performance penalty would be negligeable for the current MPI operations on today's machines. This common external interface could be designed to do the translation faster. This design has my vote. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Fri Mar 25 11:47:44 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Alternative to MPI ABI In-Reply-To: References: Message-ID: <42446AE0.3080505@myri.com> Hi Don, It looks a lot more like a runtime environement (care about resources, scheduling, failover) than a programming model to me. MPI is much more a programming model than a runtime environement, so I don't think they are that orthogonal. Donald Becker wrote: > MPI has a model of initialize-compute-terminate. > There is no explicit support for checkpointing, executing as a > service, or running "forever". There is no explicit support but people have been checkpointing parallel jobs for a while. You just need to be flush all pending communications: pass a token a couple of times and you will have a clean cut. > MPI's strength is collective mathematically-oriented operations, not > communication. I understand that even the name "Message Passing.." > indicates that stream communication isn't the focus, but many > applications expect and work well with a sockets-based model. Aaargh. Sockets are definitively not a programming model suited for parallel codes. Client/server, maybe, not tighlty coupled applications. Look at the bodies left from trying to do zero-copy and OS-bypass sockets: when you apply all of the constraints, you basically gut out the stream paradigm. > Communicators besides MPI_COMM_WORLD are rarely used. The capability > adds complexity with little benefit. Most libraries use communicators for isolation. Look at the BLACS contexts for example. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Fri Mar 25 11:59:18 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Alternative to MPI ABI In-Reply-To: <20050325013448.GB4459@greglaptop.internal.keyresearch.com> References: <20050325013448.GB4459@greglaptop.internal.keyresearch.com> Message-ID: <42446D96.7080904@myri.com> Hi Greg, Greg Lindahl wrote: > I think this overall idea falls short of the benefit of an ABI for a > couple of reasons. The first is that it's unlikely to get wide > distribution if it doesn't come with MPI implementations. The second > is that it's harder to maintain "out of tree"; the minute that an MPI > implementation changes something, MorphMPI is going to be broken. I don't see it that way. First, the implementations of the translation layers will be done by each MPI implementations. MorphMPI (no offense Jeff but we got to choose a somewhat cooler name) just define the common interface, nothing more. If the common layer does not change, the translation does not have to change unless something in the side of the MPI implementation changes, and its maintainer should then keep its local translation layer up to date. > Was there a big fight over the Fortran interface? It nails down the > types because it has to. All the ABI does is make C look like Fortran; > no internals need change in any implementation. You don't change internals, you translate them. Let say you use pointers in your MPI implementation and the common layer specifies integers. In your translation layer, you translate pointers into integers by putting them in a table. You have as much work as your internals are far from the common interface and, hopefully, it will be a midpoint for everybody. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Fri Mar 25 12:32:47 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Questions regarding interconnects In-Reply-To: References: Message-ID: <4244756F.70405@myri.com> Olli-Pekka Lehto wrote: > What do you see as the key differentiating factors in the quality of an > MPI implementation? This far I have come up with the following: Most of the time, the MPI implementation is not the factor driving performance, the underlaying hardware is. > Are there any NICs on the market which utilize the 10GBase-CX4 standard Our next product, Myrinet 10G, speaks XAUI at the link level, and the copper version use a CX4 port. With the default firmware, it is a 10GBase-CX4 NIC speaking Ethernet. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Fri Mar 25 13:02:24 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Questions regarding interconnects In-Reply-To: <3.0.32.20050325150452.016766a0@pop.xs4all.nl> References: <3.0.32.20050325150452.016766a0@pop.xs4all.nl> Message-ID: <42447C60.5090303@myri.com> Hi Vincent, Vincent Diepeveen wrote: > I feel very important to look at is 'shmem' capabilities. > In order for B to receive, it has to have either a special thread > that regurarly polls. If you have a thread that polls say each 10 > milliseconds, then what's the use of using a highend network > card (other than it's DMA capabilities)? You are in a situation where you don't have to wait for the message to arrive, you can move on and check 10 ms later. In this case, you don't care about network speed. > However, it's very expensive to poll. No, it's not. No in the OS-bypass world. > On the other hand using the 'shmem', what happens is that A ships a > nonblocking write to B of just a few bytes. The network card in B simply > writes it in the RAM. > > Now and then the searching process at B only has to poll its own main > memory to see whether it has a '1'. So sometimes you lose a TLB trashing > call to it, but other times it comes from L2 cache. It's still polling. With message passing, you actually poll a queue in the MPI lib instead of a specific location in the user application. That helps when you are looking for several messages from several sources (got to poll several locations in you model). > So for short messages which are latency sensitive that 'shmem' of quadrics > is just far superior. You are getting confused with words. "SHMEM" is a legacy shared memory interface that was used on Cray machines like the T3D. It's not a standard per se, it's a software interface. The implementations usually rest on top of remote memory operations (PUT/GET). It always stike mean when people put "one-sided" and "latency sensitive" in the same sentence. "one-sided" means that you don't want to involve the remote side in the communication and "latency sensitive" means the other side is waiting for the communication. In your example, you will be looking if someone has written in your memory every X ms. In this case, what do you care about latency ? > Do other cards implement something similar? You can do PUT on most high speed networks, this is a pretty basic functionality. The SHMEM interface may not be used because it makes sense only for former Cray customers, but look for portable RMA implementations like ARMCI for example. > As far as i know they do not. Do more research. > The overhead of the MPI implementation layer *receiving* bytes is just so > so huge. A cards theoretic one-way pingpong latency is just irrelevant to > that, because that one way pingpong programs at all cards is eating 100% > system time, effectively losing a full cpu. You are mistaken about the MPI receive overhead. You are also mistaken in your belief than one-sided operations are the Silver bullets. RMA operations may be more appropriate to an application design, but it shares many constraints with message passing: you have to poll to know when it's done, you have to tell the other side where to write (equivalent to posting a recv). It has drawbacks like usually not scaling in space (each sender should write to a different location). Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Fri Mar 25 13:10:15 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Bernhard Kuhn's real time interrupt patch In-Reply-To: <20050325181258.GA83760@piskorski.com> References: <20050325181258.GA83760@piskorski.com> Message-ID: <42447E37.1020509@myri.com> Hi Andrew, Andrew Piskorski wrote: > So I was wondering, has anyone tried that? I believe the low-latency > OS-bypass NICs are already doing some fancy mix of polling and waiting > for slow Linux interrupts, but shouldn't Kuhn's rtirq-patch still > help? I don't think this patch improves the interrupt latency or the interrupt overhead when there is no contention. Context switching is the metric you want to improve. With OS-bypass NICs, interrupts are rare and there is usually no contention. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From lindahl at pathscale.com Fri Mar 25 14:03:06 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Alternative to MPI ABI In-Reply-To: <42446D96.7080904@myri.com> References: <20050325013448.GB4459@greglaptop.internal.keyresearch.com> <42446D96.7080904@myri.com> Message-ID: <20050325220306.GD2045@greglaptop.internal.keyresearch.com> On Fri, Mar 25, 2005 at 02:59:18PM -0500, Patrick Geoffray wrote: > I don't see it that way. First, the implementations of the translation > layers will be done by each MPI implementations. In which case it's basically the same as doing an ABI. Or did I miss something? Does this somehow save a significant amount of work for anyone? > >Was there a big fight over the Fortran interface? It nails down the > >types because it has to. All the ABI does is make C look like Fortran; > >no internals need change in any implementation. > > You don't change internals, you translate them. Let say you use pointers > in your MPI implementation and the common layer specifies integers. In > your translation layer, you translate pointers into integers by putting > them in a table. You have as much work as your internals are far from > the common interface and, hopefully, it will be a midpoint for everybody. Patrick, if you read what both Jeff and I wrote, I believe it's clear we both understand that part, because we've both mentioned that particular implementation solution. What I was trying to understand was why Jeff thought this was a huge nightmare. -- greg From diep at xs4all.nl Fri Mar 25 14:39:15 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Questions regarding interconnects Message-ID: <3.0.32.20050325233914.01682260@pop.xs4all.nl> At 04:02 PM 3/25/2005 -0500, Patrick Geoffray wrote: >Hi Vincent, > >Vincent Diepeveen wrote: >> I feel very important to look at is 'shmem' capabilities. > >> In order for B to receive, it has to have either a special thread >> that regurarly polls. If you have a thread that polls say each 10 >> milliseconds, then what's the use of using a highend network >> card (other than it's DMA capabilities)? > >You are in a situation where you don't have to wait for the message to >arrive, you can move on and check 10 ms later. In this case, you don't >care about network speed. > >> However, it's very expensive to poll. >No, it's not. No in the OS-bypass world. Your definition of cheap is what is defined 'expensive' in mine. >> On the other hand using the 'shmem', what happens is that A ships a >> nonblocking write to B of just a few bytes. The network card in B simply >> writes it in the RAM. > > >> Now and then the searching process at B only has to poll its own main >> memory to see whether it has a '1'. So sometimes you lose a TLB trashing >> call to it, but other times it comes from L2 cache. >It's still polling. With message passing, you actually poll a queue in >the MPI lib instead of a specific location in the user application. That >helps when you are looking for several messages from several sources >(got to poll several locations in you model). The only way to avoid polling is by accepting death penalty runqueue latencies or waste a full cpu. Message passing MPI has the queue overrun problem. You must do several calls which makes it all together dead slow. I do not know about your software, but for mine losing time to TLB trashing will *considerably* slow it down. So every slow memory call i try to save out. Yet it's trivial that you can't avoid wasting regurarly 1 when polling. I'm prepared to lose *that* poll time to a single memory location. >> So for short messages which are latency sensitive that 'shmem' of quadrics >> is just far superior. >You are getting confused with words. "SHMEM" is a legacy shared memory >interface that was used on Cray machines like the T3D. It's not a >standard per se, it's a software interface. The implementations usually >rest on top of remote memory operations (PUT/GET). You are correct of course here, my historic knowledge regarding crays is real little. Only ran on a few supercomputers, Cray T3D recently wasn't one of them. >It always stike mean when people put "one-sided" and "latency sensitive" >in the same sentence. "one-sided" means that you don't want to involve >the remote side in the communication and "latency sensitive" means the >other side is waiting for the communication. >In your example, you will be looking if someone has written in your >memory every X ms. In this case, what do you care about latency ? That's what my problem is with MPI. Majority of researchers using MPI will be checking that MPI queue so much, in order to not slow down, that the program as a whole slows down. If they avoid it, the only advantage to using MPI is the huge bandwidth a card delivers. When cheap 10 gbit cards arrive, of course also *that* advantage has gone, and MPI has really little benefits. Calling it portable is not a good argument IMHO, because it is so hard to get system time at big supercomputers/superclusters that you can work fulltime to get something to run at it anyway, as the work you have to do on paper to get the system time is already half a manyear. >> Do other cards implement something similar? >You can do PUT on most high speed networks, this is a pretty basic >functionality. The SHMEM interface may not be used because it makes >sense only for former Cray customers, but look for portable RMA >implementations like ARMCI for example. I'm pretty sure quadrics still offers the shmem functionality to its users. See prof Aad v/d Steen's "Supercomputers in Europe" report for dutch government (NWO, NCF). >> As far as i know they do not. > >Do more research. >> The overhead of the MPI implementation layer *receiving* bytes is just so >> so huge. A cards theoretic one-way pingpong latency is just irrelevant to >> that, because that one way pingpong programs at all cards is eating 100% >> system time, effectively losing a full cpu. > >You are mistaken about the MPI receive overhead. You are also mistaken I'm not mistaken this is clear, otherwise you would show up with actual number of memory calls needed for the MPI overhead. What i do know is that you always first must avoid buffer overrun in MPI. That means for example: MPI_Isend(....) MPI_Test(&Reg,&flg,&Stat) while(!flg) { Myprogram_MsgPending(); // Important, read in messages and process them while waiting on complete. Otherwise the own Input-Buffer can overflow // and we get a deadlock. MPI_Test(&Reg,&flg,&Stat); } So that's 4 function calls where there should be 1, just for sending/polling. That means i'm gonna lose 3 unnecessary calls, where i just want to lose 1. All i want to do is get the job modified of a remote running processor as fast as possible. So i want to write in one of its arrays say 4-64 bytes at most. As soon as the remote processor polls during its search then it can use the new parameters and continue. In search this gives an exponential speedup when using YBW (youngest brother wait) + alfabeta + nullmove + shared transpositiontables (hashtables) So it is crucial to inform other processors as soon as possible. >in your belief than one-sided operations are the Silver bullets. RMA >operations may be more appropriate to an application design, but it >shares many constraints with message passing: you have to poll to know >when it's done, you have to tell the other side where to write >(equivalent to posting a recv). It has drawbacks like usually not >scaling in space (each sender should write to a different location). The silver bullet is that an array in the remote processor receives the data, without this processor nor the remote one needing to do 6 function calls first. The local processor shipping the data just wants 1 nonblocking function call, the remote processor just wants to now and then check a single variable whether there is data. Message passing seems to me too slow for that. Just the function calls already look like a big barrier and don't make programming simpler: MPI_Isend(....) MPI_Test(&Reg,&flg,&Stat) while(!flg) { Myprogram_MsgPending(); // Important, read in messages and process them while waiting on complete. Otherwise the own Input-Buffer can overflow // and we get a deadlock. MPI_Test(&Reg,&flg,&Stat); } Vincent >Patrick >-- > >Patrick Geoffray >Myricom, Inc. >http://www.myri.com > > From jsquyres at open-mpi.org Fri Mar 25 14:49:13 2005 From: jsquyres at open-mpi.org (Jeff Squyres) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Alternative to MPI ABI In-Reply-To: <20050325220306.GD2045@greglaptop.internal.keyresearch.com> References: <20050325013448.GB4459@greglaptop.internal.keyresearch.com> <42446D96.7080904@myri.com> <20050325220306.GD2045@greglaptop.internal.keyresearch.com> Message-ID: On Mar 25, 2005, at 5:03 PM, Greg Lindahl wrote: >> I don't see it that way. First, the implementations of the translation >> layers will be done by each MPI implementations. > > In which case it's basically the same as doing an ABI. Or did I miss > something? Does this somehow save a significant amount of work for > anyone? YES! MorphMPI (or, as Patrick suggests, we need a cooler name -- PatrickMPI? ;-) ) is the work of 1 grad clever student (or anyone else industrious enough). Elapsed time: a few months. Making even 2 MPI implementations agree on an ABI is an enormous amount of work. Given that two major MPI implementations take opposite sides on the pointers-vs.integers for MPI handles debate (and I suspect that neither is willing to change), just getting them to agree on one of them will be a major amount of work. Then changing the internals of one of those MPIs to match the other is another enormous amount of work (death by a million cuts). And MPI handles is only one issue. Consider all the rest of the issues... Elapsed time: 2 years (that's optimistic). Also, as I pointed out in my original alternate proposal, with PatrickMPI, only those who want to use an ABI will use it. Those who do *not* want an ABI do not have to have it forced upon them. -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/ From patrick at myri.com Fri Mar 25 15:03:15 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Alternative to MPI ABI In-Reply-To: <20050325220306.GD2045@greglaptop.internal.keyresearch.com> References: <20050325013448.GB4459@greglaptop.internal.keyresearch.com> <42446D96.7080904@myri.com> <20050325220306.GD2045@greglaptop.internal.keyresearch.com> Message-ID: <424498B3.80802@myri.com> Greg Lindahl wrote: > Patrick, if you read what both Jeff and I wrote, I believe it's clear > we both understand that part, because we've both mentioned that > particular implementation solution. What I was trying to understand > was why Jeff thought this was a huge nightmare. What Jeff thought is a nightmare, I believe, is to have to decide a common interface and then force the MPI implementations to adopt this interface internally instead of having them translating on the fly. It's like having a common language in The EU. Either we decide it will be French and everybody else starts to teach French at school, or we choose Esperanto, and everybody translate their national language in Esperanto when they want to deal with another EU member. There would be a lot of blood on the floor before French is adopted. > something? Does this somehow save a significant amount of work for > anyone? It does not, but the pill is much easier to swallow because nobody has to fight to try to impose the interface they happen to use. Am I still drunk and missing something big ? Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From lindahl at pathscale.com Fri Mar 25 15:16:46 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Alternative to MPI ABI In-Reply-To: <424498B3.80802@myri.com> References: <20050325013448.GB4459@greglaptop.internal.keyresearch.com> <42446D96.7080904@myri.com> <20050325220306.GD2045@greglaptop.internal.keyresearch.com> <424498B3.80802@myri.com> Message-ID: <20050325231646.GB3269@greglaptop.internal.keyresearch.com> On Fri, Mar 25, 2005 at 06:03:15PM -0500, Patrick Geoffray wrote: > What Jeff thought is a nightmare, I believe, is to have to decide a > common interface and then force the MPI implementations to adopt this > interface internally instead of having them translating on the fly. Ah. But no one ever suggested that, so we're all set -- it's an artifact of the poor communication content of PowerPoint slides that anyone thought I had suggested altering everyone's internals. > > something? Does this somehow save a significant amount of work for > > anyone? > > It does not, but the pill is much easier to swallow because nobody has > to fight to try to impose the interface they happen to use. Am I still > drunk and missing something big ? If I understand it correctly, MorphMPI imposes the same interface as an ABI -- a common mpi.h. The only thing it avoids is having a shared library with a common name and interface; instead it will have a separate routine per MPI implementation that dlopens all the appropriate libs for the implementation in use, with the usual chaos of trying to find where they are located. In any case, I think this sort of discussion is more of an implementation detail than a fundamental thing that would obviate having an ABI... either way you're going to want to pick the right contents for mpi.h. Apple or Orange, it's the same committee process. -- greg From lindahl at pathscale.com Fri Mar 25 15:26:28 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:56 2009 Subject: [O-MPI users] Re: [Beowulf] Alternative to MPI ABI In-Reply-To: References: <20050325013448.GB4459@greglaptop.internal.keyresearch.com> <42446D96.7080904@myri.com> <20050325220306.GD2045@greglaptop.internal.keyresearch.com> Message-ID: <20050325232628.GA3781@greglaptop.internal.keyresearch.com> On Fri, Mar 25, 2005 at 05:49:13PM -0500, Jeff Squyres wrote: > MorphMPI (or, as Patrick suggests, we need a cooler name -- PatrickMPI? > ;-) ) is the work of 1 grad clever student (or anyone else industrious > enough). Elapsed time: a few months. Right. > Making even 2 MPI implementations agree on an ABI is an enormous amount > of work. Given that two major MPI implementations take opposite sides > on the pointers-vs.integers for MPI handles debate (and I suspect that > neither is willing to change), just getting them to agree on one of > them will be a major amount of work. Then changing the internals of > one of those MPIs to match the other is another enormous amount of work > (death by a million cuts). You yourself said how MPI implementers would actually implement this without needing to change any internals: Make the C interface routines do the same thing that F77 does today. Elapsed time: a few months, same as MorphMPI. No internals need to be changed. I suppose the good news is that if this is your main objection, then it's gone. > Also, as I pointed out in my original alternate proposal, with > PatrickMPI, only those who want to use an ABI will use it. Those who > do *not* want an ABI do not have to have it forced upon them. I missed where anyone was being forced to do anything. MPI implementers can support the ABI alongside their current interface or not, it's their choice. -- greg From patrick at myri.com Fri Mar 25 15:32:02 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:56 2009 Subject: [Beowulf] Alternative to MPI ABI In-Reply-To: References: <20050325013448.GB4459@greglaptop.internal.keyresearch.com> <42446D96.7080904@myri.com> <20050325220306.GD2045@greglaptop.internal.keyresearch.com> Message-ID: <42449F72.7070303@myri.com> Jeff Squyres wrote: > MorphMPI (or, as Patrick suggests, we need a cooler name -- PatrickMPI? PMPI is already taken... I am very bad with names. UMPI for Universal MPI ? Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Fri Mar 25 15:47:34 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] Alternative to MPI ABI In-Reply-To: <20050325231646.GB3269@greglaptop.internal.keyresearch.com> References: <20050325013448.GB4459@greglaptop.internal.keyresearch.com> <42446D96.7080904@myri.com> <20050325220306.GD2045@greglaptop.internal.keyresearch.com> <424498B3.80802@myri.com> <20050325231646.GB3269@greglaptop.internal.keyresearch.com> Message-ID: <4244A316.3030008@myri.com> Greg Lindahl wrote: > On Fri, Mar 25, 2005 at 06:03:15PM -0500, Patrick Geoffray wrote: > > >>What Jeff thought is a nightmare, I believe, is to have to decide a >>common interface and then force the MPI implementations to adopt this >>interface internally instead of having them translating on the fly. > > > Ah. But no one ever suggested that, so we're all set -- it's an Sorry, I got lost in the catch up. > In any case, I think this sort of discussion is more of an > implementation detail than a fundamental thing that would obviate > having an ABI... either way you're going to want to pick the right > contents for mpi.h. Apple or Orange, it's the same committee process. True. There will be choices and you can still lobby so that your translation layer is as empty as possible. We can either do it Kyoto-style where we trade translation points or we can do it US-style and fight on the beach in Capri... Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From atp at piskorski.com Fri Mar 25 15:52:38 2005 From: atp at piskorski.com (Andrew Piskorski) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] Bernhard Kuhn's real time interrupt patch In-Reply-To: <42447E37.1020509@myri.com> References: <20050325181258.GA83760@piskorski.com> <42447E37.1020509@myri.com> Message-ID: <20050325235238.GA64948@piskorski.com> On Fri, Mar 25, 2005 at 04:10:15PM -0500, Patrick Geoffray wrote: > I don't think this patch improves the interrupt latency or the interrupt > overhead when there is no contention. Context switching is the metric > you want to improve. > > With OS-bypass NICs, interrupts are rare and there is usually no contention. Contention for what, the NIC? I was thinking that Kuhn looked at interrupt latency under heavy CPU usage, but I guess I was confused. What he actually measured was greatly tightening up the distribtuion of interrupt latency under varying network (ping), disk (looping on 'find /'), and "graphics" (glxgears) loads. I don't understand the ramifications of that. So is it true that kernel interrupt latency only matters for non-OS-bypass NICs? (E.g., your typical ethernet card.) And stuff like Myrinet and SCI doesn't care? -- Andrew Piskorski http://www.piskorski.com/ From patrick at myri.com Fri Mar 25 15:53:02 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:57 2009 Subject: [O-MPI users] Re: [Beowulf] Alternative to MPI ABI In-Reply-To: <20050325232628.GA3781@greglaptop.internal.keyresearch.com> References: <20050325013448.GB4459@greglaptop.internal.keyresearch.com> <42446D96.7080904@myri.com> <20050325220306.GD2045@greglaptop.internal.keyresearch.com> <20050325232628.GA3781@greglaptop.internal.keyresearch.com> Message-ID: <4244A45E.4010904@myri.com> Greg Lindahl wrote: > You yourself said how MPI implementers would actually implement this > without needing to change any internals: Make the C interface routines > do the same thing that F77 does today. Elapsed time: a few months, > same as MorphMPI. No internals need to be changed. Now I get it, I think. Basically, you propose that we make the same choices that were done for the F77 interface, ie integer instead of pointers for example. Sounds fine to me. We can start from that and we can still change things if we have consensus. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Fri Mar 25 16:10:09 2005 From: patrick at myri.com (Patrick Geoffray) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] Bernhard Kuhn's real time interrupt patch In-Reply-To: <20050325235238.GA64948@piskorski.com> References: <20050325181258.GA83760@piskorski.com> <42447E37.1020509@myri.com> <20050325235238.GA64948@piskorski.com> Message-ID: <4244A861.5080508@myri.com> Andrew Piskorski wrote: > On Fri, Mar 25, 2005 at 04:10:15PM -0500, Patrick Geoffray wrote: > > >>I don't think this patch improves the interrupt latency or the interrupt >>overhead when there is no contention. Context switching is the metric >>you want to improve. >> >>With OS-bypass NICs, interrupts are rare and there is usually no contention. > > > Contention for what, the NIC? CPUs. I didn't look at the patch in details, but it seems that it basically plays with priorities. So yes, it would help under load, but I don't think it will reduce the critical path. > So is it true that kernel interrupt latency only matters for > non-OS-bypass NICs? (E.g., your typical ethernet card.) And stuff > like Myrinet and SCI doesn't care? On Myrinet/MX, interrupt are use for Ethernet emulation, to improve progression on large messages with some runtime flag, or to report something rare and unexpected. It can also be asked by the application in some situations (multi-threaded and more thread than CPUs so you don't want busy poll), but it's not common. In Ethernet mode, specially at 10Gb/s, interrupts cost is very important. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From diep at xs4all.nl Fri Mar 25 17:16:12 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] Questions regarding interconnects Message-ID: <3.0.32.20050326021612.01682208@pop.xs4all.nl> Hello Patrick, >Do more research. I hope you don't mind i grab the opportunity to ask a technical question to you. Feel free to answer when you have time after Easter. Suppose we have nodes A,B,.... Each node is a dual connected with myri. Each node has a single card inside. Each node has 3 threads. One thread is just busy shipping large messages using MPI. So eating a lot of bandwidth from the card. One such messages is really a lot of megabytes. Now let's suppose thread B.1 is busy receiving a really huge message of several megabytes. That takes considerable time. In meantime a small message of 4 bytes arrives for thread B.2 which is latency crucial. 3 Questions. Q1: does B.2 have to wait for B.1 to receive the entire message? Q2: in case B.2 doesn't need to wait for B.1 to receive the entire message, but can receive it in between, what is the switch time latency of the myri hardware? So what is the worst case time it takes to receive the message (i understand on average it is 50% better, but i work with worst cases in my software; suppose that n processors are ready with an explosion, wipes everything out, i want to interrupt the stuff then). Q3: is there difference in currently offered myri cards here, if so which one can and which one can't? Many Thanks for answerring, Vincent At 04:02 PM 3/25/2005 -0500, Patrick Geoffray wrote: >Hi Vincent, > >> The overhead of the MPI implementation layer *receiving* bytes is just so >> so huge. A cards theoretic one-way pingpong latency is just irrelevant to >> that, because that one way pingpong programs at all cards is eating 100% >> system time, effectively losing a full cpu. > >You are mistaken about the MPI receive overhead. You are also mistaken >in your belief than one-sided operations are the Silver bullets. RMA >operations may be more appropriate to an application design, but it >shares many constraints with message passing: you have to poll to know >when it's done, you have to tell the other side where to write >(equivalent to posting a recv). It has drawbacks like usually not >scaling in space (each sender should write to a different location). > >Patrick >-- > >Patrick Geoffray >Myricom, Inc. >http://www.myri.com > > From lindahl at pathscale.com Fri Mar 25 17:58:18 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] Questions regarding interconnects In-Reply-To: <3.0.32.20050325233914.01682260@pop.xs4all.nl> References: <3.0.32.20050325233914.01682260@pop.xs4all.nl> Message-ID: <20050326015818.GA4702@greglaptop.internal.keyresearch.com> On Fri, Mar 25, 2005 at 11:39:15PM +0100, Vincent Diepeveen wrote: > What i do know is that you always first must avoid buffer overrun in MPI. > That means for example: > > MPI_Isend(....) > MPI_Test(&Reg,&flg,&Stat) > while(!flg) { > Myprogram_MsgPending(); // Important, read in messages and process them > while waiting on complete. Otherwise the own Input-Buffer can > overflow > // and we get a deadlock. > MPI_Test(&Reg,&flg,&Stat); > } Vincent, This code is simply mistaken. When you do an MPI_Isend(), there's no reason to immediately do MPI_Test(). Just wait until you actually want to reuse the buffer you used in the Isend(), and then do the MPI_Test(), which will likely succeed. MPI_Test()=True does NOT mean the other side has received the message! It just means that the buffer can be reused. The message may not even be in flight, it might be queued somewhere on the sender. In most implementations, for short outgoing messages, the MPI_Test() will always be true, because the send will be done immediately in MPI_Isend(). This is called an "eager send". > All i want to do is get the job modified of a remote running processor as > fast as possible. So i want to write in one of its arrays say 4-64 bytes at > most. Sounds like a job for message passing! If I do this with message passing, the remote running program occasionally checks to see if it has data. If so, it puts it in the right place, and starts using it. If I do this using the Cray SHMEM method, you seem to think you can just drop the data in the right place and everything will be faster. But that's not true. I have to worry about the program attempting to use the data in the destination location when only part of the message has arrived. To avoid that, I need some way of indicating that all of the data has arrived, and atomically using either the old or the new data. In SHMEM, the way you express this is to have the data arrive into a buffer (using a PUT) and then do second PUT to a flag to indicate that the data has all arrived. Then the recipient checks the flag, and if it's set, copies the data from the buffer to its final location. Then it can use it. This is the same work as the MPI case, just done in a different way. In short, one-sided messsaging does not help synchronization. What it does do is occasionally get rid of a copy. But people who haven't yet written a one-sided program always imagine it will make synchronization easier. In any case, it's not trivial for a recipient that only occasionally gets a message to efficiently check for it, in any paradigm. That seems to be what you're attempting to do. Few MPI programs do that. -- greg From ayoung at penguincomputing.com Fri Mar 25 19:40:30 2005 From: ayoung at penguincomputing.com (Adam Young) Date: Wed Nov 25 01:03:57 2009 Subject: [O-MPI users] Re: [Beowulf] Alternative to MPI ABI In-Reply-To: <42449F72.7070303@myri.com> References: <20050325013448.GB4459@greglaptop.internal.keyresearch.com> <42446D96.7080904@myri.com> <20050325220306.GD2045@greglaptop.internal.keyresearch.com> <42449F72.7070303@myri.com> Message-ID: <4244D9AE.5050207@penguincomputing.com> MPBI Patrick Geoffray wrote: > Jeff Squyres wrote: > >> MorphMPI (or, as Patrick suggests, we need a cooler name -- PatrickMPI? > > > PMPI is already taken... > I am very bad with names. UMPI for Universal MPI ? > > Patrick From jsquyres at open-mpi.org Sat Mar 26 03:47:41 2005 From: jsquyres at open-mpi.org (Jeff Squyres) Date: Wed Nov 25 01:03:57 2009 Subject: [O-MPI users] Re: [Beowulf] Alternative to MPI ABI In-Reply-To: <20050325232628.GA3781@greglaptop.internal.keyresearch.com> References: <20050325013448.GB4459@greglaptop.internal.keyresearch.com> <42446D96.7080904@myri.com> <20050325220306.GD2045@greglaptop.internal.keyresearch.com> <20050325232628.GA3781@greglaptop.internal.keyresearch.com> Message-ID: <5b7d13b355dc65349c7a683439e64e01@open-mpi.org> On Mar 25, 2005, at 6:26 PM, Greg Lindahl wrote: >> Making even 2 MPI implementations agree on an ABI is an enormous >> amount >> of work. Given that two major MPI implementations take opposite sides >> on the pointers-vs.integers for MPI handles debate (and I suspect that >> neither is willing to change), just getting them to agree on one of >> them will be a major amount of work. Then changing the internals of >> one of those MPIs to match the other is another enormous amount of >> work >> (death by a million cuts). > > You yourself said how MPI implementers would actually implement this > without needing to change any internals: Make the C interface routines > do the same thing that F77 does today. Elapsed time: a few months, > same as MorphMPI. No internals need to be changed. > > I suppose the good news is that if this is your main objection, > then it's gone. Er... no. Interesting: it seems that you are assuming that mpi.h should use integers for MPI handles. Regardless of which way you choose, your statement "No internals have to change" is inaccurate. At a minimum, *EVERY* MPI API function in somebody's implementation will have to change. I'm not splitting hairs on what "internals" means; what I'm saying is that code in the MPI implementations [who have chosen "wrong"] have to change. It doesn't matter whether it's code in the API calls or down in the progress engine; a lot of code has to change. And potentially a bunch of other infrastructure has to be changed to match (depending on how the MPI works). And let's not forget that this issue is actually one of the essential elements of the pointers-vs-integers debate. Some MPI implementations (both of mine included) have very good reasons for not having the C API calls do the same thing as the Fortran API calls. But that's a different conversation (one that I unfortunately do not have time to have via e-mail). >> Also, as I pointed out in my original alternate proposal, with >> PatrickMPI, only those who want to use an ABI will use it. Those who >> do *not* want an ABI do not have to have it forced upon them. > > I missed where anyone was being forced to do anything. MPI > implementers can support the ABI alongside their current interface or > not, it's their choice. Er... no. Well, let's think about this for a minute. For an MPI implementation to support two interfaces, it will need 2 mpi.h's, 2 sets of MPI API functions, and 2 corresponding sets of infrastructure to match. I have difficulty seeing MPI implementors wanting to support this -- the software engineering issues alone are tremendously unattractive (e.g., the testing scenarios have -- at least -- doubled). It'll also lead to user confusion. "Oh, yes, our product supports ABC MPI." / "But I'm using ABC MPI, and it doesn't work." / "Oh, you need to use the MPI ABI of ABC MPI..." To have a single MPI implementation support multiple instances of its API, it [at least effectively] needs to be installed twice. Users therefore have to choose which to compile/link against, etc. So from the user's perspective, MPI ABC API is effectively Yet Another MPI (as compared to MPI ABC non-ABI). In short: if an MPI implementation wants to support an MPI ABI, I have difficulty believing that it will be anything other than its one-and-only main interface. So, sure, I guess an MPI implementation isn't *forced* to only support an MPI ABI, it's just *strongly recommended*. ;-) ----- I guess I don't understand your reluctance to accept a MorphMPI-like solution: - it will work - it will take far less time to implement - it does not require a committee (there's no need to standardize its mpi.h) - no MPI implementors have to agree on anything - no existing MPI implementations need to change - no software engineering issues or practices for existing MPI implementations need to change - people who want it will use it (and those who don't, won't) Are you trying to jump start MPI-3? ----- Sidenote: I'll try to keep up with this conversation, but I can't promise anything -- it's reaching a frequency that is difficult for me to match. -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/ From lindahl at pathscale.com Sat Mar 26 16:50:45 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:57 2009 Subject: [O-MPI users] Re: [Beowulf] Alternative to MPI ABI In-Reply-To: <5b7d13b355dc65349c7a683439e64e01@open-mpi.org> References: <20050325013448.GB4459@greglaptop.internal.keyresearch.com> <42446D96.7080904@myri.com> <20050325220306.GD2045@greglaptop.internal.keyresearch.com> <20050325232628.GA3781@greglaptop.internal.keyresearch.com> <5b7d13b355dc65349c7a683439e64e01@open-mpi.org> Message-ID: <20050327005045.GA3760@greglaptop.hsd1.ca.comcast.net> On Sat, Mar 26, 2005 at 06:47:41AM -0500, Jeff Squyres wrote: > Regardless of which way you choose, your statement "No internals have > to change" is inaccurate. At a minimum, *EVERY* MPI API function in > somebody's implementation will have to change. That's what I call the interface, yes. It's a similar size effort to creating an F77 interface in an MPI implementation. We've agreed on this several times already; I'm not sure why we need to agree on it again, or why the exact definition of the word "internals" is so important. > I guess I don't understand your reluctance to accept a MorphMPI-like > solution: You have repeated your original MorphMPI attributes. I responded to them, and I don't see any sign that you've read my response. This is not the way discussions are usually held. -- greg From wburke999 at msn.com Sat Mar 26 18:31:22 2005 From: wburke999 at msn.com (William Burke) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] In-Reply-To: <1111620330.4241faeabe336@home.staff.uni-marburg.de> Message-ID: Reuti, >> I'd suggest to move over to the SGE users list at: >> http://gridengine.sunsource.net/servlets/ProjectMailingListList I have but I do not see my name yet? How long is the verification process? >> Although there is a special Myrinet directory, you can also try to use >> the files in the mpi directory instead. The mpi directory's mpich.template doesn't use mpirun.ch_gm so how does it know what version of mpirun to use? If I use the mpi what changes do I have to make? >> Can you please give more details of your queue and PE setup (qconf -sq/sp >> output SEE BELOW >> Do you have an admin account for SGE? I'd prefer not to do anything in >> SGE as root. Yes, its grid... SEE BELOW >> Not really an issue: you have to make a small change to the >> mpirun.ch_gm.pl to make all jobs staying in the same process group to get >> them correctly killed in case of a jobb abort: I have to double check that in: http://gridengine.sunsource.net/howto/mpich-integration.html Here is the new problem I have this situation in the PE: My jobs won't run when I run my script it goes into pending mode for about 10 sec (status qw), SGE submits to N number of hosts (status t), jobs hangs in a status t mode, then quickly exit. When I investigated both the Jobscript_name.{pe|po}JobID output it states that SGE can't make links in the /WEMS/grid/tmp/549.1.Production.q/ directory. It looks like the startmpi.sh script links files in $TMPDIR and from my understanding the value of $TMPDIR is derived from the tmpdir parameter in the queue's configuration. I have designated this attribute as '/WEMS/grid/tmp/' but according to the errorlog qsub_wrf.sh.pe549 it is '/WEMS/grid/tmp/549.1.Production.q/' Possibly the source of the problem is here, so what created the '549.1.Production.q' addendum? I then checked the permission of /WEMS/grid/tmp [wems@wems grid]$ ls -ltr /WEMS/grid | grep tmp drwxrwxrwx 2 root root 4096 Mar 26 17:34 tmp As a sanity check within the startmpi.sh I echo out the ls -ltr of $TMPDIR: drwxr-xr-x 2 65534 65534 4096 Mar 26 2005 549.1.Production.q as expected there is no UID/GID that is 65534 in my /etc/passwd. Furthermore there are only write permission for UID/GID 65534 so if it(N1GE) is the only one writing and reading this directory what else could be preventing the writing into that directory? I thought maybe there was a lock file in /WEMS/grid/tmp so I checked.. [wems@wems tmp]$ ls -al /WEMS/grid/tmp total 8 drwxrwxrwx 2 root root 4096 Mar 26 17:34 . drwxr-xr-x 22 grid grid 4096 Mar 26 04:20 .. No Avail, so I am out of solutions. Is this a known issue when using myrinet, mpich, tight integration or am I overlook something?? I am using the sge_mpirun script instead of mpirun script. Have you seen any problem like this before? I also suspect that the editor may be reading the PE mpich configuration file's argument start_proc_args incorrectly since the editor wraps the string of argument around to the next line according to the /WEMS/wems/data/WRF/wni001a/log/050811200.wrf.pbs file. In CHECK 5 it says "The mpirun command "\" does not exist" SEE CHECK 5, CHECK 8, then CHECK 7 BELOW. Oh yeah, this may be a silly question but where does SGE get $pe_hostfile, $TMPDIR from and what is the process of how it acquires these variables? I would like some clarification. Thanks, William Things that I checked CHECK 0.5 [root@wems wrfprd]# cat qsub_wrf.sh #!/bin/sh #$ -S /bin/ksh #$ -pe mpich 32 #$ -l h_rt=10800 #$ -q Production.q # #. /WEMS/wems/external/WRF/wrfsi/etc/setup-mpi.sh cd /WEMS/wems/data/WRF/wni001a/wrfprd echo 'This is the job ID '$JOB_ID > /WEMS/wems/data/WRF/wni001a/log/050811200.wrf.pbs echo 'This is the pe_hostfile '$PE_HOSTFILE >> /WEMS/wems/data/WRF/wni001a/log/050811200.wrf.ps echo 'This is the tmpdir '$TMPDIR >> /WEMS/wems/data/WRF/wni001a/log/050811200.wrf.ps /WEMS/grid/mpi/myrinet/sge_mpirun /WEMS/wems/external/WRF/wrfsi/../run/wrf.exe >> /WEMS/wems/data/WRF/wni001a/log/050811200.wrf.pbs 2>&1 CHECK 1 [wems@wems wems]$ qsub -pe mpich 32 -P test -q Production.q /WEMS/wems/data/WRF/wni001a/wrfprd/qsub_wrf.sh CHECK 2 [wems@wems grid]$ cat qsub_wrf.sh.pe549 ln: creating symbolic link `/WEMS/grid/tmp/549.1.Production.q/mpirun.sge' to `/WEMS/pkgs/mpich-gm-1.2.6.14a/bin/mpirun.ch_gm': Permission denied /WEMS/grid/mpi/myrinet/startmpi.sh[142]: cannot create /WEMS/grid/tmp/549.1.Production.q/machines: Permission denied cat: /WEMS/grid/tmp/549.1.Production.q/machines: No such file or directory ln: creating symbolic link `/WEMS/grid/tmp/549.1.Production.q/rsh' to `/WEMS/grid/mpi/rsh': Permission denied CHECK 3 [wems@wems grid]$ cat qsub_wrf.sh.po549 -catch_rsh /WEMS/grid/wems-hosts2 /WEMS/pkgs/mpich-gm-1.2.6.14a/bin/mpirun.ch_gm this is the value of mpirun /WEMS/pkgs/mpich-gm-1.2.6.14a/bin/mpirun.ch_gm I am doing a ls -ltr on $TMPDIR total 4 drwxr-xr-x 2 65534 65534 4096 Mar 26 2005 549.1.Production.q Machine file is /WEMS/grid/tmp/549.1.Production.q/machines CHECK 4 [wems@wems grid]$ cat Queue-config qname Production.q hostlist @Parallel seq_no 0 load_thresholds np_load_avg=1.75 suspend_thresholds NONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processors 2 qtype BATCH ckpt_list NONE pe_list mpich rerun FALSE slots 2 tmpdir /WEMS/grid/tmp shell /bin/ksh prolog NONE epilog NONE shell_start_mode posix_compliant starter_method NONE suspend_method NONE resume_method NONE terminate_method NONE notify 00:00:60 owner_list NONE user_lists Test_A xuser_lists NONE subordinate_list NONE complex_values NONE projects test xprojects NONE calendar NONE initial_state default s_rt INFINITY h_rt INFINITY s_cpu INFINITY h_cpu INFINITY s_fsize INFINITY h_fsize INFINITY s_data INFINITY h_data INFINITY s_stack INFINITY h_stack INFINITY s_core INFINITY h_core INFINITY s_rss INFINITY h_rss INFINITY s_vmem INFINITY h_vmem INFINITY CHECK 5 [wems@wems grid]$ cat mpich-PE-config pe_name mpich slots 78 user_lists Test_A xuser_lists NONE start_proc_args /WEMS/grid/mpi/myrinet/startmpi.sh -catch_rsh \ /WEMS/grid/wems-hosts2 \ /WEMS/pkgs/mpich-gm-1.2.6.14a/bin/mpirun.ch_gm stop_proc_args /WEMS/grid/mpi/myrinet/stopmpi.sh allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min CHECK 6 [wems@wems wems]# cat /WEMS/wems/data/WRF/wni001a/log/050811200.wrf.ps This is the pe_hostfile /WEMS/grid/default/spool/wems18/active_jobs/388.1/pe_hostfile This is the tmpdir /WEMS/grid/tmp/388.1.Production.q This is the pe_hostfile /WEMS/grid/default/spool/wems07/active_jobs/389.1/pe_hostfile This is the tmpdir /WEMS/grid/tmp//389.1.Production.q This is the pe_hostfile /WEMS/grid/default/spool/wems24/active_jobs/390.1/pe_hostfile This is the tmpdir /WEMS/grid/tmp/398.1.Production.q This is the pe_hostfile /WEMS/grid/default/spool/wems22/active_jobs/549.1/pe_hostfile This is the tmpdir /WEMS/grid/tmp/549.1.Production.q This is the pe_hostfile This is the tmpdir CHECK 7 [wems@wems wems]$ cat /WEMS/wems/data/WRF/wni001a/log/050811200.wrf.pbs This is the job ID 549 The mpirun command "\" does not exist There must be a problem with the mpich parallel environment CHECK 8 [root@wems wrfprd]# cat qsub_wrf.sh #!/bin/sh #$ -S /bin/ksh #$ -pe mpich 32 #$ -l h_rt=10800 #$ -q Production.q # #. /WEMS/wems/external/WRF/wrfsi/etc/setup-mpi.sh cd /WEMS/wems/data/WRF/wni001a/wrfprd echo 'This is the job ID '$JOB_ID > /WEMS/wems/data/WRF/wni001a/log/050811200.wrf.pbs echo 'This is the pe_hostfile '$PE_HOSTFILE >> /WEMS/wems/data/WRF/wni001a/log/050811200.wrf.ps echo 'This is the tmpdir '$TMPDIR >> /WEMS/wems/data/WRF/wni001a/log/050811200.wrf.ps /WEMS/grid/mpi/myrinet/sge_mpirun /WEMS/wems/external/WRF/wrfsi/../run/wrf.exe >> /WEMS/wems/data/WRF/wni001a/log/050811200.wrf.pbs 2>&1 exit -----Original Message----- From: Reuti [mailto:reuti@staff.uni-marburg.de] Sent: Wednesday, March 23, 2005 6:26 PM To: William Burke Cc: beowulf@beowulf.org Subject: Re: [Beowulf] Hi, I'd suggest to move over to the SGE users list at: http://gridengine.sunsource.net/servlets/ProjectMailingListList But anyway, let's sort the things out: Quoting William Burke : > I can't get PE to work on a 50 node class II Beowulf. It has a front-end > Sunfire v40 (qmaster host) and 49 Sunfire v20s (execution hosts) running > Linux configured to communicate data over Myrinet using MPICH-GM version > 1.26.14a. Although there is a special Myrinet directory, you can also try to use the files in the mpi directory instead. > These are the requirements of the N1GE environment to handle: > > 1. Serial type jobs for pre-processing the data - average runtime 15 > minutes. > 2. Output is pipelined into parallel processing jobs - range of runtime > 1- 6 hours. > 3. Concurrently running is post-processing serial jobs. > > I have setup a Parallel Environment called mpich-gm and a straight-forward > FIFO scheduling schema for testing. When I submit parallel jobs they hang > in > limbo in a 'qw' state pending submission. I am not sure why the scheduler > does not see jobs that I submit. > > > > I used the myrinet mpich template located $SGE_ROOT/< sge_cell > >/mpi/myrinet > directory to configure the pe (parallel environment) plus I copied the > sge_mpirun script to the $SGE_ROOT/< sge_cell >/bin directory. I > configured > a Production.q queue that runs only parallel jobs. As a last sanity check I > ran a trace on the scheduler, submitted a simple parallel job, and this is > the results that I got from the logs: Can you please give more details of your queue and PE setup (qconf -sq/sp output). > JOB RUN Window > > [wems@wems examples]$ qsub -now y -pe mpich-gm 1-4 -b y hello++ > > Your job 277 ("hello++") has been submitted. > > Waiting for immediate job to be scheduled. > > > > Your qsub request could not be scheduled, try again later. > > [wems@wems examples]$ qsub -pe mpich-gm 1-4 -b y hello++ > > Your job 278 ("hello++") has been submitted. > > [wems@wems examples]$ qsub -pe mpich-gm 1-4 -b y hello++ > > Your job 279 ("hello++") has been submitted. You can't start a parallel job this way, as there is no mpirun used. When you used your mentioned script, you get the same behavior (and there you used mpirun -np $NSLOTS ...)? > This is the 2nd window SCHEDULER LOG > > [root@wems bin]# qconf -tsm > > [root@wems bin]# qconf -tsm > > [root@wems bin]# cat /WEMS/grid/default/common/schedd_runlog > > Wed Mar 23 06:08:55 2005|-------------START-SCHEDULER-RUN------------- > > Wed Mar 23 06:08:55 2005|queue instance "all.q@wems10.grid.wni.com" dropped > because it is temporarily not available > > Wed Mar 23 06:08:55 2005|queue instance "Production.q@wems10.grid.wni.com" > dropped because it is temporarily not available > > Wed Mar 23 06:08:55 2005|queues dropped because they are temporarily not > available: all.q@wems10.grid.wni.com Production.q@wems10.grid.wni.com > > Wed Mar 23 06:08:55 2005|no pending jobs to perform scheduling on > > Wed Mar 23 06:08:55 2005|--------------STOP-SCHEDULER-RUN------------- > > Wed Mar 23 06:11:37 2005|-------------START-SCHEDULER-RUN------------- > > Wed Mar 23 06:11:37 2005|queue instance "all.q@wems10.grid.wni.com" dropped > because it is temporarily not available > > Wed Mar 23 06:11:37 2005|queue instance "Production.q@wems10.grid.wni.com" > dropped because it is temporarily not available > > Wed Mar 23 06:11:37 2005|queues dropped because they are temporarily not > available: all.q@wems10.grid.wni.com Production.q@wems10.grid.wni.com > > Wed Mar 23 06:11:37 2005|no pending jobs to perform scheduling on > > Wed Mar 23 06:11:37 2005|--------------STOP-SCHEDULER-RUN------------- > > [root@wems bin]# qstat > > job-ID prior name user state submit/start at queue > slots ja-task-ID > > ---------------------------------------------------------------------------- > ------------------------------------- > > 279 0.55500 hello++ wems qw 03/23/2005 06:11:43 > 1 > > [root@wems bin]# Do you have an admin account for SGE? I'd prefer not to do anything in SGE as root. > BTW that node wems10.grid.wni.com has connectivity issues and I have not > removed it from the cluster queue. > > > > What causes this type of problem in N1GE to return "no pending jobs to > perform scheduling on" in the schedd_runlog even though there are available > slots ready to take jobs? > > I had no problem submitting serial jobs, only the parallel jobs resulted as > such. Are there N1GE - Myrinet issue that I am not aware of? FYI the same > binary (hello++) runs with no problems from the command line. If you just start hello++, it will not run in parallel I think. Not really an issue: you have to make a small change to the mpirun.ch_gm.pl to make all jobs staying in the same process group to get them correctly killed in case of a jobb abort: http://gridengine.sunsource.net/howto/mpich-integration.html > Since I generally run scripts from qsub instead of binaries I created a > script to run the mpich executable but that yield the same result. > > > > I have an additional question regarding setting a queue.conf parameter > called "subordinate_list". How is it read from the result of qconf -mq > ? > > Example > > i.e., subordinate_list low_pri.q=5,small.q. The queue "low_pri.q" will be suspended, when 5 or more slots of "" are filled. The "small.q" will be suspened, if all slots of "" are filled. Cheers - Reuti -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20050326/c2adc1f8/attachment.html From reuti at staff.uni-marburg.de Sun Mar 27 04:08:33 2005 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:03:57 2009 Subject: Myrinet setup (was: RE: [Beowulf]) In-Reply-To: References: Message-ID: <1111925313.4246a2411bcdf@home.staff.uni-marburg.de> Hi Will, Quoting William Burke : > I have but I do not see my name yet? How long is the verification process? did you register as an observer - AFAIK you can post to the SGE list without being registered. > The mpi directory's mpich.template doesn't use mpirun.ch_gm so how does it > know what version of mpirun to use? If I use the mpi what changes do I have > to make? You can have more than one MPI implementation installed in your cluster, and it may need some planning to set up the correct $PATH for each of the implementations you want to use (and the this way located mpirun must fit to your used version of MPI during compilation of your program). You may use a "which mpirun" to check it in your job script. Also the supplied sge_mpirun will not use any Myrinet version on it's own - it's just a wrapper to the mpirun you set in the PE, so that you don't have to specify the usual options 'mpirun -machinefile $TMPDIR/machines -np $NSLOTS mypgm'. I must admit: seems that the Myrinet stuff was more for 5.3 and not updated, as in 6.0 you can have more than one line for "start_proc_args" in your PE definition - so it just grabs the last \ in the first and only line beginning with "start_proc_args" line as mpirun command - will give the error message you got, that "\" is not existing. As I said: we can use the default MPICH integration also for Myrinet and proceed this way. > >> Can you please give more details of your queue and PE setup (qconf > -sq/sp Thx, I will keep the stuff. First one additional question (before I route you in the wrong direction): is it necessary for you to have a shared $TMPDIR for SGE? This is the one you set in your queue configuration (tmpdir /WEMS/grid/tmp) and seems for now to be on a file server. More common and faster is to use the local /tmp on the nodes for this (you are right: SGE want to create there a directory for this job and some file for its own usage - but you are free to use this directory $TMPDIR also in your job script). It will be created for your job, and cleanly deleted after the job, so you won't have any leftover files. Cheers - Reuti From agrajag at dragaera.net Mon Mar 28 07:03:25 2005 From: agrajag at dragaera.net (Sean Dilda) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] Jumbo frame cards and switches In-Reply-To: <200503241802.j2OI2JJ7000237@bell.kirtland.af.mil> References: <200503241802.j2OI2JJ7000237@bell.kirtland.af.mil> Message-ID: <42481CBD.7050504@dragaera.net> Art Edwards wrote: > I have spent some time looking at posts on Gb cards and Gb switches for jumbo > frames. We currently have broadcom GB BCM7501 cards and the catalyst > 4000 switch. I am able to set mtu to 9000 on the cards, but I have found > out that the switch does not handle jumbo frames. My questions are: > > 1. If I can set the mtu to 9000, does this mean the card can actully > send and receive messages this size? Being able to set the mtu to 9000 means that the driver for the card thinks it can do it. If it makes you feel any better, I've been using jumbo frame on BCM5704's for a while now. From wburke999 at msn.com Sun Mar 27 07:00:56 2005 From: wburke999 at msn.com (William Burke) Date: Wed Nov 25 01:03:57 2009 Subject: Myrinet setup (was: RE: [Beowulf]) In-Reply-To: <1111925313.4246a2411bcdf@home.staff.uni-marburg.de> Message-ID: Hi Reuti, Quoting Reuti [reuti@staff.uni-marburg.de]: > First one additional question (before I route you in the wrong direction): > is it necessary for you to have a shared $TMPDIR for SGE? ...More common > and faster is to use the local /tmp on the nodes for this Actually there is no reason that I should have a shared $TMPDIR for SGE, except that I read some where (I am not sure where)that it was recommended to share the $TMPDIR. However I just seached the SGE and N1GE docs I found no evidence supporting that notion and now thinking about it setting $TMPDIR to local /tmp would simplfy things. Thxs Regards, William -----Original Message----- From: Reuti [mailto:reuti@staff.uni-marburg.de] Sent: Sunday, March 27, 2005 7:09 AM To: William Burke Cc: users@gridengine.sunsource.net; beowulf@beowulf.org; dag@sonsorol.org; 'John Hearns' Subject: Myrinet setup (was: RE: [Beowulf]) Hi Will, Quoting William Burke : > I have but I do not see my name yet? How long is the verification process? did you register as an observer - AFAIK you can post to the SGE list without being registered. > The mpi directory's mpich.template doesn't use mpirun.ch_gm so how does it > know what version of mpirun to use? If I use the mpi what changes do I have > to make? You can have more than one MPI implementation installed in your cluster, and it may need some planning to set up the correct $PATH for each of the implementations you want to use (and the this way located mpirun must fit to your used version of MPI during compilation of your program). You may use a "which mpirun" to check it in your job script. Also the supplied sge_mpirun will not use any Myrinet version on it's own - it's just a wrapper to the mpirun you set in the PE, so that you don't have to specify the usual options 'mpirun -machinefile $TMPDIR/machines -np $NSLOTS mypgm'. I must admit: seems that the Myrinet stuff was more for 5.3 and not updated, as in 6.0 you can have more than one line for "start_proc_args" in your PE definition - so it just grabs the last \ in the first and only line beginning with "start_proc_args" line as mpirun command - will give the error message you got, that "\" is not existing. As I said: we can use the default MPICH integration also for Myrinet and proceed this way. > >> Can you please give more details of your queue and PE setup (qconf > -sq/sp Thx, I will keep the stuff. First one additional question (before I route you in the wrong direction): is it necessary for you to have a shared $TMPDIR for SGE? This is the one you set in your queue configuration (tmpdir /WEMS/grid/tmp) and seems for now to be on a file server. More common and faster is to use the local /tmp on the nodes for this (you are right: SGE want to create there a directory for this job and some file for its own usage - but you are free to use this directory $TMPDIR also in your job script). It will be created for your job, and cleanly deleted after the job, so you won't have any leftover files. Cheers - Reuti From vinicius at centrodecitricultura.br Tue Mar 29 05:22:54 2005 From: vinicius at centrodecitricultura.br (Vinicius de Lima) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] LAM-MPI with SSH Message-ID: <424956AE.2070600@centrodecitricultura.br> Hi, I installed Fedora Core 3 to build a Cluster, but the lam-mpi had already install. How do ypu change to SSH? Because must to configure with RSH. Tks, Vinicius. From msuess at uni-kassel.de Tue Mar 29 08:01:31 2005 From: msuess at uni-kassel.de (Michael Suess) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] Call for Help: An Evaluation of Parallel Programming Systems Message-ID: <200503291801.38011.msuess@uni-kassel.de> Dear Colleagues, with this mail I want to kindly ask for your help regarding a small research project of our group, which you may find interesting as well. It has the working title: "Evaluation of parallel programming systems" and aims at collecting information about the present state of the art of parallel programming. Some of the key questions to be answered are: - What parallel programming systems are used in real world applications? - Which parallel programming systems have been developed in the past, and which are being developed at the moment? - What are the strengths of these various systems, and what are their problems? - How do the systems compare to each other? - Which application areas and architectures are the systems useful for? For this purpose, we have set up two subprojects, which might be interesting for you: At http://www.plm.eecs.uni-kassel.de/parasurvey/ we have put a small survey online. Its main goal is to figure out what parallel programming systems are in use and known today. Filling it out will??take?about?10?minutes?and?(besides?being?notified?of?the?results)?you can win one of two $50 book gift certificates from Amazon.com. The second subproject is a Wiki (with the fitting name "Parawiki") with the purpose of providing the parallel programming community a place to share and research information about parallel programming. It even has a unique feature (called "trees"), which can be used to gather an easy overview over certain topics. Please consider participating in this effort and providing information about your parallel programming system of choice there. The Parawiki is located here: http://parawiki.plm.eecs.uni-kassel.de/ Of course, the whole project has a home page as well, with a more extensive overview: http://www.plm.eecs.uni-kassel.de/plm/?eval_pps If you want even more information, we have put up a short technical report about the project here: http://www.plm.eecs.uni-kassel.de/plm/fileadmin/pm/publications/suess/introEvalPPS.pdf Thank you very much for your cooperation, Best Regards from Kassel/Germany, Michael Suess P.S.: I apologize, if you recieve this mail through multiple channels, since of course I have tried to reach as many participants as possible. If you know other sources on the net, where people involved in parallel programming gather, it would be very kind if you could forward this message there. -- "What we do in life, echos in eternity..." M.: msuess@uni-kassel.de | T.: +49-561-804-6269 | F.: +49-561-804-6219 Research Associate, Programming Languages / Methodologies Research Group University of Kassel, Wilhelmsh?her Allee 73, D-34121 Kassel Public PGP key: http://www.suessnetz.de/michael/michaelsuess.gpg PGP key fingerprint: A744 AFBA CA93 620B 8701 AB98 D4CF 4F3C 945A 61FE -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050329/29614b78/attachment.bin From rwm at absoft.com Tue Mar 29 13:34:34 2005 From: rwm at absoft.com (Rodney Mach) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] Re: Beowulf Digest, Vol 13, Issue 47 In-Reply-To: <200503292000.j2TK0GYK017135@bluewest.scyld.com> References: <200503292000.j2TK0GYK017135@bluewest.scyld.com> Message-ID: <4249C9EA.7050509@absoft.com> Oi Vinicius , You can export LAMRSH variable to select ssh, for example export LAMRSH='ssh -x' -Rod > > Today's Topics: > > 1. LAM-MPI with SSH (Vinicius de Lima) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 29 Mar 2005 10:22:54 -0300 > From: Vinicius de Lima > Subject: [Beowulf] LAM-MPI with SSH > To: Beowulf > Message-ID: <424956AE.2070600@centrodecitricultura.br> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Hi, > > I installed Fedora Core 3 to build a Cluster, but the lam-mpi had > already install. > How do ypu change to SSH? > Because must to configure with RSH. > > Tks, > Vinicius. > > From joachim at ccrl-nece.de Tue Mar 29 23:36:15 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Wed Nov 25 01:03:57 2009 Subject: [O-MPI users] Re: [Beowulf] Alternative to MPI ABI In-Reply-To: <20050327005045.GA3760@greglaptop.hsd1.ca.comcast.net> References: <20050325013448.GB4459@greglaptop.internal.keyresearch.com> <42446D96.7080904@myri.com> <20050325220306.GD2045@greglaptop.internal.keyresearch.com> <20050325232628.GA3781@greglaptop.internal.keyresearch.com> <5b7d13b355dc65349c7a683439e64e01@open-mpi.org> <20050327005045.GA3760@greglaptop.hsd1.ca.comcast.net> Message-ID: <424A56EF.8040607@ccrl-nece.de> Greg Lindahl wrote: > On Sat, Mar 26, 2005 at 06:47:41AM -0500, Jeff Squyres wrote: >>I guess I don't understand your reluctance to accept a MorphMPI-like >>solution: > > You have repeated your original MorphMPI attributes. I responded to > them, and I don't see any sign that you've read my response. This is > not the way discussions are usually held. Jeff explained very well the problems he sees with your additional F77-like-C-API (which, BTW, is largely realized in MPICH as all opaque types there are integer handles): two interfaces means two mpi.h and two libraries to maintain. I doubt that any implementor is keen on this. Instead, a layer on top would be easier, and not unrealistic to maintain: a serious MPI implementation will rarely (never!?) change its mpi.h/mpif.h. We (for NEC's MPI implemenation across all platforms) didn't do so for 5 years when we completed the full MPI-2 functionality. And even before, we only added things, but didn't change existing definitions. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From eugen at leitl.org Thu Mar 31 02:26:20 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] CiSE Cluster Issue includes Macs (fwd from d@daugerresearch.com) Message-ID: <20050331102620.GT24702@leitl.org> ----- Forwarded message from "Dean Dauger, Ph. D." ----- From: "Dean Dauger, Ph. D." Date: Wed, 30 Mar 2005 10:56:30 -0800 To: scitech@lists.apple.com Cc: "Dr. Dean Dauger" Subject: CiSE Cluster Issue includes Macs X-Mailer: Apple Mail (2.619) Hello All, A special Cluster issue of Computing in Science and Engineering, a joint publication of IEEE and the American Institute of Physics, was just published (Mar/Apr05). One of its articles is about Mac clusters: "Plug-and-Play" Cluster Computing: High-Performance Computing for the Mainstream * * "This article is a must-read for anyone who wants to apply clustering ... but faces limited resources for actually managing the cluster." - Prof. George K. Thiruvathukal in his Guest Editor's Introduction for the issue. We invite you to view a PDF preprint of the article via the above link. Thank you, Dean _______________________________________________ Do not post admin requests to the list. They will be ignored. Scitech mailing list (Scitech@lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/scitech/eugen%40leitl.org This email sent to eugen@leitl.org ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050331/5bf80bd8/attachment.bin From danmaftei at gmail.com Tue Mar 29 14:56:07 2005 From: danmaftei at gmail.com (Dan Maftei) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] how to run gamess parallel on beowulf cluster Message-ID: <4249DD07.3080007@gmail.com> Try using export DDI_RSH=/usr/bin/ssh (using the bash) or setenv DDI_RSH "/usr/bin/ssh" (using csh) From nelsoneci at gmail.com Tue Mar 29 17:23:47 2005 From: nelsoneci at gmail.com (Nelson Castillo) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] Linux NIC order : (ether)boots from eth1 but Linux says it's eth0 Message-ID: <2accc2ff0503291723771ea78a@mail.gmail.com> Hi. I'm booting a diskless cluster. Some nodes boot from eth0, others from eth1 or eth2. All nodes are identical. n@mdk:~$ lspci | grep Ethernet 0000:02:02.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 08) 0000:02:08.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 34) 0000:02:09.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 34) The etherboot image of the nodes that boot from eth0 have support only for Ethernet Pro 100. They work fine. The nodes that boot from eht1 have support for 3c90x and they work fine. I tell etherboot to boot from the first nic found. But I have an issue. Linux kernel 2.6.8 sees eth1 (3com) as eth0! only when they boot from the etherboot image. I don't know why they inherit the etherboot order. Kernel log reads (Root over NFS / get ip from DHCP): IP-Config: Complete device=eth0, addr=10.0.1.31, mask=255.255.255.0, gw=10.0.1.1, host=10.0.1.31, domain=mydomain.com, nis-domain=(none), bootserver=10.0.1.1, rootserver=10.0.1.1, rootpath= (note! : device=eth0) ****** My question is: can i override this and tell Linux to call this interface eth1? ****** Linux names it eth0. eth0 Link encap:Ethernet HWaddr 00:50:DA:CE:07:7F I also would like to know why Linux does this. Regards, Nelson.- -- Homepage : http://geocities.com/arhuaco The first principle is that you must not fool yourself and you are the easiest person to fool. -- Richard Feynman. From scho at cs.hku.hk Tue Mar 29 19:27:42 2005 From: scho at cs.hku.hk (Roy S.C. Ho) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] CFP: MSN 2005 Message-ID: We apologize if you receive multiple copies of this CFP. ------------------------------------------------------------------------- International Conference on Mobile Ad-hoc and Sensor Networks (MSN) 13-15 Dec 2005, Wuhan, China http://www.cs.cityu.edu.hk/~msn05 This conference provides a forum for researchers and practitioners to exchange research results and share development experiences. Topics of interest include, but are not limited to, the following areas in mobile ad hoc and sensor networks: - Network architecture and protocols - Software platforms and development tools - Self-organization and synchronization - Routing and data dissemination - Failure resilience and fault isolation - Energy management - Data, information, and signal processing - Security and privacy - Network planning, provisioning, and deployment - Network modeling and performance evaluation - Developments and applications - Integration with other systems Publications: Original and previously unpublished technical papers are solicited for presentation at the conference and publication in the proceedings. The proceedings will be published as Springer-LNCS series. Selected papers will be published at journal special issues. Important Deadlines: Paper submission: 1 June 2005, Acceptance notification: 15 Aug 2005, Camera ready: 10 Sept 2005 Steering Committee Co-Chair: Lionel Ni, Hong Kong Univ of Sci and Technology Jinnan Liu, Wuhan University General Co-Chair: Taieb Znati, University of Pittsburgh Yanxiang He, Wuhan University Program Co-Chair: Jie Wu, Florida Atlantic University Xiaohua Jia, City Univ of Hong Kong Program Vice Chairs: Ivan Stojmenovic, University of Ottawa Jang-Ping Sheu, National Central University, Taiwan Jianzhong Li, Harbin Institute of Technology, China Publicity Chairs: Makoto Takizawa, Tokyo Denki University, Japan Weifa Liang, The Australian National University Jiannong Cao, Hong Kong Polytechnic University Local Organization Chair: Chuanhe Huang, Wuhan Univeristy ------------------------------------------------------------------------- From scho at cs.hku.hk Tue Mar 29 21:44:04 2005 From: scho at cs.hku.hk (Roy S.C. Ho) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] CFP: APPT 2005 Message-ID: We apologize if you receive multiple copies of this CFP. ------------------------------------------------------------------------ Preliminary Call For Paper Sixth International Workshop on Advanced Parallel Processing Technologies (APPT 2005) 27-28 Oct. 2005, Hong Kong, China http://www.comp.polyu.edu.hk/APPT05 APPT is a biennial workshop on parallel and distributed processing. Its scope covers all aspects of parallel and distributed computing technologies, including architectures, software systems and tools, algorithms, and applications. APPT was originated from collaborations by researchers from China and Germany and has evolved to be an international workshop. The past five workshops were held in Beijing, Koblenz, Changsha, Illmenau, and Xiamen, respectively. APPT'05 will be the sixth in the series. Following the success of the last five workshops, APPT'05 provides a forum for scientists and engineers in academia and industry to exchange and discuss their experiences, new ideas, and results about research in the areas related to parallel and distributed processing Papers are solicited. Topics of particular interest include, but are not limited to : - Parallel / Distributed System Architectures - Advanced Microprocessor Architecture - Middleware, Software Tools and Environments - Parallelizing Compilers - Software Engineering issues - Interconnection Networks - Network Protocols - Task Scheduling and Load Balancing - Grid Computing, Cluster Computing, Peer-to-Peer Computing - Internet & Web computing - Pervasive and Mobile Computing - Security in networks and distributed systems - Fault tolerance and dependability - Image Generation and Processing: Rendering Techniques, Virtual Reality, Visualization, Graphic Processing, etc SUBMISSION GUIDELINES Prospective authors are invited to submit a full paper in English (should not exceed 10 pages (12 pt) in length) presenting original and unpublished research results and experience. Papers will be selected based on their originality, timeliness, significance, relevance, and clarity of presentation. Papers must be submitted in PDF (preferably) or Postscript that is interpretable by Ghostscript. Submissions will be carried out electronically on the Web via a link found at the conference web page http://www.comp.polyu.edu.hk/APPT05/. Submissions imply the willingness of at least one author to register, attend the conference, and present the paper. There will be two best student paper awards to recognise distinguished student research. PUBLICATION The proceedings of the symposium will be published in Springer's Lecture Notes in Computer Science (Pending). Important Dates: Paper submission: 1 May 2005 Acceptance notification: 1 July 2005 Camera-ready due: 15 July 2005 APPT'05 workshop: 27-28 Oct 2005 Sponsored by Sponsored by Architecture Professional Committee of China Computer Federation. Organized by Department of Computing, Hong Kong Polytechnic University Supported by IEEE HK, ACM HK ORGANIZING COMMITTEE General Chair Xingming Zhou, Member of Chinese Academy of Science. National Lab for Parallel and Distributed Processing, China Vice General Co-Chairs Xiaodong Zhang, College of William and Mary, USA David A. Bader, Univ. of New Mexico, USA, Program Co-Chairs Jiannong Cao, Hong Kong Polytechnic Univ, H.K. Wolfgang Nejdl, Univ. of Hannover, Germany Publicity Chair Cho-Li Wang, Univ. of Hong Kong, H.K. Publication Chair TBD Local Organisation Chair Allan K.Y. Wong, H.K. PolyU, H.K. Finance / Registration Chair Ming Xu, National Lab for Parallel and Distributed Processing, China PROGRAM COMMITTEE Srinivas Aluru, Iowa State University, USA Jose Nelson Amaral, University of Alberta, Canada Nancy Amato, Texas A&M University, USA Wentong Cai, Nanyang Technological Univ. Singapore Y. K. Chan, City Univ. of Hong Kong, Hong Kong John Feo, Cray Inc., USA Tarek El-Ghazawi, George Washington University, USA Ananth Grama, Purdue University, USA Binxing Fang, Harbin Institute of Technology, China Guang Gao, University of Delaware, USA Manfred Hauswirth, EPFL, Switzerland Bruce Hendrickson, Sandia National Lab., USA Zhenzhou Ji, Harbin Institute of Technology, China Mehdi Jazayeri, Technical University of Vienna, Austria Ashfaq Khokhar, University of Illinois, Chicago, USA Ajay Kshemkalyani, Univ. of Illinois, Chicago, USA Xiaoming Li, Peking University, China Francis Lau, University of Hong Kong, China Xinsong Liu, Electronical Sciences University, China Yunhao Liu, Hong Kong University of Science and Technology, China Xinda Lu, Shanghai Jiao Tong University, China Siwei Luo, Northern Jiao Tong University, China Beth Plale, Indiana University, USA Bernhard Plattner, Swiss Federal Institute of Tech., Switzerland Sartaj Sahni, University of Florida, USA Nahid Shahmehri, Link?pings universitet, Sweden Chengzheng Sun, Griffith University, Australia Zhimin Tang, Institute of Computing, CAS, China Bernard Traversat, Sun Microsystems Peter Triantafillou, University of Patras, Greece Lars Wolf, Tech. Universit?t Braunschweig, Germany Jie Wu, Florida Atlantic University, USA Li Xiao, Michigan State Univ, USA Cheng-Zhong Xu, Wayne State University, USA Weimin Zheng, Tsinghua University, China ------------------------------------------------------------------------ From jmdavis at mail2.vcu.edu Wed Mar 30 13:37:24 2005 From: jmdavis at mail2.vcu.edu (Mike Davis) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] OS for 64 bit AMD Message-ID: <424B1C14.1000304@mail2.vcu.edu> What OSes are Opteron clusters out there running. Is anyone running FC2 on opterons? I'm looking at opterons for our next cluster, but I'm not sure about what OS to run. Thus far we've been with RH and or RHAS. But, the next cluster will be big and I'm just not sure what we should run. Mike Davis From lindahl at pathscale.com Thu Mar 31 09:13:47 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] OS for 64 bit AMD In-Reply-To: <424B1C14.1000304@mail2.vcu.edu> References: <424B1C14.1000304@mail2.vcu.edu> Message-ID: <20050331171347.GB1284@greglaptop.internal.keyresearch.com> On Wed, Mar 30, 2005 at 04:37:24PM -0500, Mike Davis wrote: > I'm looking at opterons for our next cluster, but I'm not sure about > what OS to run. Thus far we've been with RH and or RHAS. But, the next > cluster will be big and I'm just not sure what we should run. In order to get the best performance out of your Opterons, you're going to want to run a recent 64-bit distro with a 2.6 kernel, such as FC 3 or RHEL 4. -- greg From landman at scalableinformatics.com Thu Mar 31 09:16:01 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] OS for 64 bit AMD In-Reply-To: <424B1C14.1000304@mail2.vcu.edu> References: <424B1C14.1000304@mail2.vcu.edu> Message-ID: <424C3051.5050306@scalableinformatics.com> Hi Mike: Opterons will do better with a 2.6 kernel (2.6.9 and higher). If you are going to use RHEL, you might want to look at Rocks (RHEL3 based) or Warewulf which should be able to sit atop RHEL4. If you want to use a Redhat work-alike, you might want to look closely at Centos4. I am sure others will take issue with this, but I would strongly advise against using a rolling beta OS (FC-x) as the basis for a production cycle machine. If it is a purely experimental cluster, go for it. If it is supposed to provide cycles to a wide group, you might look more closely at a supported/supportable distribution. We have had good luck with SuSE 9.x (x>=1), RHEL3, CentosX on clusters using a variety of meta-distributions (warewulf, Rocks, others). Most of our customers seem to prefer the RHEL series, so we tend to work with that more than others, but YMMV. joe Mike Davis wrote: > What OSes are Opteron clusters out there running. Is anyone running FC2 > on opterons? > > I'm looking at opterons for our next cluster, but I'm not sure about > what OS to run. Thus far we've been with RH and or RHAS. But, the next > cluster will be big and I'm just not sure what we should run. > > Mike Davis > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From dag at sonsorol.org Thu Mar 31 09:34:24 2005 From: dag at sonsorol.org (Chris Dagdigian) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] OS for 64 bit AMD In-Reply-To: <424C3051.5050306@scalableinformatics.com> References: <424B1C14.1000304@mail2.vcu.edu> <424C3051.5050306@scalableinformatics.com> Message-ID: <424C34A0.2040904@sonsorol.org> I second Joe's comments. All of our Opteron systems run Suse 9.2 by default and we use Centos-4 for "Redhat" compatible functionality when required since Redhat has explicitly chosen to price themselves out of the cluster market for everything except 2-way 32bit boxes. Suse 9.1/9.2 on Opteron and Suse Enterprise Linux (SLES 8/9) on Itanium Systems (meaning our SGI Altix) have been extremely stable and useful in our work. Highly recommended. -Chris Joe Landman wrote: > Hi Mike: > > Opterons will do better with a 2.6 kernel (2.6.9 and higher). If you > are going to use RHEL, you might want to look at Rocks (RHEL3 based) or > Warewulf which should be able to sit atop RHEL4. If you want to use a > Redhat work-alike, you might want to look closely at Centos4. > > I am sure others will take issue with this, but I would strongly > advise against using a rolling beta OS (FC-x) as the basis for a > production cycle machine. If it is a purely experimental cluster, go > for it. If it is supposed to provide cycles to a wide group, you might > look more closely at a supported/supportable distribution. > > We have had good luck with SuSE 9.x (x>=1), RHEL3, CentosX on clusters > using a variety of meta-distributions (warewulf, Rocks, others). Most > of our customers seem to prefer the RHEL series, so we tend to work with > that more than others, but YMMV. > > joe > > Mike Davis wrote: > >> What OSes are Opteron clusters out there running. Is anyone running >> FC2 on opterons? >> >> I'm looking at opterons for our next cluster, but I'm not sure about >> what OS to run. Thus far we've been with RH and or RHAS. But, the next >> cluster will be big and I'm just not sure what we should run. >> >> Mike Davis >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > > -- Chris Dagdigian, BioTeam - Independent life science IT & informatics consulting Office: 617-665-6088, Mobile: 617-877-5498, Fax: 425-699-0193 PGP KeyID: 83D4310E iChat/AIM: bioteamdag Web: http://bioteam.net From landman at scalableinformatics.com Thu Mar 31 10:04:28 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] OS for 64 bit AMD In-Reply-To: <424C34A0.2040904@sonsorol.org> References: <424B1C14.1000304@mail2.vcu.edu> <424C3051.5050306@scalableinformatics.com> <424C34A0.2040904@sonsorol.org> Message-ID: <424C3BAC.5050009@scalableinformatics.com> Actually Redhat now has HPC pricing per node. There are other good reasons to look elsewhere for HPC distributions though, specifically due to their lack of good high performance/scalable per-node file system. SuSE at least makes XFS and JFS available, and you can build/install a system with these. Redhat prefers that you use ext3. Another issue for the RHEL3 were the ancient kernels with many backports of advanced functionality from modern kernels. Additionally adding modules for new hardware support into their boot process is a minor nightmare... Chris Dagdigian wrote: > > I second Joe's comments. > > All of our Opteron systems run Suse 9.2 by default and we use Centos-4 > for "Redhat" compatible functionality when required since Redhat has > explicitly chosen to price themselves out of the cluster market for > everything except 2-way 32bit boxes. > > Suse 9.1/9.2 on Opteron and Suse Enterprise Linux (SLES 8/9) on Itanium > Systems (meaning our SGI Altix) have been extremely stable and useful in > our work. Highly recommended. > > -Chris > > > > Joe Landman wrote: > >> Hi Mike: >> >> Opterons will do better with a 2.6 kernel (2.6.9 and higher). If >> you are going to use RHEL, you might want to look at Rocks (RHEL3 >> based) or Warewulf which should be able to sit atop RHEL4. If you >> want to use a Redhat work-alike, you might want to look closely at >> Centos4. >> >> I am sure others will take issue with this, but I would strongly >> advise against using a rolling beta OS (FC-x) as the basis for a >> production cycle machine. If it is a purely experimental cluster, go >> for it. If it is supposed to provide cycles to a wide group, you >> might look more closely at a supported/supportable distribution. >> >> We have had good luck with SuSE 9.x (x>=1), RHEL3, CentosX on >> clusters using a variety of meta-distributions (warewulf, Rocks, >> others). Most of our customers seem to prefer the RHEL series, so we >> tend to work with that more than others, but YMMV. >> >> joe >> >> Mike Davis wrote: >> >>> What OSes are Opteron clusters out there running. Is anyone running >>> FC2 on opterons? >>> >>> I'm looking at opterons for our next cluster, but I'm not sure about >>> what OS to run. Thus far we've been with RH and or RHAS. But, the >>> next cluster will be big and I'm just not sure what we should run. >>> >>> Mike Davis >>> _______________________________________________ >>> Beowulf mailing list, Beowulf@beowulf.org >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >> >> >> > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From dag at sonsorol.org Thu Mar 31 12:35:33 2005 From: dag at sonsorol.org (Chris Dagdigian) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] OS for 64 bit AMD In-Reply-To: References: <424B1C14.1000304@mail2.vcu.edu> <424C3051.5050306@scalableinformatics.com> <424C34A0.2040904@sonsorol.org> <424C3BAC.5050009@scalableinformatics.com> Message-ID: <424C5F15.1080805@sonsorol.org> Hi Jamie, Leaving all distro related politics aside, the main reason for the widespread use of 'bloated' or commercial Linux distributions in clusters is the following: (1) ISV support for commercial end user applications. In some industries it is not uncommon to have scientific or engineering application licenses that cost more than the cluster itself. These applications are often extremely powerful and extremely complex and place tremendous demands on the host system. In order to get the best, most useful, help out of the software vendor's product engineers one must usually be running on a limited subset of OS distributions supported by the vendor. (2) Reducing the finger pointing loop when things go wrong in complex IT configurations. This has only happened to me once or twice but as an example: If I have to connect a cluster I/O node to an enterprise SAN fabric I'd rather my linux host OS be something that the SAN/FC switch vendor officially certifies and has qualified. The cost of the commercial Linux disto license is going to be far cheaper than the time/effort it would take to teach myself low level fibre channel transport and fabric debugging should things go wrong. Again, just my $.02 -- it totally depends what kind of work you are doing on your cluster(s). -Chris Jamie Rollins wrote: > Any decent distro should support kernel 2.6 with amd64. But can some one > give me one good reason why you would use anything other than a > streamlined distro like Debian? Why pay for all the blote in something > like redhat when your nodes are probably going to be running a single > process anyway? > > jamie. > From rgb at phy.duke.edu Thu Mar 31 13:32:56 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] OS for 64 bit AMD In-Reply-To: <424B1C14.1000304@mail2.vcu.edu> References: <424B1C14.1000304@mail2.vcu.edu> Message-ID: On Wed, 30 Mar 2005, Mike Davis wrote: > What OSes are Opteron clusters out there running. Is anyone running FC2 > on opterons? Yes, works fine. rgb > > I'm looking at opterons for our next cluster, but I'm not sure about > what OS to run. Thus far we've been with RH and or RHAS. But, the next > cluster will be big and I'm just not sure what we should run. > > Mike Davis > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Thu Mar 31 13:39:22 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] OS for 64 bit AMD In-Reply-To: <20050331171347.GB1284@greglaptop.internal.keyresearch.com> References: <424B1C14.1000304@mail2.vcu.edu> <20050331171347.GB1284@greglaptop.internal.keyresearch.com> Message-ID: On Thu, 31 Mar 2005, Greg Lindahl wrote: > On Wed, Mar 30, 2005 at 04:37:24PM -0500, Mike Davis wrote: > > > I'm looking at opterons for our next cluster, but I'm not sure about > > what OS to run. Thus far we've been with RH and or RHAS. But, the next > > cluster will be big and I'm just not sure what we should run. > > In order to get the best performance out of your Opterons, you're > going to want to run a recent 64-bit distro with a 2.6 kernel, such as > FC 3 or RHEL 4. We're running FC2 with e.g. rgb@s00|B:1001>uname -a Linux s00 2.6.7-1.494.2.2smp #1 SMP Tue Aug 3 09:34:09 EDT 2004 x86_64 x86_64 x86_64 GNU/Linux It's been pretty stable and gives good decent numerical performance. Are there particular advantages to FC3 and the more recent 2.6 kernels? I'm running only one experimental AMD 64 box with FC3 on it (since Duke has decided to use only every other FC release in production to stretch the cycle a bit) and don't have an opteron per se set up that way for comparison. rgb > > -- greg > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From lindahl at pathscale.com Thu Mar 31 13:51:11 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] OS for 64 bit AMD In-Reply-To: References: <424B1C14.1000304@mail2.vcu.edu> <20050331171347.GB1284@greglaptop.internal.keyresearch.com> Message-ID: <20050331215111.GC3149@greglaptop.internal.keyresearch.com> On Thu, Mar 31, 2005 at 04:39:22PM -0500, Robert G. Brown wrote: > Are there particular advantages to FC3 and the more recent 2.6 kernels? I don't think there's much of a difference from FC2->FC3 and from 2.6.7->2.6.11. The main issue there would be C++ users; if your code doesn't cleanly compile with gcc-3.4... -- greg From landman at scalableinformatics.com Thu Mar 31 16:54:24 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] OS for 64 bit AMD In-Reply-To: References: <424B1C14.1000304@mail2.vcu.edu> <424C3051.5050306@scalableinformatics.com> <424C34A0.2040904@sonsorol.org> <424C3BAC.5050009@scalableinformatics.com> Message-ID: <424C9BC0.1010605@scalableinformatics.com> Commercial support of apps, drivers, connectivity. Corporate IT staff are (unless they are empowered), unlikely to support things they are unfamiliar with, or when they have go/no-go decision authority, unlikely to give the go-ahead to a distro that does not have a 1800-help-me number attached to it. For academic staff, they will likely pick the distro they are comfortable with, or one that they see lots of cluster people using. At the end of the day, the cluster admin is going to be asked whether or not they trust their production computing system to distribution X. Distribution X needs to support all their mix of stuff, which likely supports Redhat, SuSE, and possibly a third. Anything else other than those and they are on their own with their support community. It is aweful lonely using Stampede Linux on a production cluster, and running into a problem with IB, when it comes time to asking a question and getting help/support. Jamie Rollins wrote: > Any decent distro should support kernel 2.6 with amd64. But can some one > give me one good reason why you would use anything other than a > streamlined distro like Debian? Why pay for all the blote in something > like redhat when your nodes are probably going to be running a single > process anyway? > > jamie. > > > > On Thu, 31 Mar 2005, Joe Landman wrote: > > >>Actually Redhat now has HPC pricing per node. There are other good >>reasons to look elsewhere for HPC distributions though, specifically due >>to their lack of good high performance/scalable per-node file system. >>SuSE at least makes XFS and JFS available, and you can build/install a >>system with these. Redhat prefers that you use ext3. Another issue for >>the RHEL3 were the ancient kernels with many backports of advanced >>functionality from modern kernels. Additionally adding modules for new >>hardware support into their boot process is a minor nightmare... >> >> >> >> >>Chris Dagdigian wrote: >> >>>I second Joe's comments. >>> >>>All of our Opteron systems run Suse 9.2 by default and we use Centos-4 >>>for "Redhat" compatible functionality when required since Redhat has >>>explicitly chosen to price themselves out of the cluster market for >>>everything except 2-way 32bit boxes. >>> >>>Suse 9.1/9.2 on Opteron and Suse Enterprise Linux (SLES 8/9) on Itanium >>>Systems (meaning our SGI Altix) have been extremely stable and useful in >>>our work. Highly recommended. >>> >>>-Chris >>> >>> >>> >>>Joe Landman wrote: >>> >>> >>>>Hi Mike: >>>> >>>> Opterons will do better with a 2.6 kernel (2.6.9 and higher). If >>>>you are going to use RHEL, you might want to look at Rocks (RHEL3 >>>>based) or Warewulf which should be able to sit atop RHEL4. If you >>>>want to use a Redhat work-alike, you might want to look closely at >>>>Centos4. >>>> >>>> I am sure others will take issue with this, but I would strongly >>>>advise against using a rolling beta OS (FC-x) as the basis for a >>>>production cycle machine. If it is a purely experimental cluster, go >>>>for it. If it is supposed to provide cycles to a wide group, you >>>>might look more closely at a supported/supportable distribution. >>>> >>>> We have had good luck with SuSE 9.x (x>=1), RHEL3, CentosX on >>>>clusters using a variety of meta-distributions (warewulf, Rocks, >>>>others). Most of our customers seem to prefer the RHEL series, so we >>>>tend to work with that more than others, but YMMV. >>>> >>>>joe >>>> >>>>Mike Davis wrote: >>>> >>>> >>>>>What OSes are Opteron clusters out there running. Is anyone running >>>>>FC2 on opterons? >>>>> >>>>>I'm looking at opterons for our next cluster, but I'm not sure about >>>>>what OS to run. Thus far we've been with RH and or RHAS. But, the >>>>>next cluster will be big and I'm just not sure what we should run. >>>>> >>>>>Mike Davis >>>>>_______________________________________________ >>>>>Beowulf mailing list, Beowulf@beowulf.org >>>>>To change your subscription (digest mode or unsubscribe) visit >>>>>http://www.beowulf.org/mailman/listinfo/beowulf >>>> >>>> >>>> >>-- >>Joseph Landman, Ph.D >>Founder and CEO >>Scalable Informatics LLC, >>email: landman@scalableinformatics.com >>web : http://www.scalableinformatics.com >>phone: +1 734 786 8423 >>fax : +1 734 786 8452 >>cell : +1 734 612 4615 >> >>_______________________________________________ >>Beowulf mailing list, Beowulf@beowulf.org >>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From list-beowulf at onerussian.com Thu Mar 31 08:58:52 2005 From: list-beowulf at onerussian.com (Yaroslav Halchenko) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] Linux NIC order : (ether)boots from eth1 but Linux says it's eth0 In-Reply-To: <2accc2ff0503291723771ea78a@mail.gmail.com> References: <2accc2ff0503291723771ea78a@mail.gmail.com> Message-ID: <20050331165851.GV32590@washoe.rutgers.edu> In general you can rename interfaces the way you want Package: ifrename Description: Rename network interfaces based on various static criteria Ifrename allow the user to decide what name a network interface will have. Ifrename can use a variety of selectors to specify how interface names match the network interfaces on the system, the most common selector is the interface MAC address but in your case it is just a matter of order in which this interfaces got acknowleged by the kernel. So if both drivers are compiled as modules, then the one loaded first gets eth0. so you just need to make sure in the right sequence of module loads -- Yarik On Tue, Mar 29, 2005 at 08:23:47PM -0500, Nelson Castillo wrote: > Hi. > I'm booting a diskless cluster. Some nodes boot from eth0, others > from eth1 or eth2. All nodes are identical. > n@mdk:~$ lspci | grep Ethernet > 0000:02:02.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro > 100] (rev 08) > 0000:02:08.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX > [Cyclone] (rev 34) > 0000:02:09.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX > [Cyclone] (rev 34) > The etherboot image of the nodes that boot from eth0 have support > only for Ethernet Pro 100. They work fine. > The nodes that boot from eht1 have support for 3c90x and they work > fine. I tell etherboot to boot from the first nic found. But I have an issue. > Linux kernel 2.6.8 sees eth1 (3com) as eth0! only when they boot from > the etherboot image. I don't know why they inherit the etherboot order. > Kernel log reads (Root over NFS / get ip from DHCP): > IP-Config: Complete > device=eth0, addr=10.0.1.31, mask=255.255.255.0, gw=10.0.1.1, > host=10.0.1.31, domain=mydomain.com, nis-domain=(none), > bootserver=10.0.1.1, rootserver=10.0.1.1, rootpath= > (note! : device=eth0) > ****** > My question is: can i override this and tell Linux to call this interface eth1? > ****** > Linux names it eth0. > eth0 Link encap:Ethernet HWaddr 00:50:DA:CE:07:7F > I also would like to know why Linux does this. > Regards, > Nelson.- -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555] -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20050331/ecade423/attachment.bin From jlb17 at duke.edu Thu Mar 31 09:03:03 2005 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] OS for 64 bit AMD In-Reply-To: <424B1C14.1000304@mail2.vcu.edu> References: <424B1C14.1000304@mail2.vcu.edu> Message-ID: On Wed, 30 Mar 2005 at 4:37pm, Mike Davis wrote > What OSes are Opteron clusters out there running. Is anyone running FC2 > on opterons? > > I'm looking at opterons for our next cluster, but I'm not sure about > what OS to run. Thus far we've been with RH and or RHAS. But, the next > cluster will be big and I'm just not sure what we should run. If you like RH but are afraid of the costs for a cluster (a valid concern), you can look at the clones. We run CentOS here and are quite happy with it. -- Joshua Baker-LePain Department of Biomedical Engineering Duke University From jrollins at ligo.mit.edu Thu Mar 31 11:00:31 2005 From: jrollins at ligo.mit.edu (Jamie Rollins) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] OS for 64 bit AMD In-Reply-To: <424C3BAC.5050009@scalableinformatics.com> References: <424B1C14.1000304@mail2.vcu.edu> <424C3051.5050306@scalableinformatics.com> <424C34A0.2040904@sonsorol.org> <424C3BAC.5050009@scalableinformatics.com> Message-ID: Any decent distro should support kernel 2.6 with amd64. But can some one give me one good reason why you would use anything other than a streamlined distro like Debian? Why pay for all the blote in something like redhat when your nodes are probably going to be running a single process anyway? jamie. On Thu, 31 Mar 2005, Joe Landman wrote: > Actually Redhat now has HPC pricing per node. There are other good > reasons to look elsewhere for HPC distributions though, specifically due > to their lack of good high performance/scalable per-node file system. > SuSE at least makes XFS and JFS available, and you can build/install a > system with these. Redhat prefers that you use ext3. Another issue for > the RHEL3 were the ancient kernels with many backports of advanced > functionality from modern kernels. Additionally adding modules for new > hardware support into their boot process is a minor nightmare... > > > > > Chris Dagdigian wrote: > > > > I second Joe's comments. > > > > All of our Opteron systems run Suse 9.2 by default and we use Centos-4 > > for "Redhat" compatible functionality when required since Redhat has > > explicitly chosen to price themselves out of the cluster market for > > everything except 2-way 32bit boxes. > > > > Suse 9.1/9.2 on Opteron and Suse Enterprise Linux (SLES 8/9) on Itanium > > Systems (meaning our SGI Altix) have been extremely stable and useful in > > our work. Highly recommended. > > > > -Chris > > > > > > > > Joe Landman wrote: > > > >> Hi Mike: > >> > >> Opterons will do better with a 2.6 kernel (2.6.9 and higher). If > >> you are going to use RHEL, you might want to look at Rocks (RHEL3 > >> based) or Warewulf which should be able to sit atop RHEL4. If you > >> want to use a Redhat work-alike, you might want to look closely at > >> Centos4. > >> > >> I am sure others will take issue with this, but I would strongly > >> advise against using a rolling beta OS (FC-x) as the basis for a > >> production cycle machine. If it is a purely experimental cluster, go > >> for it. If it is supposed to provide cycles to a wide group, you > >> might look more closely at a supported/supportable distribution. > >> > >> We have had good luck with SuSE 9.x (x>=1), RHEL3, CentosX on > >> clusters using a variety of meta-distributions (warewulf, Rocks, > >> others). Most of our customers seem to prefer the RHEL series, so we > >> tend to work with that more than others, but YMMV. > >> > >> joe > >> > >> Mike Davis wrote: > >> > >>> What OSes are Opteron clusters out there running. Is anyone running > >>> FC2 on opterons? > >>> > >>> I'm looking at opterons for our next cluster, but I'm not sure about > >>> what OS to run. Thus far we've been with RH and or RHAS. But, the > >>> next cluster will be big and I'm just not sure what we should run. > >>> > >>> Mike Davis > >>> _______________________________________________ > >>> Beowulf mailing list, Beowulf@beowulf.org > >>> To change your subscription (digest mode or unsubscribe) visit > >>> http://www.beowulf.org/mailman/listinfo/beowulf > >> > >> > >> > > > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics LLC, > email: landman@scalableinformatics.com > web : http://www.scalableinformatics.com > phone: +1 734 786 8423 > fax : +1 734 786 8452 > cell : +1 734 612 4615 > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From kewley at gps.caltech.edu Thu Mar 31 12:38:26 2005 From: kewley at gps.caltech.edu (David Kewley) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] OS for 64 bit AMD In-Reply-To: <424C3BAC.5050009@scalableinformatics.com> References: <424B1C14.1000304@mail2.vcu.edu> <424C34A0.2040904@sonsorol.org> <424C3BAC.5050009@scalableinformatics.com> Message-ID: <200503311238.26290.kewley@gps.caltech.edu> Joe Landman wrote on Thursday 31 March 2005 10:04: > Actually Redhat now has HPC pricing per node. There are other good > reasons to look elsewhere for HPC distributions though, specifically due > to their lack of good high performance/scalable per-node file system. In your experience/knowledge, what are the issues that make e.g. xfs better than ext3 on RHEL4? I ask specifically about RHEL4 because RH has worked hard on ext3 in time for RHEL4. My understanding is that RH choose to support ext3 but not xfs because: 1) they have in-house expertise for ext3 but not for xfs, and 2) they believe that xfs has no real advantages over ext3. If customers show RH that there are real-life needs for xfs that are not satisfied by ext3, then RH may well be willing to invest in in-house xfs expertise. David From laurence at scalablesystems.com Thu Mar 31 17:58:01 2005 From: laurence at scalablesystems.com (Laurence Liew) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] OS for 64 bit AMD In-Reply-To: <424C9BC0.1010605@scalableinformatics.com> References: <424B1C14.1000304@mail2.vcu.edu> <424C3051.5050306@scalableinformatics.com> <424C34A0.2040904@sonsorol.org> <424C3BAC.5050009@scalableinformatics.com> <424C9BC0.1010605@scalableinformatics.com> Message-ID: <424CAAA9.50403@scalablesystems.com> Hi, I would like to expand on Joe's argument for a commercially supported distro for *production* sites. Most of our customers run expansive commercial applications like Fluent, Cadence, etc etc... which are CERTIFIED to run on say SLES9 or RHEL3/4 often the costs of these apps are much much more than a commercial distro like RHEL. What the cluster IT admin wants is that whatever he puts on his cluster... is known to work... and supported by the expansive software he is running. He knows he can call their supportline and request for help. While we all know Debian runs fine, CentOS runs fine, FC2,3,X will run fine.... but most IT cluster admins basically wants to do their job 9am - 5pm... and go home sleeping soundly.. knowing that if anything breaks or does not work... they can call/ask for help and the vendor MUST support him because he paid for it. Most of us on this list can support a cluster running on a non-commercial OS... but for most commercial and a large number of academic sites... these IT admins or researchers just want to get their job done and research done... they dont want to bother or care much about the OS... which is most of the time a small fraction of the costs of the cluster.... if you look at Red Hat's HPC pricing.. at US$79 per node (undiscounted yet)... it is a fraction of the costs of a 2-way server.... if you have a 128 node cluster(~US$300-500K).. your electrical/heating/support costs would be much much more than US$10,112/year for RHEL HPC Edition. Saving US$10K to buy additional 2 - 3 nodes... just does not make sense when you are running a large production system supporting expansive hardware (SANs, tape backups, and expensive commercial software). All of our commercial customers and most of our academic customers recognise the value of a commercial distro and pay for it willingly (or unwillingly)... but at least they know who they can choke when things go wrong. Cheers! Laurence Joe Landman wrote: > Commercial support of apps, drivers, connectivity. Corporate IT staff > are (unless they are empowered), unlikely to support things they are > unfamiliar with, or when they have go/no-go decision authority, unlikely > to give the go-ahead to a distro that does not have a 1800-help-me > number attached to it. For academic staff, they will likely pick the > distro they are comfortable with, or one that they see lots of cluster > people using. > > At the end of the day, the cluster admin is going to be asked whether or > not they trust their production computing system to distribution X. > Distribution X needs to support all their mix of stuff, which likely > supports Redhat, SuSE, and possibly a third. Anything else other than > those and they are on their own with their support community. > > It is aweful lonely using Stampede Linux on a production cluster, and > running into a problem with IB, when it comes time to asking a question > and getting help/support. > > > Jamie Rollins wrote: > >> Any decent distro should support kernel 2.6 with amd64. But can some one >> give me one good reason why you would use anything other than a >> streamlined distro like Debian? Why pay for all the blote in something >> like redhat when your nodes are probably going to be running a single >> process anyway? >> >> jamie. >> >> >> >> On Thu, 31 Mar 2005, Joe Landman wrote: >> >> >>> Actually Redhat now has HPC pricing per node. There are other good >>> reasons to look elsewhere for HPC distributions though, specifically due >>> to their lack of good high performance/scalable per-node file system. >>> SuSE at least makes XFS and JFS available, and you can build/install a >>> system with these. Redhat prefers that you use ext3. Another issue for >>> the RHEL3 were the ancient kernels with many backports of advanced >>> functionality from modern kernels. Additionally adding modules for new >>> hardware support into their boot process is a minor nightmare... >>> >>> >>> >>> >>> Chris Dagdigian wrote: >>> >>>> I second Joe's comments. >>>> >>>> All of our Opteron systems run Suse 9.2 by default and we use Centos-4 >>>> for "Redhat" compatible functionality when required since Redhat has >>>> explicitly chosen to price themselves out of the cluster market for >>>> everything except 2-way 32bit boxes. >>>> >>>> Suse 9.1/9.2 on Opteron and Suse Enterprise Linux (SLES 8/9) on Itanium >>>> Systems (meaning our SGI Altix) have been extremely stable and >>>> useful in >>>> our work. Highly recommended. >>>> >>>> -Chris >>>> >>>> >>>> >>>> Joe Landman wrote: >>>> >>>> >>>>> Hi Mike: >>>>> >>>>> Opterons will do better with a 2.6 kernel (2.6.9 and higher). If >>>>> you are going to use RHEL, you might want to look at Rocks (RHEL3 >>>>> based) or Warewulf which should be able to sit atop RHEL4. If you >>>>> want to use a Redhat work-alike, you might want to look closely at >>>>> Centos4. >>>>> >>>>> I am sure others will take issue with this, but I would strongly >>>>> advise against using a rolling beta OS (FC-x) as the basis for a >>>>> production cycle machine. If it is a purely experimental cluster, go >>>>> for it. If it is supposed to provide cycles to a wide group, you >>>>> might look more closely at a supported/supportable distribution. >>>>> >>>>> We have had good luck with SuSE 9.x (x>=1), RHEL3, CentosX on >>>>> clusters using a variety of meta-distributions (warewulf, Rocks, >>>>> others). Most of our customers seem to prefer the RHEL series, so we >>>>> tend to work with that more than others, but YMMV. >>>>> >>>>> joe >>>>> >>>>> Mike Davis wrote: >>>>> >>>>> >>>>>> What OSes are Opteron clusters out there running. Is anyone running >>>>>> FC2 on opterons? >>>>>> >>>>>> I'm looking at opterons for our next cluster, but I'm not sure about >>>>>> what OS to run. Thus far we've been with RH and or RHAS. But, the >>>>>> next cluster will be big and I'm just not sure what we should run. >>>>>> >>>>>> Mike Davis >>>>>> _______________________________________________ >>>>>> Beowulf mailing list, Beowulf@beowulf.org >>>>>> To change your subscription (digest mode or unsubscribe) visit >>>>>> http://www.beowulf.org/mailman/listinfo/beowulf >>>>> >>>>> >>>>> >>>>> >>> -- >>> Joseph Landman, Ph.D >>> Founder and CEO >>> Scalable Informatics LLC, >>> email: landman@scalableinformatics.com >>> web : http://www.scalableinformatics.com >>> phone: +1 734 786 8423 >>> fax : +1 734 786 8452 >>> cell : +1 734 612 4615 >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf@beowulf.org >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> > -- Laurence Liew, CTO Email: laurence@scalablesystems.com Scalable Systems Pte Ltd Web : http://www.scalablesystems.com (Reg. No: 200310328D) 7 Bedok South Road Tel : 65 6827 3953 Singapore 469272 Fax : 65 6827 3922 From chimou at mail.wsu.edu Thu Mar 31 18:03:30 2005 From: chimou at mail.wsu.edu (Jack Chen) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] gigE switch suggestion In-Reply-To: <424C9BC0.1010605@scalableinformatics.com> Message-ID: Hi All, Can any one please offer a suggestion on the choice of a gigabit Ethernet switch? This is for a 16-node cluster with gigabit connection. I know having jumbo frame support is important and it seems like most SMC switches has this support (and being lower $). Now, I have trouble deciding which 24 port switches to get (both support jumbo frame). The first is: SMC8524T EZ Switch 10/100/1000, this costs around $450 US The second is SMC8624T TigerSwitch 10/100/1000, this costs around $1800 US Below lists the product overview of the two switches from the SMC website. All we need is to support interconnect between the nodes. I guess the questions is, can go with the cheaper switch, and if I do, what are the limitations I'll face? thanks in advance.. Jack ___ SMC8524T EZ Switch http://www.smc.com/index.cfm?event=viewProduct&localeCode=EN_USA&pid=1157 Overview The SMC8524T is a feature-rich, high-performance 24-port 10/100/1000 unmanaged switch designed for power users looking to migrate to gigabit speeds on their network. Delivering a combination of Ethernet, Fast Ethernet, and Gigabit connectivity in one compact solution, the SMC8524T removes server bottlenecks and speeds up access time for your users in just one move. Each port supports auto-MDI/MDI-X to simplify integration into a network. The SMC8524T also supports Jumbo Packets, complies with the IEEE802.3, IEEE802.3u, and IEEE802.3ab/Gigabit Ethernet standards, and features store-and-forward mode with wire-speed filtering to ensure data integrity. Automatic Source Address Learning and Aging ensure superior delivery of data., while broadcast storm control and runt and CRC filtering prevent erroneous packets from propagating, optimizing the network bandwidth. The SMC8524T also features IEEE802.3x-compliant full-duplex flow control and HOL blocking prevention. At A Glance LEDs show ongoing switch status and simplify troubleshooting. ___ SMC8624T TigerSwitch http://www.smc.com/index.cfm?event=viewProduct&localeCode=EN_USA&pid=1182 Overview The SMC8624T is an all-Gigabit Layer 2 switch designed for departmental connection. It provides 24 10/100/1000 Mbps ports, plus 4 mini-GBIC slots for flexible fiber backbone attachment. Designed to handle heavy traffic associated with enterprise workgroups and power users, this tri-speed high-density gigabit switch delivers blistering throughput speeds up to 48 Gbps using a non-blocking switching architecture. For even greater performance, the SMC8624T provides MultiLink Trunking with LACP, which can boost network bandwidth to 800 Mbps. Advanced features that include four levels of priority queuing, VLAN with GVRP, SNMP, and IGMP snooping are provided, allowing a department to effectively and securely deploy a bottleneck-free switching network for easy integration with a larger enterprise or campus network. From sarahw at cse.unsw.edu.au Thu Mar 31 23:27:43 2005 From: sarahw at cse.unsw.edu.au (Sarah Webster) Date: Wed Nov 25 01:03:57 2009 Subject: [Beowulf] MPI-2 communication performance on gigabit clusters Message-ID: Hello, I am a research student investigating MPI-2 communication primitives performance on commodity clusters. I've been looking for information/advice on how programs that run on gigabit clusters are penalised by the slow communication infrastructure, and ways to avoid this penalty. I've seen similar work using the infiniband and myrinet networking technologies. In particular I'm looking at one-sided communication. Any advice on where I can find information/statistics on the performance of mpich2 or LAM MPI one-sided communication primitives? Thanks, Sarah