From kilian.cavalotti.work at gmail.com Mon Nov 3 06:19:04 2008 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] Nehalem Xeons In-Reply-To: References: Message-ID: <490F0858.8010803@gmail.com> Igor Kozin wrote: >> Did you really hold the Nehalem Xeon chips in your hands? They probably >> It would be nice to hear from you some numbers concerning Harpertown vs >> Nehalem performance > > those who know will not be able tell you because of the NDA. NDA for Core i7 ended today. What's the expiration date for Nehalem Xeons' NDA? That should give a pretty good idea of their release date. Unless of course the NDA's expiration date is also under NDA... :) > i think it is fair to say that the major difference is expected in the > memory bandwidth. That's indeed what appears from the different reviews of the desktop version of Nehalem (Core i7), which are poping up all around the Internet today. http://www.theinquirer.net/gb/inquirer/news/2008/11/03/core-i7-reviews-counting Cheers, -- Kilian From deadline at eadline.org Wed Nov 5 05:58:34 2008 From: deadline at eadline.org (Douglas Eadline) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] File Systems O'Plenty on ClusterMonkey Message-ID: <48084.192.168.1.213.1225893514.squirrel@mail.eadline.org> I wanted to let everyone know that we recently posted part three of Jeff Layton's File Systems O'Plenty series. This was a huge undertaking by Jeff. And, as far as I know the most comprehensive overview of Parallel File Systems to date. Indeed, the summary table at the end of part three is worth the price of admission alone! Part One: The Basics, Taxonomy and NFS http://www.clustermonkey.net//content/view/220 Part Two: NAS, AoE, iSCSI, and more! http://www.clustermonkey.net//content/view/233 Part Three: Object Based Storage http://www.clustermonkey.net//content/view/235 -- Doug From deadline at eadline.org Wed Nov 5 06:49:45 2008 From: deadline at eadline.org (Douglas Eadline) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] Update: File Systems O'Plenty on ClusterMonkey Message-ID: <38406.192.168.1.213.1225896585.squirrel@mail.eadline.org> UPDATE: There is an issue with the next page links in Part One. I'm investigating. -- Doug I wanted to let everyone know that we recently posted part three of Jeff Layton's File Systems O'Plenty series. This was a huge undertaking by Jeff. And, as far as I know the most comprehensive overview of Parallel File Systems to date. Indeed, the summary table at the end of part three is worth the price of admission alone! Part One: The Basics, Taxonomy and NFS http://www.clustermonkey.net//content/view/220/32/ Part Two: NAS, AoE, iSCSI, and more! http://www.clustermonkey.net//content/view/233 Part Three: Object Based Storage http://www.clustermonkey.net//content/view/235 -- Doug -- Doug From mfatica at gmail.com Wed Nov 5 11:57:18 2008 From: mfatica at gmail.com (Massimiliano Fatica) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] Two HPC engineer positions at NVIDIA Message-ID: <8e6393ac0811051157q3ed57c7fg6e15466f7aa2799e@mail.gmail.com> I have two positions open in my group at NVIDIA for an HPC engineer. Requirements for the senior position: The ideal candidate should be an expert in the field of high performance computer and have experience in one or more "professional" vertical markets such as CFD, computational chemistry, computational finance, oil and gas, etc. Responsibilities will also include demos, benchmarking, building cluster middleware. REQUIREMENTS: - Programming experience in C or Fortan is required - Knowledge of GPU Computing ( CUDA in specific) a plus - Linux as main O/S, familiarity with cluster deployment - Experience and good knowledge of parallel programming ( MPI, OpenMP) and HPC - Strong analytical and mathematical skills - MS required, PhD helpful Requirements for the junior position: The ideal candidate should have experience in the field of high performance computer and have knowledge in one or more "professional" vertical markets such as CFD, computational chemistry, computational finance, oil and gas, etc. Responsibilities will also include demos, benchmarking, building cluster middleware. REQUIREMENTS: - Programming experience in C or Fortan is required - Knowledge of GPU Computing ( CUDA in specific) a plus - Linux as main O/S, familiarity with cluster deployment - Experience and good knowledge of parallel programming ( MPI, OpenMP) and HPC - BS required, MS helpful If you are interested, please send me your resume (asci or pdf, no Word documents). I will also be at SC08, if you want to know more about the positions come to the NVIDIA booth and look for me. Massimiliano From deadline at eadline.org Thu Nov 6 05:41:51 2008 From: deadline at eadline.org (Douglas Eadline) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] 10th Annual Beowulf Bash: Announcement and sponsorship opportunity In-Reply-To: References: Message-ID: <49135.192.168.1.213.1225978911.squirrel@mail.eadline.org> Please see the professionally produced web page (i.e. Don or I had nothing to do with it) http://xandmarketing.com/beobash/ -- Doug > > Subject: 10th Annual Beowulf Bash: Announcement and sponsorship > opportunity > > > Tenth Annual Beowulf Bash > And > LECCIBG > > November 17 2008 9pm at Pete's Dueling Piano Bar > > > We are finalizing the plans for this year's combined Beowulf Bash > and LECCIBG > > It will take place, as usual, with the IEEE SC Conference. > This year SC08 is in Austin during the week of Nov 17 2008 > > As in previous years, the attraction is the conversations with > other attendees. We will have drinks and light snacks, with a short > greeting by the sponsors about 10:15pm. > > The venue is in the lively area of Austin near 6th street, very close to > many of the conference hotels and within walking distance of the rest. > > November 17 2008 9-11:30pm > Monday, Immediately after the SC08 Opening Gala > Pete's Dueling Piano Bar > http://www.petesduelingpianobar.com > > If your company (or even you as an individual) would like to help > sponsor the event, please contact me, becker@beowulf.org before early > November. (We can accommodate last-minute sponsorship, but your name > won't be on the printed info.) > > Our "headlining" sponsor list for 2008 is currently: > Penguin/Scyld (organizing sponsor) http://penguincomputing.com > AMD http://amd.com > NVIDIA http://nvidia.com > > Sponsors > - get their name up in lights (Well, if their sign is lighted. Bring a > sign, and we'll do our best to make certain the room is not too dark.) > - are part of the brief greeting in the middle of the party. > - have the opportunity for technical, hands-on demos at the Bash > - will have their logos on the beowulf.org BeoBash 2008 web pages > and on the 2008 yearbook page > > > > > > > -- > Donald Becker Never send mail to beowulf-bait@boewulf.org > Penguin Computing / Scyld Software > www.penguincomputing.com www.scyld.com > Annapolis MD and San Francisco CA > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Doug From xclski at yahoo.com Thu Nov 6 08:24:21 2008 From: xclski at yahoo.com (Ellis Wilson) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] 10th Annual Beowulf Bash: Announcement and sponsorship opportunity Message-ID: <222898.48734.qm@web37901.mail.mud.yahoo.com> I love the disclaimer below the bro: "Professional model. Do not attempt this activity on your own." Ellis Douglas Eadline wrote: > Please see the professionally produced web page > (i.e. Don or I had nothing to do with it) > > http://xandmarketing.com/beobash/ > > -- > Doug > >> Subject: 10th Annual Beowulf Bash: Announcement and sponsorship >> opportunity >> >> >> Tenth Annual Beowulf Bash >> And >> LECCIBG >> >> November 17 2008 9pm at Pete's Dueling Piano Bar >> >> >> We are finalizing the plans for this year's combined Beowulf Bash >> and LECCIBG >> >> It will take place, as usual, with the IEEE SC Conference. >> This year SC08 is in Austin during the week of Nov 17 2008 >> >> As in previous years, the attraction is the conversations with >> other attendees. We will have drinks and light snacks, with a short >> greeting by the sponsors about 10:15pm. >> >> The venue is in the lively area of Austin near 6th street, very close to >> many of the conference hotels and within walking distance of the rest. >> >> November 17 2008 9-11:30pm >> Monday, Immediately after the SC08 Opening Gala >> Pete's Dueling Piano Bar >> http://www.petesduelingpianobar.com >> >> If your company (or even you as an individual) would like to help >> sponsor the event, please contact me, becker@beowulf.org before early >> November. (We can accommodate last-minute sponsorship, but your name >> won't be on the printed info.) >> >> Our "headlining" sponsor list for 2008 is currently: >> Penguin/Scyld (organizing sponsor) http://penguincomputing.com >> AMD http://amd.com >> NVIDIA http://nvidia.com >> >> Sponsors >> - get their name up in lights (Well, if their sign is lighted. Bring a >> sign, and we'll do our best to make certain the room is not too dark.) >> - are part of the brief greeting in the middle of the party. >> - have the opportunity for technical, hands-on demos at the Bash >> - will have their logos on the beowulf.org BeoBash 2008 web pages >> and on the 2008 yearbook page >> >> >> >> >> >> >> -- >> Donald Becker Never send mail to beowulf-bait@boewulf.org >> Penguin Computing / Scyld Software >> www.penguincomputing.com www.scyld.com >> Annapolis MD and San Francisco CA >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > From thpierce at gmail.com Tue Nov 4 11:26:22 2008 From: thpierce at gmail.com (Tom Pierce) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] IBM memo - Lazy Linux: 11 secrets for lazy cluster admins Message-ID: <25e9e5ad0811041126o59e20eafoe0f4b243567111c4@mail.gmail.com> Dear Cluster Admins, This is an interesting 20 pages. http://www.ibm.com/developerworks/linux/library/l-11sysadtips/index.html?ca=drs- It was published 22 Oct 2008 And it has a link to the earlier version from July 2008 Lazy Linux: 10 essential tricks for admins They like xCAT, but they make interesting arguements for it. ----------------------- Thanks Tom From rreis at aero.ist.utl.pt Thu Nov 6 13:48:46 2008 From: rreis at aero.ist.utl.pt (Ricardo Reis) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] why the price diference ?! Message-ID: Heck! After reading some posts here and going to cluster monkey I really got seduced by the SDR infiniband idea. So, next cluster extension, thats what I'm gonna get. Except... Mellanox MHES14-XTC Infinihost III Lx, Single Port 4x Infiniband, PCIe4x buy at colfax : 125 USD roughly, now, 98 eur ask for Mellanox Portugal (or europe) for a quote and you get 488 EUROS!!! sorry to polute the list but I got take this out of my chest... Ricardo Reis 'Non Serviam' PhD student @ Lasef Computational Fluid Dynamics, High Performance Computing, Turbulence http://www.lasef.ist.utl.pt & Cultural Instigator @ R?dio Zero http://www.radiozero.pt http://www.flickr.com/photos/rreis/ From gerry.creager at tamu.edu Fri Nov 7 09:50:59 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] RRDtools graphs of temp from IPMI Message-ID: <49148003.1090703@tamu.edu> We're collecting a CSV dataset of node temps from IPMI for our Dell 1950s. Now we're trying to plot this in RRDtools. Anyone got a good script already cast to do this or do we need to start reinventing the annular transportation device? Thanks, gerry -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From bernard at vanhpc.org Fri Nov 7 13:54:14 2008 From: bernard at vanhpc.org (Bernard Li) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] RRDtools graphs of temp from IPMI In-Reply-To: <49148003.1090703@tamu.edu> References: <49148003.1090703@tamu.edu> Message-ID: Hi Gerry: On Fri, Nov 7, 2008 at 9:50 AM, Gerry Creager wrote: > We're collecting a CSV dataset of node temps from IPMI for our Dell 1950s. > Now we're trying to plot this in RRDtools. Anyone got a good script > already cast to do this or do we need to start reinventing the annular > transportation device? Have you considered using software like Cacti or Ganglia ontop of rrdtool? It should be fairly easy to add user-defined metrics to those systems for graphing. Cheers, Bernard From landman at scalableinformatics.com Fri Nov 7 18:21:41 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] RRDtools graphs of temp from IPMI In-Reply-To: <49148003.1090703@tamu.edu> References: <49148003.1090703@tamu.edu> Message-ID: <4914F7B5.9040402@scalableinformatics.com> Gerry Creager wrote: > We're collecting a CSV dataset of node temps from IPMI for our Dell > 1950s. Now we're trying to plot this in RRDtools. Anyone got a good > script already cast to do this or do we need to start reinventing the > annular transportation device? > > Thanks, gerry probably a little cut and paste from here ... http://search.cpan.org/~mschilli/RRDTool-OO-0.22/lib/RRDTool/OO.pm You can also parse the CSV here, so it should be fairly simple to create what you need. You can see code which generates RRDgraphs from here http://mailgraph.schweikert.ch/pub/mailgraph-1.14.tar.gz . -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From csamuel at vpac.org Sat Nov 8 16:11:35 2008 From: csamuel at vpac.org (Chris Samuel) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] RRDtools graphs of temp from IPMI In-Reply-To: <49148003.1090703@tamu.edu> Message-ID: <1450691330.2243241226189495951.JavaMail.root@mail.vpac.org> ----- "Gerry Creager" wrote: > We're collecting a CSV dataset of node temps from IPMI > for our Dell 1950s. On our SuperMicro boxes we just inject the system and both CPU temperatures into Ganglia from each node. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Sat Nov 8 16:17:06 2008 From: csamuel at vpac.org (Chris Samuel) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] RRDtools graphs of temp from IPMI In-Reply-To: <1426527578.2243271226189768265.JavaMail.root@mail.vpac.org> Message-ID: <1856588083.2243291226189826105.JavaMail.root@mail.vpac.org> ----- "Gerry Creager" wrote: > Anyone got a good script already cast to do this or > do we need to start reinventing the annular > transportation device? I forgot to attach our ATD! ;-) The reason it worries about high load is that we used to see processes hang trying to read from the IPMI device, but haven't seen that with more recent kernels.. cheers, Chris #!/bin/sh [ -e /dev/ipmi0 ] || exit 0 load=$(cat /proc/loadavg | awk '{print $3}' | awk -F. '{print $1}') if [ $load -gt 16 ] then exit 0 fi TEMPS=$(/usr/bin/ipmitool sensor | grep Temp | awk '{print $1 ":" $4 }') for i in $TEMPS; do type=`echo $i | awk -F: '{print $1}'` temp=`echo $i | awk -F: '{print $2}'` gmetric -n "$type Temp" -v "$temp" -t float -u Celcius done -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From alsimao at gmail.com Sat Nov 8 12:18:46 2008 From: alsimao at gmail.com (Alcides Simao) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] Re: Beowulf Digest, Vol 57, Issue 6 In-Reply-To: <200811082000.mA8K08l6027459@bluewest.scyld.com> References: <200811082000.mA8K08l6027459@bluewest.scyld.com> Message-ID: <7be8c36b0811081218q58a8b673qd29995b371843776@mail.gmail.com> Hello all! I'm a beginner in beowulfing. I presently work in a lab where beowulfing is starting to be our way to make quantum-mechanical calculations of higher order - creepy, buzzing, wierd stuff. Well, I have no experience paralelizing, nor how to make a cluster. So I figure I could start by an easy aproach : build a 2 pc cluster. Can you please help me? I'm a devoted member of The Church of Emacs, and of Saint iGNUtius. Thanks 4 your time! Best, 'Newbie' Alcides -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081108/187536eb/attachment.html From gdjacobs at gmail.com Sat Nov 8 18:28:42 2008 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] Re: Beowulf Digest, Vol 57, Issue 6 In-Reply-To: <7be8c36b0811081218q58a8b673qd29995b371843776@mail.gmail.com> References: <200811082000.mA8K08l6027459@bluewest.scyld.com> <7be8c36b0811081218q58a8b673qd29995b371843776@mail.gmail.com> Message-ID: <49164ADA.6090705@gmail.com> Alcides Simao wrote: > Hello all! > > I'm a beginner in beowulfing. I presently work in a lab where beowulfing > is starting to be our way to make quantum-mechanical calculations of > higher order - creepy, buzzing, wierd stuff. > > Well, I have no experience paralelizing, nor how to make a cluster. So I > figure I could start by an easy aproach : build a 2 pc cluster. Can you > please help me? I'm a devoted member of The Church of Emacs, and of > Saint iGNUtius. > > Thanks 4 your time! > > Best, > > 'Newbie' Alcides This is covered every couple of months, so searching back in the list for keywords and phrases like "beginner" and "getting started" is a not bad idea. In summary, though, a Beowulf is a network of commodity computers designed to perform operations cooperatively. Take pile o' pc's, install a free operating system (historically Linux) on each, add the networking stuff, start programming. Everything in addition is an engineering consideration to solve or ameliorate issues as the cluster gets larger and more sophisticated. Check out these links: http://www.clustermonkey.net//content/view/41/33/ http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book.php -- Geoffrey D. Jacobs From gerry.creager at tamu.edu Sat Nov 8 21:14:10 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] RRDtools graphs of temp from IPMI In-Reply-To: <4914F7B5.9040402@scalableinformatics.com> References: <49148003.1090703@tamu.edu> <4914F7B5.9040402@scalableinformatics.com> Message-ID: <491671A2.1080101@tamu.edu> To Bernard, Chris and Joe, especially, Thanks! Now, for the flame-bait. Bernard suggests cacti and/or ganglia to handle this. Our group have heard some mutterings that ganglia is a "chatty" applicaiton and could cause some potential hits on or 1 Gbe interconnect fabric. A little background on our current implementation: 126 dual-quad core Xeon Dell 1950's interconnected with gigabit ethernet. No, it's not the world's best MPI machine, but it should... and does... perform admirably for throughput applications where most jobs can be run on a node (or two) but which don't use MPI as much as, e.g., OpenMP, or in some cases, even run on a single core but use all the RAM. So, we're worried a bit about having everything talk on the same gigabit backplane, hence, so far, no ganglia. What are the issues I might want to worry about in this regard, especially as we expand this cluster to more nodes (potentially going to 2k cores, or, essentially doubling? Thanks, gerry Joe Landman wrote: > Gerry Creager wrote: >> We're collecting a CSV dataset of node temps from IPMI for our Dell >> 1950s. Now we're trying to plot this in RRDtools. Anyone got a good >> script already cast to do this or do we need to start reinventing the >> annular transportation device? >> >> Thanks, gerry > > probably a little cut and paste from here ... > http://search.cpan.org/~mschilli/RRDTool-OO-0.22/lib/RRDTool/OO.pm > > You can also parse the CSV here, so it should be fairly simple to create > what you need. You can see code which generates RRDgraphs from here > http://mailgraph.schweikert.ch/pub/mailgraph-1.14.tar.gz . > -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From cwest at astro.umass.edu Sat Nov 8 21:45:18 2008 From: cwest at astro.umass.edu (Craig West) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] RRDtools graphs of temp from IPMI In-Reply-To: <491671A2.1080101@tamu.edu> References: <49148003.1090703@tamu.edu> <4914F7B5.9040402@scalableinformatics.com> <491671A2.1080101@tamu.edu> Message-ID: <491678EE.6040807@astro.umass.edu> Gerry, Like others, I too use ganglia - and have a custom script which reports cpu temps (and fan speeds) for the nodes. However, I changed the default method of communication for ganglia (multicast) to reduce the chatter. I use a unicast setup, where each node reports directly to the monitoring server - which is a dedicated machine for monitoring all the systems - and performing other tasks (dhcp, ntp, imaging, etc) Each node is using less than 1KB/sec to transmit all the ganglia information, including my extra metrics. For the useful recording information you get from this data its worth the rather small network chatter. You can tune the metrics further, turn off the ones you don't want, or have them report less often. I'd suggest installing it, if you still think it is chatty, then remove it and look for another option. I find it useful in that you can see when a node died, what the load on the node was when it crashed, what the network traffic is, etc... I also use cacti - but only for the head servers, switches, etc. I find it has too much over head for the nodes. It is however useful in that it can send emails to alert you to problems, and allows for graphing of SNMP devices. Craig. Gerry Creager wrote: > Now, for the flame-bait. Bernard suggests cacti and/or ganglia to > handle this. Our group have heard some mutterings that ganglia is a > "chatty" applicaiton and could cause some potential hits on or 1 Gbe > interconnect fabric. > > A little background on our current implementation: 126 dual-quad core > Xeon Dell 1950's interconnected with gigabit ethernet. No, it's not > the world's best MPI machine, but it should... and does... perform > admirably for throughput applications where most jobs can be run on a > node (or two) but which don't use MPI as much as, e.g., OpenMP, or in > some cases, even run on a single core but use all the RAM. > > So, we're worried a bit about having everything talk on the same > gigabit backplane, hence, so far, no ganglia. > > What are the issues I might want to worry about in this regard, > especially as we expand this cluster to more nodes (potentially going > to 2k cores, or, essentially doubling? From prentice at ias.edu Mon Nov 10 05:44:04 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] RRDtools graphs of temp from IPMI In-Reply-To: <491671A2.1080101@tamu.edu> References: <49148003.1090703@tamu.edu> <4914F7B5.9040402@scalableinformatics.com> <491671A2.1080101@tamu.edu> Message-ID: <49183AA4.6030808@ias.edu> Gerry Creager wrote: > Now, for the flame-bait. Bernard suggests cacti and/or ganglia to > handle this. Our group have heard some mutterings that ganglia is a > "chatty" applicaiton and could cause some potential hits on or 1 Gbe > interconnect fabric. The noisy ganglia issue was discussed a few months ago. The issue isn't network traffic as much as it is operating system noise, or jitter, and that noise becomes more of an issue as the cluster size grows. Search the archives for "ganglia", "noise" or "jitter". You can also google for "operating system noise" or "operating system jitter" to find some academic papers on the topic. The links to a few were posted here as part of the aforementioned discussion on ganglia. The advantage of using IPMI is that it is hardware based, so there shouldn't be any OS noise to slow down the MPI calculations. -- Prentice From d.love at liverpool.ac.uk Tue Nov 11 06:26:51 2008 From: d.love at liverpool.ac.uk (Dave Love) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] Re: Active directory with Linux References: <87vdvi6tx0.fsf@liv.ac.uk> <269510977.1901111225139194459.JavaMail.root@mail.vpac.org> Message-ID: <87d4h2wd78.fsf@liv.ac.uk> Chris Samuel writes: > Well we were told that AD doesn't permit anonymous access. , for example, has instructions for 2000 and 2003 servers. > Bear in mind we're Linux geeks here, not Windows geeks.. ;-) I hope you don't think I'm a Windows geek! Just passing on what I know from having had to tangle with AD admin previously and having to get things working here eventually post-eDirectory; I guess plenty of us are in similar boats with this. >> or the `machine' account. The latter is what you get from >> `joining the domain' (e.g. with Samba) > > Whilst I couldn't be certain I suspect their security > policy would have classed that as just being an implementation > of the former, and it too would have been locked out after > N failed attempts and hence locked out all users. It would be the same on Windows boxes, surely, allowing a DoS attack. > We got the impression that AD didn't permit them to > make an exception to this policy either.. :-( I think you can control the lockout policy with fairly fine granularity, and I think it's actually off by default, but don't have a system to check. I guess it's documented OTW somewhere. -- IBM^WMicrosoft is not a necessary evil; IBM^WMicrosoft is not necessary. -- Ted Nelson updated From d.love at liverpool.ac.uk Tue Nov 11 06:31:50 2008 From: d.love at liverpool.ac.uk (Dave Love) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] Re: Active directory with Linux References: <2120044993.1785371224805118397.JavaMail.root@mail.vpac.org> <4905C611.1090201@ias.edu> Message-ID: <87bpwmwcyx.fsf@liv.ac.uk> Prentice Bisbal writes: > I looked at implementing Fedora Directory Server a few months ago to > provide LDAP services to our Linux systems and synchronize passwords > with our AD servers. For authentication, you should use an authentication protocol, i.e. Kerberos -- what AD uses (not that I'd want to encourage use of AD if you have any choice in the matter). That actually gives you single sign-on -- e.g. for interacting with the directory server itself or, potentially, resources used by your beowulf jobs -- too. In comparison with the case at issue, it also means you store keys, not passwords, although having the key is similar to knowing the password. I think LDAP vendors do people a disservice by pushing abuse of a directory service as an authentication service, and there's a lot of confusion about it. Put your account data in LDAP (which may be better than, say, NIS, even within a cluster), and authenticate with Kerberos. > To do this, it must store the user passwords in > cleartest in the replication logs, where they are in LDIF format, and > clearly labelled as clear-text passwords. Even if you shorten the > retention time of the replication logs, If you're going to do replication, you have to keep the replicated data secure in transit, and I'd always expect that to use TLS or similar. If the logs are insecure on the server, I'd worry about the directory service independent of replication. (Login passwords may not be the only sensitive data stored in the directory, and for various reasons it's not clear that encrypting the directory's database is appropriate.) > I decided this was completely unsafe and abandoned the project. Not long > after (the next day, in fact) Slashdot reported that people had been > hack into Redhat/Fedora Directory server. For what it's worth, SDS is (now) a different product, presumably with a different security regime, and some crack reported in slashdot probably isn't a good basis for choosing a directory server. It's probably beside the point for an authentication service, though. [I hope that didn't come across as unintentionally obnoxious.] From d.love at liverpool.ac.uk Tue Nov 11 06:35:24 2008 From: d.love at liverpool.ac.uk (Dave Love) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] Re: RRDtools graphs of temp from IPMI References: <49148003.1090703@tamu.edu> Message-ID: <874p2ewcsz.fsf@liv.ac.uk> "Bernard Li" writes: > Have you considered using software like Cacti or Ganglia ontop of > rrdtool? It should be fairly easy to add user-defined metrics to > those systems for graphing. In case it's helpful, there's a ganglia example of what I used to do with in-band ipmitool at , but I'm currently using the following out-of-band with FreeIPMI. It's site-specific due to the IPMI hosts, at least (ipmi, head name, and the -W kludges for Sun IPMI bugs). -------------- next part -------------- A non-text attachment was scrubbed... Name: freeipmi-gmetric-temp Type: application/x-sh Size: 1439 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20081111/83f3ec93/freeipmi-gmetric-temp.sh From d.love at liverpool.ac.uk Tue Nov 11 06:41:06 2008 From: d.love at liverpool.ac.uk (Dave Love) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] Re: RRDtools graphs of temp from IPMI References: <1426527578.2243271226189768265.JavaMail.root@mail.vpac.org> <1856588083.2243291226189826105.JavaMail.root@mail.vpac.org> Message-ID: <87wsfauxz1.fsf@liv.ac.uk> Chris Samuel writes: > The reason it worries about high load is that we > used to see processes hang trying to read from the > IPMI device, but haven't seen that with more recent > kernels.. How recent? We've seen similar trouble on Supermicros with a SuSE 10.3 (2.6.22.17) kernel, hence doing it out-of-band, as I just posted. (Sorry I basically duplicated the in-band one of yours.) It involves the kipmi0 kernel thread going CPU-bound and sometimes getting a huge load average from failed ipmitool instances hanging around. By the way, the IPMI temperature sensors don't work on our H8DCE-HTE/AOC-IPMI20-E Supermicros, although lmsensors does work. Does anyone know a fix for that? From d.love at liverpool.ac.uk Tue Nov 11 06:51:46 2008 From: d.love at liverpool.ac.uk (Dave Love) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] Re: RRDtools graphs of temp from IPMI In-Reply-To: <49183AA4.6030808@ias.edu> (Prentice Bisbal's message of "Mon, 10 Nov 2008 08:44:04 -0500") References: <49148003.1090703@tamu.edu> <4914F7B5.9040402@scalableinformatics.com> <491671A2.1080101@tamu.edu> <49183AA4.6030808@ias.edu> Message-ID: <87ljvquxh9.fsf@liv.ac.uk> Prentice Bisbal writes: > The advantage of using IPMI is that it is hardware based, so there > shouldn't be any OS noise to slow down the MPI calculations. Yes iff you do it out-of-band, which is probably less convenient, and is significantly slower on our systems, at least. Actually, is it clear that interacting with the firmware generally won't cause any jitter? I could imagine it might. Our vendor installed a daemon doing in-band IPMI sensor probes which I didn't initially know about, since it wasn't sending the ganglia metrics correctly anyhow. I don't know whether that means they disagree with the effect on MPI performance or what. From apittman at concurrent-thinking.com Tue Nov 11 06:53:37 2008 From: apittman at concurrent-thinking.com (Ashley Pittman) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] Re: RRDtools graphs of temp from IPMI In-Reply-To: <87wsfauxz1.fsf@liv.ac.uk> References: <1426527578.2243271226189768265.JavaMail.root@mail.vpac.org> <1856588083.2243291226189826105.JavaMail.root@mail.vpac.org> <87wsfauxz1.fsf@liv.ac.uk> Message-ID: <1226415217.888.397.camel@bruce.priv.wark.uk.streamline-computing.com> On Tue, 2008-11-11 at 14:41 +0000, Dave Love wrote: > Chris Samuel writes: > > > The reason it worries about high load is that we > > used to see processes hang trying to read from the > > IPMI device, but haven't seen that with more recent > > kernels.. > > How recent? We've seen similar trouble on Supermicros with a SuSE 10.3 > (2.6.22.17) kernel, hence doing it out-of-band, as I just posted. > (Sorry I basically duplicated the in-band one of yours.) It involves > the kipmi0 kernel thread going CPU-bound and sometimes getting a huge > load average from failed ipmitool instances hanging around. Even when it does work running "ipmptool sensor" in-band can often take 30 seconds to complete which isn't great for performance. Ashley, From csamuel at vpac.org Tue Nov 11 12:03:26 2008 From: csamuel at vpac.org (Chris Samuel) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] Re: Active directory with Linux In-Reply-To: <87d4h2wd78.fsf@liv.ac.uk> Message-ID: <29380078.621226433798838.JavaMail.csamuel@ubuntu> ----- "Dave Love" wrote: > Chris Samuel writes: > > > Well we were told that AD doesn't permit anonymous access. > > , for > example, has instructions for 2000 and 2003 servers. Thanks! That's useful for future reference but I don't know if the admins for their AD servers would have felt comfortable with (or been permitted to) make those changes. > > Bear in mind we're Linux geeks here, not Windows geeks.. ;-) > > I hope you don't think I'm a Windows geek! Grin, you're more of one than we are, we've managed to escape mostly unscathed from it.. ;-) cheers! Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From csamuel at vpac.org Tue Nov 11 12:08:21 2008 From: csamuel at vpac.org (Chris Samuel) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] Re: RRDtools graphs of temp from IPMI In-Reply-To: <87wsfauxz1.fsf@liv.ac.uk> Message-ID: <23088134.641226434073945.JavaMail.csamuel@ubuntu> ----- "Dave Love" wrote: > Chris Samuel writes: > > > The reason it worries about high load is that we > > used to see processes hang trying to read from the > > IPMI device, but haven't seen that with more recent > > kernels.. > > How recent? We've seen similar trouble on Supermicros with a SuSE > 10.3 (2.6.22.17) kernel, hence doing it out-of-band, as I just posted. I think they seemed to go away somewhere around 2.6.27 I believe. > (Sorry I basically duplicated the in-band one of yours.) It involves > the kipmi0 kernel thread going CPU-bound and sometimes getting a huge > load average from failed ipmitool instances hanging around. Sounds very much like what we were seeing on ours! cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From lindahl at pbm.com Tue Nov 11 12:59:22 2008 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] Re: RRDtools graphs of temp from IPMI In-Reply-To: <87wsfauxz1.fsf@liv.ac.uk> References: <1426527578.2243271226189768265.JavaMail.root@mail.vpac.org> <1856588083.2243291226189826105.JavaMail.root@mail.vpac.org> <87wsfauxz1.fsf@liv.ac.uk> Message-ID: <20081111205922.GB31962@bx9> On Tue, Nov 11, 2008 at 02:41:06PM +0000, Dave Love wrote: > By the way, the IPMI temperature sensors don't work on our > H8DCE-HTE/AOC-IPMI20-E Supermicros, although lmsensors does work. Does > anyone know a fix for that? We recently had to update our SIMSO firmware to avoid spurious events, perhaps that will help out your problem? -- g From dnlombar at ichips.intel.com Tue Nov 11 15:39:02 2008 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] Re: RRDtools graphs of temp from IPMI In-Reply-To: <87ljvquxh9.fsf@liv.ac.uk> References: <49148003.1090703@tamu.edu> <4914F7B5.9040402@scalableinformatics.com> <491671A2.1080101@tamu.edu> <49183AA4.6030808@ias.edu> <87ljvquxh9.fsf@liv.ac.uk> Message-ID: <20081111233902.GB31883@nlxdcldnl2.cl.intel.com> On Tue, Nov 11, 2008 at 06:51:46AM -0800, Dave Love wrote: > Prentice Bisbal writes: > > > The advantage of using IPMI is that it is hardware based, so there > > shouldn't be any OS noise to slow down the MPI calculations. > > Yes iff you do it out-of-band, which is probably less convenient, and is > significantly slower on our systems, at least. Actually, is it clear > that interacting with the firmware generally won't cause any jitter? I > could imagine it might. IPMI lives on the BMC. That's not to say there's zero interaction, but the impact of OOB IPMI should really be very limited. > Our vendor installed a daemon doing in-band IPMI sensor probes which I > didn't initially know about, since it wasn't sending the ganglia metrics > correctly anyhow. I don't know whether that means they disagree with > the effect on MPI performance or what. It's a preference issue; there are reasonable arguments to be made in either direction. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From alsimao at gmail.com Tue Nov 11 14:01:28 2008 From: alsimao at gmail.com (Alcides Simao) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] Re: Beowulf Digest, Vol 57, Issue 9 In-Reply-To: <200811112000.mABK0DFj030674@bluewest.scyld.com> References: <200811112000.mABK0DFj030674@bluewest.scyld.com> Message-ID: <7be8c36b0811111401v6824c562o1c4698650eeee71@mail.gmail.com> Hello all! I've heard that there are some motherboards, if I can recall I believe it's Intel, that make use of a Yukon driver, that happens not to work well under Linux, and hence, it is a serious problem for Beowulfing. Can someone develop this? Best, Alcides PS: Can someone tell me how to start training on CUDA programming? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081111/cb46bcab/attachment.html From prentice at ias.edu Wed Nov 12 07:37:27 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] Re: Active directory with Linux In-Reply-To: <87bpwmwcyx.fsf@liv.ac.uk> References: <2120044993.1785371224805118397.JavaMail.root@mail.vpac.org> <4905C611.1090201@ias.edu> <87bpwmwcyx.fsf@liv.ac.uk> Message-ID: <491AF837.5030008@ias.edu> Dave Love wrote: > Prentice Bisbal writes: > >> I looked at implementing Fedora Directory Server a few months ago to >> provide LDAP services to our Linux systems and synchronize passwords >> with our AD servers. > > For authentication, you should use an authentication protocol, > i.e. Kerberos -- what AD uses (not that I'd want to encourage use of AD > if you have any choice in the matter). That actually gives you single > sign-on -- e.g. for interacting with the directory server itself or, > potentially, resources used by your beowulf jobs -- too. In comparison > with the case at issue, it also means you store keys, not passwords, > although having the key is similar to knowing the password. I think > LDAP vendors do people a disservice by pushing abuse of a directory > service as an authentication service, and there's a lot of confusion > about it. Put your account data in LDAP (which may be better than, say, > NIS, even within a cluster), and authenticate with Kerberos. I agree. I'm a big fan of Kerberos. > >> To do this, it must store the user passwords in >> cleartest in the replication logs, where they are in LDIF format, and >> clearly labelled as clear-text passwords. Even if you shorten the >> retention time of the replication logs, > > If you're going to do replication, you have to keep the replicated data > secure in transit, and I'd always expect that to use TLS or similar. If > the logs are insecure on the server, I'd worry about the directory > service independent of replication. (Login passwords may not be the > only sensitive data stored in the directory, and for various reasons > it's not clear that encrypting the directory's database is appropriate.) I'm pretty sure that the replication was done over TLS. The cleartext passwords are only needed when replicating with AD synchronization, if I recall correctly, since AD uses a different password hashing algorithm. I've found that Microsoft offers (for free!) a pam module (pam_sso, or something like that) that will do AD hashing of a password on the client, so a cleartext password is never sent to the AD server. Much better solution, IMHO. Not sure why RHDS doesn't do this themselves, unless software patents on MS's hashing algorithm prevents them from doing so. >> I decided this was completely unsafe and abandoned the project. Not long >> after (the next day, in fact) Slashdot reported that people had been >> hack into Redhat/Fedora Directory server. > > For what it's worth, SDS is (now) a different product, presumably with a > different security regime, and some crack reported in slashdot probably > isn't a good basis for choosing a directory server. It's probably > beside the point for an authentication service, though. Wasn't sure - I think I mentioned that in my e-mail. Thanks for the clarification/removal of doubt. -- Prentice From gus at ldeo.columbia.edu Wed Nov 12 08:55:43 2008 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed Nov 25 01:07:57 2009 Subject: [Beowulf] Re: Beowulf Digest, Vol 57, Issue 9 In-Reply-To: <7be8c36b0811111401v6824c562o1c4698650eeee71@mail.gmail.com> References: <200811112000.mABK0DFj030674@bluewest.scyld.com> <7be8c36b0811111401v6824c562o1c4698650eeee71@mail.gmail.com> Message-ID: <491B0A8F.7030709@ldeo.columbia.edu> Hello Alcides and list 1) I don't know about motherboards and Yukon, but I had problems on Linux with the D-Link DGE-530T rev. 11 NIC (GigE), which uses the sklin98 driver, also from Marvell. The NICs never worked, even when I built the driver using the Marvell source code. I've read several postings out there reporting problems with GigE drivers and interfaces. You may find some information searching for the appropriate keywords on the Beowulf and on the ROCKS Cluster list archives: http://www.beowulf.org/archive/index.html http://marc.info/?l=npaci-rocks-discussion 2) There are plenty of free CUDA programming materials on the NVidia site: http://www.nvidia.com/object/cuda_home.html# http://www.nvidia.com/object/cuda_learn.html http://www.nvidia.com/object/cuda_develop.html CUDA is not as friendly as other parallel APIs, like MPI and OpenMP, though. ** I hope this helps. Gus Correa -- --------------------------------------------------------------------- Gustavo J. Ponce Correa, PhD - Email: gus@ldeo.columbia.edu Lamont-Doherty Earth Observatory - Columbia University P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Alcides Simao wrote: > Hello all! > > I've heard that there are some motherboards, if I can recall I believe > it's Intel, that make use of a Yukon driver, that happens not to work > well under Linux, and hence, it is a serious problem for Beowulfing. > > Can someone develop this? > > Best, > > Alcides > > PS: Can someone tell me how to start training on CUDA programming? > >------------------------------------------------------------------------ > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From hearnsj at googlemail.com Wed Nov 12 09:09:05 2008 From: hearnsj at googlemail.com (John Hearns) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Re: Beowulf Digest, Vol 57, Issue 9 In-Reply-To: <7be8c36b0811111401v6824c562o1c4698650eeee71@mail.gmail.com> References: <200811112000.mABK0DFj030674@bluewest.scyld.com> <7be8c36b0811111401v6824c562o1c4698650eeee71@mail.gmail.com> Message-ID: <9f8092cc0811120909w6e701700p411ece96aacbc1ca@mail.gmail.com> 2008/11/11 Alcides Simao > Hello all! > > I've heard that there are some motherboards, if I can recall I believe it's > Intel, that make use of a Yukon driver, that happens not to work well under > Linux, and hence, it is a serious problem for Beowulfing. > > I think this has been covered on the Beowulf list before. Think seriously about getting hold of a set of separate Intel Pro-1000 network cards if you are going to run MPI over Ethernet. The Intel drivers are well developed, and the cards perform well. By all means run your cluster management and NFS storage over the on-board chipsets, but you may find it wort the extra expense to have separate NICs for the MPI. I agree with the point abotu the Marvell driver - I recall a session I ahd with a system at University of Newcastle. The external connection was to their campus LAN - which was a 100Mbps connection to a Cisco switch. In our lab, the external connection ran just run on a gig E connection to Nortel. We piled gbytes of data up and down it. But connect to a slow LAN - and it stops after 20 minutes. the cause being explicit congestion notification packets. I COULD have spent time updating the driver etc. etc. But I took the road of fitting a PCI-e Intel card and configuring that up with the external IP Address. Worked fine. I hate to say it, but it depends on how much you value your time. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081112/0fa009aa/attachment.html From orion at cora.nwra.com Thu Nov 13 08:50:02 2008 From: orion at cora.nwra.com (Orion Poplawski) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Hardware considerations for running wrf Message-ID: <491C5ABA.2010003@cora.nwra.com> Some folks here are starting to run the WRF model and we may get a couple machines to run it. Can anyone with experience running the WRF model comment on what kind of hardware might be particularly suited for it? -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA/CoRA Division FAX: 303-415-9702 3380 Mitchell Lane orion@cora.nwra.com Boulder, CO 80301 http://www.cora.nwra.com From Craig.Tierney at noaa.gov Thu Nov 13 09:48:36 2008 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Hardware considerations for running wrf In-Reply-To: <491C5ABA.2010003@cora.nwra.com> References: <491C5ABA.2010003@cora.nwra.com> Message-ID: <491C6874.9010706@noaa.gov> Orion Poplawski wrote: > Some folks here are starting to run the WRF model and we may get a > couple machines to run it. Can anyone with experience running the WRF > model comment on what kind of hardware might be particularly suited for it? > How big are the domains? How reliably do you need the runs to finish? If you are only running WRF on a few nodes, you can get a way with any standard x86_64 node (although I would try and wait for Nehalem) and gigE. If you want to be running with more than 8 nodes, you might start to consider Infiniband for efficiency. WRF isn't that hard to run, the questions you should really be asking is what is the most maintainable system (in my budget) that I can get? A rack of nodes is not a cluster. Do you want to build and configure every part of the cluster yourself, or do you want a vendor to wheel it in, create some user accounts, and off the users go? Once you start providing a cluster to your users, they are going to have expectations of reliability and performance, and those things are going to take effort to maintain. And then you need to make time to educate your users than 100% uptime and 100% runs completed successfully is not easy (or impossible) and setting expectations will be come a necessity. Craig -- Craig Tierney (craig.tierney@noaa.gov) From kus at free.net Thu Nov 13 09:54:35 2008 From: kus at free.net (Mikhail Kuzminsky) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] =?windows-1251?q?=D1los?= network vs fat tree Message-ID: Sorry, is it correct to say that fat tree topology is equal to *NON-BLOCKING* Clos network w/addition of "uplinks" ? I.e. any non-blocking Clos network w/corresponding addition of uplinks gives fat tree ? I read somewhere that exact evidence of "non-blocking" was performed for Clos networks with >= 3 levels. But most popular Infiniband fat trees has only 2 levels. (Yes, I know that "non-blocking" for Clos network isn't "absolute" :-)) Mikhail From csamuel at vpac.org Thu Nov 13 10:33:37 2008 From: csamuel at vpac.org (Chris Samuel) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Idle core on overloaded CPU? In-Reply-To: <4901B095.6060004@aei.mpg.de> Message-ID: <17398798.921226601211834.JavaMail.csamuel@ubuntu> ----- "Carsten Aulbert" wrote: Hi Carsten, > on Intel X3220 CPU based systems (4 physical cores) I came across the > following thing (Debian etch, with vanilla kernel 2.6.25.9): Do you get the same issue with a recent kernel ? cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency From lindahl at pbm.com Thu Nov 13 11:53:37 2008 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] =?utf-8?Q?=D0=A1lo?= =?utf-8?Q?s?= network vs fat tree In-Reply-To: References: Message-ID: <20081113195337.GA6352@bx9> On Thu, Nov 13, 2008 at 08:54:35PM +0300, Mikhail Kuzminsky wrote: > I read somewhere that exact evidence of "non-blocking" was performed for > Clos networks with >= 3 levels. But most popular Infiniband fat trees has > only 2 levels. Two is three: if you have spine spine | \/ | leaf leaf / \ / \ node node node node that's a 3-level network in Clos-speak, because the image you should really imagine is node node \ / leaf / \ spine spine \ / leaf / \ node node Or, to think about it another way, you have to go through 3 switches max to get from a node to another. This is also sometimes called the network diameter. > (Yes, I know that "non-blocking" for Clos network isn't "absolute" :-)) So useless that it doesn't even help to mention "non-blocking", unless each node only talks to exactly 1 other node. In which case I won't complain if you mention the standard "latency" and "bandwidth" benchmarks ;-) -- greg From serge.fonville at gmail.com Wed Nov 12 12:06:10 2008 From: serge.fonville at gmail.com (Serge Fonville) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Transparant clustering Message-ID: <680cbe0e0811121206v5732f41qbace71a4621ba1f1@mail.gmail.com> Hi, I am new to clustering and beowulf (I only played with Microsoft Clustering Service). I have always thought that it should be possible to build a cluster which works like a single system. So that when I open an SSH session to the cluster I get a connection as normal while in fact I am connecting to the clustered system. I started reading a lot, and it seems as if this can be done with beowulf. I just wonder if the head node would make things more difficult (since that can go down as well). Is this at all possible (using beowulf) and how would I go about configuring this? I know this isn't very clear (I am just exploring), so please ask away. Thanks a very big lot in advance!!! Regards, Serge Fonville -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081112/2ba49645/attachment.html From hahn at mcmaster.ca Thu Nov 13 14:52:20 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Transparant clustering In-Reply-To: <680cbe0e0811121206v5732f41qbace71a4621ba1f1@mail.gmail.com> References: <680cbe0e0811121206v5732f41qbace71a4621ba1f1@mail.gmail.com> Message-ID: > Service). I have always thought that it should be possible to build a > cluster which works like a single system. So that when I open an SSH session > to the cluster I get a connection as normal while in fact I am connecting to > the clustered system. the ssh connection is not interesting; what happens after that _is_. so you ssh to a cluster, and through lvs or similar, you get put onto some node. then what? is it supposed to still "work like" a single system? it does as long as you don't need more than one node, but that's not only boring but begs the question of why a cluster? making a cluster really "work like" a single system means that no thread should be aware of the fact that some other thread is on a different node. this means a single pid space, transparent shared memory, etc. and even then (an SGI altix is such a machine), none of this is transparent in a strong sense (ie, the thread will indeed be able to tell when a cacheline is remote...) > I started reading a lot, and it seems as if this can > be done with beowulf. I just wonder if the head node would make things more > difficult (since that can go down as well). Is this at all possible (using > beowulf) and how would I go about configuring this? are you confusing high-availability with clustering? avoiding single points of failure is laudable, but you quickly start to move away from anything that resembles high-performance (and necessarily start relying on more replication and thus cost...) > I know this isn't very clear (I am just exploring), so please ask away. well, "what do you mean?" pretty much covers it. it's certainly possible to avoid a single login node as a single point of failure. it's also possible to use HA techniques to avoid other SPOF's (such as where the scheduler runs, or filesystems, etc). but "working like a single system" is much harder. From niftyompi at niftyegg.com Thu Nov 13 15:26:17 2008 From: niftyompi at niftyegg.com (Nifty niftyompi Mitch) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] =?utf-8?Q?=D0=A1lo?= =?utf-8?Q?s?= network vs fat tree In-Reply-To: References: Message-ID: <20081113232617.GA4694@compegg.wr.niftyegg.com> On Thu, Nov 13, 2008 at 08:54:35PM +0300, Mikhail Kuzminsky wrote: > Sorry, is it correct to say that fat tree topology is equal to > *NON-BLOCKING* Clos network w/addition of "uplinks" ? I.e. any > non-blocking Clos network w/corresponding addition of uplinks gives fat > tree ? > > I read somewhere that exact evidence of "non-blocking" was performed for > Clos networks with >= 3 levels. But most popular Infiniband fat trees has > only 2 levels. > > (Yes, I know that "non-blocking" for Clos network isn't "absolute" :-)) Since Infiniband routing is static I suspect that the topology may match but the behavior will not. http://en.wikipedia.org/wiki/Clos_network#Strict-sense_nonblocking_Clos_networks_.28m_.E2.89.A5_2n_-_1.29_-_the_original_1953_Clos_result See the bit: "If m ? n, the Clos network is rearrangeably nonblocking, meaning that an unused input on an ingress switch can always be connected to an unused output on an egress switch, but for this to take place, existing calls may have to be rearranged by assigning them to different centre stage switches in the Clos network [2]. To prove this, it is..." The key word is "rearrangeably nonblocking". If 30 seconds of homework is sufficient the key to Clos topology research is that it is focused on teleco switching where a call is 'routed' when it is made and torn down on disconnect. This is not the same problem space as a packet switched network at a couple of levels. -- T o m M i t c h e l l Found me a new hat, now what? From Bogdan.Costescu at iwr.uni-heidelberg.de Fri Nov 14 00:23:19 2008 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Transparant clustering In-Reply-To: <680cbe0e0811121206v5732f41qbace71a4621ba1f1@mail.gmail.com> References: <680cbe0e0811121206v5732f41qbace71a4621ba1f1@mail.gmail.com> Message-ID: On Wed, 12 Nov 2008, Serge Fonville wrote: > I have always thought that it should be possible to build a cluster > which works like a single system. Have you looked at ScaleMP ? > I just wonder if the head node would make things more difficult > (since that can go down as well). You have to better define your expectations of reliability and performance - they often go in different directions. -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 E-mail: bogdan.costescu@iwr.uni-heidelberg.de From kilian.cavalotti.work at gmail.com Fri Nov 14 00:44:33 2008 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Transparant clustering In-Reply-To: <680cbe0e0811121206v5732f41qbace71a4621ba1f1@mail.gmail.com> References: <680cbe0e0811121206v5732f41qbace71a4621ba1f1@mail.gmail.com> Message-ID: <200811140944.34067.kilian.cavalotti.work@gmail.com> On Wednesday 12 November 2008 21:06:10 Serge Fonville wrote: > I am new to clustering and beowulf (I only played with Microsoft Clustering > Service). I have always thought that it should be possible to build a > cluster which works like a single system. You're describing something close to a Single System Image cluster (http://en.wikipedia.org/wiki/Single-system_image) You may want to give a look at OpenSSI (http://en.wikipedia.org/wiki/OpenSSI, http://openssi.org) Cheers, -- Kilian From eugen at leitl.org Fri Nov 14 01:35:23 2008 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Transparant clustering In-Reply-To: <200811140944.34067.kilian.cavalotti.work@gmail.com> References: <680cbe0e0811121206v5732f41qbace71a4621ba1f1@mail.gmail.com> <200811140944.34067.kilian.cavalotti.work@gmail.com> Message-ID: <20081114093523.GV11544@leitl.org> On Fri, Nov 14, 2008 at 09:44:33AM +0100, Kilian CAVALOTTI wrote: > You may want to give a look at OpenSSI (http://en.wikipedia.org/wiki/OpenSSI, > http://openssi.org) Is that stable or just dead? I can't quite tell. From hahn at mcmaster.ca Fri Nov 14 07:12:40 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Transparant clustering In-Reply-To: <20081114093523.GV11544@leitl.org> References: <680cbe0e0811121206v5732f41qbace71a4621ba1f1@mail.gmail.com> <200811140944.34067.kilian.cavalotti.work@gmail.com> <20081114093523.GV11544@leitl.org> Message-ID: > Is that stable or just dead? I can't quite tell. is there a difference? From andres_polindara at hotmail.com Thu Nov 13 20:07:59 2008 From: andres_polindara at hotmail.com (Cesar Andres Polindara Lopez) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Getting started on parallelization Message-ID: Hi to all. I have developed a finite element model for the simulation of poroelastic materials. The model requires the implementation of a time integration scheme (also known as step by step algorithms) hence iterations are performed. For each iteration I have to solve a linear system of equations so I have to calculate the inverse of a matrix and then execute a multiplication. I'm trying to see if there's any chance to parallelize my code. I'm just getting started on the subject of parallelization and I would appreciate if anyone can give me a clue where to begin. I'm not quite sure what would be the best strategy to solve my problem. C?sar. _________________________________________________________________ Explore the seven wonders of the world http://search.msn.com/results.aspx?q=7+wonders+world&mkt=en-US&form=QBRE -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081113/ac79feaa/attachment.html From serge.fonville at gmail.com Fri Nov 14 00:41:16 2008 From: serge.fonville at gmail.com (Serge Fonville) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Transparant clustering In-Reply-To: References: <680cbe0e0811121206v5732f41qbace71a4621ba1f1@mail.gmail.com> Message-ID: <680cbe0e0811140041w63bfd3d5wb183517116682723@mail.gmail.com> Thanks for the response (and for the questions :-)) I'll try and elaborate a bit more. I currently have two equal systems. (XEON 3220,8GB,80GB RAID1) I want to run a couple of websites (using Tomcat) and two database servers (PostgreSQL and MySQL) I have contintued reading a lot and I think I am starting to have a clear idea on what is possible Basically I want it to appear as a single system to the outside world while in fact there are more (currently just two). They should divide all usage of resources equally. If one goes down the other notices and takes over everything, if it comes up again they are synchronized (I am aware of the split brain issue) either server has four network interfaces and can also be connected through an RS-232 cable. Alternatives seem to be to create a tomcat cluster, a mysql cluster and pgcluster for clustering tomcat, but that would not be as scalable as I would have hoped. There are no specific requirements as to availability (one internet connection, one switch, one power group) currently a very simple project (it is supposed to grow bigger and larger over time) I also tried looking into OpenSSI, kerrighed ,LinuxPMI and using heartbeat with drbd and lvs. All these possibilities are a bit overwhelming to start with, so I hope the most developed project of all (which seems to be beowulf) seemed to me the best starting point. Regards, Serge Fonville On Fri, Nov 14, 2008 at 9:23 AM, Bogdan Costescu < Bogdan.Costescu@iwr.uni-heidelberg.de> wrote: > On Wed, 12 Nov 2008, Serge Fonville wrote: > > I have always thought that it should be possible to build a cluster which >> works like a single system. >> > > Have you looked at ScaleMP ? > > I just wonder if the head node would make things more difficult (since >> that can go down as well). >> > > You have to better define your expectations of reliability and performance > - they often go in different directions. > > -- > Bogdan Costescu > > IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany > Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 > E-mail: bogdan.costescu@iwr.uni-heidelberg.de > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081114/ce4b3f5d/attachment.html From lindahl at pbm.com Fri Nov 14 14:00:34 2008 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Getting started on parallelization In-Reply-To: References: Message-ID: <20081114220034.GA19744@bx9> On Thu, Nov 13, 2008 at 11:07:59PM -0500, Cesar Andres Polindara Lopez wrote: > I have developed a finite element model for the simulation of > poroelastic materials. The model requires the implementation of a time > integration scheme (also known as step by step algorithms) hence > iterations are performed. For each iteration I have to solve a linear > system of equations so I have to calculate the inverse of a matrix and > then execute a multiplication. Good news: look up BLAS, and you will find a standard interface for doing matrix computations. There are parallel implementations available: Atlas, AMD ACML, Intel's math lib, etc etc. Bad news: Are you really inverting the matrix? You should probably be doing something like LU decomposition. -- greg From becker at scyld.com Fri Nov 14 14:07:02 2008 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] 10th Annual Beowulf Bash: Austin TX Nov 17 2008 9pm Message-ID: Tenth Annual Beowulf Bash And LECCIBG November 17 2008 9pm at Pete's Dueling Piano Bar We have finalized the plans for this year's combined Beowulf Bash and LECCIBG http://www.xandmarketing.com/beobash/ It will take place, as usual, with the IEEE SC Conference. This year SC08 is in Austin during the week of Nov 17 2008 As in previous years, the attraction is the conversations with other attendees. We will have drinks and light snacks, with a short greeting by the sponsors about 10:15pm. The venue is in the lively area of Austin near 6th street, very close to many of the conference hotels and within walking distance of the rest. November 17 2008 9-11:30pm Monday, Immediately after the SC08 Opening Gala Pete's Dueling Piano Bar http://www.petesduelingpianobar.com If your company (or even you as an individual) would like to help sponsor the event, please contact me, becker@beowulf.org before early November. (We can accommodate last-minute sponsorship, but your name won't be on the printed info.) Our "headlining" sponsor list for 2008 is AMD AMD (Lead sponsor) http://amd.com Other sponsors are Penguin/Scyld (organizing sponsor) http://penguincomputing.com XAND Marketing (organizing sponsor) http://xandmarketing.com NVIDIA http://nvidia.com Terascala http://www.terascala.com/ Panasas http://www.panasas.com/ Clustermonkey http://www.clustermonkey.net/ -- Donald Becker becker@scyld.com Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA From dnlombar at ichips.intel.com Fri Nov 14 14:34:05 2008 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Transparant clustering In-Reply-To: <680cbe0e0811140041w63bfd3d5wb183517116682723@mail.gmail.com> References: <680cbe0e0811121206v5732f41qbace71a4621ba1f1@mail.gmail.com> <680cbe0e0811140041w63bfd3d5wb183517116682723@mail.gmail.com> Message-ID: <20081114223405.GA8697@nlxdcldnl2.cl.intel.com> On Fri, Nov 14, 2008 at 01:41:16AM -0700, Serge Fonville wrote: > Thanks for the response (and for the questions :-)) > > I'll try and elaborate a bit more. > I currently have two equal systems. (XEON 3220,8GB,80GB RAID1) > I want to run a couple of websites (using Tomcat) and two database servers (PostgreSQL and MySQL) That's a very different type of clustering. This list, the Beowulf list, is about about clustering for HPC, to increase compute performance, usually for scientific and engineering calculations. See below. > I have contintued reading a lot and I think I am starting to have a clear idea on what is possible > Basically I want it to appear as a single system to the outside world while in fact there are more (currently just two). > They should divide all usage of resources equally. If one goes down the other notices and takes over everything, if it comes up again they are synchronized (I am aware of the split brain issue) either server has four network interfaces and can also be connected through an RS-232 cable. There are hardware and software solutions to this problem. At the conceptual level, you have a system--the load balancer--that routes incoming requests to one-of-N backing servers. If any of the backing servers fails, it's simply ignored and future incoming requests are routed to the remaining server(s). As long as the remaining server(s) can handle the load, all is well and your service is provided. Here's LVS, a software solution: You may also want to consider Linux-HA to ensure your LVS server is robust: There are many other details to consider, but you'll learn of those as you research more appropriate solutions. HTH -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From reuti at staff.uni-marburg.de Fri Nov 14 14:59:49 2008 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Getting started on parallelization In-Reply-To: <20081114220034.GA19744@bx9> References: <20081114220034.GA19744@bx9> Message-ID: <89017A90-F212-4943-8E95-1EE3DF01319A@staff.uni-marburg.de> Am 14.11.2008 um 23:00 schrieb Greg Lindahl: > On Thu, Nov 13, 2008 at 11:07:59PM -0500, Cesar Andres Polindara > Lopez wrote: > >> I have developed a finite element model for the simulation of >> poroelastic materials. The model requires the implementation of a >> time >> integration scheme (also known as step by step algorithms) hence >> iterations are performed. For each iteration I have to solve a linear >> system of equations so I have to calculate the inverse of a matrix >> and >> then execute a multiplication. > > Good news: look up BLAS, and you will find a standard interface for http://netlib.org/liblist.html > doing matrix computations. There are parallel implementations > available: Atlas, AMD ACML, Intel's math lib, etc etc. If you want to extend it to mutilple machines, you can also look into PBLAS as part of ScaLAPACK (above link) and/or it's Intel implementation in Intel's MKL. -- Reuti > Bad news: Are you really inverting the matrix? You should probably be > doing something like LU decomposition. > > -- greg > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From serge.fonville at gmail.com Fri Nov 14 14:56:30 2008 From: serge.fonville at gmail.com (Serge Fonville) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Transparant clustering In-Reply-To: <20081114223405.GA8697@nlxdcldnl2.cl.intel.com> References: <680cbe0e0811121206v5732f41qbace71a4621ba1f1@mail.gmail.com> <680cbe0e0811140041w63bfd3d5wb183517116682723@mail.gmail.com> <20081114223405.GA8697@nlxdcldnl2.cl.intel.com> Message-ID: <680cbe0e0811141456k6de0e556m3139480b2a80d623@mail.gmail.com> Thank you for your answer. I proably misunderstood the purpose of beowulf, thanks for clarifying I will continue my search for a soution. Thanks a lot all for the help On Fri, Nov 14, 2008 at 11:34 PM, Lombard, David N < dnlombar@ichips.intel.com> wrote: > On Fri, Nov 14, 2008 at 01:41:16AM -0700, Serge Fonville wrote: > > Thanks for the response (and for the questions :-)) > > > > I'll try and elaborate a bit more. > > I currently have two equal systems. (XEON 3220,8GB,80GB RAID1) > > I want to run a couple of websites (using Tomcat) and two database > servers (PostgreSQL and MySQL) > > That's a very different type of clustering. This list, the Beowulf list, > is about about clustering for HPC, to increase compute performance, usually > for scientific and engineering calculations. See below. > > > I have contintued reading a lot and I think I am starting to have a clear > idea on what is possible > > Basically I want it to appear as a single system to the outside world > while in fact there are more (currently just two). > > They should divide all usage of resources equally. If one goes down the > other notices and takes over everything, if it comes up again they are > synchronized (I am aware of the split brain issue) either server has four > network interfaces and can also be connected through an RS-232 cable. > > There are hardware and software solutions to this problem. At the > conceptual > level, you have a system--the load balancer--that routes incoming requests > to > one-of-N backing servers. If any of the backing servers fails, it's simply > ignored and future incoming requests are routed to the remaining server(s). > As long as the remaining server(s) can handle the load, all is well and > your > service is provided. > > Here's LVS, a software solution: > You may also want to consider Linux-HA to ensure your LVS server is robust: > > > There are many other details to consider, but you'll learn of those as you > research more appropriate solutions. > > HTH > -- > David N. Lombard, Intel, Irvine, CA > I do not speak for Intel Corporation; all comments are strictly my own. > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081114/57b0b282/attachment.html From diep at xs4all.nl Sun Nov 16 17:55:57 2008 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Getting started on parallelization In-Reply-To: References: Message-ID: Hi Cesar, Before doing the hard work of parallellization, do some big efforts of figuring out which methods are there to speedup your calculation in algorithmic manner. Find all software that's doing what you are after and find experts there and talk to the guys; the reality is in every world of computation that what you find on the net is just enough for a beginners level usually; only when really digging hard you can find the very best, if that's at the net somewhere anyway; even when you see source code of something that performs well, the theories behind it and tiny details implemented you might miss not knowing the reasons behind the choices taken. In general a lot of algorithms out of the past get total hammered down by newer ones which have total other parallel capabilities usually (much harder to parallellize sometimes) and eating up a lot more RAM usual; publications simply are never accurate as the guys can make money with something that works better, and most publications get done by persons who have an expertise grade of just 4 years, which usually is not even enough to know all existing public theory. Let alone that those who make money will give away their secrets/ideas nor get paid for publishing ideas. There is a lot known out there. Vincent p.s. how does that feel being a Colombian getting more trade access to USA? On Nov 14, 2008, at 5:07 AM, Cesar Andres Polindara Lopez wrote: > Hi to all. > > I have developed a finite element model for the simulation of > poroelastic materials. The model requires the implementation of a > time integration scheme (also known as step by step algorithms) > hence iterations are performed. For each iteration I have to solve > a linear system of equations so I have to calculate the inverse of > a matrix and then execute a multiplication. I'm trying to see if > there's any chance to parallelize my code. > I'm just getting started on the subject of parallelization and I > would appreciate if anyone can give me a clue where to begin. I'm > not quite sure what would be the best strategy to solve my problem. > > C?sar. > > Explore the seven wonders of the world Learn more! > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From becker at scyld.com Mon Nov 17 00:01:02 2008 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Final Announcement: 10th Annual Beowulf Bash 9pm Nov 17 2008 Message-ID: Final Announcement: 10th Annual Beowulf Bash 9pm Nov 17 2008 Tenth Annual Beowulf Bash And LECCIBG 9pm November 17 2008 Pete's Dueling Piano Bar 421 E. 6th St. Austin, TX It will take place, as usual, with the IEEE SC Conference. We've picked the only time that doesn't conflict with vendor events, Monday evening just after the Opening Gala. Pete's is nearby -- a short walk away. As in previous years, the primary attraction is the conversations with other attendees. We will supplement this with some fun professional entertainment: "Dueling Pianos". We will have drinks and snacks, along with a few give-aways There will be a short greeting by the sponsors somewhere around 10pm (anytime from 9:45 to 10:15pm) -- try to be there by then. Again: Monday, November 17 2008 9-11:30pm Immediately after the SC08 Opening Gala Pete's Dueling Piano Bar http://www.petesduelingpianobar.com Our headlining sponsor for 2008 is AMD (yes, that means they put up a big chunk of the money) AMD http://amd.com Other sponsors are Penguin/Scyld (organizing sponsor) http://penguincomputing.com XAND Marketing (organizing sponsor) http://xandmarketing.com NVIDIA http://nvidia.com Terascala http://terascala.com/ Clustermonkey http://clustermonkey.net/ ClusterCorp http://clustercorp.com/ -- Donald Becker becker@scyld.com Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA From kilian.cavalotti.work at gmail.com Mon Nov 17 00:40:49 2008 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Transparant clustering In-Reply-To: References: <680cbe0e0811121206v5732f41qbace71a4621ba1f1@mail.gmail.com> <20081114093523.GV11544@leitl.org> Message-ID: <200811170940.50181.kilian.cavalotti.work@gmail.com> On Friday 14 November 2008 16:12:40 Mark Hahn wrote: > > Is that stable or just dead? I can't quite tell. As far as I can tell from what i read on the mailing lists, the development pace has slowed down quite a bit, but the project is still alive. The website seems pretty much outdated, though. And I didn't use it recently, so I can't really tell how usable it is. > is there a difference? Well... Is the Linux kernel stable? Many people seem to think this way. Is it dead? Looks darn alive and kicking to me. :) Cheers, -- Kilian From deadline at eadline.org Tue Nov 18 11:49:30 2008 From: deadline at eadline.org (Douglas Eadline) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Not to be missed: Walt Ligon Performs in SC08 Music Series In-Reply-To: References: Message-ID: <59895.140.221.238.198.1227037770.squirrel@mail.eadline.org> If you are at the show, come by the SC08 Music Room on Wednesday at 2:30PM to see Walt Ligon (of PVFS frame) perform. Those attending the Beowulf Bash had a preview last night, so I'm looking forward to tomorrows performance. -- Doug -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From dnlombar at ichips.intel.com Tue Nov 18 13:30:51 2008 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] Linux authentication via AD Message-ID: <20081118213051.GB27402@nlxdcldnl2.cl.intel.com> Just passing on a link: Authenticate Linux Clients with Active Directory (Technet) -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From rfinch at water.ca.gov Tue Nov 18 11:55:30 2008 From: rfinch at water.ca.gov (Finch, Ralph) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters Message-ID: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> As we know by now GPUs can run some problems many times faster than CPUs (e.g. http://folding.stanford.edu/English/FAQ-highperformance). From what I understand GPUs are useful only with certain classes of numerical problems and discretization schemes, and of course the code must be rewritten to take advantage of the GPU. I'm part of a group that is purchasing our first beowulf cluster for a climate model and an estuary model using Chombo (http://seesar.lbl.gov/ANAG/chombo/). Getting up to speed (ha) on clusters I started wondering if packages like Chombo, and numerical problems generally, would be rewritten to take advantage of GPUs and GPU clusters, if the latter exist. From decades ago when I actually knew something I vaguely recall that PDEs can be classed as to parabolic, hyperbolic or elliptic. And there are explicit and implicit methods in time. Are some of these classifications much better suited for GPUs than others? Given the very substantial speed improvements with GPUs, will there be a movement to GPU clusters, even if there is a substantial cost in problem reformulation? Or are GPUs only suitable for a rather narrow range of numerical problems? Ralph Finch, P.E. California Dept. of Water Resources Delta Modeling Section, Bay-Delta Office Room 215-13 1416 9th Street Sacramento CA 95814 916-653-7552 rfinch@water.ca.gov From hearnsj at googlemail.com Wed Nov 19 13:11:54 2008 From: hearnsj at googlemail.com (John Hearns) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> Message-ID: <9f8092cc0811191311w652bcd24t926d3a0592117ae3@mail.gmail.com> 2008/11/18 Finch, Ralph > > p Given the very substantial speed improvements with GPUs, > will there be a movement to GPU clusters, even if there is a substantial > cost in problem reformulation? Or are GPUs only suitable for a rather > narrow range of numerical problems? > > My take? Yes, there WILL be a movement to GPU clusters. Note the tense. It has not happened yet. Speaking as someone responsible for running commercial codes on clusters, I've recently been talking to a former colleague in medical imaging whose group is getting good results with CUDA, and someone who is getting good results in CFD work. BUT if you are not writing your own codes, you should be looking at a Beowulf type cluster. Find yourself a vendor who you have the warm-and-fuzzies with. Seriously, as you leftpondians say it is like dating. Also speak with the other researchers who are running thes models - maybe they behave well with Infiniband interconnects, or work well with Myrinet. Let's be very honest here - we all have a huge amount of computer power on our desks, many times that of the original Cray 1 systems. The art is to install, care for and to run Beowulf class systems efficiently. Yes, for certain algorithms and certain problems CUDA and Firestream accelerate things by 10, 20...100 times. But don't lose track of the amount of power in the current generation - and imminent Shanghai and Nehalem systems. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081119/1146c5e2/attachment.html From xclski at yahoo.com Wed Nov 19 16:18:35 2008 From: xclski at yahoo.com (Ellis Wilson) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <9f8092cc0811191311w652bcd24t926d3a0592117ae3@mail.gmail.com> Message-ID: <330326.56339.qm@web37907.mail.mud.yahoo.com> From: John Hearns To: "Finch, Ralph" ; beowulf@beowulf.org Sent: Wednesday, November 19, 2008 4:11:54 PM Subject: Re: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters ... and to run Beowulf class systems efficiently. Yes, for certain algorithms and certain problems CUDA and Firestream accelerate things by 10, 20...100 times. But don't lose track of the amount of power in the current generation - and imminent Shanghai and Nehalem systems. ... This said, I've had a nagging question of late - if I purchase an ATI desktop graphics card that has streams akin to the official workstation FireStream GPU processor, can I write code using their SDK that will work on the desktop cards, often times within my college kid budget, unlike the FireStream series? Ellis From hearnsj at googlemail.com Wed Nov 19 23:23:47 2008 From: hearnsj at googlemail.com (John Hearns) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: <330326.56339.qm@web37907.mail.mud.yahoo.com> References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <9f8092cc0811191311w652bcd24t926d3a0592117ae3@mail.gmail.com> <330326.56339.qm@web37907.mail.mud.yahoo.com> Message-ID: <9f8092cc0811192323u25067283o321b6d8cef97966c@mail.gmail.com> 2008/11/20 Ellis Wilson > ... > > This said, I've had a nagging question of late - if I purchase an ATI > desktop graphics card that has streams akin to the official workstation > FireStream GPU processor, can I write code using their SDK that will work on > the desktop cards, often times within my college kid budget, unlike the > FireStream series? > Ellis, I can't say re. the Firestream cards, but for Nvidia the answer is a resounding yes. Virtually any recent card can run CUDA code. If you Google you can get a list of compatible cards. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081120/447ebe0a/attachment.html From peter.st.john at gmail.com Thu Nov 20 05:21:48 2008 From: peter.st.john at gmail.com (Peter St. John) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> Message-ID: Regarding the hyperbolic (etc) classification of PDEs: you may want qualitative theory instead of (or in additon to) number crunching. I'd suggest stopping over at UC Davis, where I count at least half a dozen PDE folks in the applied math program (more if you count the turbulence and fluid dynamics folks). Never miss an opportunity to buy a mathematician lunch :-) Peter On Tue, Nov 18, 2008 at 2:55 PM, Finch, Ralph wrote: > As we know by now GPUs can run some problems many times faster than CPUs > (e.g. http://folding.stanford.edu/English/FAQ-highperformance). From > what I understand GPUs are useful only with certain classes of numerical > problems and discretization schemes, and of course the code must be > rewritten to take advantage of the GPU. > > I'm part of a group that is purchasing our first beowulf cluster for a > climate model and an estuary model using Chombo > (http://seesar.lbl.gov/ANAG/chombo/). Getting up to speed (ha) on > clusters I started wondering if packages like Chombo, and numerical > problems generally, would be rewritten to take advantage of GPUs and GPU > clusters, if the latter exist. From decades ago when I actually knew > something I vaguely recall that PDEs can be classed as to parabolic, > hyperbolic or elliptic. And there are explicit and implicit methods in > time. Are some of these classifications much better suited for GPUs > than others? Given the very substantial speed improvements with GPUs, > will there be a movement to GPU clusters, even if there is a substantial > cost in problem reformulation? Or are GPUs only suitable for a rather > narrow range of numerical problems? > > Ralph Finch, P.E. > California Dept. of Water Resources > Delta Modeling Section, Bay-Delta Office > Room 215-13 > 1416 9th Street > Sacramento CA 95814 > 916-653-7552 > rfinch@water.ca.gov > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081120/7875621f/attachment.html From hahn at mcmaster.ca Thu Nov 20 06:58:27 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: <9f8092cc0811192323u25067283o321b6d8cef97966c@mail.gmail.com> References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <9f8092cc0811191311w652bcd24t926d3a0592117ae3@mail.gmail.com> <330326.56339.qm@web37907.mail.mud.yahoo.com> <9f8092cc0811192323u25067283o321b6d8cef97966c@mail.gmail.com> Message-ID: > Ellis, I can't say re. the Firestream cards, but for Nvidia the answer is a > resounding yes. AMD had some PR recently (check the reg and inq) about supporting their stream stuff across the whole product line, including chipset-integrated gpus. that seems intelligent, given that lines between CPU and GPU are obviously blurring in the future (Larrabee, Fusion, etc). IMO, it would be crazy to invest too much in the current gen of gp-gpu programming stuff. doing some pilot stuff with both vendors probably makes sense, but the field really does need OpenCL to succeed. I hope the OpenCL people are not too OpenGL-ish, and realize that they need to target SSE and SSE512 as well. > Virtually any recent card can run CUDA code. If you Google you can get a > list of compatible cards. not that many NVidia cards support DP yet though, which is probably important to anyone coming from the normal HPC world... there's some speculation that NV will try to keep DP as a market segmentation feature to drive HPC towards high-cost Tesla cards, much as vendors have traditionally tried to herd high-end vis into 10x priced cards. regards, mark hahn. From hahn at mcmaster.ca Thu Nov 20 07:11:13 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> Message-ID: > As we know by now GPUs can run some problems many times faster than CPUs it's good to cultivate some skepticism. the paper that quotes 40x does so with a somewhat tilted comparison. (I consider this comparison fair: a host with 2x 3.2 GHz QC Core2 vs 1 current high-end CPU card. former delivers 102.4 SP Gflops; latter is something like 1.2 Tflop. those are all peak/theoretical. the nature of the problem determines how much slower real workloads are - I suggest that as not-suited-ness increases, performance falls off _faster_ for the GPU.) > what I understand GPUs are useful only with certain classes of numerical > problems and discretization schemes, and of course the code must be I think it's fair to say that GPUs are good for graphics-like loads, or more generally: fairly small data, accessed data-parallel or with very regular and limited sharing, with high work-per-data. > I'm part of a group that is purchasing our first beowulf cluster for a > climate model and an estuary model using Chombo > (http://seesar.lbl.gov/ANAG/chombo/). Getting up to speed (ha) on offhand, I'd guess that adaptive grids will be substantially harder to run efficiently on a GPU than a uniform grid. > than others? Given the very substantial speed improvements with GPUs, > will there be a movement to GPU clusters, even if there is a substantial > cost in problem reformulation? Or are GPUs only suitable for a rather > narrow range of numerical problems? GP-GPU tools are currently immature, and IMO the hardware probably needs a generation of generalization before it becomes really widely used. OTOH, GP-GPU has obviously drained much of the interest away from eg FPGA computation. I don't know whether there is still enough interest in vector computers to drain anything... From landman at scalableinformatics.com Thu Nov 20 07:43:15 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> Message-ID: <49258593.1010608@scalableinformatics.com> Quick intervention from SC08 show Mark Hahn wrote: >> As we know by now GPUs can run some problems many times faster than CPUs > > it's good to cultivate some skepticism. the paper that quotes 40x > does so with a somewhat tilted comparison. (I consider this comparison > fair: a host with 2x 3.2 GHz QC Core2 vs 1 current high-end CPU card. > former delivers 102.4 SP Gflops; latter is something like 1.2 Tflop. > those are all peak/theoretical. the nature of the problem determines > how much slower real workloads are - I suggest that as not-suited-ness > increases, performance falls off _faster_ for the GPU.) Not always. [shameless plug] A project I have spent some time with is showing 117x on a 3-GPU machine over a single core of a host machine (3.0 GHz Opteron 2222). The code is mpihmmer, and the GPU version of it. See http://www.mpihmmer.org for more details. Ping me offline if you need more info. [/shameless plug] >> what I understand GPUs are useful only with certain classes of numerical >> problems and discretization schemes, and of course the code must be > > I think it's fair to say that GPUs are good for graphics-like loads, ... not entirely true. We are seeing good performance with a number of calculations that share similar features. Some will not work well on GPUs, those with lots of deep if-then or conditional constructs. If you can refactor these such that the conditionals are hoisted out of the inner loops, this is a good thing for GPUs. > or more generally: fairly small data, accessed data-parallel or with > very regular and limited sharing, with high work-per-data. ... not small data. You can stream data. Hi work per data is advisable on any NUMA like machine with penalties for data motion (cache based architectures, NUMA, MPI, ...). You want as much data reuse as you can get, or to structure the stream to leverage the maximum bandwidth. [...] >> than others? Given the very substantial speed improvements with GPUs, >> will there be a movement to GPU clusters, even if there is a substantial >> cost in problem reformulation? Or are GPUs only suitable for a rather >> narrow range of numerical problems? > > GP-GPU tools are currently immature, and IMO the hardware probably needs > a generation of generalization before it becomes really widely used. Hrmm... Cuda is pretty good. Still needs some polish, but people can use it, and are generating real apps from it. We are seeing pretty wide use ... I guess the issue is what one defines as "wide". > OTOH, GP-GPU has obviously drained much of the interest away from eg > FPGA computation. I don't know whether there is still enough interest There is still some of it on the show floor. Some things FPGAs do very well. But the cost for this performance has been prohibitive, and GPUs are basically decimating the business model that has been in use for FPGAs. > in vector computers to drain anything... Hmmm.... There is a (micro)vector machine in your CPU anyway. Joe > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From hahn at mcmaster.ca Thu Nov 20 08:23:31 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: <49258593.1010608@scalableinformatics.com> References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <49258593.1010608@scalableinformatics.com> Message-ID: > [shameless plug] > > A project I have spent some time with is showing 117x on a 3-GPU machine over > a single core of a host machine (3.0 GHz Opteron 2222). The code is > mpihmmer, and the GPU version of it. See http://www.mpihmmer.org for more > details. Ping me offline if you need more info. > > [/shameless plug] I'm happy for you, but to me, you're stacking the deck by comparing to a quite old CPU. you could break out the prices directly, but comparing 3x GPU (modern? sounds like pci-express at least) to a current entry-level cluster node (8 core2/shanghai cores at 2.4-3.4 GHz) be more appropriate. at the VERY least, honesty requires comparing one GPU against all the cores in a current CPU chip. with your numbers, I expect that would change the speedup from 117 to around 15. still very respectable. I apologize for not RTFcode, but does the host version of hmmer you're comparing with vectorize using SSE? >> or more generally: fairly small data, accessed data-parallel or with very >> regular and limited sharing, with high work-per-data. > > ... not small data. You can stream data. can you sustain your 117x speedup if your data is in host memory? by small, I meant the on-gpu-card memory, mainly, which is fast but often more limited than host memory. sidebar: it's interesting that ram is incredibly cheap these days, and we typically spec a middle-of-the-road machine at 2GB/core. but even 4GB/core is not much more expensive, but to be honest, the number of users who need that much is fairly small. >> GP-GPU tools are currently immature, and IMO the hardware probably needs a >> generation of generalization before it becomes really widely used. > > Hrmm... Cuda is pretty good. Still needs some polish, but people can use > it, and are generating real apps from it. We are seeing pretty wide use ... > I guess the issue is what one defines as "wide". Cuda is NV-only, and forces the programmer to face a lot of limits and weaknesses. at least I'm told so by our Cuda users - things like having to re-jigger code to avoid running out of registers. from my perspective, a random science prof is going to be fairly put off by that sort of thing unless the workload is really impossible to do otherwise. (compared to the traditional cluster+MPI approach, which is portable, scalable and at least short-term future-proof.) thanks, mark. From hearnsj at googlemail.com Thu Nov 20 08:35:16 2008 From: hearnsj at googlemail.com (John Hearns) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <49258593.1010608@scalableinformatics.com> Message-ID: <9f8092cc0811200835t115abb49ifd113fecf3407811@mail.gmail.com> > > I'm happy for you, but to me, you're stacking the deck by comparing to a > quite old CPU. you could break out the prices directly, but comparing 3x > GPU (modern? sounds like pci-express at least) Mark, all CUDA capable cards are PCI-Express. (off the top of my head). > to a current entry-level cluster node (8 core2/shanghai cores at 2.4-3.4 > GHz) be more appropriate. > > at the VERY least, honesty requires comparing one GPU against all the cores > in a current CPU chip. with your numbers, I expect that would change the > speedup from 117 to around 15. still very respectable. > > I apologize for not RTFcode, but does the host version of hmmer you're > comparing with vectorize using SSE? > Good question. I'd really like to see the numbers on this one also. As is clear to the list, I'm really enthusuastic about CUDA. But as you say no point in that if your compiler/application could make equally good use of current on-chip SSE (etc. etc.) (Currently sitting in an Altix Performance and Tuning class, and my head is spinning with this stuff. Pun very much intended.) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081120/0f6ebdac/attachment.html From jan.heichler at gmx.net Thu Nov 20 08:39:26 2008 From: jan.heichler at gmx.net (Jan Heichler) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <49258593.1010608@scalableinformatics.com> Message-ID: <1283731746.20081120173926@gmx.net> Hallo Mark, Donnerstag, 20. November 2008, meintest Du: >> [shameless plug] >> A project I have spent some time with is showing 117x on a 3-GPU machine over >> a single core of a host machine (3.0 GHz Opteron 2222). The code is >> mpihmmer, and the GPU version of it. See http://www.mpihmmer.org for more >> details. Ping me offline if you need more info. >> [/shameless plug] MH> I'm happy for you, but to me, you're stacking the deck by comparing to a MH> quite old CPU. you could break out the prices directly, but comparing 3x MH> GPU (modern? sounds like pci-express at least) to a current entry-level MH> cluster node (8 core2/shanghai cores at 2.4-3.4 GHz) be more appropriate. Instead of benchmarking some CPU vs. some GPU wouldn't it be fairer to a) compare systems of similar costs (1k, 2k, 3k EUR/USD) b) compare systems with a similar power footprint ? What does it help that 3 GPUs are 1000x faster than a Asus Eee PC? Jan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081120/5db746e4/attachment.html From hearnsj at googlemail.com Thu Nov 20 08:49:25 2008 From: hearnsj at googlemail.com (John Hearns) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: <1283731746.20081120173926@gmx.net> References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <49258593.1010608@scalableinformatics.com> <1283731746.20081120173926@gmx.net> Message-ID: <9f8092cc0811200849l2f91f45cl794b912df8f4eccd@mail.gmail.com> 2008/11/20 Jan Heichler > > > > What does it help that 3 GPUs are 1000x faster than a Asus Eee PC? > > I read on HPCwire that SGI are demoing a concept machine made of stuffing 1000's of Atom processors in a rack. Seems to me. to be honest. to be bucking the current trends. Has anyone seen it? Maybe some "black" customer has a very, very good reason for having a machine architected like that. (Hmmm. NSA? Codebreaking?) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081120/c37de38b/attachment.html From alscheinine at tuffmail.us Thu Nov 20 08:58:13 2008 From: alscheinine at tuffmail.us (Alan Louis Scheinine) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: <49258593.1010608@scalableinformatics.com> References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <49258593.1010608@scalableinformatics.com> Message-ID: <49259725.2070105@tuffmail.us> Mark Hahn wrote: > OTOH, GP-GPU has obviously drained much of the interest away from eg > FPGA computation. I don't know whether there is still enough interest > in vector computers to drain anything... Joe Landman replied: > Hmmm.... There is a (micro)vector machine in your CPU anyway. What is missing in PCs is the very high main memory bandwidth of vector machines for datasets larger than cache. What do we see in the near future with regard to increased memory bandwidth? Best regards, Alan Scheinine From laytonjb at att.net Thu Nov 20 09:32:33 2008 From: laytonjb at att.net (Jeff Layton) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters Message-ID: <607269.75268.qm@web80705.mail.mud.yahoo.com> I disagree with Mark on investing into GP-GPUs. I think it's a good thing to do for the simple reason of understanding the programming model. I've been watching people work with GP-GPUs for several years and there is always this big hump that they have to get over - understanding how to take their algorithm and re-write it for SIMD. Once they get over this hump, then things get easier. This is also independent of precision. It doesn't matter if you learn in SP or DP - as long as you learn. I would love to see a common language for GP-GPUs, but my guess is that OpenCL will be a bit slow. In the meantime, CUDA is the leader and gaining ground. I haven't had a chance to talk to PGI about their new compiler that has GP-GPU capability - but it sounds really fantastic (PGI makes a really great compiler). Jeff P.S. Sorry for the top posting, but this silly web based email tool can't indent or do much of anything useful :) ________________________________ From: Mark Hahn To: Beowulf Mailing List Sent: Thursday, November 20, 2008 9:58:27 AM Subject: Re: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters > Ellis, I can't say re. the Firestream cards, but for Nvidia the answer is a > resounding yes. AMD had some PR recently (check the reg and inq) about supporting their stream stuff across the whole product line, including chipset-integrated gpus. that seems intelligent, given that lines between CPU and GPU are obviously blurring in the future (Larrabee, Fusion, etc). IMO, it would be crazy to invest too much in the current gen of gp-gpu programming stuff. doing some pilot stuff with both vendors probably makes sense, but the field really does need OpenCL to succeed. I hope the OpenCL people are not too OpenGL-ish, and realize that they need to target SSE and SSE512 as well. > Virtually any recent card can run CUDA code. If you Google you can get a > list of compatible cards. not that many NVidia cards support DP yet though, which is probably important to anyone coming from the normal HPC world... there's some speculation that NV will try to keep DP as a market segmentation feature to drive HPC towards high-cost Tesla cards, much as vendors have traditionally tried to herd high-end vis into 10x priced cards. regards, mark hahn. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081120/dd9f2345/attachment.html From laytonjb at att.net Thu Nov 20 09:36:04 2008 From: laytonjb at att.net (Jeff Layton) Date: Wed Nov 25 01:07:58 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters Message-ID: <504859.61845.qm@web80707.mail.mud.yahoo.com> >> what I understand GPUs are useful only with certain classes of numerical >> problems and discretization schemes, and of course the code must be > I think it's fair to say that GPUs are good for graphics-like loads, > or more generally: fairly small data, accessed data-parallel or with > very regular and limited sharing, with high work-per-data. >From my limited experience I would agree. Getting to the high work-per-data is absolutely key to getting the huge speedups. >> I'm part of a group that is purchasing our first beowulf cluster for a >> climate model and an estuary model using Chombo >> (http://seesar.lbl.gov/ANAG/chombo/). Getting up to speed (ha) on > offhand, I'd guess that adaptive grids will be substantially harder > to run efficiently on a GPU than a uniform grid. One key thing is that unstructured grid codes don't work as well. The problem is the indirect addressing. I know two of the developers at Nvidia and both are CFD gurus - I will ping them to get more details because I know they were looking at this (unstructured vs. structured). Jeff P.S. I had to do the indentation by hand on this stupid email web-based email tool :) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081120/46520a9e/attachment.html From diep at xs4all.nl Thu Nov 20 09:41:42 2008 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: <1283731746.20081120173926@gmx.net> References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <49258593.1010608@scalableinformatics.com> <1283731746.20081120173926@gmx.net> Message-ID: On Nov 20, 2008, at 5:39 PM, Jan Heichler wrote: > Hallo Mark, > > > > Donnerstag, 20. November 2008, meintest Du: > > > > >> [shameless plug] > > > > >> A project I have spent some time with is showing 117x on a 3-GPU > machine over > > >> a single core of a host machine (3.0 GHz Opteron 2222). The > code is > > >> mpihmmer, and the GPU version of it. See http:// > www.mpihmmer.org for more > > >> details. Ping me offline if you need more info. > > > > >> [/shameless plug] > > > > MH> I'm happy for you, but to me, you're stacking the deck by > comparing to a > > MH> quite old CPU. you could break out the prices directly, but > comparing 3x > > MH> GPU (modern? sounds like pci-express at least) to a current > entry-level > > MH> cluster node (8 core2/shanghai cores at 2.4-3.4 GHz) be more > appropriate. > > > > Instead of benchmarking some CPU vs. some GPU wouldn't it be fairer to > > > > a) compare systems of similar costs (1k, 2k, 3k EUR/USD) > > b) compare systems with a similar power footprint > > > > ? > > > > What does it help that 3 GPUs are 1000x faster than a Asus Eee PC? > > > Exactly. http://re.jrc.ec.europa.eu/energyefficiency/html/ standby_initiative_data%20centers.htm The correct comparision is comparing power usage, as that is what is 'hot' these days. Just plain cash money compare is not enough. Weird yet true. In 3d world nations like for example China, India power is not a concern at all, not for government related tasks either. The slow adaptation to manycores, even for workloads that would do well on them (just in theory), is definitely limited by portability. Had some ESA dude on the phone a few days ago. I heard the word "portability" just a bit too much. That's why they do just too much with ugly slow JAVA code. Not fast enough at 1 pc? Put another 100 there. I was told exactly the same reasoning (portability problem) for other projects where i tried to sneak in GPU computing (regardless which manufacturer). Portability was also the KILLER there. If you write burocratic paper documents then CUDA is not portable and never will be of course, as the hardware is simply different from a CPU. Yet that code must be portable between oldie Sun, UNIX type machines and modern quadcores as well as new GPU hardware, inc ase you want to introduce GPU's. Not realistic of course. Just enjoy the speedup i'd say, if you can get it. They can spend millions on hardware, but not even a couple of hundreds of thousands on customized software to solve the problem of portability by having a plugin that is doing the crunching just for gpu's. Idiotic yet that's the sole truth. So to speak, manycores will only make it in there when NASA writes a big article online bragging how fast their supercomputing codes are at todays gpu's where they own a 100k from to do number crunching. I would argue for workloads favourable to GPU's, which is just a very few as of now, NVIDIA/AMD is up to 10x faster than a quadcore, if you know how to get it out of the card. Probably gpgpu for now is the cheap alternative for a few very specific tasks of 3d world nations therefore. May they lead us in the path ahead... In itself very funny that burocratic reasons (portability) is the biggest problem limiting progress. When you speak to hardware designers about say for example 32 core cpu's, they laugh loud. The only scalable hardware for now at 1 cpu giving a big punch, it seems to be manycores. All those managers simply have put their mind in a big storage bunker where alternatives are not allowed in. Even an economic crisis will not help it. They have to get bombarded with actual products that are interesting to them, that get a huge speedup at GPU's, to start understanding the advantage of it. The few who do understand already, they all keep their stuff so secret, and usually guys who are not exactly very good in parallellization may "try out" the GPU in question. That's another recipe for disaster of course. Logically that they never even get a speedup over a simple quadcore. If you compare assembler level SSE2 (modified intel primitives in SSE2 so you want) with a clumsy guy (not in his own thinking) who tries out the GPU for a few weeks, obviously it is gonna fail. Something algorithmic optimized for like 20-30 years now for pc type hardware, that suddenly must get ported within a few weeks to GPU. There is not many who can do that. You need complete different algorithmic approach for that. Something that is memory bound CAN get rewritten to cpu bound. Sometimes even without losing speed. Just because they didn't have the luxury of such huge cpu crunching power, they never tried! But that optimization step of 20 years is a big limit to GPU's. Add to it that intel is used to GIVE AWAY hardware to developers. I'll have to see nvidia do that. If those same guys as the above guys who failed, have that hardware for years at home, they MIGHT get to some ideas and tell their boss. It's those reports of those guys currently which adds to the storage bunker thinking. It is wrong to assume that experts can predict the future. Vincent > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From gerry.creager at tamu.edu Thu Nov 20 10:40:27 2008 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <9f8092cc0811191311w652bcd24t926d3a0592117ae3@mail.gmail.com> <330326.56339.qm@web37907.mail.mud.yahoo.com> <9f8092cc0811192323u25067283o321b6d8cef97966c@mail.gmail.com> Message-ID: <4925AF1B.5020001@tamu.edu> Mark Hahn wrote: >> Ellis, I can't say re. the Firestream cards, but for Nvidia the answer >> is a >> resounding yes. > > AMD had some PR recently (check the reg and inq) about supporting their > stream stuff across the whole product line, including chipset-integrated > gpus. that seems intelligent, given that lines between CPU and GPU are > obviously blurring in the future (Larrabee, Fusion, etc). > > IMO, it would be crazy to invest too much in the current gen of gp-gpu > programming stuff. doing some pilot stuff with both vendors probably > makes sense, but the field really does need OpenCL to succeed. I hope > the OpenCL people are not too OpenGL-ish, and realize that they need to > target SSE and SSE512 as well. Also, Portland Group is working to make their compilers work with CUDA programming methods >> Virtually any recent card can run CUDA code. If you Google you can get a >> list of compatible cards. > > not that many NVidia cards support DP yet though, which is probably > important to anyone coming from the normal HPC world... there's some > speculation that NV will try to keep DP as a market segmentation feature > to drive HPC towards high-cost Tesla cards, much as vendors have > traditionally tried to herd high-end vis into 10x priced cards. Ah, but the GLX280 does... gerry -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.862.3982 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From landman at scalableinformatics.com Thu Nov 20 11:55:04 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <49258593.1010608@scalableinformatics.com> Message-ID: <4925C098.2080501@scalableinformatics.com> Mark Hahn wrote: >> [shameless plug] >> >> A project I have spent some time with is showing 117x on a 3-GPU >> machine over a single core of a host machine (3.0 GHz Opteron 2222). >> The code is mpihmmer, and the GPU version of it. See >> http://www.mpihmmer.org for more details. Ping me offline if you need >> more info. >> >> [/shameless plug] > > I'm happy for you, but to me, you're stacking the deck by comparing to a > quite old CPU. you could break out the prices directly, but comparing 3x Hmmm... This is the machine the units were hosted in. The 2222 is not "quite old" by my definition of old. My experience with this code on Barcelona has been that it hasn't added much performance. Will quantify this more for you in the future. > GPU (modern? sounds like pci-express at least) to a current entry-level > cluster node (8 core2/shanghai cores at 2.4-3.4 GHz) be more appropriate. Hey ... messenger ... don't shoot? :) We would love to have a Shanghai. I don't have one in the lab. I just asked AMD for one. I honestly don't expect it to make much of a difference. > at the VERY least, honesty requires comparing one GPU against all the cores > in a current CPU chip. with your numbers, I expect that would change We are not being dishonest, in fact I was responding to the "can't really get good performance" thread. You can. This code scales linearly with the number of cores. Our mpi version scales linearly across compute nodes. > the speedup from 117 to around 15. still very respectable. Look, the performance is good. The cost to get this performance is very low from an acquisition side. The effort to get performance is relatively speaking, quite low. I want to emphasize this. It won't work for every code. There are large swaths of code it won't work for. This is life, and as with all technologies, YMMV. > I apologize for not RTFcode, but does the host version of hmmer you're > comparing with vectorize using SSE? JP did the vectorization. Performance was about 60% better than the baseline. I (and Joydeep at AMD) rewrote 30 lines of code and got 2x. There are papers referenced on the website that talk about this. > >>> or more generally: fairly small data, accessed data-parallel or with >>> very regular and limited sharing, with high work-per-data. >> >> ... not small data. You can stream data. > > can you sustain your 117x speedup if your data is in host memory? I believe that the databases are being streamed from host ram and disk. > by small, I meant the on-gpu-card memory, mainly, which is fast but > often more limited than host memory. The database sizes are 3-4GB and growing rapidly. The tests were originally run on GTX260s, which have 1GB ram or less. > sidebar: it's interesting that ram is incredibly cheap these days, > and we typically spec a middle-of-the-road machine at 2GB/core. > but even 4GB/core is not much more expensive, but to be honest, > the number of users who need that much is fairly small. > >>> GP-GPU tools are currently immature, and IMO the hardware probably >>> needs a generation of generalization before it becomes really widely >>> used. >> >> Hrmm... Cuda is pretty good. Still needs some polish, but people can >> use it, and are generating real apps from it. We are seeing pretty >> wide use ... I guess the issue is what one defines as "wide". > > Cuda is NV-only, and forces the programmer to face a lot of limits and > weaknesses. at least I'm told so by our Cuda users - things like having Er ... ok. Cuda is getting pretty much all the mind-share. We have asked AMD to support it. AMD is doing something else, CTM was not successful, and I haven't heard what the new strategy is. OpenCL looks like it will be "designed by committee". > to re-jigger code to avoid running out of registers. from my perspective, > a random science prof is going to be fairly put off by that sort of thing > unless the workload is really impossible to do otherwise. (compared to This is not my experience. > the traditional cluster+MPI approach, which is portable, scalable and at > least short-term future-proof.) If you go to the site, you will discover that mpihmmer is in fact cluster+MPI. It was extended to include GPU, FPGA, ... . Again, please don't shoot the messenger. > > thanks, mark. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From lindahl at pbm.com Thu Nov 20 12:16:27 2008 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] Badness in Shanghai upgrade Message-ID: <20081120201626.GC25370@bx9> I had been planning for a long time to upgrade a bunch of my AMD dual-core systems to quad-cores when quad-cores were cheap enough -- same socket, no problem, right? Well, the motherboard that I have apparently doesn't supply quite enough power to the memory when it's being driven by a Shanghai quad-core cpu. My test machine panics once per 2-3 days. I have a couple of Barcelonas and they work fine with this same mobo. Ooopsie. Good thing I don't own that many of 'em. -- greg From bill at cse.ucdavis.edu Thu Nov 20 13:59:00 2008 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] Badness in Shanghai upgrade In-Reply-To: <20081120201626.GC25370@bx9> References: <20081120201626.GC25370@bx9> Message-ID: <4925DDA4.5010406@cse.ucdavis.edu> Greg Lindahl wrote: > I had been planning for a long time to upgrade a bunch of my AMD > dual-core systems to quad-cores when quad-cores were cheap enough -- > same socket, no problem, right? > > Well, the motherboard that I have apparently doesn't supply quite > enough power to the memory Same dimms right? > when it's being driven by a Shanghai > quad-core cpu. Which speed shanghai? How big is your power supply? > My test machine panics once per 2-3 days. I have a > couple of Barcelonas and they work fine with this same mobo. Which speed barcelona? That's really surprising since I believe all shanghai cpus are lower power than all barcelonas. Did you upgrade the BIOS? Oh, what speed does the shanghai claim it's running the memory at? Shanghai can drive ddr2-800 instead of the barcelona's ddr2-667. If your motherboard and dimms are limited to 667 and you drive them at 800 that might be your problem. From ntmoore at gmail.com Thu Nov 20 19:52:47 2008 From: ntmoore at gmail.com (Nathan Moore) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] OpenMP on AMD dual core processors Message-ID: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> Hi All, I'm getting to the end of a semester of computational physics at my institution, and thought it would be fin to close the semester with a discussion of parallel programming. Initially, I was simply planning to discuss MPI, but while reading through the gfortran man page I realized that gcc now supports OpenMP directives. Given that the machines my students are using are all dual core, I started working on a simple example that I hoped would show a nice speedup from the "easy" library. The specific problem I'm working on is a 2-d solution to the laplace equation (electrostatics). The bulk of the computation is a recursion relation, applied to elements of a 2-d array, according to the following snippet. Of course, by now I should know that "simple" never really is. When I compile with gfortran and run with 1 or 2 cores (ie, OMP_NUM_THREADS=2, export OMP_NUM_THREADS) there is basically no difference in execution time. Any suggestions? I figured that this would be a simple example to parallelize. Is there a better example for OpenMP parallelization? Also, is there something obvious I'm missing in the example below? Nathan Moore integer,parameter::Nx=1000 integer,parameter::Ny=1000 real*8 v(Nx,Ny) integer boundary(Nx,Ny) v_cloud = -1.0e-4 v_ground = 0.d0 convergence_v = dabs(v_ground-v_cloud)/(1.d0*Ny*Ny) ! initialize the the boundary conditions do i=1,Nx do j=1,Ny v_y = v_ground + (v_cloud-v_ground)*(j*dy/Ly) boundary(i,j)=0 v(i,j) = v_y ! we need to ensure that the edges of the domain are held as boundary if(i.eq.0 .or. i.eq.Nx .or. j.eq.0 .or. j.eq.Ny) then boundary(i,j)=1 endif end do end do 10 converged = 1 !$OMP PARALLEL !$OMP DO do i=1,Nx do j=1,Ny if(boundary(i,j).eq.0) then old_v = v(i,j) v(i,j) = 0.25*(v(i-1,j)+v(i+1,j)+v(i,j+1)+v(i,j-1)) dv = dabs(old_v-v(i,j)) if(dv.gt.convergence_v) then converged = 0 endif endif end do end do !$OMP ENDDO !$OMP END PARALLEL if(converged.eq.0) then goto 10 endif -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081120/978f6317/attachment.html From landman at scalableinformatics.com Thu Nov 20 20:55:24 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] OpenMP on AMD dual core processors In-Reply-To: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> References: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> Message-ID: <49263F3C.3090005@scalableinformatics.com> Nathan Moore wrote: > Any suggestions? I figured that this would be a simple example to > parallelize. Is there a better example for OpenMP parallelization? > Also, is there something obvious I'm missing in the example below? A few thoughts ... Initialize your data in parallel as well. No reason not to. But optimize that code a bit. You don't need v_y = v_ground + (v_cloud-v_ground)*(j*dy/Ly) boundary(i,j)=0 v(i,j) = v_y when v(i,j)= v_ground + (v_cloud-v_ground)*(j*dy/Ly) boundary(i,j)=0 will eliminate the explicit temporary variable. Also the i.eq.0 test is guaranteed never to be hit in the if-then construct, as with the j.eq.0. You can (and should) replace that if-then construct with a set of loops of the form do j=1,Ny boundary(Nx,j) = 1 end do do i=1,Nx boundary(i,Ny) = 1 end do Also, what sticks out to me is that old_v may be viewed as "shared" versus "private". I know OpenMP is supposed to do the right thing here, but you might need to explicitly mark old_v as private. And dv for that matter. Note also that this inner loop is attempting to do a convergence test. You are looking to set a globally shared value from within an inner loop. This is not a good thing to do. This means accesses to that globally shared variable are going to be locked. I would suggest a slightly different inner loop and convergence test: (note ... this relies on something I havent tried in fortran so adjustment may be needed) real*8 vnew(Nx,Ny),dv(Nx,Ny) do i=1,Nx do j=1,Ny ! notice that the if-then construct is gone ... ! vnew eq 0.0 for boundaries vnew(i,j) = 0.25*(v(i-1,j)+v(i+1,j)+v(i,j+1)+v(i,j-1))* dabs(boundary(i,j).eq.0) dv(i,j) = (dabs(v(i,j)-vnew(i,j)) - convergence_v )* dabs(boundary(i,j).eq.0) end do end do ! now all you need is a "linear scan" to find positive elements in ! dv. You can approach these as sum reductions, and do them in ! parallel do i=1,Nx sum=0.0 do j=1,Ny sum = sum + dabs(dv(i,j) .gt. 0.0) * dv(i,j) end do if (sum .gt. 0.0) converged = 0 end do The basic idea is to replace the inner loop conditionals and remove as many of the shared variables as possible. Also c.f. examples here: http://www.linux-mag.com/id/4609 specifically the Riemann zeta function (fairly trivial). -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From lindahl at pbm.com Thu Nov 20 22:06:23 2008 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] Windows in top 10 Message-ID: <20081121060623.GA4643@bx9> A while back a customer of PathScale issued a press release about how they were running Windows on their supercomputer, with InfiniPath interconnect. We were kinda surprised, since we didn't have Windows drivers. So I was curious about the Dawning 5000A... the previous supercomputers in this series ran Linux. But the top500 entry trumpets Windows. Well... http://tyan.com/newsroom_pressroom_detail.aspx?id=1289 This press release says it runs Linux and Windows. That's a bit more credible; I wonder if they actually had to buy the Windows software? -- greg From diep at xs4all.nl Thu Nov 20 23:56:59 2008 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] Windows in top 10 In-Reply-To: <20081121060623.GA4643@bx9> References: <20081121060623.GA4643@bx9> Message-ID: hi, i'm using now 10 yuan = 1 euro, just for ease of calculation: 2 billion RMB = 2 billion yuah (assumption) = 200 million euro 200 mln / 1920 nodes = 104,166.666666667 euro a node 104k euro a node in short. That's soon 200k dollar. It doesn't matter what gets delivered for that, if it has just 16 cores, then it is a factor 5-10 too expensive. Even if there is 4 expensive FPGA cards in each node or so, that just doesn't matter. Additional to that, 128 GB DDR2-800 ECC ram is real real cheap. We all know this typical government problem of overpaying for hardware. But it sure means there was enough budget for windows. Vincent On Nov 21, 2008, at 7:06 AM, Greg Lindahl wrote: > A while back a customer of PathScale issued a press release about how > they were running Windows on their supercomputer, with InfiniPath > interconnect. We were kinda surprised, since we didn't have Windows > drivers. > > So I was curious about the Dawning 5000A... the previous > supercomputers > in this series ran Linux. But the top500 entry trumpets Windows. > Well... > > http://tyan.com/newsroom_pressroom_detail.aspx?id=1289 > > This press release says it runs Linux and Windows. That's a bit more > credible; I wonder if they actually had to buy the Windows software? > > -- greg > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From hbugge at platform.com Fri Nov 21 00:55:05 2008 From: hbugge at platform.com (=?iso-8859-1?Q?H=E5kon?= Bugge) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <49258593.1010608@scalableinformatics.com> Message-ID: Mark, Guess you're too humble ;-) At 17:23 20.11.2008, Mark Hahn wrote: >I'm happy for you, but to me, you're stacking >the deck by comparing to a quite old CPU. you >could break out the prices directly, but comparing 3x >GPU (modern? sounds like pci-express at least) >to a current entry-level cluster node (8 >core2/shanghai cores at 2.4-3.4 GHz) be more appropriate. > >at the VERY least, honesty requires comparing one GPU against all the cores >in a current CPU chip. with your numbers, I >expect that would change the speedup from 117 to >around 15. still very respectable. I compiled the serial hmm version using the default make file (gcc -O2 -g) and ran it on an Opetron 2220 (2.8 GHz). Then I compiled the MPI version using Intel compiler 10.1 (icc -axS -O3), and ran it on a not-yet-to-be-released two socket machine using 16 MPI process. The latter ran 145x times faster. So soon, the 15x is below 1x... So, YMWV! H?kon From hbugge at platform.com Fri Nov 21 01:01:09 2008 From: hbugge at platform.com (=?iso-8859-1?Q?H=E5kon?= Bugge) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] Windows in top 10 In-Reply-To: <20081121060623.GA4643@bx9> References: <20081121060623.GA4643@bx9> Message-ID: At 07:06 21.11.2008, Greg Lindahl wrote: >A while back a customer of PathScale issued a press release about how >they were running Windows on their supercomputer, with InfiniPath >interconnect. We were kinda surprised, since we didn't have Windows >drivers. > >So I was curious about the Dawning 5000A... the previous supercomputers >in this series ran Linux. But the top500 entry trumpets Windows. Well... > >http://tyan.com/newsroom_pressroom_detail.aspx?id=1289 > >This press release says it runs Linux and Windows. That's a bit more >credible; I wonder if they actually had to buy the Windows software? The top500 list is not very trustworthy in my opinion. I guess I posted an example some years ago of a Myrinet cluster on the list owned by Statoil (Norwegian Oil company). Only problem was that Statoil didn't have any Myrinet clusters. They had clusters with the same spec except the interconnect though. So the entry kinda depicted _if_ that cluster had been equipped with Myrinet, _then_ it would (could?) have been able to achieve the result. H?kon From franz.marini at mi.infn.it Fri Nov 21 02:44:35 2008 From: franz.marini at mi.infn.it (Franz Marini) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <9f8092cc0811191311w652bcd24t926d3a0592117ae3@mail.gmail.com> <330326.56339.qm@web37907.mail.mud.yahoo.com> <9f8092cc0811192323u25067283o321b6d8cef97966c@mail.gmail.com> Message-ID: <1227264275.20502.5.camel@merlino.mi.infn.it> Hello, On Thu, 2008-11-20 at 09:58 -0500, Mark Hahn wrote: > > Virtually any recent card can run CUDA code. If you Google you can get a > > list of compatible cards. > > not that many NVidia cards support DP yet though, which is probably > important to anyone coming from the normal HPC world... there's some > speculation that NV will try to keep DP as a market segmentation > feature to drive HPC towards high-cost Tesla cards, much as vendors > have traditionally tried to herd high-end vis into 10x priced cards. That's simply not true. Every newer card from NVidia (that is, every G200-based card, right now, GTX260, GTX260-216 and GTX280) supports DP, and nothing indicates that NV will remove support in future cards, quite the contrary. The distinction between Tesla and GeForce cards is that the former have no display output, they usually have more ram, and (but I'm not sure about this one) they are clocked a little lower. F. --------------------------------------------------------- Franz Marini Prof. R. A. Broglia Theoretical Physics of Nuclei, Atomic Clusters and Proteins Research Group Dept. of Physics, University of Milan, Italy. email : franz.marini@mi.infn.it phone : +39 02 50317226 --------------------------------------------------------- From eugen at leitl.org Fri Nov 21 04:03:29 2008 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] SGI shows off Molecule concept machine Message-ID: <20081121120329.GE11544@leitl.org> http://www.theregister.co.uk/2008/11/20/sgi_molecule_concept/ SGI shows off Molecule concept machine SC08 A dense cluster of Intel Atoms By Timothy Prickett Morgan ? Get more from this author Posted in Servers, 20th November 2008 22:05 GMT While supercomputer maker Silicon Graphics was showing off its existing Altix lines of Xeon and Itanium servers at the SC08 supercomputing show in Austin, Texas, this week, the most interesting thing the company touted was not yet a real computer, but a concept system, called Molecule. The Molecule machine takes a few pages out of IBM's BlueGene massively parallel supercomputer book, and the main one is that for some workloads, where a large number of compute nodes need to be brought to bear to run a simulation, sometimes it makes more sense to have relatively modest processors instead of big fat ones. IBM built the BlueGene/L super from its embedded PowerPC 440 dual-core processors. SGI's Molecule concept machine would be built from Intel dual-core Atom x64 chips, which are based on 45 nanometer processes and are designed for netbooks and other portable computing devices where long battery life, not computing power, is the limit of usefulness. The chips run at between 800 MHz and 1.67 GHz and implement HyperThreading, so they can deliver up to two virtual threads per core. With the BlueGene box, IBM controlled not only the chip but also the interface off the chip and out into the system interconnect. Michael Brown, sciences segment manager at SGI who was showing off the Molecule concept box, says that SGI can't really control the interconnect Intel will put on Atom boards. But presumably a fast enough interconnect could be designed to plug multiple Atom boards into a chassis. The Molecule concept machine puts a dual-core Atom N330, code-named "Diamondville," on a system board that is about the size of a credit card. This particular chip runs at 1.6 GHz and has a thermal design point of about 8 watts. The Atom N330 is not a true dual-core chip, but rather two single-core Atoms side-by-side in a single chip package (it really isn't even a socket) that is mounted to the board. Brown said that the future "Lincroft" iteration of the Atom chip, which will put a DDR2 memory controller on the chip, and thereby eliminate the need for an external chipset since the Molecule boards have no direct attached storage other than main memory, would be an interesting possibility. But Brown made no commitments to SGI actually using this chip. In any event, the Molecule board had four memory DIMMs soldered directly to the board and linked to the chip, which provided 2 GB of memory capacity. The interconnect is along the side of the board as the memory chips, and would plug into a backplane of some sort that would reach out to external storage and networks, much as blade servers do inside their chassis. The Molecule design glues two of these Atom boards to a hollow ceramic cartridge that is used to hold the boards in place, to draw heat off the boards, and to channel cooling air that comes in through the bottom of the chassis and is diverted at a 90 degree angle out the back of the chassis. The cartridges interlace to create a bunch of channels, and have fins and baffles inside to direct airflow very precisely. SGI calls this Atom board packaging Kelvin. SGI's Molecule Kelvin Packaging Kelvin, lording over the Atoms in the Molecule The concept machine at the SC08 show was a 3U rack that contained 180 of the Atom boards, for a total of 360 cores. These boards would present 720 virtual threads to a clustered application, and have 720 GB of main memory (using 512 MB DDR2 DIMMs mounted on the board) and a total of 720 GB/sec of memory bandwidth. The important thing to realize, explained Brown, is that if the interconnect was architected correctly, the entire memory inside the chassis could be searched in one second. That memory bandwidth, Brown explained, was up to 15 TB/sec per rack, or about 20 times that of a single-rack cluster these days. This setup would be good for applications where cache memory or out-of-order execution don't help, but massive amounts of threads do help. (Search, computational fluid dynamics, seismic processing, stochastic modeling, and others were mentioned). The other advantages that the Molecule system might have are low energy use and low cost. The aggregate memory bandwidth in a rack of these machines (that's 10,080 cores with 9.8 TB of memory) would deliver about 7 times the GB per second per watt of a rack of x64 servers in a cluster today, according to Brown. On applications where threads rule, the Molecule would do about 7 times the performance per watt of x64 servers, and on SPEC-style floating point tests, it might even deliver twice the performance per watt. On average, SGI is saying performance per watt should be around 3.5 times that of a rack of x64 servers. One more thing: It has no moving parts, and that increases reliability. And if storage needs to be added to the Molecule architecture, it will be flash memory. The Molecule aims to run off-the-shelf HPC applications on top of Linux or Windows. Brown said that SGI was showing off the concept box to solicit input from prospective customers even before it creates an alpha box. If SGI sees enough interest, it could take 12 to 18 months to produce the concept. If the idea is sound, let's hope it doesn't take that long. From bill at cse.ucdavis.edu Fri Nov 21 04:36:43 2008 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] OpenMP on AMD dual core processors In-Reply-To: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> References: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> Message-ID: <4926AB5B.8040609@cse.ucdavis.edu> Fortran isn't one of my better languages, but I did manage to tweak your code into something that I believe works the same and is openMP friendly. I put a copy at: http://cse.ucdavis.edu/bill/OMPdemo.f When I used the pathscale compiler on your code it said: "told.f", line 27: Warning: Referenced scalar variable OLD_V is SHARED by default "told.f", line 29: Warning: Referenced scalar variable DV is SHARED by default "told.f", line 31: Warning: Referenced scalar variable CONVERGED is SHARED by default I rewrote your code to get rid of those, I didn't know some of the constants you mentioned dy and Ly. So I just wrote my own initialization. I skipped the boundary conditions by just restricting the start and end of the loops. Your code seemed to be interpolating between the current iteration (i-1 and j-1) and the last iteration (i+1 and j+1). Not sure if that was intentional or not. In any case I just processed the array v into v2, then if it didn't converge I processed the v2 array back into v. To make each loop independent I made converge a 1D array which stored the sum of that row's error. Then after each array was processed I walked the 1-d array to see if we had converged. I exit when all pixels are below the convergence value. It scales rather well on a dual socket barcelona (amd quad core), my version iterates a 1000x1000 array with a range of values from 0-200 over 1214 iterations to within a convergence of 0.02. CPUs time Scaling ================= 1 54.51 2 27.75 1.96 faster 4 14.14 3.85 faster 8 7.75 7.03 faster Hopefully my code is doing what you intended. Alas, with gfortran (4.3.1 or 4.3.2), I get a segmentation fault as soon as I run. Same if I compile with -g and run it under the debugger. I'm probably doing something stupid. From hearnsj at googlemail.com Fri Nov 21 05:21:17 2008 From: hearnsj at googlemail.com (John Hearns) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] SGI shows off Molecule concept machine In-Reply-To: <20081121120329.GE11544@leitl.org> References: <20081121120329.GE11544@leitl.org> Message-ID: <9f8092cc0811210521i260f3100xde9a1553d4871197@mail.gmail.com> 2008/11/21 Eugen Leitl > > http://www.theregister.co.uk/2008/11/20/sgi_molecule_concept/ > > SGI shows off Molecule concept machine > > Cool! (pun intended). I wonder out loud if some customer with a black hat and some deep pockets has already placed an order :-) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081121/8135f7fc/attachment.html From hahn at mcmaster.ca Fri Nov 21 06:05:54 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: <1227264275.20502.5.camel@merlino.mi.infn.it> References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <9f8092cc0811191311w652bcd24t926d3a0592117ae3@mail.gmail.com> <330326.56339.qm@web37907.mail.mud.yahoo.com> <9f8092cc0811192323u25067283o321b6d8cef97966c@mail.gmail.com> <1227264275.20502.5.camel@merlino.mi.infn.it> Message-ID: >>> Virtually any recent card can run CUDA code. If you Google you can get a >>> list of compatible cards. >> >> not that many NVidia cards support DP yet though, which is probably >> important to anyone coming from the normal HPC world... there's some >> speculation that NV will try to keep DP as a market segmentation >> feature to drive HPC towards high-cost Tesla cards, much as vendors >> have traditionally tried to herd high-end vis into 10x priced cards. > > That's simply not true. Every newer card from NVidia (that is, every which part is not true? the speculation? OK - speculation is always just speculation. it _is_ true that only the very latest NV generation, essentially three bins of one card, does support DP. > and nothing indicates that NV will remove support in future cards, quite > the contrary. hard to say. NV is a very competitively driven company, that is, makes decisions for competitive reasons. it's a very standard policy to try to segment your market, to develop higher margin segments that depend on restricted features. certainly NV has done that before (hence the existence of Quadro and Tesla) though it's not clear to me whether they will have any meaningful success given the other players in the market. segmentation is a play for a dominant incumbent, and I don't think NV is or believes itself so. AMD obviously seeks to avoid giving NV any advantage, and ATI has changed its outlook somewhat since AMDification. and Larrabee threatens to eat both their lunches. > The distinction between Tesla and GeForce cards is that the former have > no display output, they usually have more ram, and (but I'm not sure > about this one) they are clocked a little lower. both NV and ATI have always tried to segment "professional graphics" into a higher-margin market. this involves tying the pro drivers to features found only in the pro cards. it's obvious that NV _could_ do this with Cuda, though I agree they probably won't. the original question was whether there is a strong movement towards gp-gpu clusters. I think there is not, because neither the hardware nor software is mature. Cuda is the main software right now, and is NV-proprietary, and is unlikley to target ATI and Intel gp-gpu hardware. finally, it needs to be said again: current gp-gpus deliver around 1 SP Tflop for around 200W. a current cpu (3.4 GHz Core2) delivers about 1/10 as many flops for something like 1/2 the power. (I'm approximating cpu+nb+ram.) cost for the cpu approach is higher (let's guess 2x, but again it's hard to isolate parts of a system.) so we're left with a peak/theoretical difference of around 1 order of magnitude. that's great! more than enough to justify use of a unique (read proprietary, nonportable) development tool for some places where GPUs work especially well (and/or CPUs work poorly). and yes, adding gp-gpu cards to a cluster is a fairly modest price/power premium if you expect to use it. Joe's hmmer example sounds like an excellent example, since it shows good speedup, and the application seems to be well-suited to gp-gpu strengths (and it has a fairly small kernel that needs to be ported to Cuda.) but comparing all the cores of a July 2008 GPU card to a single core on a 90-nm, n-3 generation chip really doesn't seem appropriate to me. From jan.heichler at gmx.net Fri Nov 21 06:23:58 2008 From: jan.heichler at gmx.net (Jan Heichler) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: <1227264275.20502.5.camel@merlino.mi.infn.it> References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <9f8092cc0811191311w652bcd24t926d3a0592117ae3@mail.gmail.com> <330326.56339.qm@web37907.mail.mud.yahoo.com> <9f8092cc0811192323u25067283o321b6d8cef97966c@mail.gmail.com> <1227264275.20502.5.camel@merlino.mi.infn.it> Message-ID: <619899985.20081121152358@gmx.net> Hallo Franz, Freitag, 21. November 2008, meintest Du: FM> That's simply not true. Every newer card from NVidia (that is, every FM> G200-based card, right now, GTX260, GTX260-216 and GTX280) supports DP, FM> and nothing indicates that NV will remove support in future cards, quite FM> the contrary. FM> The distinction between Tesla and GeForce cards is that the former have FM> no display output, they usually have more ram, and (but I'm not sure FM> about this one) they are clocked a little lower. Don't forget that Teslas have ECC-RAM. Normal Graphic cards don't care about flipped memory bits. That does not count when processing DirectX or OpenGL - but it does for computation. So a highend GPU can miscalculate... Jan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081121/0f2df55e/attachment.html From gdjacobs at gmail.com Fri Nov 21 07:01:35 2008 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] OpenMP on AMD dual core processors In-Reply-To: <49263F3C.3090005@scalableinformatics.com> References: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> <49263F3C.3090005@scalableinformatics.com> Message-ID: <4926CD4F.6050307@gmail.com> Joe Landman wrote: > Nathan Moore wrote: > >> Any suggestions? I figured that this would be a simple example to >> parallelize. Is there a better example for OpenMP parallelization? >> Also, is there something obvious I'm missing in the example below? > > A few thoughts ... > > Initialize your data in parallel as well. No reason not to. But > optimize that code a bit. You don't need > > v_y = v_ground + (v_cloud-v_ground)*(j*dy/Ly) > boundary(i,j)=0 > v(i,j) = v_y > > when > > v(i,j)= v_ground + (v_cloud-v_ground)*(j*dy/Ly) > boundary(i,j)=0 > > will eliminate the explicit temporary variable. Also the i.eq.0 test is > guaranteed never to be hit in the if-then construct, as with the j.eq.0. > > You can (and should) replace that if-then construct with a set of loops > of the form > > do j=1,Ny > boundary(Nx,j) = 1 > end do > do i=1,Nx > boundary(i,Ny) = 1 > end do > > Also, what sticks out to me is that old_v may be viewed as "shared" > versus "private". I know OpenMP is supposed to do the right thing here, > but you might need to explicitly mark old_v as private. And dv for > that matter. > > Note also that this inner loop is attempting to do a convergence test. > You are looking to set a globally shared value from within an inner > loop. This is not a good thing to do. This means accesses to that > globally shared variable are going to be locked. > > I would suggest a slightly different inner loop and convergence test: > (note ... this relies on something I havent tried in fortran so > adjustment may be needed) > > > real*8 vnew(Nx,Ny),dv(Nx,Ny) > > do i=1,Nx > do j=1,Ny > ! notice that the if-then construct is gone ... > ! vnew eq 0.0 for boundaries > vnew(i,j) = 0.25*(v(i-1,j)+v(i+1,j)+v(i,j+1)+v(i,j-1))* > dabs(boundary(i,j).eq.0) > dv(i,j) = (dabs(v(i,j)-vnew(i,j)) - convergence_v )* > dabs(boundary(i,j).eq.0) > end do > end do If this were done with MPI, one would have to be careful of the boundaries on the matrix as it's partitioned for computation. OpenMP is intelligent enough to hold off computation on the tiles south and east of the first until the first is done, and so forth? > ! now all you need is a "linear scan" to find positive elements in > ! dv. You can approach these as sum reductions, and do them in > ! parallel > do i=1,Nx > sum=0.0 > do j=1,Ny > sum = sum + dabs(dv(i,j) .gt. 0.0) * dv(i,j) > end do > if (sum .gt. 0.0) converged = 0 > end do > > The basic idea is to replace the inner loop conditionals and remove as > many of the shared variables as possible. Yup, keep things pipelined. > Also c.f. examples here: http://www.linux-mag.com/id/4609 specifically > the Riemann zeta function (fairly trivial). > -- Geoffrey D. Jacobs From franz.marini at mi.infn.it Fri Nov 21 07:08:52 2008 From: franz.marini at mi.infn.it (Franz Marini) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: <619899985.20081121152358@gmx.net> References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <9f8092cc0811191311w652bcd24t926d3a0592117ae3@mail.gmail.com> <330326.56339.qm@web37907.mail.mud.yahoo.com> <9f8092cc0811192323u25067283o321b6d8cef97966c@mail.gmail.com> <1227264275.20502.5.camel@merlino.mi.infn.it> <619899985.20081121152358@gmx.net> Message-ID: <1227280132.20502.31.camel@merlino.mi.infn.it> Hallo Jan, On Fri, 2008-11-21 at 15:23 +0100, Jan Heichler wrote: > Hallo Franz, > > > Freitag, 21. November 2008, meintest Du: > > > FM> That's simply not true. Every newer card from NVidia (that is, > every > > FM> G200-based card, right now, GTX260, GTX260-216 and GTX280) > supports DP, > > FM> and nothing indicates that NV will remove support in future cards, > quite > > FM> the contrary. > > > FM> The distinction between Tesla and GeForce cards is that the former > have > > FM> no display output, they usually have more ram, and (but I'm not > sure > > FM> about this one) they are clocked a little lower. > > > Don't forget that Teslas have ECC-RAM. Normal Graphic cards don't care > about flipped memory bits. That does not count when processing DirectX > or OpenGL - but it does for computation. So a highend GPU can > miscalculate... Ja, wirchlich... ;) Yeah, that's an advantage I was forgetting about, and for cluster use, or a multi-GPU system in a deskside computer, it could really matter... In order not to flood the list with answers, I'm gonna answer Mark here, too : On Fri, 2008-11-21 at 09:05 -0500, Mark Hahn wrote: > > and nothing indicates that NV will remove support in future cards, quite > > the contrary. > > hard to say. NV is a very competitively driven company, that is, makes > decisions for competitive reasons. it's a very standard policy to try > to segment your market, to develop higher margin segments that depend > on restricted features. certainly NV has done that before (hence the > existence of Quadro and Tesla) though it's not clear to me whether they > will have any meaningful success given the other players in the market. > segmentation is a play for a dominant incumbent, and I don't think NV > is or believes itself so. AMD obviously seeks to avoid giving NV any > advantage, and ATI has changed its outlook somewhat since AMDification. > and Larrabee threatens to eat both their lunches. > > > The distinction between Tesla and GeForce cards is that the former have > > no display output, they usually have more ram, and (but I'm not sure > > about this one) they are clocked a little lower. > > both NV and ATI have always tried to segment "professional graphics" > into a higher-margin market. this involves tying the pro drivers to > features found only in the pro cards. True, although, as far as I remember, the only real distinction between Quadro and GeForce cards are hardware support for antialiased lines which is present in the former (I could be wrong though, and there may be some more substantial differences)... > it's obvious that NV _could_ > do this with Cuda, though I agree they probably won't. > > the original question was whether there is a strong movement towards > gp-gpu clusters. I think there is not, because neither the hardware > nor software is mature. Cuda is the main software right now, and is > NV-proprietary, and is unlikley to target ATI and Intel gp-gpu hardware. > > finally, it needs to be said again: current gp-gpus deliver around > 1 SP Tflop for around 200W. a current cpu (3.4 GHz Core2) delivers > about 1/10 as many flops for something like 1/2 the power. (I'm > approximating cpu+nb+ram.) cost for the cpu approach is higher (let's > guess 2x, but again it's hard to isolate parts of a system.) > > so we're left with a peak/theoretical difference of around 1 order of > magnitude. that's great! more than enough to justify use of a unique > (read proprietary, nonportable) development tool for some places where > GPUs work especially well (and/or CPUs work poorly). and yes, adding > gp-gpu cards to a cluster is a fairly modest price/power premium if > you expect to use it. > > Joe's hmmer example sounds like an excellent example, since it shows good > speedup, and the application seems to be well-suited to gp-gpu strengths > (and it has a fairly small kernel that needs to be ported to Cuda.) > but comparing all the cores of a July 2008 GPU card to a single core on a > 90-nm, n-3 generation chip really doesn't seem appropriate to me. I think we can agree on all these points, although I'm sure Joe's comparison, or, better, Joe's cpu used in the comparison has not been a deliberate choice to somehow make the GPU version stand out more. Regarding the proprietary-ness of CUDA, I would argue that being proprietary also means that it probably better targets the NV GPU architecture, and a more general, portable solution, like OpenCL (which seems to be closer than expected, by the way) will possibly mean a somewhat less optimal use of the GPU. Maybe I'm wrong, though, I guess we will just have to wait a few more months to find out :) I'm gonna get back to some real work now, have a good day, F. --------------------------------------------------------- Franz Marini Prof. R. A. Broglia Theoretical Physics of Nuclei, Atomic Clusters and Proteins Research Group Dept. of Physics, University of Milan, Italy. email : franz.marini@mi.infn.it phone : +39 02 50317226 --------------------------------------------------------- From landman at scalableinformatics.com Fri Nov 21 07:23:51 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] OpenMP on AMD dual core processors In-Reply-To: <4926CD4F.6050307@gmail.com> References: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> <49263F3C.3090005@scalableinformatics.com> <4926CD4F.6050307@gmail.com> Message-ID: <4926D287.7030109@scalableinformatics.com> Geoff Jacobs wrote: > If this were done with MPI, one would have to be careful of the > boundaries on the matrix as it's partitioned for computation. OpenMP is > intelligent enough to hold off computation on the tiles south and east > of the first until the first is done, and so forth? No... I didn't address the interior vs exterior. I have a nice worked example where I convert this sort of code into an exterior, a skin, and the communication for an MPI and OpenMP version. Scales pretty well. You are right, I should have fixed that as well. > >> ! now all you need is a "linear scan" to find positive elements in >> ! dv. You can approach these as sum reductions, and do them in >> ! parallel >> do i=1,Nx >> sum=0.0 >> do j=1,Ny >> sum = sum + dabs(dv(i,j) .gt. 0.0) * dv(i,j) >> end do >> if (sum .gt. 0.0) converged = 0 >> end do >> >> The basic idea is to replace the inner loop conditionals and remove as >> many of the shared variables as possible. > > Yup, keep things pipelined. That was the idea, though I didn't compile/test the code to be sure it would work. That and I usually try to avoid real coding when I am tired at night. Coding and beer don't mix (for me). -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From hearnsj at googlemail.com Fri Nov 21 07:24:58 2008 From: hearnsj at googlemail.com (John Hearns) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: <1227280132.20502.31.camel@merlino.mi.infn.it> References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <9f8092cc0811191311w652bcd24t926d3a0592117ae3@mail.gmail.com> <330326.56339.qm@web37907.mail.mud.yahoo.com> <9f8092cc0811192323u25067283o321b6d8cef97966c@mail.gmail.com> <1227264275.20502.5.camel@merlino.mi.infn.it> <619899985.20081121152358@gmx.net> <1227280132.20502.31.camel@merlino.mi.infn.it> Message-ID: <9f8092cc0811210724h782b80a0ma92a74a3338acedd@mail.gmail.com> 2008/11/21 Franz Marini > H > Regarding the proprietary-ness of CUDA, I would argue that being > proprietary also means that it probably better targets the NV GPU > architecture, and a more general, portable solution, like OpenCL (which > seems to be closer than expected, by the way) will possibly mean a > somewhat less optimal use of the GPU. M Guys, I'm going to be controversial here. The market may SAY otherwise, but the market does not give a rat's behind about proprietariness. Tell a scientist that her N-body dynamics astrophysics model will run 500 times faster on a certain GPU and she'll get more papers published and an invite to a conference in Hawaii next year and you'll see those grant dollars being spent. Tell and engineer that his Nastran model or his CFD simulation will finish whilst he goes off to lunch/coffee and he'll bite your hand off. It all comes down to codes - when the ISV codes use these things, you'll see the uptake. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081121/a04afa51/attachment.html From ntmoore at gmail.com Fri Nov 21 07:35:15 2008 From: ntmoore at gmail.com (Nathan Moore) Date: Wed Nov 25 01:07:59 2009 Subject: Fwd: [Beowulf] OpenMP on AMD dual core processors In-Reply-To: <6009416b0811210735p2fe089a8r7bd3fe40bddfdbbf@mail.gmail.com> References: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> <49264B79.9040301@cse.ucdavis.edu> <6009416b0811210735p2fe089a8r7bd3fe40bddfdbbf@mail.gmail.com> Message-ID: <6009416b0811210735x64abccfdie623b2a77f1d593c@mail.gmail.com> ---------- Forwarded message ---------- From: Nathan Moore Date: Fri, Nov 21, 2008 at 9:35 AM Subject: Re: [Beowulf] OpenMP on AMD dual core processors To: Bill Broadley You're right about the recursive definition, v(i,j) = 0.25*(v(i-1,j)+v(i+1,j)+v(i,j+1)+v(i,j-1)) It is an old serial programming trick that makes the computation go faster with little convergence penalty. I was thinking that two arrays would have a memory latency (reading in and out simultaneously), but I see what you mean about forcing the computation to be serial. On Thu, Nov 20, 2008 at 11:47 PM, Bill Broadley wrote: > OpenMP only works on loops that are independent. So something like: > do j=1,Ny > v(j) = v(j) + 1 > > So 100 CPUs could each run with a different value for J and not conflict. > > Your code however: > do i=1,Nx > do j=1,Ny > if(boundary(i,j).eq.0) then > old_v = v(i,j) > v(i,j) = 0.25*(v(i-1,j)+v(i+1,j)+v(i,j+1)+v(i,j-1)) > > Neither the i loop nor the j loop can be parallelized because the value if > i-1 > and j-1 have been referenced. Does that code even work? Is it intentional > that the v(i-1) value is from the current iteration, but v(i+1) value is > from > the previous iteration? > > Seems like a much better idea to have a new array that is built entirely > from > the previous timestep. That would allow it to converge faster, coverge is > more cases, and also parallelize. > > Make sense? > -- - - - - - - - - - - - - - - - - - - - - - Nathan Moore Assistant Professor, Physics Winona State University AIM: nmoorewsu - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - Nathan Moore Assistant Professor, Physics Winona State University AIM: nmoorewsu - - - - - - - - - - - - - - - - - - - - - -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081121/73a62902/attachment.html From ntmoore at gmail.com Fri Nov 21 07:38:29 2008 From: ntmoore at gmail.com (Nathan Moore) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] OpenMP on AMD dual core processors In-Reply-To: <4926AB5B.8040609@cse.ucdavis.edu> References: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> <4926AB5B.8040609@cse.ucdavis.edu> Message-ID: <6009416b0811210738h27251837i80aebb5ee30546d3@mail.gmail.com> Thanks a ton for the worked out example! I had a similar problem with gfortran, and it only appeared with large array sizes (bigger than 4000x4000 as I recall). "ulimit" was no help, I assume there's a memory constraint built in somewhere. (as an aside, I once ran into a similar problem with perl - the release on linux would only allow 200MB array sizes, but the version available on a sun machine would allow GB of array sizes) On Fri, Nov 21, 2008 at 6:36 AM, Bill Broadley wrote: > Fortran isn't one of my better languages, but I did manage to tweak your > code > into something that I believe works the same and is openMP friendly. > > I put a copy at: > http://cse.ucdavis.edu/bill/OMPdemo.f > > When I used the pathscale compiler on your code it said: > "told.f", line 27: Warning: Referenced scalar variable OLD_V is SHARED by > default > "told.f", line 29: Warning: Referenced scalar variable DV is SHARED by > default > "told.f", line 31: Warning: Referenced scalar variable CONVERGED is SHARED > by > default > > I rewrote your code to get rid of those, I didn't know some of the > constants > you mentioned dy and Ly. So I just wrote my own initialization. I skipped > the boundary conditions by just restricting the start and end of the loops. > > Your code seemed to be interpolating between the current iteration (i-1 and > j-1) and the last iteration (i+1 and j+1). Not sure if that was > intentional > or not. In any case I just processed the array v into v2, then if it > didn't > converge I processed the v2 array back into v. To make each loop > independent > I made converge a 1D array which stored the sum of that row's error. Then > after each array was processed I walked the 1-d array to see if we had > converged. I exit when all pixels are below the convergence value. > > It scales rather well on a dual socket barcelona (amd quad core), my > version > iterates a 1000x1000 array with a range of values from 0-200 over 1214 > iterations to within a convergence of 0.02. > > CPUs time Scaling > ================= > 1 54.51 > 2 27.75 1.96 faster > 4 14.14 3.85 faster > 8 7.75 7.03 faster > > Hopefully my code is doing what you intended. > > Alas, with gfortran (4.3.1 or 4.3.2), I get a segmentation fault as soon as > I > run. Same if I compile with -g and run it under the debugger. I'm > probably > doing something stupid. > > -- - - - - - - - - - - - - - - - - - - - - - Nathan Moore Assistant Professor, Physics Winona State University AIM: nmoorewsu - - - - - - - - - - - - - - - - - - - - - -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081121/683120af/attachment.html From ntmoore at gmail.com Fri Nov 21 07:42:37 2008 From: ntmoore at gmail.com (Nathan Moore) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] OpenMP on AMD dual core processors In-Reply-To: <39705.128.83.67.198.1227279925.squirrel@webmail.lncc.br> References: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> <39705.128.83.67.198.1227279925.squirrel@webmail.lncc.br> Message-ID: <6009416b0811210742i67c06368yc5488f940eef22ec@mail.gmail.com> What a relief that no one castigated me for including a "goto"! I'm teaching out of Kupferschmid's "Classical Fortran", and he makes a rather compelling case that goto s actually more pedagogically sound (for beginners) than "do while" Successive steps in the equilibration are time-dependant, so they're IMPOSSIBLE to parallelize. I suppose I could get around the re-initialization by including a "SINGLE" directive around the iteration control structure. Nathan On Fri, Nov 21, 2008 at 9:05 AM, wrote: > Hi All > > I thing the problem could be the convergence "loop" test and the criation > of threads > > 10 converged = 1 > !$OMP PARALLEL > !$OMP DO > ..... > !$OMP ENDDO > !$OMP END PARALLEL > if(converged.eq.0) then > > goto 10 > endif > > Each time you "goto 10" > the compiler "create" and "initialize" the threads > and this is time comsuming. > try to change the convergence test to a > reduce operation this will > take time but not some much as !$OMP > PARALLEL > > I hope its help > > Renato Silva > > > > > Hi All, > > > > I'm getting to the end of a semester of computational physics at my > > institution, and thought it would be fin to close the semester with a > > discussion of parallel programming. Initially, I was simply planning to > > discuss MPI, but while reading through the gfortran man page I realized > > that > > gcc now supports OpenMP directives. > > > > Given that the machines my students are using are all dual core, I > started > > working on a simple example that I hoped would show a nice speedup from > > the > > "easy" library. > > > > The specific problem I'm working on is a 2-d solution to the laplace > > equation (electrostatics). The bulk of the computation is a recursion > > relation, applied to elements of a 2-d array, according to the following > > snippet. > > > > Of course, by now I should know that "simple" never really is. When I > > compile with gfortran and run with 1 or 2 cores (ie, OMP_NUM_THREADS=2, > > export OMP_NUM_THREADS) there is basically no difference in execution > > time. > > > > > > Any suggestions? I figured that this would be a simple example to > > parallelize. Is there a better example for OpenMP parallelization? Also, > > is there something obvious I'm missing in the example below? > > > > Nathan Moore > > > > integer,parameter::Nx=1000 > > integer,parameter::Ny=1000 > > real*8 v(Nx,Ny) > > integer boundary(Nx,Ny) > > > > v_cloud = -1.0e-4 > > v_ground = 0.d0 > > > > convergence_v = dabs(v_ground-v_cloud)/(1.d0*Ny*Ny) > > > > ! initialize the the boundary conditions > > do i=1,Nx > > do j=1,Ny > > v_y = v_ground + (v_cloud-v_ground)*(j*dy/Ly) > > boundary(i,j)=0 > > v(i,j) = v_y > > ! we need to ensure that the edges of the domain are held > > as > > boundary > > if(i.eq.0 .or. i.eq.Nx .or. j.eq.0 .or. j.eq.Ny) then > > boundary(i,j)=1 > > endif > > end do > > end do > > > > 10 converged = 1 > > !$OMP PARALLEL > > !$OMP DO > > do i=1,Nx > > do j=1,Ny > > if(boundary(i,j).eq.0) then > > old_v = v(i,j) > > v(i,j) = > > 0.25*(v(i-1,j)+v(i+1,j)+v(i,j+1)+v(i,j-1)) > > dv = dabs(old_v-v(i,j)) > > if(dv.gt.convergence_v) then > > converged = 0 > > endif > > endif > > end do > > end do > > !$OMP ENDDO > > !$OMP END PARALLEL > > if(converged.eq.0) then > > goto 10 > > endif > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > > -- - - - - - - - - - - - - - - - - - - - - - Nathan Moore Assistant Professor, Physics Winona State University AIM: nmoorewsu - - - - - - - - - - - - - - - - - - - - - -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081121/127b013f/attachment.html From ntmoore at gmail.com Fri Nov 21 07:45:36 2008 From: ntmoore at gmail.com (Nathan Moore) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] OpenMP on AMD dual core processors In-Reply-To: <4926D287.7030109@scalableinformatics.com> References: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> <49263F3C.3090005@scalableinformatics.com> <4926CD4F.6050307@gmail.com> <4926D287.7030109@scalableinformatics.com> Message-ID: <6009416b0811210745w2ae0b7dbu11c3cfc18101c43c@mail.gmail.com> Hi Joe, I found the article you wrote for Linux Journal right about the time you emailed last night - thanks for the reference and the suggestions! I find that 1 beer and coding is ok, but num_beer .ge. 2 makes me too poetic and insufficiently detail oriented. Debugging the next morning is never fun. On Fri, Nov 21, 2008 at 9:23 AM, Joe Landman < landman@scalableinformatics.com> wrote: > Geoff Jacobs wrote: > > If this were done with MPI, one would have to be careful of the >> boundaries on the matrix as it's partitioned for computation. OpenMP is >> intelligent enough to hold off computation on the tiles south and east >> of the first until the first is done, and so forth? >> > > No... I didn't address the interior vs exterior. I have a nice worked > example where I convert this sort of code into an exterior, a skin, and the > communication for an MPI and OpenMP version. Scales pretty well. You are > right, I should have fixed that as well. > > >> ! now all you need is a "linear scan" to find positive elements in >>> ! dv. You can approach these as sum reductions, and do them in >>> ! parallel >>> do i=1,Nx >>> sum=0.0 >>> do j=1,Ny >>> sum = sum + dabs(dv(i,j) .gt. 0.0) * dv(i,j) >>> end do >>> if (sum .gt. 0.0) converged = 0 >>> end do >>> >>> The basic idea is to replace the inner loop conditionals and remove as >>> many of the shared variables as possible. >>> >> >> Yup, keep things pipelined. >> > > That was the idea, though I didn't compile/test the code to be sure it > would work. That and I usually try to avoid real coding when I am tired at > night. Coding and beer don't mix (for me). > > > > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics LLC, > email: landman@scalableinformatics.com > web : http://www.scalableinformatics.com > http://jackrabbit.scalableinformatics.com > phone: +1 734 786 8423 x121 > fax : +1 866 888 3112 > cell : +1 734 612 4615 > -- - - - - - - - - - - - - - - - - - - - - - Nathan Moore Assistant Professor, Physics Winona State University AIM: nmoorewsu - - - - - - - - - - - - - - - - - - - - - -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081121/f29eb512/attachment.html From tom.elken at qlogic.com Fri Nov 21 08:03:23 2008 From: tom.elken at qlogic.com (Tom Elken) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] Windows in top 10 In-Reply-To: References: <20081121060623.GA4643@bx9> Message-ID: <6DB5B58A8E5AB846A7B3B3BFF1B4315A02990AA4@AVEXCH1.qlogic.org> > 104k euro a node in short. That's soon 200k dollar. Hmmm. The trend doesn't indicate that. It was up to $1.60 per Euro this summer, but ~ $1.25 per Euro now: http://finance.yahoo.com/currency/convert?amt=1&from=EUR&to=USD&submit=C onvert Your point on the high price per node is still very relevant, though. -Tom > > It doesn't matter what gets delivered for that, if it has > just 16 cores, > then it is a factor 5-10 too expensive. > From dnlombar at ichips.intel.com Fri Nov 21 07:55:50 2008 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] OpenMP on AMD dual core processors In-Reply-To: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> References: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> Message-ID: <20081121155550.GB5471@nlxdcldnl2.cl.intel.com> On Thu, Nov 20, 2008 at 07:52:47PM -0800, Nathan Moore wrote: > Hi All, > > I'm getting to the end of a semester of computational physics at > my institution, and thought it would be fin to close the semester > with a discussion of parallel programming. Initially, I was simply > planning to discuss MPI, but while reading through the gfortran man > page I realized that gcc now supports OpenMP directives. Intel offers support on teaching parallel computing here: -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From rgb at phy.duke.edu Fri Nov 21 08:44:38 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] OpenMP on AMD dual core processors In-Reply-To: <6009416b0811210745w2ae0b7dbu11c3cfc18101c43c@mail.gmail.com> References: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> <49263F3C.3090005@scalableinformatics.com> <4926CD4F.6050307@gmail.com> <4926D287.7030109@scalableinformatics.com> <6009416b0811210745w2ae0b7dbu11c3cfc18101c43c@mail.gmail.com> Message-ID: On Fri, 21 Nov 2008, Nathan Moore wrote: > Hi Joe, > > I found the article you wrote for Linux Journal right about the time you > emailed last night - thanks for the reference and the suggestions!? I find > that 1 beer and coding is ok, but num_beer .ge. 2 makes me too poetic and > insufficiently detail oriented.? Debugging the next morning is never fun. Nathan, This is simply a matter of practice. Try coding with num_beer .ge. 10 for a few weeks, and then fall back to a lesser range. I'd say .ge. 12 but it is so difficult to get vomit out of a keyboard... it might take you a while to build up enough new smooth endoplasmic reticula to be able to cope with the really high ranges. For a special treat, try balancing out the beer fuzz with cocaine or methamphetamines. Much better than mere caffeine. One can type so FAST, you know. :-) rgb > > On Fri, Nov 21, 2008 at 9:23 AM, Joe Landman > wrote: > Geoff Jacobs wrote: > > If this were done with MPI, one would have to be > careful of the > boundaries on the matrix as it's partitioned for > computation. OpenMP is > intelligent enough to hold off computation on the > tiles south and east > of the first until the first is done, and so forth? > > > No... I didn't address the interior vs exterior. ?I have a nice worked > example where I convert this sort of code into an exterior, a skin, > and the communication for an MPI and OpenMP version. ?Scales pretty > well. You are right, I should have fixed that as well. > > > ! now all you need is a "linear scan" to find > positive elements in > ! dv. ?You can approach these as sum > reductions, and do them in > ! parallel > do i=1,Nx > ?sum=0.0 > ?do j=1,Ny > ?sum = sum + dabs(dv(i,j) .gt. 0.0) * dv(i,j) > ?end do > ?if (sum .gt. 0.0) converged = 0 > end do > > The basic idea is to replace the inner loop > conditionals and remove as > many of the shared variables as possible. > > > Yup, keep things pipelined. > > > That was the idea, though I didn't compile/test the code to be sure it > would work. ?That and I usually try to avoid real coding when I am > tired at night. ?Coding and beer don't mix (for me). > > > > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics LLC, > email: landman@scalableinformatics.com > web ?: http://www.scalableinformatics.com > ? ? ? http://jackrabbit.scalableinformatics.com > phone: +1 734 786 8423 x121 > fax ?: +1 866 888 3112 > cell : +1 734 612 4615 > > > > > -- > - - - - - - - ? - - - - - - - ? - - - - - - - > Nathan Moore > Assistant Professor, Physics > Winona State University > AIM: nmoorewsu > - - - - - - - ? - - - - - - - ? - - - - - - - > > Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From rgb at phy.duke.edu Fri Nov 21 08:53:29 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] OpenMP on AMD dual core processors In-Reply-To: <6009416b0811210742i67c06368yc5488f940eef22ec@mail.gmail.com> References: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> <39705.128.83.67.198.1227279925.squirrel@webmail.lncc.br> <6009416b0811210742i67c06368yc5488f940eef22ec@mail.gmail.com> Message-ID: On Fri, 21 Nov 2008, Nathan Moore wrote: > What a relief that no one castigated me for including a "goto"!? I'm > teaching out of Kupferschmid's "Classical Fortran", and he makes a rather > compelling case that goto s actually more pedagogically sound (for > beginners) than "do while" My friend, long before I whacked you for using a goto (or horrors, teaching impressionable young minds about the command that is the fundamental basis for some of the worst spaghetti code ever written) I'd whack you about teaching them "fortran" at all, especially "classical" fortran (which I interpret as being something around Fortran IV, which is where I got off the Fortran choo-choo). (rgb calmly strips nekkid in his office and puts on his handy-dandy adiabatic suit, gulping the rest of his beer and mildly shocking the undergraduates he was interacting with as he prepares for the incoming missiles...:-) > > Successive steps in the equilibration are time-dependant, so they're > IMPOSSIBLE to parallelize.? I suppose I could get around the > re-initialization by including a "SINGLE" directive around the iteration > control structure. > > Nathan > > On Fri, Nov 21, 2008 at 9:05 AM, wrote: > Hi All > > I thing the problem could be the? convergence "loop" test and > the criation of threads > > 10 converged = 1 > !$OMP PARALLEL > !$OMP DO > ..... > !$OMP ENDDO > !$OMP END PARALLEL > if(converged.eq.0) then > > goto 10 > endif > Each time you "goto 10" > the compiler "create" and "initialize" the threads > and this is time comsuming. > try to change the convergence test to a > reduce operation this will > take time but not some much as !$OMP > PARALLEL > I hope its help > > Renato Silva > > > > > Hi All, > > > > I'm getting to the end of a semester of computational physics at my > > institution, and thought it would be fin to close the semester with > a > > discussion of parallel programming. Initially, I was simply planning > to > > discuss MPI, but while reading through the gfortran man page I > realized > > that > > gcc now supports OpenMP directives. > > > > Given that the machines my students are using are all dual core, I > started > > working on a simple example that I hoped would show a nice speedup > from > > the > > "easy" library. > > > > The specific problem I'm working on is a 2-d solution to the laplace > > equation (electrostatics). The bulk of the computation is a > recursion > > relation, applied to elements of a 2-d array, according to the > following > > snippet. > > > > Of course, by now I should know that "simple" never really is. When > I > > compile with gfortran and run with 1 or 2 cores (ie, > OMP_NUM_THREADS=2, > > export OMP_NUM_THREADS) there is basically no difference in > execution > > time. > > > > > > Any suggestions? I figured that this would be a simple example to > > parallelize. Is there a better example for OpenMP parallelization? > Also, > > is there something obvious I'm missing in the example below? > > > > Nathan Moore > > > > integer,parameter::Nx=1000 > > integer,parameter::Ny=1000 > > real*8 v(Nx,Ny) > > integer boundary(Nx,Ny) > > > > v_cloud = -1.0e-4 > > v_ground = 0.d0 > > > > convergence_v = dabs(v_ground-v_cloud)/(1.d0*Ny*Ny) > > > > ! initialize the the boundary conditions > > do i=1,Nx > > do j=1,Ny > > v_y = v_ground + (v_cloud-v_ground)*(j*dy/Ly) > > boundary(i,j)=0 > > v(i,j) = v_y > > ! we need to ensure that the edges of the domain are held > > as > > boundary > > if(i.eq.0 .or. i.eq.Nx .or. j.eq.0 .or. j.eq.Ny) then > > boundary(i,j)=1 > > endif > > end do > > end do > > > > 10 converged = 1 > > !$OMP PARALLEL > > !$OMP DO > > do i=1,Nx > > do j=1,Ny > > if(boundary(i,j).eq.0) then > > old_v = v(i,j) > > v(i,j) = > > 0.25*(v(i-1,j)+v(i+1,j)+v(i,j+1)+v(i,j-1)) > > dv = dabs(old_v-v(i,j)) > > if(dv.gt.convergence_v) then > > converged = 0 > > endif > > endif > > end do > > end do > > !$OMP ENDDO > > !$OMP END PARALLEL > > if(converged.eq.0) then > > goto 10 > > endif > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > > > > > > -- > - - - - - - - ? - - - - - - - ? - - - - - - - > Nathan Moore > Assistant Professor, Physics > Winona State University > AIM: nmoorewsu > - - - - - - - ? - - - - - - - ? - - - - - - - > > Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From landman at scalableinformatics.com Fri Nov 21 09:45:59 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] OpenMP on AMD dual core processors In-Reply-To: References: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> <49263F3C.3090005@scalableinformatics.com> <4926CD4F.6050307@gmail.com> <4926D287.7030109@scalableinformatics.com> <6009416b0811210745w2ae0b7dbu11c3cfc18101c43c@mail.gmail.com> Message-ID: <4926F3D7.80601@scalableinformatics.com> Robert G. Brown wrote: > This is simply a matter of practice. Try coding with num_beer .ge. 10 > for a few weeks, and then fall back to a lesser range. I'd say .ge. 12 > but it is so difficult to get vomit out of a keyboard... it might take > you a while to build up enough new smooth endoplasmic reticula to be > able to cope with the really high ranges. > > For a special treat, try balancing out the beer fuzz with cocaine or > methamphetamines. Much better than mere caffeine. One can type so > FAST, you know. > > > :-) wait ... let me go back and get a coffee so it can come out of several facial orifices at once .... -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From landman at scalableinformatics.com Fri Nov 21 10:14:22 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: <9f8092cc0811210724h782b80a0ma92a74a3338acedd@mail.gmail.com> References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <9f8092cc0811191311w652bcd24t926d3a0592117ae3@mail.gmail.com> <330326.56339.qm@web37907.mail.mud.yahoo.com> <9f8092cc0811192323u25067283o321b6d8cef97966c@mail.gmail.com> <1227264275.20502.5.camel@merlino.mi.infn.it> <619899985.20081121152358@gmx.net> <1227280132.20502.31.camel@merlino.mi.infn.it> <9f8092cc0811210724h782b80a0ma92a74a3338acedd@mail.gmail.com> Message-ID: <4926FA7E.7080402@scalableinformatics.com> John Hearns wrote: > > > 2008/11/21 Franz Marini > > > H > Regarding the proprietary-ness of CUDA, I would argue that being > proprietary also means that it probably better targets the NV GPU > architecture, and a more general, portable solution, like OpenCL (which > seems to be closer than expected, by the way) will possibly mean a > somewhat less optimal use of the GPU. M > > > Guys, I'm going to be controversial here. > The market may SAY otherwise, but the market does not give a rat's > behind about proprietariness. Absolutely true. The market cares about price and price performance. > Tell a scientist that her N-body dynamics astrophysics model will run > 500 times faster on a certain GPU and > she'll get more papers published and an invite to a conference in Hawaii > next year and you'll see those > grant dollars being spent. Yup. > Tell and engineer that his Nastran model or his CFD simulation will > finish whilst he goes off to lunch/coffee and he'll > bite your hand off. :) The idea is for users, that increasing throughput is most important. Minimizing the wallclock time at the most reasonable price, or getting the least price and a reasonable wall clock time. > It all comes down to codes - when the ISV codes use these things, you'll > see the uptake. Yup. Thats it. Not controversial, but quite true. Without the ISVs, there will be no clear winner. With the ISVs, you will see one emerge. My take is that one is emerging. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From lindahl at pbm.com Fri Nov 21 12:16:18 2008 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] OpenMP on AMD dual core processors In-Reply-To: <6009416b0811210738h27251837i80aebb5ee30546d3@mail.gmail.com> References: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> <4926AB5B.8040609@cse.ucdavis.edu> <6009416b0811210738h27251837i80aebb5ee30546d3@mail.gmail.com> Message-ID: <20081121201618.GA8965@bx9> On Fri, Nov 21, 2008 at 09:38:29AM -0600, Nathan Moore wrote: > I had a similar problem with gfortran, and it only appeared with large array > sizes (bigger than 4000x4000 as I recall). "ulimit" was no help, I assume > there's a memory constraint built in somewhere. With OpenMP the compiler has to set up multiple stacks, and some are more clever than others. If you were using the PathScale compiler, for example, and overran one of the thread stacks, it would print out an error message saying what happened and how to raise that limit. gfortran's omp probably has documentation which discusses how to raise the stack limit. It's not a simple ulimit... -- greg From lindahl at pbm.com Fri Nov 21 12:18:18 2008 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] Windows in top 10 In-Reply-To: <20081121060623.GA4643@bx9> References: <20081121060623.GA4643@bx9> Message-ID: <20081121201818.GB8965@bx9> On Thu, Nov 20, 2008 at 10:06:23PM -0800, Greg Lindahl wrote: > This press release says it runs Linux and Windows. That's a bit more > credible; I wonder if they actually had to buy the Windows software? BTW, I wasn't claiming that Windows was expensive compared to the rest of the machine. I was hinting that it was another publicity stunt, like my previous example. -- greg From prentice at ias.edu Fri Nov 21 16:47:23 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Wed Nov 25 01:07:59 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: <1227280132.20502.31.camel@merlino.mi.infn.it> References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <9f8092cc0811191311w652bcd24t926d3a0592117ae3@mail.gmail.com> <330326.56339.qm@web37907.mail.mud.yahoo.com> <9f8092cc0811192323u25067283o321b6d8cef97966c@mail.gmail.com> <1227264275.20502.5.camel@merlino.mi.infn.it> <619899985.20081121152358@gmx.net> <1227280132.20502.31.camel@merlino.mi.infn.it> Message-ID: <4927569B.2070502@ias.edu> Franz Marini wrote: >>> The distinction between Tesla and GeForce cards is that the former have >>> no display output, they usually have more ram, and (but I'm not sure >>> about this one) they are clocked a little lower. >> both NV and ATI have always tried to segment "professional graphics" >> into a higher-margin market. this involves tying the pro drivers to >> features found only in the pro cards. > > True, although, as far as I remember, the only real distinction between > Quadro and GeForce cards are hardware support for antialiased lines which > is present in the former (I could be wrong though, and there may be some > more substantial differences)... The Quadro cards above a certain level (280 NVS or 580 NVS, I think - I never could keep track of model 3's) could do hardware stereo 3D graphics, which the Computational Chem/Molecular Modeling folks love, a d happily pay extra for it. The other nvidia model lines do not have this capability. -- Prentice From gdjacobs at gmail.com Sat Nov 22 07:11:05 2008 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters In-Reply-To: <9f8092cc0811210724h782b80a0ma92a74a3338acedd@mail.gmail.com> References: <3EE436BDAFE0044A8D6997B74EC3FAB1034E6FB4@sacex2.ad.water.ca.gov> <9f8092cc0811191311w652bcd24t926d3a0592117ae3@mail.gmail.com> <330326.56339.qm@web37907.mail.mud.yahoo.com> <9f8092cc0811192323u25067283o321b6d8cef97966c@mail.gmail.com> <1227264275.20502.5.camel@merlino.mi.infn.it> <619899985.20081121152358@gmx.net> <1227280132.20502.31.camel@merlino.mi.infn.it> <9f8092cc0811210724h782b80a0ma92a74a3338acedd@mail.gmail.com> Message-ID: <49282109.9000004@gmail.com> John Hearns wrote: > > > 2008/11/21 Franz Marini > > > H > Regarding the proprietary-ness of CUDA, I would argue that being > proprietary also means that it probably better targets the NV GPU > architecture, and a more general, portable solution, like OpenCL (which > seems to be closer than expected, by the way) will possibly mean a > somewhat less optimal use of the GPU. M > > > Guys, I'm going to be controversial here. > The market may SAY otherwise, but the market does not give a rat's > behind about proprietariness. > Tell a scientist that her N-body dynamics astrophysics model will run > 500 times faster on a certain GPU and > she'll get more papers published and an invite to a conference in Hawaii > next year and you'll see those > grant dollars being spent. > Tell and engineer that his Nastran model or his CFD simulation will > finish whilst he goes off to lunch/coffee and he'll > bite your hand off. > > It all comes down to codes - when the ISV codes use these things, you'll > see the uptake. We will see general solutions develop coming from multilateral groups (OpenCL) and Microsoft (DirectCL?) With a third player coming into the market in the form of Intel, no ISV will be interested in locking themselves to any particular API when they have viable multi platform options. Choosing proprietary solutions will automatically deny any ISV a significant portion of the consumer or professional computing market. Why do you think Adobe accelerated Photoshop using shader math? -- Geoffrey D. Jacobs From bill at cse.ucdavis.edu Sat Nov 22 12:45:09 2008 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] OpenMP on AMD dual core processors In-Reply-To: <20081121201618.GA8965@bx9> References: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> <4926AB5B.8040609@cse.ucdavis.edu> <6009416b0811210738h27251837i80aebb5ee30546d3@mail.gmail.com> <20081121201618.GA8965@bx9> Message-ID: <49286F55.9050401@cse.ucdavis.edu> Greg Lindahl wrote: > On Fri, Nov 21, 2008 at 09:38:29AM -0600, Nathan Moore wrote: > >> I had a similar problem with gfortran, and it only appeared with large array >> sizes (bigger than 4000x4000 as I recall). "ulimit" was no help, I assume >> there's a memory constraint built in somewhere. > > With OpenMP the compiler has to set up multiple stacks, and some are > more clever than others. If you were using the PathScale compiler, for > example, and overran one of the thread stacks, it would print out an > error message saying what happened and how to raise that limit. I see warnings from Pathscale-3.2 when I run 2 8k x 8k arrays of doubles using 1 or 2 threads (but not 4 or 8 threads): ** OpenMP warning: requested pthread stack too large, using 4294967296 bytes instead But it still seems to work. On my machine (with 8GB ram) I can run 2 8k x 8k arrays of doubles without problem. For runs with 8kx8k arrays, convergence_v = 1, 24 iterations and gcc-4.3.2 -O3 -fopenmp: real 3m39.818s real 2m21.298s real 2m19.850s real 1m39.412s Pathscale-3.2 -O3 -mp: real 3m10.803s real 2m24.492s real 1m43.183s real 1m20.669s I believe that arrays in C/Fortran have to be contiguous by default and that depending on kernel (32 vs 64bit), PAE, and BIOS settings sometimes all physical memory isn't contiguous. From spambox at emboss.co.nz Thu Nov 20 11:01:31 2008 From: spambox at emboss.co.nz (Michael Brown) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] What class of PDEs/numerical schemes suitable for GPUclusters In-Reply-To: <504859.61845.qm@web80707.mail.mud.yahoo.com> References: <504859.61845.qm@web80707.mail.mud.yahoo.com> Message-ID: <49374AF0C0544FE8AFA318ADF4FAFCBB@Forethought> Jeff Layton wrote: >> offhand, I'd guess that adaptive grids will be substantially harder >> to run efficiently on a GPU than a uniform grid. > > One key thing is that unstructured grid codes don't work as well. > The problem is the indirect addressing. Bingo. GPUs are still GPUs, and are still heavily optimized for coherent data access patterns. If cell (x, y) depends on data at (x, y), then cell (x + 1, y) better depend on data at cell (x + 1, y) or performance will suffer terribly. In C-speak: x += C[i][j]; is good, and x += C[Idx[i][j]]; is bad. Similarly bad is non-coherent branching, due to the thread grouping. The ideal workload is one that had minimal or no branching, and can be mapped into a computational model where you have a 1-, 2-, or 3-dimensional arrangement of cells, where the computation (including the relative position for any data lookups) for each cell does not change. IME, as soon as you depart significantly from this workload, you often start to see order of magnitude drops in performance. Additionally, the round-trip CPU->GPU->CPU latency is horrific (in the order of 1 ms on my 8800GTX on Vista, though I'm not sure about the newer cards or other OSes) so unless you can get a good pipeline going, bouncing computation between the CPU and GPU can wreck the overall performance. This also makes it very hard to scale out to more than one card. I've spent a fair amount of time tweaking a bit of software that at its core is a RKF45 adaptive integrator on a number of independent entities, with some other GPU-unfriendly code (very branchy and with PRNGs). The optimal method that I've found for this code is to do the integration substeps on the GPU, but all other processing on the CPU. The GPU doesn't worry if the requested substep has excessive error, it just passes back the better step-size to the CPU and doesn't update the data. The CPU then notices that the returned "next" stepsize is smaller than the stepsize it sent, and handles the situation correctly. Subdividing steps on the GPU (or simply looping around with the smaller step sizes until the error is sufficiently small) is a performance loss. Additionally, since the entities are essentially independent, I can have multiple sets in progress at once. The peak seems to be to break it into 4 sets, presumably corresponding to one being sent to the GPU, one being processed on the GPU, one coming back from the GPU, and one being processed on the CPU. The performance gain going from 1 set to 4 is about a factor of 2.5. Cheers, Michael From rssr at lncc.br Fri Nov 21 07:05:25 2008 From: rssr at lncc.br (rssr@lncc.br) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] OpenMP on AMD dual core processors In-Reply-To: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> References: <6009416b0811201952y141fee6fibce557b4988c752d@mail.gmail.com> Message-ID: <39705.128.83.67.198.1227279925.squirrel@webmail.lncc.br> Hi All I thing the problem could be the? convergence "loop" test and the criation of threads 10 converged = 1 !$OMP PARALLEL !$OMP DO ..... !$OMP ENDDO !$OMP END PARALLEL if(converged.eq.0) then goto 10 endif Each time you "goto 10" the compiler "create" and "initialize" the threads and this is time comsuming. try to change the convergence test to a reduce operation this will take time but not some much as !$OMP PARALLEL I hope its help Renato Silva > Hi All, > > I'm getting to the end of a semester of computational physics at my > institution, and thought it would be fin to close the semester with a > discussion of parallel programming. Initially, I was simply planning to > discuss MPI, but while reading through the gfortran man page I realized > that > gcc now supports OpenMP directives. > > Given that the machines my students are using are all dual core, I started > working on a simple example that I hoped would show a nice speedup from > the > "easy" library. > > The specific problem I'm working on is a 2-d solution to the laplace > equation (electrostatics). The bulk of the computation is a recursion > relation, applied to elements of a 2-d array, according to the following > snippet. > > Of course, by now I should know that "simple" never really is. When I > compile with gfortran and run with 1 or 2 cores (ie, OMP_NUM_THREADS=2, > export OMP_NUM_THREADS) there is basically no difference in execution > time. > > > Any suggestions? I figured that this would be a simple example to > parallelize. Is there a better example for OpenMP parallelization? Also, > is there something obvious I'm missing in the example below? > > Nathan Moore > > integer,parameter::Nx=1000 > integer,parameter::Ny=1000 > real*8 v(Nx,Ny) > integer boundary(Nx,Ny) > > v_cloud = -1.0e-4 > v_ground = 0.d0 > > convergence_v = dabs(v_ground-v_cloud)/(1.d0*Ny*Ny) > > ! initialize the the boundary conditions > do i=1,Nx > do j=1,Ny > v_y = v_ground + (v_cloud-v_ground)*(j*dy/Ly) > boundary(i,j)=0 > v(i,j) = v_y > ! we need to ensure that the edges of the domain are held > as > boundary > if(i.eq.0 .or. i.eq.Nx .or. j.eq.0 .or. j.eq.Ny) then > boundary(i,j)=1 > endif > end do > end do > > 10 converged = 1 > !$OMP PARALLEL > !$OMP DO > do i=1,Nx > do j=1,Ny > if(boundary(i,j).eq.0) then > old_v = v(i,j) > v(i,j) = > 0.25*(v(i-1,j)+v(i+1,j)+v(i,j+1)+v(i,j-1)) > dv = dabs(old_v-v(i,j)) > if(dv.gt.convergence_v) then > converged = 0 > endif > endif > end do > end do > !$OMP ENDDO > !$OMP END PARALLEL > if(converged.eq.0) then > goto 10 > endif > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081121/2952b1e9/attachment.html From naveed at caltech.edu Fri Nov 21 16:57:54 2008 From: naveed at caltech.edu (Naveed Near-Ansari) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] 10th Annual Beowulf Bash: Austin TX Nov 17 2008 9pm In-Reply-To: References: Message-ID: Thanks for putting it on. A good time was had by all. Naveed On Nov 14, 2008, at 4:07 PM, Donald Becker wrote: > > > Tenth Annual Beowulf Bash > And > LECCIBG > > November 17 2008 9pm at Pete's Dueling Piano Bar > > We have finalized the plans for this year's combined Beowulf Bash > and LECCIBG > > http://www.xandmarketing.com/beobash/ > > It will take place, as usual, with the IEEE SC Conference. > This year SC08 is in Austin during the week of Nov 17 2008 > > As in previous years, the attraction is the conversations with > other attendees. We will have drinks and light snacks, with a short > greeting by the sponsors about 10:15pm. > > The venue is in the lively area of Austin near 6th street, very > close to > many of the conference hotels and within walking distance of the rest. > > November 17 2008 9-11:30pm > Monday, Immediately after the SC08 Opening Gala > Pete's Dueling Piano Bar > http://www.petesduelingpianobar.com > > If your company (or even you as an individual) would like to help > sponsor the event, please contact me, becker@beowulf.org before early > November. (We can accommodate last-minute sponsorship, but your name > won't be on the printed info.) > > Our "headlining" sponsor list for 2008 is AMD > AMD (Lead sponsor) http://amd.com > > Other sponsors are > Penguin/Scyld (organizing sponsor) http://penguincomputing.com > XAND Marketing (organizing sponsor) http://xandmarketing.com > NVIDIA http://nvidia.com > Terascala http://www.terascala.com/ > Panasas http://www.panasas.com/ > Clustermonkey http://www.clustermonkey.net/ > > > > -- > Donald Becker becker@scyld.com > Penguin Computing / Scyld Software > www.penguincomputing.com www.scyld.com > Annapolis MD and San Francisco CA > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From hahn at mcmaster.ca Sun Nov 23 15:00:03 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] gpu numbers Message-ID: one thing I was surprised at is the substantial penalty that the current gtx280-based gpus pay for double-precision. I think I understand the SP throughput - since these are genetically graphics processors, their main flop-relevant op is blend: pixA * alpha + pixB * beta that's 3 sp flops, and indeed the quoted 933 glops = 240 cores @ 1.3 GHz * 2mul1add/cycle. I'm a little surprised that they quote only 78 DP gflops - 1/12 the SP rate. I counted ops when doing base-10 multiplication on paper, and it seemed to require only 4x each SP mul. I guess the problem might simply be that each core isn't OOO like CPUs, or that emulating DP does't optimally utilize the available 2mul+add. note also: 78 DP Gflops/~200W. 3.2 GHz QC CPU: 51 DP Gflops/~200W. figuring power is a bit tricky, but price is even worse. for power, NV claims <200W (not less than 150 in any of the GTX280 reviews, though). but you have to add in a host, which will probably be around 300W; assuming you go for the C1070, the final is 4*78/(800+300). a comparison CPU-based machine would be something like 2*51/350W. amusingly, almost the same DP flops per watt ;) does anyone know whether the reputed hordes of commercial Cuda apps mostly stick to SP? From mfatica at gmail.com Sun Nov 23 17:33:56 2008 From: mfatica at gmail.com (Massimiliano Fatica) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] gpu numbers In-Reply-To: References: Message-ID: <8e6393ac0811231733q7f354574t501da7dcc7525e82@mail.gmail.com> On the GT200, there are 30 multiprocessors, each with 8 single precision (SP) units, 1 double precision (DP) unit and 1 special function (SFU) unit. Each SP and DP unit can perform a multiply and add, the SFU unit if not busy computing a transcendental function can perform a single precision multiply (this is where the 3 comes from in the peak performance number for single). So, the peak performance numbers are: SP: 240*3*Clock DP: 30*2*Clock The C1060 has a clock of 1.296Ghz (SP peak =933 Gflops, DP peak=77 Gflops ), the S1070 has a clock of 1.44Ghz (SP peak =1036 Gflops, DP peak=86 Gflops ). These are peak numbers, in reality the difference between single and double is between 4x and 6x (most of the double precision codes are running close to 80-90% of peak, you can really feed data to the unit). The power numbers are including not only the GPU but also the memory (and we are talking about 4GB of GDDR3 memory) that can account for several tens of Watts. There was a CUDA tutorial at SC08, these are some numbers presented from Hess Corporation on the performance of a GPU cluster for seismic imaging: a 128-GPU cluster (32 S1070) out-perform a 3000 CPU cluster, with speed ups varying from 5x to 60x depending on the algorithms. If we are talking double precision, this is a preliminary Linpack result for a small problem (only 4GB) on a standard Sun Ultra 24 (1 Core2 Extreme CPU Q6850 @ 3GHz, standard 530W power supply) with a 1 Tesla C1060 : ================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR00L2L2 23040 960 1 1 97.91 8.328e+01 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0048141 ...... PASSED ================================================================================ The workstation alone is performing around 38 Gflops. So even in double precision, you can use a cheap single socket machine and still get results comparable to more expensive server configuration with Xeon and multi-socket motherboard. If you are talking clusters, you can reduce the number of nodes and get a significant saving on network if you are using IB or 10GigE Massimiliano On Sun, Nov 23, 2008 at 3:00 PM, Mark Hahn wrote: > one thing I was surprised at is the substantial penalty that the current > gtx280-based gpus pay for double-precision. > I think I understand the SP throughput - since these are genetically > graphics processors, their main flop-relevant op is blend: > pixA * alpha + pixB * beta > that's 3 sp flops, and indeed the quoted 933 glops = 240 cores @ 1.3 GHz * > 2mul1add/cycle. I'm a little surprised > that they quote only 78 DP gflops - 1/12 the SP rate. > I counted ops when doing base-10 multiplication on paper, > and it seemed to require only 4x each SP mul. I guess the problem might > simply be that each core isn't OOO like CPUs, > or that emulating DP does't optimally utilize the available 2mul+add. > > note also: 78 DP Gflops/~200W. 3.2 GHz QC CPU: 51 DP Gflops/~200W. > figuring power is a bit tricky, but price is even worse. for power, > NV claims <200W (not less than 150 in any of the GTX280 reviews, though). > but you have to add in a host, which will probably be around 300W; > assuming you go for the C1070, the final is 4*78/(800+300). > a comparison CPU-based machine would be something like 2*51/350W. > amusingly, almost the same DP flops per watt ;) > > does anyone know whether the reputed hordes of commercial Cuda apps > mostly stick to SP? > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From coutinho at dcc.ufmg.br Sun Nov 23 17:37:21 2008 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] gpu numbers In-Reply-To: References: Message-ID: 2008/11/23 Mark Hahn > one thing I was surprised at is the substantial penalty that the current > gtx280-based gpus pay for double-precision. > I think I understand the SP throughput - since these are genetically > graphics processors, their main flop-relevant op is blend: > pixA * alpha + pixB * beta This is most used in texture fetching. Unfortunately, for nvidia cards texture fetching units can't do general purpose processing. They can only do texture fetch operations like old gpus. > > that's 3 sp flops, and indeed the quoted 933 glops = 240 cores @ 1.3 GHz * > 2mul1add/cycle. The most desnse instruction that the general purpose units (stream processors) can do multiply-add, so it's: 240 cores @ 1.3 GHz * 1mul1add/cycle = 624 gflops. > I'm a little surprised > that they quote only 78 DP gflops - 1/12 the SP rate. > I counted ops when doing base-10 multiplication on paper, > and it seemed to require only 4x each SP mul. I guess the problem might > simply be that each core isn't OOO like CPUs, > or that emulating DP does't optimally utilize the available 2mul+add. As Gtx280-based gpus main purpose is games, the architecture is heavily focused on SP operations, like the cell processor. In cell, the DP throughput is nearly 1/10 of it's SP throughput. > > note also: 78 DP Gflops/~200W. 3.2 GHz QC CPU: 51 DP Gflops/~200W. > figuring power is a bit tricky, but price is even worse. for power, > NV claims <200W (not less than 150 in any of the GTX280 reviews, though). > but you have to add in a host, which will probably be around 300W; > assuming you go for the C1070, the final is 4*78/(800+300). > a comparison CPU-based machine would be something like 2*51/350W. > amusingly, almost the same DP flops per watt ;) But remember that it memory interface can do 100GB/s, four times the best Nehalens commercially available, and its cores can have 1024 threads (32 warps) so it has better conditions to sustain high throughput (if your application use coherent data acces, so we come back to what Michael said). > > > does anyone know whether the reputed hordes of commercial Cuda apps > mostly stick to SP? > As Cuda started to support DP only since gtx 280 was launched, I think the answer is yes. :) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081123/6ab15199/attachment.html From spambox at emboss.co.nz Mon Nov 24 17:19:32 2008 From: spambox at emboss.co.nz (Michael Brown) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] QsNet-1 parts, last call Message-ID: <92D3B1FC3E7C44BEAB5FC688BABB614D@Forethought> Hello all, As you may remember, I've ended up with a 128-way QsNet1 setup. Unfortunately, the house I've been renting has been sold, and the new place doesn't have the space to store it all. As a result, I'm going to be scrap-metalling the setup on about the 10th of December, give or take a few days. So, anyone who's interested in parts will have to get back to me before then. Essentially what you will get is the QM400 (64-bit 66 MHz PCI) cards, the cables (about 15m long IIRC), and a QM-401X (16-way) switch card. The cards cannot be used as-is, since they expect +/- 24V from the chassis. However, by desoldering the DC-DC converter on the card, and connecting up to an external +/-3.3V supply, the card appears to operate correctly. I've tested it using two Meanwell RS-100-3.3 supplies (one for +3.3, one for -3.3). If you want, I can remove the regulator and solder in the other wires. You'll have to actually plug the wires into whatever supply you're using - the Meanwell RS-100 supplies (for example) have exposed mains voltages on the terminals, so for legal reasons I can't connect it all up for you. I can say that it all fits quite nicely into the Jaycar 2U rack enclosure, with a couple of 12V fans running off a unregulated 12V supply (transformer + diodes). To put a number on it, I'm saying AU$320 + shipping, for a 16 cards + cables + switch card, though no reasonable offers will be refused since I'd like to see it used instead of scrap-metalled. The cost is basically to cover the materials + time to package it all up, plus a bit to cover what I would get for the copper in the cables. Note that the cables are big and heavy (for some reason 1.2 kg/cable rings a bell, but I'll have to check), so if you live overseas (I'm in Canberra, Australia) shipping could be a bit. I've also got the chassis (2 PSUs, fans, 16 QM-402's, and a clock card) if anyone is interested, but it's REALLY big and heavy so you'll have to come and pick that up yourself if you're interested. For reference, QsNet1 has about 350 MB/sec bandwidth and MPI latency of somewhere around 4.5 - 5.0 us, depending on the platform (~2 us for the lower-level interface). So it's a big step up from ethernet, for example :) You need to use a patched Linux kernel (x86, x86-64, and IA64 supported, versions 2.6.18 and earlier IIRC) but it's not all that difficult to get set up. There's binaries for RHEL, and I got it to build with a bit of coercion on Debian Etch (4.0) IA-64. Sorry about the spam, Michael Brown From mathog at caltech.edu Tue Nov 25 10:19:26 2008 From: mathog at caltech.edu (David Mathog) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] tools for cluster event logging? Message-ID: What would be a good tool for logging cluster specific messages, and nothing else, on a single server? The purpose of this is to let computer nodes send messages like "node XXX hardware failure, shutting down", or "node xxx, boot sequence completed" messages to a central repository. But I do not want any other messages logged to the repository from the clients. I suppose syslog could be used for this, but the trick would be to choose a facility/priority for it such that nothing other than the desired cluster messages was ever sent. In other words, something like: logger -p cluster.info "this is a cluster message" Unfortunately there is no "cluster" facility, and I do not know which one of the 20 or so defined facilities (auth, authpriv... local7) will never be used by some other part of the client OS. The main reason I'm looking for this now, after so many years of doing without it, is that changes in umount and umount.nfs and the NFS umount section of the distro I use have resulted in the loss of the "unmount request" messages which used to be logged on the NFS server when a client shut down normally. (In brief, "umount -l /mountpoint" used to send these, but it no longer does.) In the past I used those messages, and the corresponding "mount request" messages to determine what the clients were doing, or if they had crashed or shut down normally. Since that isn't possible now, I want to modify the init and hardware monitor scripts to send specific messages. I am running ganglia, but that doesn't have this particular capability, at least as far as I can tell. Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From hearnsj at googlemail.com Tue Nov 25 10:36:46 2008 From: hearnsj at googlemail.com (John Hearns) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] tools for cluster event logging? In-Reply-To: References: Message-ID: <9f8092cc0811251036g28a7e5bbhaf7852ba779a0cf9@mail.gmail.com> 2008/11/25 David Mathog > > I suppose syslog could be used for this, but the trick would be to > choose a facility/priority for it such that nothing other than the > desired cluster messages was ever sent. In other words, something > like: > > logger -p cluster.info "this is a cluster message" > > Unfortunately there is no "cluster" facility, and I do not know > which one of the 20 or so defined facilities (auth, authpriv... local7) > will never be used by some other part of the client OS. > > Sounds good to me. Maybe you could just log all those other types of messages to the central syslog server, and use logwatch, or another log parser,. to filter out the 'noise' and just email you the interesting ones? Not 100% what you want. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081125/4bd28304/attachment.html From hearnsj at googlemail.com Tue Nov 25 10:42:28 2008 From: hearnsj at googlemail.com (John Hearns) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] tools for cluster event logging? In-Reply-To: References: Message-ID: <9f8092cc0811251042s2b5d192ak95c716f72929e0c5@mail.gmail.com> 2008/11/25 David Mathog > What would be a good tool for logging cluster specific messages, and > nothing else, on a single server? How about an SNMP trap, and have a specific cluster MIB? I know this idea sounds daft, but SNMP is well known and all the packages are readily available. Just blue skying really. At one time, six years ago, Ithought of using Beep for messages like this: http://beepcore.org/ Also worth a thought. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081125/ed1e1a1f/attachment.html From hearnsj at googlemail.com Tue Nov 25 10:42:28 2008 From: hearnsj at googlemail.com (John Hearns) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] tools for cluster event logging? In-Reply-To: References: Message-ID: <9f8092cc0811251042s2b5d192ak95c716f72929e0c5@mail.gmail.com> 2008/11/25 David Mathog > What would be a good tool for logging cluster specific messages, and > nothing else, on a single server? How about an SNMP trap, and have a specific cluster MIB? I know this idea sounds daft, but SNMP is well known and all the packages are readily available. Just blue skying really. At one time, six years ago, Ithought of using Beep for messages like this: http://beepcore.org/ Also worth a thought. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081125/ed1e1a1f/attachment-0001.html From mathog at caltech.edu Tue Nov 25 10:52:34 2008 From: mathog at caltech.edu (David Mathog) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] tools for cluster event logging? Message-ID: Figured out that there is no "news" on any of my cluster machines, so I usurped that facility for these messages. (It is sort of "news", right?). On the clients added to /etc/syslog.conf news.* @safserver.cluster news.* -/var/log/messages On the server added "-r" to the syslogd boot, and changed /etc/syslog.conf from: *.info;mail,news,authpriv.none -/var/log/messages to *.info;mail,news.*,authpriv.none -/var/log/messages Now the clients can send messages to the server with logger -p news.warn "Shutting down on hardware error" and it shows up that nodes messages as: Nov 25 10:50:00 monkey01 root: Shutting down on hardware error and in the cluster messages file as: Nov 25 10:50:00 monkey01.cluster root: Shutting down on hardware error Sending .emerg messages puts a message on all server terminal windows though, which is a little annoying. I just won't send those I guess. Anyway, this should do the trick. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From lynesh at cardiff.ac.uk Tue Nov 25 10:56:07 2008 From: lynesh at cardiff.ac.uk (Huw Lynes) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] tools for cluster event logging? In-Reply-To: References: Message-ID: <1227639367.5977.4.camel@desktop> On Tue, 2008-11-25 at 10:19 -0800, David Mathog wrote: > What would be a good tool for logging cluster specific messages, and > nothing else, on a single server? The purpose of this is to let > computer nodes send messages like "node XXX hardware failure, shutting > down", or "node xxx, boot sequence completed" messages to a central > repository. But I do not want any other messages logged to the > repository from the clients. > > I suppose syslog could be used for this, but the trick would be to > choose a facility/priority for it such that nothing other than the > desired cluster messages was ever sent. In other words, something > like: > > logger -p cluster.info "this is a cluster message" > > Unfortunately there is no "cluster" facility, and I do not know > which one of the 20 or so defined facilities (auth, authpriv... local7) > will never be used by some other part of the client OS. > If you use syslog-ng as your central syslog server you can filter messages based on strings. So if you preface all your cluster messages with CLUSTER_MESSAGE or somesuch you can filter them to a custom destination which can be a log file or a command, or both. for examples see: http://www.campin.net/newlogcheck.html -- Huw Lynes | Advanced Research Computing HEC Sysadmin | Cardiff University | Redwood Building, Tel: +44 (0) 29208 70626 | King Edward VII Avenue, CF10 3NB From prentice at ias.edu Tue Nov 25 11:12:21 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] tools for cluster event logging? In-Reply-To: References: Message-ID: <492C4E15.90301@ias.edu> David Mathog wrote: > Figured out that there is no "news" on any of my cluster machines, so I > usurped that facility for these messages. (It is sort of "news", > right?). On the clients added to /etc/syslog.conf > > news.* @safserver.cluster > news.* -/var/log/messages > Why not use one of the "local" facilities? You should be able to use local0 - local7. Are they already in use at your site? -- Prentice From mathog at caltech.edu Tue Nov 25 14:20:34 2008 From: mathog at caltech.edu (David Mathog) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] tools for cluster event logging? Message-ID: Here is a little init.d script "cluster_notify" I put together that uses syslog to log init state changes that take place on the computer nodes on the master node. For instance here is what comes out when one node is rebooted: Nov 25 13:23:21 monkey01.cluster logger: init change: 3 6 Nov 25 13:24:43 monkey01.cluster logger: init change: N 3 The script is for a Mandriva system, so it would probably work unchanged on RedHat or Fedora. For Debian changes are probably needed. cat cluster_notify #!/bin/sh # chkconfig: 2345 11 89 # description: This startup script tells the cluster about init changes ### BEGIN INIT INFO # Provides: cluster_notify # Required-Start: $network $syslog # Required-Stop: $network $syslog # Default-Start: 2345 # Short-Description: Inform cluster log of init state on client # Description: Inform cluster log of init state on client ### END INIT INFO # Local variables LFILE=/var/lock/subsys/cluster_notify # Source function library. . /etc/init.d/functions # Source networking configuration. . /etc/sysconfig/network # Check that networking is up. [ "$NETWORKING" = "no" ] && exit 0 doit(){ gprintf "Cluster_Notify, new init level" transition=`/sbin/runlevel` /bin/logger -p news.info "init change: $transition" echo } RETVAL=0 case "$1" in start) touch $LFILE doit ;; stop) rm -f $LFILE doit ;; *) gprintf "Usage: %s {start,stop}\n" "$0" RETVAL=1 ;; esac exit $RETVAL Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From mathog at caltech.edu Tue Nov 25 14:40:38 2008 From: mathog at caltech.edu (David Mathog) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] after update sgeexecd not starting correctly on reboot Message-ID: This is an odd one, and I hope one of you has seen it and fixed it, because the only way I have been able to trigger the bug is through a reboot. I updated one node from Mandriva 2007.1 to 2008.1. Those are both 2.6.x kernels, and are as you might guess about a year apart. Both use the exact same SGE distribution, which is NFS mounted on /usr/SGE6. On a reboot of the newer system, /etc/rc.d/init.d/sgeexecd, which is the last thing to start in runlevel 3 (except for S99local, which doesn't do anything except "touch /var/lock/subsys/local") fails. First it spews a bunch of lines which look like a script did "set", and as a side effect, this pushes all the other text lines off the console, and then it emits can't determine path to Grid Engine binaries without starting sge_execd. On the older system the exact same scipt starts up with none of this drama, leaving sge_execd running. However, once I logon as root at the console on the newer system, it happily starts up with: /etc/rc.d/init.d/sgeexecd start There are no SGE variables defined in .bashrc etc. The init script has these prerequisites, as on the older system: # Provides: sgeexecd # Required-Start: $network $remote_fs Ring any bells? I think maybe the NFS mounting is different, so that the remote_fs prerequisite isn't really satisfied, even though the associated script has run. The sgeexecd script does include a test: while [ ! -d "$SGE_ROOT" -a $count -le 120 ]; do count=`expr $count + 1` sleep 1 done but since SGE_ROOT is the mount point, the test will be true whether or not the NFS mount has completed. Maybe I'll change that to $SGE_ROOT/bin and see if it helps. Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From mathog at caltech.edu Tue Nov 25 16:08:15 2008 From: mathog at caltech.edu (David Mathog) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] Re: after update sgeexecd not starting correctly on reboot Message-ID: > I think maybe the NFS mounting is different, so that the remote_fs > prerequisite isn't really satisfied, even though the associated script > has run. The sgeexecd script does include a test: > > while [ ! -d "$SGE_ROOT" -a $count -le 120 ]; do > count=`expr $count + 1` > sleep 1 > done This seems to have been it. Changing "$SGE_ROOT" to "$SGE_ROOT/bin" let SGE came up ok in a couple of consecutive reboots. Not definitive proof that was the issue, but at least it seems like progress. Apparently it was getting to this part of the SGE init script before $SGE_ROOT was actually mounted, the -d test always passed, NFS mounted or not, and of course the SGE start up failed since none of that code from the remote system was reachable. Just for kicks I added an echo line within the loop, so that if it sticks there it will show up on the console. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From reuti at staff.uni-marburg.de Wed Nov 26 04:15:38 2008 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] Re: after update sgeexecd not starting correctly on reboot In-Reply-To: References: Message-ID: <41798679-677C-4808-A665-9311F9488E97@staff.uni-marburg.de> Hi David, Am 26.11.2008 um 01:08 schrieb David Mathog: >> I think maybe the NFS mounting is different, so that the remote_fs >> prerequisite isn't really satisfied, even though the associated >> script >> has run. The sgeexecd script does include a test: >> >> while [ ! -d "$SGE_ROOT" -a $count -le 120 ]; do >> count=`expr $count + 1` >> sleep 1 >> done > > This seems to have been it. Changing "$SGE_ROOT" to "$SGE_ROOT/bin" > let SGE came up ok in a couple of consecutive reboots. Not definitive > proof that was the issue, but at least it seems like progress. > Apparently it was getting to this part of the SGE init script before > $SGE_ROOT was actually mounted, the -d test always passed, NFS > mounted or not, and of course the SGE start up failed since none of > that > code from the remote system was reachable. Just for kicks I added an > echo line within the loop, so that if it sticks there it will show > up on the console. may I beg you to enter an issue at http://gridengine.sunsource.net/ of this? -- Reuti From p2s2-chairs at mcs.anl.gov Tue Nov 25 09:46:09 2008 From: p2s2-chairs at mcs.anl.gov (p2s2-chairs@mcs.anl.gov) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] Call For Papers: Intl. Workshop on Parallel Programming Models and Systems Software for HEC (P2S2) Message-ID: <200811251746.mAPHk9Ok000456@pakkled.iit.edu> CALL FOR PAPERS =============== Second International Workshop on Parallel Programming Models and Systems Software for High-end Computing (P2S2) Sept. 22nd, 2009 To be held in conjunction with ICPP-09: The 38th International Conference on Parallel Processing, Sept. 22-25, 2009, Vienna, Austria Website: http://www.mcs.anl.gov/events/workshops/p2s2 SCOPE ----- The goal of this workshop is to bring together researchers and practitioners in parallel programming models and systems software for high-end computing systems. Please join us in a discussion of new ideas, experiences, and the latest trends in these areas at the workshop. TOPICS OF INTEREST ------------------ The focus areas for this workshop include, but are not limited to: * Systems software for high-end scientific and enterprise computing architectures o Communication sub-subsystems for high-end computing o High-performance file and storage systems o Fault-tolerance techniques and implementations o Efficient and high-performance virtualization and other management mechanisms for high-end computing * Programming models and their high-performance implementations o MPI, Sockets, OpenMP, Global Arrays, X10, UPC, Chapel, Fortress and others o Hybrid Programming Models * Tools for Management, Maintenance, Coordination and Synchronization o Software for Enterprise Data-centers using Modern Architectures o Job scheduling libraries o Management libraries for large-scale system o Toolkits for process and task coordination on modern platforms * Performance evaluation, analysis and modeling of emerging computing platforms PROCEEDINGS ----------- Proceedings of this workshop will be published by the IEEE Computer Society (together with the ICPP conference proceedings) in CD format only and will be available at the conference. SUBMISSION INSTRUCTIONS ----------------------- Submissions should be in PDF format in U.S. Letter size paper. They should not exceed 8 pages (all inclusive). Submissions will be judged based on relevance, significance, originality, correctness and clarity. Please visit workshop website at: http://www.mcs.anl.gov/events/workshops/p2s2/ for the submission link. JOURNAL SPECIAL ISSUE --------------------- The best papers selected for the workshop will be published in a special issue of the International Journal of High Performance Computing Applications (IJHPCA) on Parallel Programming Models and Systems Software for High-End Computing. IMPORTANT DATES --------------- Paper Submission: Feb. 27th, 2009 Author Notification: May 1st, 2009 Camera Ready: June 5th, 2009 PROGRAM CHAIRS -------------- * Pavan Balaji (Argonne National Laboratory) * Abhinav Vishnu (Pacific Northwest National Laboratory) PUBLICITY CHAIR --------------- * Yong Chen, Illinois Institute of Technology STEERING COMMITTEE ------------------ * William D. Gropp (University of Illinois Urbana-Champaign) * Dhabaleswar K. Panda (Ohio State University) * Vijay Saraswat (IBM Research) PROGRAM COMMITTEE ----------------- * Taisuke Boku, University of Tsukuba, Japan * Ron Brightwell, Sandia National Laboratory * Narayan Desai, Argonne National Laboratory * Richard Graham, Oak Ridge National Laboratory * Zhiyi Huang, University of Otago, New Zealand * Hyun-Wook Jin, Konkuk University, Korea * Matthew Koop, Ohio State University * Sriram Krishnamoorthy, Pacific Northwest National Laboratory * Zhiling Lan, Illinois Institute of Technology * Doug Lea, State University of New York at Oswego * Jiuxing Liu, IBM Research * Guillaume Mercier, INRIA, France * Jarek Nieplocha, Pacific Northwest National Laboratory * Scott Pakin, Los Alamos National Laboratory * Fabrizio Petrini, IBM Research * Arun Raghunath, Intel * Vivek Sarkar, Rice University * Bronis de Supinksi, Lawrence Livermore National Laboratory * Sayantan Sur, IBM Research * Rajeev Thakur, Argonne National Laboratory * Jesper Traff, NEC, Europe * Weikuan Yu, Oak Ridge National Laboratory If you have any questions, please contact us at p2s2-chairs@mcs.anl.gov ======================================================================== If you do not want to receive any more announcements regarding the P2S2 workshop, please unsubscribe here: https://lists.mcs.anl.gov/mailman/listinfo/p2s2-announce ======================================================================== From ntmoore at gmail.com Wed Nov 26 10:32:22 2008 From: ntmoore at gmail.com (Nathan Moore) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] OpenMP wierdness on dual AMD 2350 box w/ SL5.2 x86_64 In-Reply-To: <6009416b0811261027r7a683dfeua843413d902c1cd6@mail.gmail.com> References: <6009416b0811261027r7a683dfeua843413d902c1cd6@mail.gmail.com> Message-ID: <6009416b0811261032g254ea412uc8a1651dd8b8b33f@mail.gmail.com> After the help last week on openmp, I got inspired and bought a dual-quad opteron machine for the department to show 8-way scaling for my students ("Hey, its much cheaper than something new in the optics lab", my dept. chair laughed). I've been working on said machine over the past few days and found something really weird in an OpenMP example program I descrobed to the list. The machine is a dual-proc AMD Opteron 2350, Tyan n3600T (S2937) mainboard, w/ 8GB ram. Initially, I installed the i386 version of Scientific Linux 5.2, but then realized that only half of the RAM was usable, and re-installed SL5.2 x86_64 this morning. The example program is appended to the end of this email. Again, it is a 2-d finite-difference solution to the laplace equation (the context being to "predict" lightning strikes based on the potential between the ground and some clouds overhead). The program scales beautifully up to OMP_NUM_THREADS~6 or 7, but when I set the number of threads to 8, something weird happens, and instead of taking the normal 241 iterations to converge, the program converges after 1 step. This reeks of numerical instability to me, but my programming experience in x86_64 is very limited. I'm using gfortran, with the simple compile string, gfortran clouds_example_OpenMP.f90 -m64 -fopenmp Any insight into what obvious mistake I'm making would be wonderful! The stability of the output seems erratic to me. Sometimes when OMP_NUM_THREADS=7 the result converges normally after 241 iterations and at other times, the result converges after 1 iteration (indicating some sort of problem with hardware???) This didn't occur yesterday when the machine was running SL5.2, i386. Simulation Output: [nmoore@aykroyd clouds]$ OMP_NUM_THREADS=1 [nmoore@aykroyd clouds]$ export OMP_NUM_THREADS [nmoore@aykroyd clouds]$ ./a.out Hello World from thread 0 There are 1 threads ... convergence criteria is \Delta V < 0.250000003725290 iterations necessary, 241 initialization time, 0 simulation time, 57 [nmoore@aykroyd clouds]$ OMP_NUM_THREADS=2 [nmoore@aykroyd clouds]$ export OMP_NUM_THREADS [nmoore@aykroyd clouds]$ ./a.out Hello World from thread 0 Hello World from thread 1 There are 2 threads ... convergence criteria is \Delta V < 0.250000003725290 iterations necessary, 241 initialization time, 0 simulation time, 28 [nmoore@aykroyd clouds]$ OMP_NUM_THREADS=4 [nmoore@aykroyd clouds]$ export OMP_NUM_THREADS [nmoore@aykroyd clouds]$ ./a.out Hello World from thread 3 Hello World from thread 1 Hello World from thread 0 Hello World from thread 2 There are 4 threads ... convergence criteria is \Delta V < 0.250000003725290 iterations necessary, 241 initialization time, 0 simulation time, 14 [nmoore@aykroyd clouds]$ OMP_NUM_THREADS=8 [nmoore@aykroyd clouds]$ export OMP_NUM_THREADS [nmoore@aykroyd clouds]$ ./a.out Hello World from thread 2 ... convergence criteria is \Delta V < 0.250000003725290 iterations necessary, 1 initialization time, 0 simulation time, 0 Code listing: nmoore@aykroyd clouds]$ cat clouds_example_OpenMP.f90 ! ! use omp_lib ! IMPLICIT NONE integer,parameter::Nx=2000 integer,parameter::Ny=2000 real*8 v(Nx,Ny), dv(Nx,Ny) integer boundary(Nx,Ny) integer i,j,converged,i1,i2 integer t0,t1,t2 real*8 convergence_v, v_cloud, v_ground, max_dv real*8 bump_P,old_v real*8 Lx,Ly,dx,dy,v_y ! real*8 lightning_rod_center, lightning_rod_height ! real*8 house_center, house_height, house_width integer num_iterations ! integer:: id, nthreads !$omp parallel private(id) id = omp_get_thread_num() write (*,*) 'Hello World from thread', id !$omp barrier if ( id == 0 ) then nthreads = omp_get_num_threads() write (*,*) 'There are', nthreads, 'threads' end if !$omp end parallel ! initial time t0 = secnds(0.0) dx =0.1d0 ! differential lengths, m dy =0.1d0 Lx = Nx*dx ! system sizes, m Ly = Ny*dy print *,"\nSimulation has bounds:\n\tX: 0,",Lx,"\n\tY: 0,",Ly print *,"\tNx = ",Nx,"\n\tNy = ",Ny print *,"\tdx = ",dx,"\n\tdy = ",dy v_cloud = -10000.d0 ! volts v_ground = 0.d0 ! initialize the the boundary conditions ! ! first, set the solution function (v), to look like a ! parallel-plate capacitor ! ! note that there is one large parallel section and several ! parallel do's inside that region !$OMP PARALLEL ! !$OMP DO !$OMP& SHARED(Nx,Ny,boundary,v_cloud,v_ground,Ly,dy,v) !$OMP& PRIVATE(i,j) do j=1,Ny do i=1,Nx boundary(i,j)=0 v(i,j) = v_ground + (v_cloud-v_ground)*(j*dy/Ly) end do end do !$OMP END DO ! !$OMP DO !$OMP& SHARED(Nx,Ny,boundary) !$OMP& PRIVATE(i) do i=1,Nx boundary(i,1)=1 ! we need to ensure that the edges of boundary(i,Ny)=1 ! the domain are held as boundary end do !$OMP END DO ! !$OMP DO !$OMP& SHARED(boundary,Nx) !$OMP& PRIVATE(j) do j=1,Ny boundary(1,j)=1 boundary(Nx,j)=1 end do !$OMP END DO !$OMP END PARALLEL ! set up an interesting feature on the lower boundary ! do this in parallel with SECTIONS directive ! !$OMP PARALLEL !$OMP& SHARED(v,boundary,Nx,Ny,dx,dy,Lx,Ly,lightning_rod_height) !$OMP& PRIVATE(lightning_rod_center,house_center,house_height,house_width)) !$OMP SECTIONS !$OMP SECTION ! Lightning_rod lightning_rod_center = Lx*0.6d0 lightning_rod_height = 5.0d0 call create_lightning_rod(v_ground,lightning_rod_center,lightning_rod_height,dx,dy,Nx,Ny,v,boundary) !$OMP SECTION lightning_rod_center = Lx*0.5d0 call create_lightning_rod(v_ground,lightning_rod_center,lightning_rod_height,dx,dy,Nx,Ny,v,boundary) !$OMP SECTION lightning_rod_center = Lx*0.7d0 call create_lightning_rod(v_ground,lightning_rod_center,lightning_rod_height,dx,dy,Nx,Ny,v,boundary) !$OMP SECTION ! house house_center = 0.4d0*Lx house_height = 5.0d0 house_width = 5.0d0 call create_house(v_ground,house_center,house_height,house_width,dx,dy,Nx,Ny,v,boundary) !$OMP END SECTIONS !$OMP END PARALLEL ! initialization done t1 = secnds(0.0) ! main solution iteration ! ! repeat the recursion relation until the maximum change ! from update to update is less than a convergence epsilon, convergence_v = (0.05)*dabs(v_ground-v_cloud)/(1.d0*Ny) print *,"\nconvergence criteria is \Delta V < ",convergence_v num_iterations = 0 ! ! iteration implemented with a goto or a do-while converged=0 do while( converged .eq. 0) converged = 1 num_iterations = num_iterations + 1 !$OMP PARALLEL !$OMP DO !$OMP& PRIVATE(i,j) !$OMP& SHARED(Ny,Nx,dv,v,boundary)) do j=2,(Ny-1) do i=2,(Nx-1) dv(i,j) = 0.25d0*(v(i-1,j)+v(i+1,j)+v(i,j+1)+v(i,j-1)) - v(i,j) dv(i,j) = dv(i,j)*(1.d0-DFLOAT(boundary(i,j))) end do end do !$OMP END DO max_dv = 0.d0 !$OMP DO !$OMP& PRIVATE(i,j) !$OMP& SHARED(NX,NY,dv,v)) !$OMP& REDUCTION(MAX:max_dv) do j=2,(Ny-1) do i=2,(Nx-1) v(i,j) = v(i,j) + dv(i,j) if(dv(i,j) .gt. max_dv) then max_dv = dv(i,j) endif end do end do !$OMP END DO !$OMP END PARALLEL if(max_dv .ge. convergence_v) then converged = 0 endif end do ! simulation finished t2 = secnds(0.0) print *," iterations necessary, ",num_iterations print *," initialization time, ",t1-t0 print *," simulation time, ",t2-t1 open(unit=10,file="v_output.dat") write(10,*) "# x\ty\tv(x,y)" do j=1,Ny !do i=1,Nx ! skipping the full array print to save execution time ! the printed data file is normally sent to gnuplot w/ splot i=10 write (10,*) i*dx,j*dy,v(i,j) !enddo write (10,*)" " end do close(10) stop end subroutine create_lightning_rod(v_ground,lightning_rod_center,lightning_rod_height,dx,dy,Nx,Ny,v,boundary) IMPLICIT NONE integer Nx, Ny,j,boundary(Nx,Ny) integer j_limit integer index_lightning_rod_center real*8 v(Nx,Ny) real*8 lightning_rod_center,lightning_rod_height real*8 dx, dy, v_ground index_lightning_rod_center = lightning_rod_center/dx j_limit = lightning_rod_height/dy do j=1,j_limit v(index_lightning_rod_center,j) = v_ground boundary(index_lightning_rod_center,j) = 1 end do print *,"Created a lightning rod of height ",lightning_rod_height print *,"\ty_index ",j_limit print *,"\tx-position ",lightning_rod_center print *,"\tx_index ",index_lightning_rod_center end subroutine subroutine create_house(v_ground,house_center,house_height,house_width,dx,dy,Nx,Ny,v,boundary) IMPLICIT NONE integer Nx, Ny, boundary(Nx,Ny) real*8 v(Nx,Ny) real*8 v_ground, dx, dy integer i,j,i_limit,j_limit, index_house_center real*8 house_center,house_height,house_width index_house_center = house_center/dx i_limit = 0.5d0*house_width/dx j_limit = house_height/dy do j=1,j_limit do i=(index_house_center-i_limit),(index_house_center+i_limit) v(i,j) = v_ground boundary(i,j) = 1 end do end do print *,"Created a house of height ",house_height print *,"\ty_index ",j_limit print *,"\twidth ",house_width print *,"\thouse bounds: ",index_house_center-i_limit,index_house_center+i_limit end subroutine -- - - - - - - - - - - - - - - - - - - - - - Nathan Moore Assistant Professor, Physics Winona State University AIM: nmoorewsu - - - - - - - - - - - - - - - - - - - - - -- - - - - - - - - - - - - - - - - - - - - - Nathan Moore Assistant Professor, Physics Winona State University AIM: nmoorewsu - - - - - - - - - - - - - - - - - - - - - -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081126/3d2c4fa8/attachment.html From mathog at caltech.edu Wed Nov 26 11:03:34 2008 From: mathog at caltech.edu (David Mathog) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] Re: OpenMP wierdness on dual AMD 2350 box w/ SL5.2 x86_64 Message-ID: "Nathan Moore" wrote > The program scales beautifully up to OMP_NUM_THREADS~6 or 7, but when I set > the number of threads to 8, something weird happens, and instead of taking > the normal 241 iterations to converge, the program converges after 1 step. > This reeks of numerical instability to me, but my programming experience in > x86_64 is very limited. I did not read through your code very carefully, but are you by any chance running out of memory somewhere, without the code in place to catch and report the error? In other words, it might be a coincidence that this happens at 8 threads. If you increase the size of the arrays you may find the bug moves to a smaller number of threads. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From ntmoore at gmail.com Wed Nov 26 11:34:30 2008 From: ntmoore at gmail.com (Nathan Moore) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] OpenMP wierdness on dual AMD 2350 box w/ SL5.2 x86_64 In-Reply-To: <492DA1F3.90801@cse.ucdavis.edu> References: <6009416b0811261027r7a683dfeua843413d902c1cd6@mail.gmail.com> <6009416b0811261032g254ea412uc8a1651dd8b8b33f@mail.gmail.com> <492DA1F3.90801@cse.ucdavis.edu> Message-ID: <6009416b0811261134q1e56fab0q3fc8ca9b4b1d77cf@mail.gmail.com> on the web, http://course1.winona.edu/nmoore/clouds_example_OpenMP_f90.html On Wed, Nov 26, 2008 at 1:22 PM, Bill Broadley wrote: > I'd be happy to take a look, but email formatting of f90 sometimes seems to > cause issues. Could you send it as an attachment or put it on a webpage? > > Nathan Moore wrote: > > After the help last week on openmp, I got inspired and bought a dual-quad > > opteron machine for the department to show 8-way scaling for my students > > ("Hey, its much cheaper than something new in the optics lab", my dept. > > chair laughed). > > > > I've been working on said machine over the past few days and found > something > > really weird in an OpenMP example program I descrobed to the list. > > > > The machine is a dual-proc AMD Opteron 2350, Tyan n3600T (S2937) > mainboard, > > w/ 8GB ram. Initially, I installed the i386 version of Scientific Linux > > 5.2, but then realized that only half of the RAM was usable, and > > re-installed SL5.2 x86_64 this morning. > > > > The example program is appended to the end of this email. Again, it is a > > 2-d finite-difference solution to the laplace equation (the context being > to > > "predict" lightning strikes based on the potential between the ground and > > some clouds overhead). > > > > The program scales beautifully up to OMP_NUM_THREADS~6 or 7, but when I > set > > the number of threads to 8, something weird happens, and instead of > taking > > the normal 241 iterations to converge, the program converges after 1 > step. > > This reeks of numerical instability to me, but my programming experience > in > > x86_64 is very limited. > > > > I'm using gfortran, with the simple compile string, > > gfortran clouds_example_OpenMP.f90 -m64 -fopenmp > > > > Any insight into what obvious mistake I'm making would be wonderful! > > > > The stability of the output seems erratic to me. Sometimes when > > OMP_NUM_THREADS=7 the result converges normally after 241 iterations and > at > > other times, the result converges after 1 iteration (indicating some sort > of > > problem with hardware???) > > > > This didn't occur yesterday when the machine was running SL5.2, i386. > > > > > > > > > > Simulation Output: > > > > [nmoore@aykroyd clouds]$ OMP_NUM_THREADS=1 > > [nmoore@aykroyd clouds]$ export OMP_NUM_THREADS > > [nmoore@aykroyd clouds]$ ./a.out > > Hello World from thread 0 > > There are 1 threads > > ... > > convergence criteria is \Delta V < 0.250000003725290 > > iterations necessary, 241 > > initialization time, 0 > > simulation time, 57 > > > > > > > > > > [nmoore@aykroyd clouds]$ OMP_NUM_THREADS=2 > > [nmoore@aykroyd clouds]$ export OMP_NUM_THREADS > > [nmoore@aykroyd clouds]$ ./a.out > > Hello World from thread 0 > > Hello World from thread 1 > > There are 2 threads > > ... > > convergence criteria is \Delta V < 0.250000003725290 > > iterations necessary, 241 > > initialization time, 0 > > simulation time, 28 > > > > > > > > > > [nmoore@aykroyd clouds]$ OMP_NUM_THREADS=4 > > [nmoore@aykroyd clouds]$ export OMP_NUM_THREADS > > [nmoore@aykroyd clouds]$ ./a.out > > Hello World from thread 3 > > Hello World from thread 1 > > Hello World from thread 0 > > Hello World from thread 2 > > There are 4 threads > > ... > > convergence criteria is \Delta V < 0.250000003725290 > > iterations necessary, 241 > > initialization time, 0 > > simulation time, 14 > > > > > > > > > > [nmoore@aykroyd clouds]$ OMP_NUM_THREADS=8 > > [nmoore@aykroyd clouds]$ export OMP_NUM_THREADS > > [nmoore@aykroyd clouds]$ ./a.out > > Hello World from thread 2 > > ... > > convergence criteria is \Delta V < 0.250000003725290 > > iterations necessary, 1 > > initialization time, 0 > > simulation time, 0 > > > > Code listing: > > > > nmoore@aykroyd clouds]$ cat clouds_example_OpenMP.f90 > > ! > > ! > > use omp_lib > > ! > > IMPLICIT NONE > > integer,parameter::Nx=2000 > > integer,parameter::Ny=2000 > > real*8 v(Nx,Ny), dv(Nx,Ny) > > integer boundary(Nx,Ny) > > integer i,j,converged,i1,i2 > > integer t0,t1,t2 > > real*8 convergence_v, v_cloud, v_ground, max_dv > > real*8 bump_P,old_v > > real*8 Lx,Ly,dx,dy,v_y > > ! > > real*8 lightning_rod_center, lightning_rod_height > > ! > > real*8 house_center, house_height, house_width > > integer num_iterations > > ! > > integer:: id, nthreads > > !$omp parallel private(id) > > id = omp_get_thread_num() > > write (*,*) 'Hello World from thread', id > > !$omp barrier > > if ( id == 0 ) then > > nthreads = omp_get_num_threads() > > write (*,*) 'There are', nthreads, 'threads' > > end if > > !$omp end parallel > > > > ! initial time > > t0 = secnds(0.0) > > > > dx =0.1d0 ! differential lengths, m > > dy =0.1d0 > > Lx = Nx*dx ! system sizes, m > > Ly = Ny*dy > > > > print *,"\nSimulation has bounds:\n\tX: 0,",Lx,"\n\tY: 0,",Ly > > print *,"\tNx = ",Nx,"\n\tNy = ",Ny > > print *,"\tdx = ",dx,"\n\tdy = ",dy > > > > v_cloud = -10000.d0 ! volts > > v_ground = 0.d0 > > > > ! initialize the the boundary conditions > > ! > > ! first, set the solution function (v), to look like a > > ! parallel-plate capacitor > > ! > > ! note that there is one large parallel section and several > > ! parallel do's inside that region > > !$OMP PARALLEL > > ! > > !$OMP DO > > !$OMP& SHARED(Nx,Ny,boundary,v_cloud,v_ground,Ly,dy,v) > > !$OMP& PRIVATE(i,j) > > do j=1,Ny > > do i=1,Nx > > boundary(i,j)=0 > > v(i,j) = v_ground + (v_cloud-v_ground)*(j*dy/Ly) > > end do > > end do > > !$OMP END DO > > ! > > !$OMP DO > > !$OMP& SHARED(Nx,Ny,boundary) > > !$OMP& PRIVATE(i) > > do i=1,Nx > > boundary(i,1)=1 ! we need to ensure that the edges of > > boundary(i,Ny)=1 ! the domain are held as boundary > > end do > > !$OMP END DO > > ! > > !$OMP DO > > !$OMP& SHARED(boundary,Nx) > > !$OMP& PRIVATE(j) > > do j=1,Ny > > boundary(1,j)=1 > > boundary(Nx,j)=1 > > end do > > !$OMP END DO > > !$OMP END PARALLEL > > > > > > ! set up an interesting feature on the lower boundary > > ! do this in parallel with SECTIONS directive > > ! > > !$OMP PARALLEL > > !$OMP& SHARED(v,boundary,Nx,Ny,dx,dy,Lx,Ly,lightning_rod_height) > > !$OMP& > PRIVATE(lightning_rod_center,house_center,house_height,house_width)) > > !$OMP SECTIONS > > > > !$OMP SECTION > > ! Lightning_rod > > lightning_rod_center = Lx*0.6d0 > > lightning_rod_height = 5.0d0 > > call > > > create_lightning_rod(v_ground,lightning_rod_center,lightning_rod_height,dx,dy,Nx,Ny,v,boundary) > > > > !$OMP SECTION > > lightning_rod_center = Lx*0.5d0 > > call > > > create_lightning_rod(v_ground,lightning_rod_center,lightning_rod_height,dx,dy,Nx,Ny,v,boundary) > > > > !$OMP SECTION > > lightning_rod_center = Lx*0.7d0 > > call > > > create_lightning_rod(v_ground,lightning_rod_center,lightning_rod_height,dx,dy,Nx,Ny,v,boundary) > > > > !$OMP SECTION > > ! house > > house_center = 0.4d0*Lx > > house_height = 5.0d0 > > house_width = 5.0d0 > > call > > > create_house(v_ground,house_center,house_height,house_width,dx,dy,Nx,Ny,v,boundary) > > > > !$OMP END SECTIONS > > !$OMP END PARALLEL > > > > ! initialization done > > t1 = secnds(0.0) > > > > > > > > > > > > ! main solution iteration > > ! > > ! repeat the recursion relation until the maximum change > > ! from update to update is less than a convergence epsilon, > > convergence_v = (0.05)*dabs(v_ground-v_cloud)/(1.d0*Ny) > > print *,"\nconvergence criteria is \Delta V < ",convergence_v > > num_iterations = 0 > > > > ! > > ! iteration implemented with a goto or a do-while > > converged=0 > > do while( converged .eq. 0) > > > > converged = 1 > > num_iterations = num_iterations + 1 > > !$OMP PARALLEL > > !$OMP DO > > !$OMP& PRIVATE(i,j) > > !$OMP& SHARED(Ny,Nx,dv,v,boundary)) > > do j=2,(Ny-1) > > do i=2,(Nx-1) > > dv(i,j) = > > 0.25d0*(v(i-1,j)+v(i+1,j)+v(i,j+1)+v(i,j-1)) - v(i,j) > > dv(i,j) = dv(i,j)*(1.d0-DFLOAT(boundary(i,j))) > > end do > > end do > > !$OMP END DO > > > > max_dv = 0.d0 > > !$OMP DO > > !$OMP& PRIVATE(i,j) > > !$OMP& SHARED(NX,NY,dv,v)) > > !$OMP& REDUCTION(MAX:max_dv) > > do j=2,(Ny-1) > > do i=2,(Nx-1) > > v(i,j) = v(i,j) + dv(i,j) > > if(dv(i,j) .gt. max_dv) then > > max_dv = dv(i,j) > > endif > > end do > > end do > > !$OMP END DO > > !$OMP END PARALLEL > > > > if(max_dv .ge. convergence_v) then > > converged = 0 > > endif > > > > end do > > > > > > > > > > > > > > > > ! simulation finished > > t2 = secnds(0.0) > > > > print *," iterations necessary, ",num_iterations > > print *," initialization time, ",t1-t0 > > print *," simulation time, ",t2-t1 > > > > > > open(unit=10,file="v_output.dat") > > write(10,*) "# x\ty\tv(x,y)" > > do j=1,Ny > > !do i=1,Nx > > ! skipping the full array print to save execution time > > ! the printed data file is normally sent to gnuplot w/ splot > > i=10 > > write (10,*) i*dx,j*dy,v(i,j) > > !enddo > > write (10,*)" " > > end do > > close(10) > > > > > > stop > > end > > > > > > > > > > subroutine > > > create_lightning_rod(v_ground,lightning_rod_center,lightning_rod_height,dx,dy,Nx,Ny,v,boundary) > > IMPLICIT NONE > > integer Nx, Ny,j,boundary(Nx,Ny) > > integer j_limit > > integer index_lightning_rod_center > > real*8 v(Nx,Ny) > > real*8 lightning_rod_center,lightning_rod_height > > real*8 dx, dy, v_ground > > > > index_lightning_rod_center = lightning_rod_center/dx > > j_limit = lightning_rod_height/dy > > do j=1,j_limit > > v(index_lightning_rod_center,j) = v_ground > > boundary(index_lightning_rod_center,j) = 1 > > end do > > > > print *,"Created a lightning rod of height ",lightning_rod_height > > print *,"\ty_index ",j_limit > > print *,"\tx-position ",lightning_rod_center > > print *,"\tx_index ",index_lightning_rod_center > > > > > > end subroutine > > > > > > > > > > > > > > > > subroutine > > > create_house(v_ground,house_center,house_height,house_width,dx,dy,Nx,Ny,v,boundary) > > IMPLICIT NONE > > integer Nx, Ny, boundary(Nx,Ny) > > real*8 v(Nx,Ny) > > real*8 v_ground, dx, dy > > integer i,j,i_limit,j_limit, index_house_center > > real*8 house_center,house_height,house_width > > > > index_house_center = house_center/dx > > i_limit = 0.5d0*house_width/dx > > j_limit = house_height/dy > > do j=1,j_limit > > do i=(index_house_center-i_limit),(index_house_center+i_limit) > > v(i,j) = v_ground > > boundary(i,j) = 1 > > end do > > end do > > > > print *,"Created a house of height ",house_height > > print *,"\ty_index ",j_limit > > print *,"\twidth ",house_width > > print *,"\thouse bounds: > > ",index_house_center-i_limit,index_house_center+i_limit > > > > end subroutine > > > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- - - - - - - - - - - - - - - - - - - - - - Nathan Moore Assistant Professor, Physics Winona State University AIM: nmoorewsu - - - - - - - - - - - - - - - - - - - - - -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081126/40611227/attachment.html From ntmoore at gmail.com Thu Nov 27 06:50:25 2008 From: ntmoore at gmail.com (Nathan Moore) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] OpenMP wierdness on dual AMD 2350 box w/ SL5.2 x86_64 In-Reply-To: References: <6009416b0811261027r7a683dfeua843413d902c1cd6@mail.gmail.com> <6009416b0811261032g254ea412uc8a1651dd8b8b33f@mail.gmail.com> Message-ID: <6009416b0811270650q68dcda7cj6e00385d3719808a@mail.gmail.com> Dmitri, Perfect! Thanks so much for the response. Your guess about the barrier was exactly correct. The problem has disappeared. I was ignorant about the proper way to specify shared variables. Thanks for the correction. Is the following use of reduction acceptable? !$OMP PARALLEL !$OMP& PRIVATE(i,j) !$OMP& SHARED(Ny,Nx,dv,v,boundary)) ! !$OMP DO ... ... !$OMP END DO max_dv = 0.d0 !$OMP BARRIER !$OMP DO !$OMP& REDUCTION(MAX:max_dv) do j=2,(Ny-1) do i=2,(Nx-1) v(i,j) = v(i,j) + dv(i,j) if(dabs(dv(i,j)) .gt. max_dv) then max_dv = dv(i,j) endif end do end do !$OMP END DO On Thu, Nov 27, 2008 at 4:07 AM, Dmitri Chubarov wrote: > Nathan, hello, > > I gave your code a second look and noticed this: > > !$OMP PARALLEL >> > >> !$OMP DO >> > .... >> !$OMP END DO >> >> max_dv = 0.d0 >> !$OMP DO >> > .... > >> !$OMP END DO >> !$OMP END PARALLEL >> >> > There is a BARRIER missing between max_dv = 0.d0 and the following loop. > One of the threads in the pool might've been late and get to this statement > when the rest have already completed the reduction loop. > > The barrier is also important to ensure that no thread would use the values > of dv(i,j) in the reduction loop before they are updated by the main > computational loop above. > > Finally > if(dv(i,j) .gt. max_dv) then > max_dv = dv(i,j) > endif > Does not look right since it would not handle negative values of dv(i,j) > correctly. I assume it should read > as > max_dv = max(max_dv, dabs(dv(i,j) )) > > Best regards, > Dmitri Chubarov > > -- > Junior Researcher > Siberian Branch of the Russian Academy of Sciences > Institute of Computational Technologies > > -- - - - - - - - - - - - - - - - - - - - - - Nathan Moore Assistant Professor, Physics Winona State University AIM: nmoorewsu - - - - - - - - - - - - - - - - - - - - - -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081127/5346f032/attachment.html From tvixel at gmail.com Wed Nov 26 12:39:48 2008 From: tvixel at gmail.com (Thomas Vixel) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] cli alternative to cluster top? Message-ID: I've been googling for a top-like cli tool to use on our cluster, but the closest thing that comes up is Rocks' "cluster top" script. That could be tweaked to work via the cli, but due to factors beyond my control (management) all functionality has to come from a pre-fab program rather than a software stack with local, custom modifications. I'm sure this has come up more than once in the HPC sector as well -- could anyone point me to any top-like apps for our cluster? For reference, wulfware/wulfstat was nixed as well because of the xmlsysd dependency. From vlad at geociencias.unam.mx Thu Nov 27 17:37:27 2008 From: vlad at geociencias.unam.mx (Vlad Manea) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] SSH to compute nodes hangs Message-ID: <492F4B57.2030109@geociencias.unam.mx> Hi all, I am a problem with ssh and the firewall on my new Rocks 5.1 cluster: -firewall on -I can ping all compute nodes but ssh hangs... -firewall off -I can ssh into compute nodes. Since I installed the system I did not touch the iptables. Below is the iptables on my system. Any idea what really happens and how to fix it? Thanks, Vlad /sbin/iptables-save # Generated by iptables-save v1.3.5 on Wed Nov 26 13:17:53 2008 *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [14815:3295535] :RH-Firewall-1-INPUT - [0:0] -A INPUT -j RH-Firewall-1-INPUT -A FORWARD -j RH-Firewall-1-INPUT -A RH-Firewall-1-INPUT -i lo -j ACCEPT -A RH-Firewall-1-INPUT -p icmp -m icmp --icmp-type any -j ACCEPT -A RH-Firewall-1-INPUT -p esp -j ACCEPT -A RH-Firewall-1-INPUT -p ah -j ACCEPT -A RH-Firewall-1-INPUT -d 224.0.0.251 -p udp -m udp --dport 5353 -j ACCEPT -A RH-Firewall-1-INPUT -p udp -m udp --dport 631 -j ACCEPT -A RH-Firewall-1-INPUT -p tcp -m tcp --dport 631 -j ACCEPT -A RH-Firewall-1-INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT -A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT -A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibited COMMIT # Completed on Wed Nov 26 13:17:53 2008 From gmkurtzer at gmail.com Sat Nov 29 23:33:07 2008 From: gmkurtzer at gmail.com (Greg Kurtzer) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] cli alternative to cluster top? In-Reply-To: References: Message-ID: <571f1a060811292333q4aa31ab6x5b7995baa4145445@mail.gmail.com> Warewulf has a real time top like command for the cluster nodes and has been known to scale up to the thousands of nodes: http://www.runlevelzero.net/images/wwtop-screenshot.png We are just kicking off Warewulf development again now that Perceus has gotten to a critical mass and Caos NSA 1.0 has been released. We should have our repositories for Warewulf-3 pre-releases up shortly but if you need something ASAP, please contact me offline and I will get you what you need. Thanks! Greg On Wed, Nov 26, 2008 at 12:39 PM, Thomas Vixel wrote: > I've been googling for a top-like cli tool to use on our cluster, but > the closest thing that comes up is Rocks' "cluster top" script. That > could be tweaked to work via the cli, but due to factors beyond my > control (management) all functionality has to come from a pre-fab > program rather than a software stack with local, custom modifications. > > I'm sure this has come up more than once in the HPC sector as well -- > could anyone point me to any top-like apps for our cluster? > > For reference, wulfware/wulfstat was nixed as well because of the > xmlsysd dependency. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Greg Kurtzer http://www.infiscale.com/ http://www.runlevelzero.net/ http://www.perceus.org/ http://www.caoslinux.org/ From rgb at phy.duke.edu Sun Nov 30 08:45:44 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] cli alternative to cluster top? In-Reply-To: <571f1a060811292333q4aa31ab6x5b7995baa4145445@mail.gmail.com> References: <571f1a060811292333q4aa31ab6x5b7995baa4145445@mail.gmail.com> Message-ID: On Sat, 29 Nov 2008, Greg Kurtzer wrote: > Warewulf has a real time top like command for the cluster nodes and > has been known to scale up to the thousands of nodes: > > http://www.runlevelzero.net/images/wwtop-screenshot.png > > We are just kicking off Warewulf development again now that Perceus > has gotten to a critical mass and Caos NSA 1.0 has been released. We > should have our repositories for Warewulf-3 pre-releases up shortly > but if you need something ASAP, please contact me offline and I will > get you what you need. > > Thanks! > Greg > > On Wed, Nov 26, 2008 at 12:39 PM, Thomas Vixel wrote: >> I've been googling for a top-like cli tool to use on our cluster, but >> the closest thing that comes up is Rocks' "cluster top" script. That >> could be tweaked to work via the cli, but due to factors beyond my >> control (management) all functionality has to come from a pre-fab >> program rather than a software stack with local, custom modifications. >> >> I'm sure this has come up more than once in the HPC sector as well -- >> could anyone point me to any top-like apps for our cluster? >> >> For reference, wulfware/wulfstat was nixed as well because of the >> xmlsysd dependency. That's fine, but I'm curious. How do you expect to run a cluster information tool over a network without a socket at both ends? If not xmlsysd, then something else -- sshd, xinetd, dedicated or general purpose, where the latter almost certainly will have have higher overhead? Or are you looking for something with a kernel level network interface, more like scyld? rgb >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >> > > > > -- > Greg Kurtzer > http://www.infiscale.com/ > http://www.runlevelzero.net/ > http://www.perceus.org/ > http://www.caoslinux.org/ > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From becker at scyld.com Sun Nov 30 08:52:20 2008 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] cli alternative to cluster top? In-Reply-To: Message-ID: On Wed, 26 Nov 2008, Thomas Vixel wrote: > I've been googling for a top-like cli tool to use on our cluster, but > the closest thing that comes up is Rocks' "cluster top" script. That > could be tweaked to work via the cli, but due to factors beyond my > control (management) all functionality has to come from a pre-fab > program rather than a software stack with local, custom modifications. > > I'm sure this has come up more than once in the HPC sector as well -- > could anyone point me to any top-like apps for our cluster? Most remote job mechanisms only think about starting remote processes, not about the full create-monitor-control-report functionality. The Scyld system (currently branded "Clusterware") defaults to using a built-in unified process space. That presents all of the processes running over the cluster in a process space on the master machine, with fully POSIX semantics. It neatly solves your need with... the standard 'top' program. Most scheduling systems also have a way to monitor processes that they start, but I haven't found one that takes advantage of all information available and reports it quickly/efficiently. There are many advantages of the Scyld implementation -- no new or modified process management tools need to be written. Standard utilities such as 'top' and 'ps' work unmodified, as well as tools we didn't specifically plan for e.g. GUI versions of 'pstree'. -- The 'killall' program works over the cluster, efficiently. -- All signals work as expected, including 'kill -9'. (Most remote process starting mechanisms will just kill off the local endpoint, leaving the remote process running-but-confused.) -- Process groups and controlling-TTY groups works properly for job control and signals -- Running jobs report their status and statistics accurately -- an updated 'rusage' structure is sent once per second, and a final rusage structure and exit status is sent when the process terminates. The "downside" is that we explicitly use Linux features and details, relying on kernel-version-specific features. That's an issue if it's a one-off hack, but we've been using this approach continuously for a decade, since the Linux 2.2 kernel and over multiple architectures. We've been producing supported commercial releases since 2000, longer than anyone else in the business. -- Donald Becker becker@scyld.com Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA From landman at scalableinformatics.com Sun Nov 30 08:55:40 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] cli alternative to cluster top? In-Reply-To: References: Message-ID: <4932C58C.6020706@scalableinformatics.com> Thomas Vixel wrote: > I've been googling for a top-like cli tool to use on our cluster, but > the closest thing that comes up is Rocks' "cluster top" script. That > could be tweaked to work via the cli, but due to factors beyond my > control (management) all functionality has to come from a pre-fab > program rather than a software stack with local, custom modifications. > > I'm sure this has come up more than once in the HPC sector as well -- > could anyone point me to any top-like apps for our cluster? We have a ctop we have written a while ago. Depends upon pdsh, though with a little effort, even that could be removed (albeit being a somewhat slower program as a result). Our version is Perl based, open source, and quite a few of our customers do use it. I had looked at hooking it into wulfstat at some point. Doug Eadline has a top he had written (is that correct Doug?) for clusters some time ago. > > For reference, wulfware/wulfstat was nixed as well because of the > xmlsysd dependency. Sometimes I wonder about the 'logic' underpinning some of the decisions I hear about. ctop could work with plain ssh, though you will need to make sure that all nodes are able to be reached via passwordless ssh (shouldn't be an issue for most of todays clusters), and you will need some mechanism to tell ctop which nodes you wish to include in the list. We have used /etc/cluster/hosts.cluster in the past to list hostnames/ip addresses of the cluster nodes. Let me know if you have pdsh implemented. BTW: ctop is OSS (GPLv2). It should be available on our download site as an RPM/source RPM (http://downloads.scalableinformatics.com). If there is enough interest in it, I'll put it into our public Mercurial repository as well. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From becker at scyld.com Sun Nov 30 09:19:29 2008 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] cli alternative to cluster top? In-Reply-To: Message-ID: On Sun, 30 Nov 2008, Robert G. Brown wrote: > On Sat, 29 Nov 2008, Greg Kurtzer wrote: > > > Warewulf has a real time top like command for the cluster nodes and > > has been known to scale up to the thousands of nodes: > > > > http://www.runlevelzero.net/images/wwtop-screenshot.png > > On Wed, Nov 26, 2008 at 12:39 PM, Thomas Vixel wrote: > >> I've been googling for a top-like cli tool to use on our cluster, but > >> the closest thing that comes up is Rocks' "cluster top" script. That > >> could be tweaked to work via the cli, but due to factors beyond my > >> control (management) all functionality has to come from a pre-fab > >> program rather than a software stack with local, custom modifications. > >> > >> I'm sure this has come up more than once in the HPC sector as well -- > >> could anyone point me to any top-like apps for our cluster? > >> > >> For reference, wulfware/wulfstat was nixed as well because of the > >> xmlsysd dependency. > > That's fine, but I'm curious. How do you expect to run a cluster > information tool over a network without a socket at both ends? If not > xmlsysd, then something else -- sshd, xinetd, dedicated or general > purpose, where the latter almost certainly will have have higher > overhead? Or are you looking for something with a kernel level network > interface, more like scyld? The theoretical architecture of our system has all of the process control communication going over persistent TCP/IP sockets. The master node has a 'master daemon'. As compute nodes boot and join the cluster their 'slave daemon' opens a single TCP socket to the master daemon. Having a persistent connection is a key element to performance. It eliminates the cost and delay of name lookup, reverse name lookup, socket establishment and authentication. (Example: The MPICH people learned this lesson -- MPD is much faster than MPICH v1 using 'rsh'.) We optimized our system extensively, down to the number of bytes in efficiently constructed and parsed packets. But to get scalability to thousands of nodes and processes, we found that we needed to "cheat". While connections are established to the user-level daemon, we optimize by having some of the communication handled by a kernel module that shares the socket. The optimization isn't needed for 'only' hundreds of nodes and processes, or if you are willing to dedicate most of a very powerful head node to process control. But 'thousands' is much more challenging than 'hundreds'. -- Donald Becker becker@scyld.com Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA From lindahl at pbm.com Sun Nov 30 10:28:39 2008 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] cli alternative to cluster top? In-Reply-To: References: <571f1a060811292333q4aa31ab6x5b7995baa4145445@mail.gmail.com> Message-ID: <20081130182839.GA17239@bx9> On Sun, Nov 30, 2008 at 11:45:44AM -0500, Robert G. Brown wrote: > That's fine, but I'm curious. How do you expect to run a cluster > information tool over a network without a socket at both ends? There's always "qstat". The OP didn't really say what sorts of information he was looking for... -- greg From rgb at phy.duke.edu Sun Nov 30 12:44:07 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] cli alternative to cluster top? In-Reply-To: <20081130182839.GA17239@bx9> References: <571f1a060811292333q4aa31ab6x5b7995baa4145445@mail.gmail.com> <20081130182839.GA17239@bx9> Message-ID: On Sun, 30 Nov 2008, Greg Lindahl wrote: > > On Sun, Nov 30, 2008 at 11:45:44AM -0500, Robert G. Brown wrote: > >> That's fine, but I'm curious. How do you expect to run a cluster >> information tool over a network without a socket at both ends? > > There's always "qstat". The OP didn't really say what sorts of > information he was looking for... :-) Hey, didn't think of that -- an enormous Quake cluster? Although I didn't realize that qstat worked by electronic telepathy;-) rgb > > -- greg > > > Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From jbardin at bu.edu Sun Nov 30 11:23:31 2008 From: jbardin at bu.edu (james bardin) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] SSH to compute nodes hangs In-Reply-To: <492F4B57.2030109@geociencias.unam.mx> References: <492F4B57.2030109@geociencias.unam.mx> Message-ID: On Thu, Nov 27, 2008 at 8:37 PM, Vlad Manea wrote: > Hi all, > > I am a problem with ssh and the firewall on my new Rocks 5.1 cluster: > -firewall on > -I can ping all compute nodes but ssh hangs... > > -firewall off > -I can ssh into compute nodes. > Do you mean that you can connect with ssh, but it later hangs? If that's the case, I see this with RH based systems fairly often. There's a few bugs open, but no real consensus on what's going on. It seems to have something to do with the ip contrack module, and the network itself. What I do is get rid of the "--state NEW" requirement for port 22. -A RH-Firewall-1-INPUT -p tcp -m tcp --dport 22 -j ACCEPT -jim From landman at scalableinformatics.com Sun Nov 30 14:08:05 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:08:00 2009 Subject: [Beowulf] cli alternative to cluster top? In-Reply-To: References: <571f1a060811292333q4aa31ab6x5b7995baa4145445@mail.gmail.com> <20081130182839.GA17239@bx9> Message-ID: <49330EC5.50700@scalableinformatics.com> Robert G. Brown wrote: > On Sun, 30 Nov 2008, Greg Lindahl wrote: > >> >> On Sun, Nov 30, 2008 at 11:45:44AM -0500, Robert G. Brown wrote: >> >>> That's fine, but I'm curious. How do you expect to run a cluster >>> information tool over a network without a socket at both ends? >> >> There's always "qstat". The OP didn't really say what sorts of >> information he was looking for... > > :-) Hey, didn't think of that -- an enormous Quake cluster? > > Although I didn't realize that qstat worked by electronic telepathy;-) to bad we can't use EPR pairs for this ... -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615