From rgb at phy.duke.edu Fri Jun 1 09:06:22 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:06:05 2009 Subject: [Beowulf] HDTV video file sizes In-Reply-To: <1932e3120705291017t4f11eed9gcc36cd120697e216@mail.gmail.com> References: <1088659434.1180453767761.JavaMail.root@fepweb03> <1932e3120705291017t4f11eed9gcc36cd120697e216@mail.gmail.com> Message-ID: On Tue, 29 May 2007, Jim Windle wrote: > So if Netflix isn't lying when they say they have shipped over a billion > movies that means they have moved roughly 5 exabytes of data via the US > mail. I wonder how that compares the amount moved over the internet during > the same time period? > > compressed data rates appear to be 20-50 Mbps (lower than 20 Oh, there's little doubt about this sort of thing. With a DSL bottleneck, it's MUCH faster for me to drive to Duke and do an install from its mirrors via a 1 Gbps local network than it is to wait at home for the data to squeeze through my little pipe. And every time I drive to and from Duke carrying my laptop, I move 10 GB/minute between locations which (at a GB/six seconds) is slightly HIGHER bandwidth than the campus Gbps backbone. If you want to move terabytes at high bandwidth, box up some portable multi-terabyte RAIDS and fly them there. However, I can transfer data home while doing other things. I cannot drive and do other things. Network transfers are often parallelizable in a classic sense and can complete while other things are happening (as they are now on my laptop as I type this). rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From mathog at caltech.edu Sat Jun 2 17:39:46 2007 From: mathog at caltech.edu (David Mathog) Date: Wed Nov 25 01:06:05 2009 Subject: [Beowulf] network transfer issue to disk, old versus new hardware Message-ID: I can't quite wrap my head around a recent nettee result, perhaps one of the network gurus here can explain it. The tests were these: A. Sustained write to disk: sync; accudate; dd if=/dev/zero bs=512 count=1000000 of=test.dat; \ sync; accudate (accudate is a little utility of mine which is like date but gives times to milliseconds. Subtract the times and calculate sustained write rate to disk.) B. transfer of 512Mb one node to another: first node: dd if=/dev/zero bs=512 count=1000000 | \ nettee -in - -next secondnode -v 63 second node: nettee -out test.dat C. Same as B, but buffer nettee output second node: nettee -out - | mbuffer -m 4000000 >test.dat D. Calculate transfer rate if read from network and write to disk are strictly sequential (alternating read, write)= 1/(1/11.7 + 1/(speed from A)) E. Ratio: Observed (B) / expected (D) F. Pipe speed (lowest of 5 consecutive tests, it varies a lot, probably because of other activity on the nodes, even though they were quiescent, highest was around 970Mb/s for both platforms) dd if=/dev/zero bs=512 count=1000000 >/dev/null G. Raw network speed (move the data, then throw it out) first node: dd if=/dev/zero bs=512 count=1000000 | \ nettee -in - -next secondnode -v 63 second node: nettee -out /dev/null This was carried out on two different sets of hardware, both with 100BaseT networks (different switches though): Old: Athlon MP 2200+, Tyan S2466MPX mobo, 2.6.19.3 kernel, 512Mb RAM New: Athlon64 3700+ CPU, ASUS A8N5X mobo, 2.6.21.1 kernel, 1G RAM Here are the results, all in Megabytes/sec OLD NEW A 17 40 B 7.4 10.47 C 7.4 11.43 D 6.9 9.05 E 1.07 1.16 F 743 603 G 11.77 11.71 Start with G, in both cases the hardware could push data across the network at almost exactly the same speed. From A we see that the disks on the older machines are considerably slower than the ones on the newer machines (hdparm showed the same values for OLD/NEW, so it isn't an obvious misconfiguration). From D we expect OLD to be slower than NEW, and B shows that that is indeed the case. It's a little better than pure sequential because there's some parallelism in the read part of the network transfer, giving ratios greater than 1 (E). There's plenty of pipe bandwidth (F). Yet when we put mbuffer in (C) there is no speed up AT ALL on OLD, and a nice one (as expected) on NEW. Everything is as it should be for NEW, but why isn't mbuffer doing it's thing on the OLD machines? Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From Wally-Edmondson at utc.edu Fri Jun 1 08:24:40 2007 From: Wally-Edmondson at utc.edu (Wally Edmondson) Date: Wed Nov 25 01:06:05 2009 Subject: [Beowulf] IBRIX Experiences Message-ID: <46603A38.3060006@utc.edu> On Thu, 10 May 2007, Ian Reynolds wrote: > Hey all -- we're considering IBRIX for a parallel storage cluster > solution with an EMC Clarion CX3-20 at the center, as well as a handful > of storage servers -- total of roughly 40 client servers, mix of 32 and > 64 bit OSs. > > Can anyone offer their experiences with IBRIX, good or bad? We have > worked with gpfs extensively, so any comparisons would also be helpful. It looks like you aren't getting many answers your question, Ian. I'll quickly share my IBRIX experiences. I have been running IBRIX since late 2004 on around 540 diskless clients and 50 regular servers and workstations with 8 segment servers and a Fusion Manager connected to a DDN S2A 3000 couplet with 20TB of usable storage. The storage is 1Gb FibreChannel to the Segment Servers and it's non-bonded GigE for everything else. I'll start with the bad, I guess. We had our share of problems with the 1.x version of the software in the early days. I suppose all parallel filesystems with 600 clients are going to hit bumps. That's what CFS said back then, anyways. Stability wasn't a problem, but occasionally a file wouldn't be readable and to fix it you had to copy the file, stuff like that. This was no longer an issue beginning with version 2.0. You have to get a new build of the software if you want to change kernels. Their are two RPMS, one generic for the major kernel number and the other specific to your kernel containing some modules. They only support RHEL/CENTOS and SLES as far as I know, and SLES was only recently added. I asked about Ubuntu and they don't yet support it, which sucks because I would like to use it on some workstations. Oh, and make sure that the segment servers can always see each other. Use at least two links through different switches. We had some bad switch ports that caused the segment servers to miss heartbeats. This caused automatic failovers to segment servers that also couldn't be seen. This is a disaster. I thought it was IBRIX's fault the whole time. Turned out to be intermittent switch port problems. It was avoidable with a little bit more planning and a better understanding of how the whole thing worked. Redundancy is set up with buddies rather than globally, so you tell it that one server should watch some other server's back. It works, but it could be a problem if a failing server's buddy is down or a server goes down while it owns a failed segment. In either case, some percentage of your files won't be accessible until one of the servers is fixed. It hasn't happened to me, but it is a possibility. I can bring down four of my eight servers without a problem, for instance, but it needs to be the right four. Servers have failed and it has never been a problem for me. The running jobs never know the difference. Support has been top-notch. Last year, we had a catastrophic storage controller failure following a scheduled power outage, major corruption, the works. A guy at IBRIX stayed with me all weekend on the phone and AIM. He logged in and remotely restored all the files he could (tens of thousands). Apparently he could have restored more if I had already been running 2.0 or higher. They know their product very well. I'm not sure if I am the right person to compare it to GPFS or Lustre since I looked into those products back in 2004 and haven't really researched them since. My setup is simple, too, so I only use the basics. The performance is fine, using nearly all of my GigE pipes. With more segment servers and faster storage you could get some pretty amazing speeds. I don't use the quotas or multiple interfaces. Their GUI looks nice at first but you really don't need it because their command-line tools make sense and have excellent help output if you forget something. Adding new clients is a breeze. There is a Windows client now but I haven't used it. I use CIFS exports and it works just fine. I also use NFS exports for my few remaining Solaris clients. Everything is very customizable and the documentation seems pretty thorough. You can put any storage you like behind it, which is nice. I think I could use USB keys if I felt like it. I have been very please with IBRIX overall, especially since we upgraded out of 1.x land. It's usually the last thing on my mind, so I guess that's a good thing. That's all I have time for right now. Let me know if you have any specific questions. Wally From ruhollah.mb at gmail.com Fri Jun 1 13:41:05 2007 From: ruhollah.mb at gmail.com (Ruhollah Moussavi Baygi ) Date: Wed Nov 25 01:06:05 2009 Subject: [Beowulf] ssh connection problem In-Reply-To: References: <1bef2ce30705270147k430800b5x303e56410aba640b@mail.gmail.com> Message-ID: <1bef2ce30706011341g3f93fe9bo3a13121efa00d678@mail.gmail.com> Hi, Thank you for your answers, But, please ignore the content of the 'links' I have posted, I didn't mean to send you those links. I just did google to find a solution for our cluster's problem 'Disconnecting:?'. However, because I couldn't find a proper solution via googling, I posted it to Beowulf, so, I just did copy-paste the sentence 'Disconnecting:?' in my gmail. That's why you can see 'links' in my email. Returning to our problem, the results of 'netstat ?i' and '-s' are as follows, respectively. Please note that: a) I use cat 6, b) it is nearly improbable to have electricity noise c) the head-node has two NICs, eth0 is for internal zone, i.e. computing nodes, which is running with no problem. eth1 is for external zone, i.e. to be connected by our users via ssh. This one has disconnecting problem. d) it doesn't seem that there is any SW/router problem. Because in the same network, there is some other machine, which is connected by users via ssh with no problem. ___________________________________________________________________ *[root@node01 ~]# netstat -i*** Kernel Interface table Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg eth0 1500 0 586745989 0 0 0 598858710 0 0 0 BMRU eth1 1500 0 701868 0 0 0 325542 0 0 0 BMRU lo 16436 0 1959 0 0 0 1959 0 0 0 LRU *[root@node01 ~]# netstat -s*** Ip: 585891011 total packets received 0 forwarded 0 incoming packets discarded 585887228 incoming packets delivered 597668214 requests sent out Icmp: 34 ICMP messages received 21 input ICMP message failed. ICMP input histogram: destination unreachable: 25 timeout in transit: 5 echo requests: 4 601 ICMP messages sent 0 ICMP messages failed ICMP output histogram: destination unreachable: 597 echo replies: 4 Tcp: 78 active connections openings 360 passive connection openings 0 failed connection attempts 18 connection resets received 8 connections established 585798178 segments received 597666644 segments send out 16197 segments retransmited 94 bad segments received. 1682 resets sent Udp: 1005 packets received 596 packets to unknown port received. 0 packet receive errors 1019 packets sent TcpExt: 2 resets received for embryonic SYN_RECV sockets 26 packets pruned from receive queue because of socket buffer overrun ArpFilter: 0 60 TCP sockets finished time wait in fast timer 1 packets rejects in established connections because of timestamp 734435 delayed acks sent 127 delayed acks further delayed because of locked socket Quick ack mode was activated 7963 times 724 packets directly queued to recvmsg prequeue. 6030 packets directly received from backlog 164431 packets directly received from prequeue 571897537 packets header predicted 138 packets header predicted and directly queued to user TCPPureAcks: 44870 TCPHPAcks: 458279645 TCPRenoRecovery: 0 TCPSackRecovery: 2875 TCPSACKReneging: 0 TCPFACKReorder: 0 TCPSACKReorder: 0 TCPRenoReorder: 0 TCPTSReorder: 0 TCPFullUndo: 0 TCPPartialUndo: 0 TCPDSACKUndo: 1 TCPLossUndo: 7099 TCPLoss: 626 TCPLostRetransmit: 0 TCPRenoFailures: 0 TCPSackFailures: 1635 TCPLossFailures: 169 TCPFastRetrans: 4294 TCPForwardRetrans: 23 TCPSlowStartRetrans: 1130 TCPTimeouts: 8329 TCPRenoRecoveryFail: 0 TCPSackRecoveryFail: 279 TCPSchedulerFailed: 0 TCPRcvCollapsed: 2731 TCPDSACKOldSent: 8194 TCPDSACKOfoSent: 0 TCPDSACKRecv: 7125 TCPDSACKOfoRecv: 0 TCPAbortOnSyn: 0 TCPAbortOnData: 28 TCPAbortOnClose: 8 TCPAbortOnMemory: 0 TCPAbortOnTimeout: 12 TCPAbortOnLinger: 0 TCPAbortFailed: 0 TCPMemoryPressures: 0 ___________________________________________________________________ -- Best, Ruhollah Moussavi Baygi On 5/29/07, Robert G. Brown wrote: > > On Sun, 27 May 2007, Ruhollah Moussavi Baygi wrote: > > > Hi everybody at Beowulf, > > > > I have a serious problem with ssh connection to our cluster. Every > > hint/help/suggestion, which can help me to solve it, is highly > appreciated. > > > > Most of the time, when users want to connect and run their programs from > > their own PCs, the ssh connection failed, especially during transfer > files > > from/to head-node. Our user's PCs are mainly WindowsXP, so they use > packages > > like SSH Secure Shell for connection and file transfer, or Putty for > > connection and WinSCP for file transfer. > > > > > > The error massage is as follows: > > > > 'Disconnecting: Corrupted MAC on input' > > This sounds to me like hardware problems. What does your physical > network look like? Is it built with the right cables, within spec, with > decent switches? Do you see other evidence of network packet > corruption? > > > < > http://www.google.com/history/url?url=http://ubuntuforums.org/showthread.php%3Ft%3D202076&ei=wkJZRsGfHZf-0gTehKXrDQ&sig2=lIzQGYq3zN0Tz2EC8b4dAw&zx=JGkABbsjtaA&ct=w > > > > > > or > > > > 'Disconnecting: bad packet > > Yes, sounds like bad hardware. Perhaps your cables aren't cat 5? > Perhaps your electrical power has noise? Perhaps your switch(es) are > broken or have been taken over by trolls? This sounds like you're > failing packet checksum tests or experiencing pretty serious TCP > collision problems. > > What do the network statistics look like on the interfaces in question? > > rgb > > > length...< > http://www.google.com/search?q=disconnecting:+bad+packet+length+from+windows+to+linux+machine&hl=en > >', > > followed by a long integer. > > > > > > This problem has practically made our cluster unusable. So, I would be > > thankful for any coming advice. > > > > -- > Robert G. Brown http://www.phy.duke.edu/~rgb/ > Duke University Dept. of Physics, Box 90305 > Durham, N.C. 27708-0305 > Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu > > > -- Best, Ruhollah Moussavi Baygi -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070602/11a5e345/attachment.html From ctierney at hypermall.net Sat Jun 2 21:11:10 2007 From: ctierney at hypermall.net (Craig Tierney) Date: Wed Nov 25 01:06:05 2009 Subject: [Beowulf] tftp permission denied In-Reply-To: References: Message-ID: <46623F5E.4010201@hypermall.net> fahad saeed wrote: > Hello All, > > I am trying to install Fedora Core 6 using network(since i only have 1 > cd rom installed on the head node and no cdrom/flopy drive on the slave > node...)so > > I used this how-to to configure my tftp server and all seems to go well... > > http://www.opensourcehowto.org/how-t...a-install.html > > > > Now the problem is that when i boot my slave node and 'command' it boot > from the network (using Intel boot Boot Agent 1.1.07) I get this error > > PXE -T00 permission denied > PXE -E36 error received from tftp server > > Although the slave node does recognises the master node and its ip etc.... > > > > Any Help would be highly appreciable as I have no idea what to do next... > > Thanks in advance...and please help !! > > Fahad > Have you tried to copy the file via tftp from the server node: # tftp localhost # get "blah" See if that works. I was seeing something similar to this on RHEL5 this past week. I haven't got an answer yet, but it seemed that I could only transfer files that ended in .bin. I wonder if it is a security or selinux issue, but I haven't tracked it down yet. Craig > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From john.hearns at streamline-computing.com Sun Jun 3 01:39:54 2007 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed Nov 25 01:06:05 2009 Subject: [Beowulf] tftp permission denied In-Reply-To: <46623F5E.4010201@hypermall.net> References: <46623F5E.4010201@hypermall.net> Message-ID: <46627E5A.7070801@streamline-computing.com> Craig Tierney wrote: > fahad saeed wrote: >> Now the problem is that when i boot my slave node and 'command' it >> boot from the network (using Intel boot Boot Agent 1.1.07) I get this >> error >> >> PXE -T00 permission denied >> PXE -E36 error received from tftp server > # tftp localhost > # get "blah" I'll add another debugging tip to that one - stop the tftp daemon service, then start it on the command line (as root) with the following flags: -l -vv -s /path/to/your/tftpdirectory The try Craig's tip - ie can you transfer a file by hand from 'localhost' then reboot a compute node and follow the tftp request > See if that works. > > I was seeing something similar to this on RHEL5 this past week. > I haven't got an answer yet, but it seemed that I could only > transfer files that ended in .bin. I wonder if it is a security > or selinux issue, but I haven't tracked it down yet. > > Craig From hahn at mcmaster.ca Sun Jun 3 11:25:13 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:06:05 2009 Subject: [Beowulf] tftp permission denied In-Reply-To: <46627E5A.7070801@streamline-computing.com> References: <46623F5E.4010201@hypermall.net> <46627E5A.7070801@streamline-computing.com> Message-ID: > stop the tftp daemon service, then start it on the command line (as root) > with the following flags: > > -l -vv -s /path/to/your/tftpdirectory yes, definitely. this sort of problem calls for debugging on the server side - verbose server settings is probably enough, but I wouldn't shy away from running the server under strace to see what it's really doing... From Bogdan.Costescu at iwr.uni-heidelberg.de Mon Jun 4 05:07:36 2007 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] network transfer issue to disk, old versus new hardware In-Reply-To: References: Message-ID: On Sat, 2 Jun 2007, David Mathog wrote: > I can't quite wrap my head around a recent nettee result, perhaps > one of the network gurus here can explain it. IMHO, it's not a network issue, as is shown by your G results. > sync; accudate; dd if=/dev/zero bs=512 count=1000000 of=test.dat; All your tests use bs=512 - why ? This makes unnecessary trips to kernel code and back which result in an increased number of context switches and significant slowdown. My guess is that this (high number of context switches) plus a high interrupt rate (disk and network simultaneously) is the reason for your results. > Old: Athlon MP 2200+, Tyan S2466MPX mobo, 2.6.19.3 kernel, 512Mb RAM I used to have the exact same hardware as cluster nodes (but with dual CPU, whether you also have duals is not clear from your post) and tried to convert 2 of them to small file-servers - same problem of disk + network simultaneous activity. After benchmarking, I gave up - this was almost 2 years ago and I don't have the exact numbers anymore, but a single PIV 3GHz on a consumer-grade mainboard was able to provide significantly better performance for the same task. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De From mathog at caltech.edu Mon Jun 4 10:02:13 2007 From: mathog at caltech.edu (David Mathog) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] network transfer issue to disk, old versus new hardware Message-ID: Bogdan Costescu wrote: > On Sat, 2 Jun 2007, David Mathog wrote: > > > I can't quite wrap my head around a recent nettee result, perhaps > > one of the network gurus here can explain it. > > IMHO, it's not a network issue, as is shown by your G results. > > > sync; accudate; dd if=/dev/zero bs=512 count=1000000 of=test.dat; > > All your tests use bs=512 - why ? This makes unnecessary trips to > kernel code and back which result in an increased number of context > switches and significant slowdown. It's a convenient number, it may slow things down slightly but clearly it isn't rate limiting since piping that straight to /dev/null gives rates of 650Mb/sec or higher. In any case, I figured the problem out. The issue was that the distro (Mandriva 2007.0) installed a while back on the older machines turns on "athcool". Athcool does cut the idle temperatures of the nodes considerably, but apparently also prevents them from performing this sort of transfer at full speed, whether or not buffer is used. When I turned athcool off, on just the receiving node, the transfer rate for: sender: dd if=/dev/zero bs=512 count=1000000 | \ nettee -in - -v 63 -next next_node receiver: nettee -out test.dat jumped from 7.7Mb/sec to 11.6Mb/sec. So apparently athcool gets in the way by preventing rapid shifts from disk to network IO, no matter which process is doing them. Which is interesting because it didn't have any measurable effect on CPU bound processes. I had thought it would shut itself off and get out of the way when the CPU rate was high, but apparently not. When imaging nodes athcool isn't running, but I'll have to keep this in mind when doing routine transfers of data across the nodes. On the newer machines cpufreq runs instead of athcool, and it didn't make very much difference if that was running or not. Apparently this power saver does a much better job of detecting higher CPU load and "getting out of the way" when it's present. Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From mathog at caltech.edu Mon Jun 4 10:39:52 2007 From: mathog at caltech.edu (David Mathog) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] network transfer issue to disk, old versus new hardware Message-ID: Bogdan Costescu wrote: > > Old: Athlon MP 2200+, Tyan S2466MPX mobo, 2.6.19.3 kernel, 512Mb RAM > > I used to have the exact same hardware as cluster nodes (but with dual > CPU, whether you also have duals is not clear from your post) These are single CPU machines. > and > tried to convert 2 of them to small file-servers - same problem of > disk + network simultaneous activity. After benchmarking, I gave up - > this was almost 2 years ago and I don't have the exact numbers > anymore, but a single PIV 3GHz on a consumer-grade mainboard was able > to provide significantly better performance for the same task. Was athcool running on these? I've done some more benchmarking with athcool on/off, and it changed the write speed for the dd generated 512MB file from just under 18MB/sec to 31 MB/sec. Even with that change, there is clearly something else going on in the network + disk department, since the "expected sequential" rate only changes from 7.1 to 8.5MB/sec. The "hdparm -tT" results were around 520MB/sec cached reads in either case, but the timed buffered disk reads went from 24MB/sec to 44MB/sec (both with large variances, but not THAT large.) Thankfully this is entirely irrelevant to those of you who have long since retired these older Tyan systems. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From atp at piskorski.com Mon Jun 4 13:38:07 2007 From: atp at piskorski.com (Andrew Piskorski) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] cheap SMC 8524T gigabit switches, and performance of In-Reply-To: References: Message-ID: <20070604203805.GA72069@tehun.pair.com> FYI, some vendor called Unity Electronics is currently selling a bunch of 24 port SMC 8524T gigabit switches for c. $120 each on Ebay: http://search.ebay.com/search/search.dll?satitle=SMC+gigabit&sass=unityelectronics.com http://unityelectronics.com/product-product_id/3942/m/SMC/p/SMC8524T http://unityelectronics.com/product-product_id/3941/m/SMC/p/SMC8516T I haven't actually tried using it yet, but the one I recieved is part number 751.7398, and appears to be new in box as advertised. And that reminded me of the interesting thread from April, below, on performance testing of some (small) SMC gigabit switches: http://www.beowulf.org/archive/2007-April/017924.html On Mon, Apr 02, 2007 at 03:58:06PM -0500, Bruce Allen wrote: > Subject: Re: [Beowulf] How to Diagnose Cause of Cluster Ethernet Errors? > Just for kicks have a look at these figures: > http://www.lsc-group.phys.uwm.edu/beowulf/nemo/design/SMC_8508T_Performance.html > Here are some more testing results from different edge switches: > http://www.lsc-group.phys.uwm.edu/beowulf/nemo/design/switching.html Bruce, it's interesting how your bandwith tests show the SMC 8508T 721.0154 switch started out with true wire-speed and 9k jumbo frame performance, 721.8129 was worse, and then 722.8486 was yet worse again. Compared to the previous part number, each subsequent revision of the supposedly "same" SMC8508T model degraded performance! And your tests were with only 2 of the 8 ports on each switch, so I wonder how much worse they'd be when using all ports at once. It's also interesting that all 3 part numbers showed the same performance for the 2 kb MTU. The iterative cheapening of the hardware seems to have only broken the large frame sizes. However, I'm confused by part of your results: Some of your crossover cable and 5 port switch results show a big bandwith advantage when using jumbo frames - bandwith takes a huge jump up from around 125 MB/s with a 2k MTU to 225 with 4k. But your 8508T results, on the other hand, are much better at 2k, around 200 MB/s, and then gradually moves up to about the same 225 at 4k. Any idea why you saw those different behaviors? -- Andrew Piskorski http://www.piskorski.com/ From Bogdan.Costescu at iwr.uni-heidelberg.de Tue Jun 5 07:03:38 2007 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] network transfer issue to disk, old versus new hardware In-Reply-To: References: Message-ID: On Mon, 4 Jun 2007, David Mathog wrote: > Athcool does cut the idle temperatures of the nodes considerably, > but apparently also prevents them from performing this sort of > transfer at full speed, whether or not buffer is used. Well, near the top of the athcool website there is a warning and one the listed items is 'a slowdown in harddisk performance' - so nothing new here ;-) > Which is interesting because it didn't have any measurable effect on > CPU bound processes. I had thought it would shut itself off and get > out of the way when the CPU rate was high, but apparently not. CPU bound and I/O bound processes use the processor in different ways... When doing only I/O, the processor is often waiting for the hardware, so the load on the processor is low. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De From Bogdan.Costescu at iwr.uni-heidelberg.de Tue Jun 5 07:24:49 2007 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] network transfer issue to disk, old versus new hardware In-Reply-To: References: Message-ID: On Mon, 4 Jun 2007, David Mathog wrote: > Was athcool running on these? No. Given that our cluster nodes are in use most of the time, it makes no sense to think much about idling... And when it's known that some cluster nodes will not be used for some time (like one day or more), I prefer to just turn them off - most of them are built from consumer-grade components, so this brings them closer to their typical life-cycle. ;-) > Thankfully this is entirely irrelevant to those of you who have long > since retired these older Tyan systems. ... and if you didn't have any of those to begin with, then consider yourself lucky :-) -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De From hahn at mcmaster.ca Tue Jun 5 09:40:22 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] network transfer issue to disk, old versus new hardware In-Reply-To: References: Message-ID: >> Athcool does cut the idle temperatures of the nodes considerably, but >> apparently also prevents them from performing this sort of transfer at full >> speed, whether or not buffer is used. > > Well, near the top of the athcool website there is a warning and one the > listed items is 'a slowdown in harddisk performance' - so nothing new here > ;-) athcool works by putting the cpu-northbridge interface into a low-power mode. the difficulties people had with it was that this sort of down-clocking was new at the time, and not well-handled by all chips, probably on both the chipset and cpu sides. erata centered on how long it took to stabilize the PLL's involved. things are quite different nowadays - AMD put the northbridge entirely on-cpu, so it has fully control, and can modulate clocks extensively and differentially. I don't know how common (or effective) it is to modulate HT power, but such features show up prominently in recent HT revs. it's interesting to speculate about Intel - mostly it solved this by dominating the chipset market for its own CPUs. I'm guessing Intel will fall somewhat behind AMD in system-wide power savings, at least until CSI. even then, I'm a little unclear how good Intel's initial implementation will be - the fact that they've chosen to not simply adopt HT indicates to me that Intel will be re-learning AMD's lessons. >> Which is interesting because it didn't have any measurable effect on CPU >> bound processes. I had thought it would shut itself off and get out of the I'd expect athcool to not affect a cache-friendly cpu-bound process, but to hurt pretty badly if you have cache misses. networking (using the normal network stack) count as memory-bound, I think, rather than kinds of IO which might be more DMA-intensive. that is, if a disk is streaming many MB into memory, the CPU's northbridge interface should be able to go low-power (though most disk transfers are only in the 64K range...) regards, mark hahn. From naveed at caltech.edu Mon Jun 4 15:08:07 2007 From: naveed at caltech.edu (Naveed Near-Ansari) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] Re: IBRIX Experiences (Wally Edmondson) In-Reply-To: <200706030347.l533l4Mo009549@bluewest.scyld.com> References: <200706030347.l533l4Mo009549@bluewest.scyld.com> Message-ID: <1180994887.5878.21.camel@aeolis.gps.caltech.edu> On Sat, 2007-06-02 at 20:47 -0700, beowulf-request@beowulf.org wrote: > Date: Fri, 01 Jun 2007 11:24:40 -0400 > From: Wally Edmondson > Subject: Re: [Beowulf] IBRIX Experiences > To: beowulf@beowulf.org > Message-ID: <46603A38.3060006@utc.edu> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > On Thu, 10 May 2007, Ian Reynolds wrote: > > > Hey all -- we're considering IBRIX for a parallel storage cluster > > solution with an EMC Clarion CX3-20 at the center, as well as a handful > > of storage servers -- total of roughly 40 client servers, mix of 32 and > > 64 bit OSs. > > > > Can anyone offer their experiences with IBRIX, good or bad? We have > > worked with gpfs extensively, so any comparisons would also be helpful. > > It looks like you aren't getting many answers your question, Ian. I'll quickly share > my IBRIX experiences. I have been running IBRIX since late 2004 on around 540 > diskless clients and 50 regular servers and workstations with 8 segment servers and a > Fusion Manager connected to a DDN S2A 3000 couplet with 20TB of usable storage. The > storage is 1Gb FibreChannel to the Segment Servers and it's non-bonded GigE for > everything else. > > I'll start with the bad, I guess. We had our share of problems with the 1.x version > of the software in the early days. I suppose all parallel filesystems with 600 > clients are going to hit bumps. That's what CFS said back then, anyways. Stability > wasn't a problem, but occasionally a file wouldn't be readable and to fix it you had > to copy the file, stuff like that. This was no longer an issue beginning with > version 2.0. You have to get a new build of the software if you want to change > kernels. Their are two RPMS, one generic for the major kernel number and the other > specific to your kernel containing some modules. They only support RHEL/CENTOS and > SLES as far as I know, and SLES was only recently added. I asked about Ubuntu and > they don't yet support it, which sucks because I would like to use it on some > workstations. Oh, and make sure that the segment servers can always see each other. > Use at least two links through different switches. We had some bad switch ports > that caused the segment servers to miss heartbeats. This caused automatic failovers > to segment servers that also couldn't be seen. This is a disaster. I thought it was > IBRIX's fault the whole time. Turned out to be intermittent switch port problems. > It was avoidable with a little bit more planning and a better understanding of how > the whole thing worked. Redundancy is set up with buddies rather than globally, so > you tell it that one server should watch some other server's back. It works, but it > could be a problem if a failing server's buddy is down or a server goes down while it > owns a failed segment. In either case, some percentage of your files won't be > accessible until one of the servers is fixed. It hasn't happened to me, but it is a > possibility. I can bring down four of my eight servers without a problem, for > instance, but it needs to be the right four. Servers have failed and it has never > been a problem for me. The running jobs never know the difference. > > Support has been top-notch. Last year, we had a catastrophic storage controller > failure following a scheduled power outage, major corruption, the works. A guy at > IBRIX stayed with me all weekend on the phone and AIM. He logged in and remotely > restored all the files he could (tens of thousands). Apparently he could have > restored more if I had already been running 2.0 or higher. They know their product > very well. I'm not sure if I am the right person to compare it to GPFS or Lustre > since I looked into those products back in 2004 and haven't really researched them > since. My setup is simple, too, so I only use the basics. The performance is fine, > using nearly all of my GigE pipes. With more segment servers and faster storage you > could get some pretty amazing speeds. I don't use the quotas or multiple interfaces. > Their GUI looks nice at first but you really don't need it because their > command-line tools make sense and have excellent help output if you forget something. > Adding new clients is a breeze. There is a Windows client now but I haven't used > it. I use CIFS exports and it works just fine. I also use NFS exports for my few > remaining Solaris clients. Everything is very customizable and the documentation > seems pretty thorough. You can put any storage you like behind it, which is nice. I > think I could use USB keys if I felt like it. I have been very please with IBRIX > overall, especially since we upgraded out of 1.x land. It's usually the last thing > on my mind, so I guess that's a good thing. That's all I have time for right now. > Let me know if you have any specific questions. > > Wally > I would agree with some of this. The support is indeed top notch, but our switch to 2.x wasn't as smooth. we have had some problems with files not writing and some performance issues. this is being used on 520 nodes. For us, alot of our (recent) problems have been related to ibrix. Ibrix has been very good about helping fix things. I have had the same experience with ibrix being there when i needed them. when i have a problem, they work on it until fixed regardless of whether it is nighttime or weekends. At this point, i think we are stable and you probably would not have the same issues on a new system. -- Naveed Near-Ansari California Institute of Technology Division of Geology and Planetary Sciense From vaughanc at gmail.com Tue Jun 5 06:51:32 2007 From: vaughanc at gmail.com (Chris Vaughan) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] tftp permission denied In-Reply-To: References: <46623F5E.4010201@hypermall.net> <46627E5A.7070801@streamline-computing.com> Message-ID: <216ee070706050651s4ba40c10o165f11c8f08be0b4@mail.gmail.com> Hi, What does your default.cfg look like and your dhcp.conf file look like. I remember having this issue before and I fixed it one of those two files. On 6/3/07, Mark Hahn wrote: > > stop the tftp daemon service, then start it on the command line (as root) > > with the following flags: > > > > -l -vv -s /path/to/your/tftpdirectory > > yes, definitely. this sort of problem calls for debugging on the server side > - verbose server settings is probably enough, but I wouldn't shy away from > running the server under strace to see what it's really doing... > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- ------------------------------ Christopher Vaughan From aohara at haverford.edu Tue Jun 5 11:47:28 2007 From: aohara at haverford.edu (aohara@haverford.edu) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] xhpl and HPL.dat directory Message-ID: <41822.165.82.168.219.1181069248.squirrel@165.82.168.219> Hi, I'm working on benchmarking a recently installed cluster at Haverford College and we've been using the hpl benchmark. Currently, I've been testing the performance of each individual node blade in an attempt to look at bottlenecking in accessing the memory. Since we have several indentical nodes, it was be nice to have a different set of parameters running on each node. However, xhpl (installed in my home directory under my account) will only look for the HPL.dat file in the top directory (i.e. /n/home/aohara) and not in the same directory as a copy of the xhpl (for example I put a submission script, xhpl, and HPL.dat in the folder /n/home/aohara/newrun, but it runs the parameters of the file /n/home/aohara/HPL.dat instead). If anybody knows of a way to give a directive about the location of HPL.dat to xhpl, that would be exteremely helpful. Thank you very much, Andrew O'Hara '09 Haverford College Physics Department From lindahl at pbm.com Wed Jun 6 10:06:13 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] xhpl and HPL.dat directory In-Reply-To: <41822.165.82.168.219.1181069248.squirrel@165.82.168.219> References: <41822.165.82.168.219.1181069248.squirrel@165.82.168.219> Message-ID: <20070606170613.GA9230@bx9.net> On Tue, Jun 05, 2007 at 02:47:28PM -0400, aohara@haverford.edu wrote: > However, xhpl (installed in my home directory under > my account) will only look for the HPL.dat file in the top directory Use the Source, Luke. -- greg From mitch48 at sbcglobal.net Wed Jun 6 14:54:58 2007 From: mitch48 at sbcglobal.net (Tom Mitchell) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] tftp permission denied In-Reply-To: References: <46623F5E.4010201@hypermall.net> <46627E5A.7070801@streamline-computing.com> Message-ID: <20070606215458.GB11062@xtl1.xtl.tenegg.com> On Sun, Jun 03, 2007 at 02:25:13PM -0400, Mark Hahn wrote: > Date: Sun, 3 Jun 2007 14:25:13 -0400 (EDT) > From: Mark Hahn > To: Beowulf Mailing List > Subject: Re: [Beowulf] tftp permission denied > > >stop the tftp daemon service, then start it on the command line (as root) > >with the following flags: > > > >-l -vv -s /path/to/your/tftpdirectory > > yes, definitely. this sort of problem calls for debugging on the server > side > - verbose server settings is probably enough, but I wouldn't shy away from > running the server under strace to see what it's really doing... All the previous and above plus. Check /etc/xinetd.d/tftp, /etc/hosts.allow, /etc/hosts.deny Then check the ipfilter and security setting. TFTP is at a different port than FTP. If you are using the GUI Security Level Configuration tool you will have to enable TFTP under "Other ports". If ip filtering is blocking packets into the server 'verbose' flags will have nothing to be verbose about. The quick test is to disable filtering and test. ftp 21/tcp ftp 21/udp fsp fspd tftp 69/tcp tftp 69/udp sftp 115/tcp sftp 115/udp Both ftp and tftp get used by bad boys out on the Internet so watch the ownership, permissions, settings and logs. Most system admins will want to restrict TFTP access to your local hosts/networks. For the network programmers interested in historic bugs out there give this a quick read. http://en.wikipedia.org/wiki/Sorcerer's_Apprentice_Syndrome Later, mitch -- T o m M i t c h e l l Found me a new place to hang my hat :-) Now it got bought. From jlb17 at duke.edu Thu Jun 7 11:50:39 2007 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] Intel ESB2/82563EB NICs and RHEL/CentOS Message-ID: I have 6 dual Xeon 5160 compute nodes with Supermicro X7DVL-E motherboards . These boards have onboard Intel 82563EB NICs (PCI ID 8086:1096) and the systems are all running CentOS 4. When I first installed them, I was running CentOS 4.4 (kernel 2.6.9-42.0.x), which included version e1000-7.0.39. The network interfaces were very unreliable -- they would randomly stop and then re-start passing traffic. I downloaded version e1000-7.3.20 from intel.com, and they worked just fine. With the release of CentOS 4.5 (kernel 2.6.9-55) and its inclusion of e1000-7.2.7, I decided to give the stock driver a try again, but it had the same issues. Again, upgrading to a more recent version from intel.com (e1000-7.5.5 in this case) fixed the problem. I'm planning on moving these systems to CentOS 5 shortly (kernel kernel-2.6.18-8.x), but it too includes e1000-7.2.7. Has anybody else seen this issue? I'm wondering whether it is motherboard specific or if it's an issue with the NIC itself. Thanks! -- Joshua Baker-LePain Department of Biomedical Engineering Duke University From hahn at mcmaster.ca Fri Jun 8 09:11:10 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces Message-ID: I had a user grumble about how it was not trivial to get a basic backtrace on our clusters. his jobs tend to be 32-128p, and run for a week, so it's not ideal to run them under the debugger. turns out to be fairly simple to produce a backtrace.so which can be LD_PRELOAD'ed - it contains a constructor which registers a signal handler, which obtains the backtrace and translates and prints the corresponding file:func:line. does this sound like something of interest to other HPC sites? regards, mark hahn. From toon.knapen at fft.be Sun Jun 10 22:32:35 2007 From: toon.knapen at fft.be (Toon Knapen) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: References: Message-ID: <466CDE73.7020901@fft.be> Interesting indeed. On which platform is this backtrace.so available (obtaining backtraces is higly platform dependent AFAIK) ? toon Mark Hahn wrote: > I had a user grumble about how it was not trivial to get a basic > backtrace on our clusters. his jobs tend to be 32-128p, > and run for a week, so it's not ideal to run them under the debugger. > > turns out to be fairly simple to produce a backtrace.so which can > be LD_PRELOAD'ed - it contains a constructor which registers a signal > handler, which obtains the backtrace and translates and prints the > corresponding file:func:line. > > does this sound like something of interest to other HPC sites? > > regards, mark hahn. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From bencer at cauterized.net Mon Jun 11 04:33:13 2007 From: bencer at cauterized.net (Jorge Salamero Sanz) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] MPI performance gain with jumbo frames Message-ID: <1269.155.210.32.73.1181561593.squirrel@webmail.cauterized.net> hi all, new to this list, so don't know if this is offtopic. i'd like to know experiences about MPI performance gain with jumbo frames. i manage a beowulf cluster (42 athlon xp, gentoo linux) with gigabit ethernet where fluent, openfoam and other mpi apps are run. with NFS i'm sure wich kind of gain i would have, but with MPI apps i'm worried about after seeing this page http://www.scl.ameslab.gov/Projects/IBMCluster/Benchmarks.html regards From ctierney at hypermall.net Mon Jun 11 08:27:46 2007 From: ctierney at hypermall.net (Craig Tierney) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: <466CDE73.7020901@fft.be> References: <466CDE73.7020901@fft.be> Message-ID: <466D69F2.60005@hypermall.net> Toon Knapen wrote: > Interesting indeed. On which platform is this backtrace.so available > (obtaining backtraces is higly platform dependent AFAIK) ? > The Intel Compiler provides backtraces. I think (from memory) that you compile with -g --traceback. Craig > toon > > Mark Hahn wrote: >> I had a user grumble about how it was not trivial to get a basic >> backtrace on our clusters. his jobs tend to be 32-128p, >> and run for a week, so it's not ideal to run them under the debugger. >> >> turns out to be fairly simple to produce a backtrace.so which can >> be LD_PRELOAD'ed - it contains a constructor which registers a signal >> handler, which obtains the backtrace and translates and prints the >> corresponding file:func:line. >> >> does this sound like something of interest to other HPC sites? >> >> regards, mark hahn. >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From deadline at eadline.org Mon Jun 11 08:43:02 2007 From: deadline at eadline.org (Douglas Eadline) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] MPI performance gain with jumbo frames In-Reply-To: <1269.155.210.32.73.1181561593.squirrel@webmail.cauterized.net> References: <1269.155.210.32.73.1181561593.squirrel@webmail.cauterized.net> Message-ID: <56470.192.168.1.1.1181576582.squirrel@mail.eadline.org> 1) The results you reference are rather old. Does this reflect your hardware? 2) To support Jumbo Frames you need both NICs and a switch that support them. 3) It is possible to achieve wire speed from GigE, you need something other then 32 bit PCI connections, however. (PCIe, PCI-X) 4) While Jumbo Frames can help NFS, the effect on MPI can vary by application. Have you run any tests to see exactly what your network performance is? (i.e. Netpipe) You may find these articles helpful: http://www.clustermonkey.net//content/view/38/34/ http://www.clustermonkey.net//content/view/39/34/ -- Doug > hi all, > > new to this list, so don't know if this is offtopic. > > i'd like to know experiences about MPI performance gain with jumbo frames. > i > manage a beowulf cluster (42 athlon xp, gentoo linux) with gigabit > ethernet > where fluent, openfoam and other mpi apps are run. > > with NFS i'm sure wich kind of gain i would have, but with MPI apps i'm > worried about after seeing this page > http://www.scl.ameslab.gov/Projects/IBMCluster/Benchmarks.html > > regards > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > !DSPAM:466d3377130762071360113! > -- Doug From laytonjb at charter.net Mon Jun 11 08:57:09 2007 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] MPI performance gain with jumbo frames In-Reply-To: <56470.192.168.1.1.1181576582.squirrel@mail.eadline.org> References: <1269.155.210.32.73.1181561593.squirrel@webmail.cauterized.net> <56470.192.168.1.1.1181576582.squirrel@mail.eadline.org> Message-ID: <466D70D5.5050701@charter.net> Doug brings up some good points. If you want to try Jumbo Frames to improve MPI performance you might have to tweak the TCP buffers as well. There are some links around the web on this. Sometimes it helps performance, sometimes it doesn't. Your mileage may vary. Jeff > 1) The results you reference are rather old. Does this > reflect your hardware? > > 2) To support Jumbo Frames you need both NICs and a switch > that support them. > > 3) It is possible to achieve wire speed from > GigE, you need something other then 32 bit PCI > connections, however. (PCIe, PCI-X) > > 4) While Jumbo Frames can help NFS, the effect on MPI > can vary by application. Have you run any tests to > see exactly what your network performance is? > (i.e. Netpipe) > > You may find these articles helpful: > > http://www.clustermonkey.net//content/view/38/34/ > > http://www.clustermonkey.net//content/view/39/34/ > > -- > Doug > > > >> hi all, >> >> new to this list, so don't know if this is offtopic. >> >> i'd like to know experiences about MPI performance gain with jumbo frames. >> i >> manage a beowulf cluster (42 athlon xp, gentoo linux) with gigabit >> ethernet >> where fluent, openfoam and other mpi apps are run. >> >> with NFS i'm sure wich kind of gain i would have, but with MPI apps i'm >> worried about after seeing this page >> http://www.scl.ameslab.gov/Projects/IBMCluster/Benchmarks.html >> >> regards >> >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> !DSPAM:466d3377130762071360113! >> >> > > > -- > Doug > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From toon.knapen at fft.be Mon Jun 11 12:12:46 2007 From: toon.knapen at fft.be (Toon Knapen) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: <466D69F2.60005@hypermall.net> References: <466CDE73.7020901@fft.be> <466D69F2.60005@hypermall.net> Message-ID: <466D9EAE.8010105@fft.be> > > The Intel Compiler provides backtraces. I think (from memory) that > you compile with -g --traceback. > Thanks. I had no idea. However from the man page at http://www.intel.com/software/products/compilers/docs/clin/icc_txt.htm I read: -[no]traceback Tell the compiler to generate [not generate] extra information in the object file to allow the display of source file trace- back information at run time when a severe error occurs. This is intended for use with C code that is to be linked into a Fortran program. I do not understand the last sentence though. I do not see how this can be specific to C code linked into a Fortran program (and thus linked against the fortran runtime library) t From toon.knapen at fft.be Mon Jun 11 12:15:37 2007 From: toon.knapen at fft.be (Toon Knapen) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> Message-ID: <466D9F59.7070901@fft.be> Ashley Pittman wrote: > It's highly dependant to implement but I should imagine most people who > need backtraces use a debugger, the libc backtrace() function or > libbacktrace which can be use from either inside or outside the target > process, these tend to be platform independent. > libbacktrace is AFAICT also gcc specific. Or do you any pointers to some more platform-info on libbacktrace ? thanks, t From lindahl at pbm.com Mon Jun 11 12:35:26 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: <466D9F59.7070901@fft.be> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> Message-ID: <20070611193526.GE6911@bx9.net> On Mon, Jun 11, 2007 at 09:15:37PM +0200, Toon Knapen wrote: > libbacktrace is AFAICT also gcc specific. That would be hard, given that the PathScale and Intel compilers are extremely gcc-compatible. By the way, some MPIs already offer backtraces: OpenMPI, PathScale MPI, perhaps others. -- greg From hahn at mcmaster.ca Mon Jun 11 12:43:12 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: <466D9F59.7070901@fft.be> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> Message-ID: >> It's highly dependant to implement but I should imagine most people who >> need backtraces use a debugger, suppose your program is running on a hundred nodes for a week before you hit the event you want the backtrace for... yes, debugger+coredump can be used, but for obvious reasons, we normally recommend users _not_ have them enabled. >> the libc backtrace() function or >> libbacktrace which can be use from either inside or outside the target >> process, these tend to be platform independent. I started with the libc backtrace function, but wanted something better than its backtrace_symbols() companion. > libbacktrace is AFAICT also gcc specific. Or do you any pointers to some more > platform-info on libbacktrace ? I believe it's binutils/libc-specific, not compiler-specific. at least "pathcc -O3 -fno-inline-functions -g" gave me a meaningful backtrace on an mpi tester. anyway, appended is my current version of backtrace.c - I think it's interesting and potentially useful, especially considering that it's not really complex: /* print a backtrace. written by Mark Hahn, SHARCnet, 2007. gcc -fPIC backtrace.c /usr/lib64/libbfd-2.15.92.0.2.so -shared -o backtrace.so using -lbfd chokes on a symbol addressing issue with (static) libbfd.a on my system. your libbfd version number may differ. LD_PRELOAD=./backtrace.so ./tester signal(11) Obtained 9 stack frames. file: /home/hahn/private/tester.c, line: 10, func dosegv file: /home/hahn/private/tester.c, line: 14, func bar file: /home/hahn/private/tester.c, line: 17, func foo file: /home/hahn/private/tester.c, line: 29, func main all symbols (globals and functions) are static to avoid contamination. you need -g on the target program, and potentially something like -fno-inline-functions to dissuade the compiler from disappearing some functions. */ #define _GNU_SOURCE #include #include #include #include #include #include #include #define MAX_FRAMES (20) /* globals retained across calls to resolve. */ static bfd* abfd = 0; static asymbol **syms = 0; static asection *text = 0; static void resolve(char *address) { if (!abfd) { char ename[1024]; int l = readlink("/proc/self/exe",ename,sizeof(ename)); if (l == -1) { perror("failed to find executable\n"); return; } ename[l] = 0; bfd_init(); abfd = bfd_openr(ename, 0); if (!abfd) { perror("bfd_openr failed: "); return; } /* oddly, this is required for it to work... */ bfd_check_format(abfd,bfd_object); unsigned storage_needed = bfd_get_symtab_upper_bound(abfd); syms = (asymbol **) malloc(storage_needed); unsigned cSymbols = bfd_canonicalize_symtab(abfd, syms); text = bfd_get_section_by_name(abfd, ".text"); } long offset = ((long)address) - text->vma; if (offset > 0) { const char *file; const char *func; unsigned line; if (bfd_find_nearest_line(abfd, text, syms, offset, &file, &func, &line) && file) printf("file: %s, line: %u, func %s\n",file,line,func); } } static void print_trace() { void *array[MAX_FRAMES]; size_t size; size_t i; void *approx_text_end = (void*) ((128+100) * 2<<20); size = backtrace (array, MAX_FRAMES); printf ("Obtained %zd stack frames.\n", size); for (i = 0; i < size; i++) { if (array[i] < approx_text_end) { resolve(array[i]); } } } static void handler(int sig) { printf("signal(%d)\n",sig); print_trace(); _exit(1); } static void __attribute__((constructor)) init() { static struct sigaction sa; sa.sa_handler = handler; sigaction(SIGABRT, &sa, 0); sigaction(SIGFPE, &sa, 0); sigaction(SIGSEGV, &sa, 0); } From ctierney at hypermall.net Mon Jun 11 13:58:59 2007 From: ctierney at hypermall.net (Craig Tierney) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> Message-ID: <466DB793.1040903@hypermall.net> Mark Hahn wrote: >>> It's highly dependant to implement but I should imagine most people who >>> need backtraces use a debugger, > > suppose your program is running on a hundred nodes for a week before you > hit the event you want the backtrace for... > yes, debugger+coredump can be used, but for obvious reasons, > we normally recommend users _not_ have them enabled. Sorry to start a flame war.... Make sure that your code generates the exact same answer with debug/backtrace enabled and disabled, then you add user-level checkpointing so that you can restart where you want. Then you run up until the problem and restart with the last checkpoint. Run for a week without checkpointing? Just begging for trouble. Craig > >>> the libc backtrace() function or >>> libbacktrace which can be use from either inside or outside the target >>> process, these tend to be platform independent. > > I started with the libc backtrace function, but wanted something better > than its backtrace_symbols() companion. > >> libbacktrace is AFAICT also gcc specific. Or do you any pointers to >> some more platform-info on libbacktrace ? > > I believe it's binutils/libc-specific, not compiler-specific. at least > "pathcc -O3 -fno-inline-functions -g" gave me a meaningful backtrace on > an mpi tester. > > anyway, appended is my current version of backtrace.c - I think it's > interesting and potentially useful, especially considering that it's not > really complex: > > /* print a backtrace. > written by Mark Hahn, SHARCnet, 2007. > > gcc -fPIC backtrace.c /usr/lib64/libbfd-2.15.92.0.2.so -shared -o > backtrace.so > > using -lbfd chokes on a symbol addressing issue with (static) libbfd.a > on my system. your libbfd version number may differ. > > LD_PRELOAD=./backtrace.so ./tester > signal(11) > Obtained 9 stack frames. > file: /home/hahn/private/tester.c, line: 10, func dosegv > file: /home/hahn/private/tester.c, line: 14, func bar > file: /home/hahn/private/tester.c, line: 17, func foo > file: /home/hahn/private/tester.c, line: 29, func main > > all symbols (globals and functions) are static to avoid contamination. > > you need -g on the target program, and potentially something like > -fno-inline-functions to dissuade the compiler from disappearing some > functions. > */ > > #define _GNU_SOURCE > #include > #include > #include > #include > #include > #include > #include > > #define MAX_FRAMES (20) > > /* globals retained across calls to resolve. */ > static bfd* abfd = 0; > static asymbol **syms = 0; > static asection *text = 0; > > static void resolve(char *address) { > if (!abfd) { > char ename[1024]; > int l = readlink("/proc/self/exe",ename,sizeof(ename)); > if (l == -1) { > perror("failed to find executable\n"); > return; > } > ename[l] = 0; > > bfd_init(); > > abfd = bfd_openr(ename, 0); > if (!abfd) { > perror("bfd_openr failed: "); > return; > } > /* oddly, this is required for it to work... */ > bfd_check_format(abfd,bfd_object); > > unsigned storage_needed = bfd_get_symtab_upper_bound(abfd); > syms = (asymbol **) malloc(storage_needed); > unsigned cSymbols = bfd_canonicalize_symtab(abfd, syms); > > text = bfd_get_section_by_name(abfd, ".text"); > } > long offset = ((long)address) - text->vma; > if (offset > 0) { > const char *file; > const char *func; > unsigned line; > if (bfd_find_nearest_line(abfd, text, syms, offset, &file, > &func, &line) && file) > printf("file: %s, line: %u, func %s\n",file,line,func); > } > } > > static void print_trace() { > void *array[MAX_FRAMES]; > size_t size; > size_t i; > void *approx_text_end = (void*) ((128+100) * 2<<20); > > size = backtrace (array, MAX_FRAMES); > printf ("Obtained %zd stack frames.\n", size); > for (i = 0; i < size; i++) { > if (array[i] < approx_text_end) { > resolve(array[i]); > } > } > } > > static void handler(int sig) { > printf("signal(%d)\n",sig); > print_trace(); > _exit(1); > } > > static void __attribute__((constructor)) init() { > static struct sigaction sa; > sa.sa_handler = handler; > sigaction(SIGABRT, &sa, 0); > sigaction(SIGFPE, &sa, 0); > sigaction(SIGSEGV, &sa, 0); > } > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From hahn at mcmaster.ca Mon Jun 11 19:00:02 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: <466DB793.1040903@hypermall.net> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> Message-ID: > Sorry to start a flame war.... what part do you think was inflamed? > Make sure that your code generates the exact same answer with debug/backtrace > enabled and disabled, part of the point of my very simple backtrace.so is that it has zero runtime overhead and doesn't require any special compilation. > then you add user-level checkpointing so that you can I'm most curious to hear people's experience with checkpointing. all our more serious, established codes do checkpointing, but it's extremely foreign to people writing newish codes. and, of course, it's a lot of extra work. I'm not arguing against checkpointing, just acknowledging that although we _require_ it, we don't actually demand "proof-of-checkpointability". > restart where you want. Then you > run up until the problem and restart with the last checkpoint. restarting from checkpoint is fine (the code in question could actually do it), but still means you have hours of running, presumably under a debugger. > Run for a week without checkpointing? Just begging for trouble. suppose you have 2k users, with ~300 active at any instant, and probably 200 unrelated codes running. while we do require checkpointing (I usually say "every 6-8 cpu hours"), I suspect that many users never do. how do you check/validate/encourage/support checkpointing? part of the reason I got a kick out of this simple backtrace.so is indeed that it's quite possible to conceive of a checkpoint.so which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent job of checkpointing at least serial codes non-intrusively. regards, mark hahn. From ctierney at hypermall.net Mon Jun 11 20:54:28 2007 From: ctierney at hypermall.net (Craig Tierney) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> Message-ID: <466E18F4.90701@hypermall.net> Mark Hahn wrote: >> Sorry to start a flame war.... > > what part do you think was inflamed? It was when I was trying to say "Real codes have user-level checkpointing implemented and no code should ever run for 7 days." > >> Make sure that your code generates the exact same answer with >> debug/backtrace enabled and disabled, > > part of the point of my very simple backtrace.so is that it has zero > runtime overhead and doesn't require any special compilation. > Does the Intel version have overhead? I never measured it before, but I never thought it was much. >> then you add user-level checkpointing so that you can > > I'm most curious to hear people's experience with checkpointing. > all our more serious, established codes do checkpointing, but it's > extremely foreign to people writing newish codes. > and, of course, it's a lot of extra work. I'm not arguing against > checkpointing, just acknowledging that although we _require_ it, > we don't actually demand "proof-of-checkpointability". > I included checkpointing in an ocean-model once. It was very easy, but that was most likely because of how it was organized (Fortran 77, most data structures were shared). I don't think that it is foreign to people writing new codes. It is foreign to scientists. Software developers (who could be scientists) would think of this from the beginning (I hope). >> restart where you want. Then you >> run up until the problem and restart with the last checkpoint. > > restarting from checkpoint is fine (the code in question could > actually do it), but still means you have hours of running, > presumably under a debugger. > >> Run for a week without checkpointing? Just begging for trouble. > > suppose you have 2k users, with ~300 active at any instant, > and probably 200 unrelated codes running. while we do require > checkpointing (I usually say "every 6-8 cpu hours"), I suspect that many > users never do. how do you check/validate/encourage/support > checkpointing? > Set your queue maximums to 6-8 hours. Prevents system hogging, encourages checkpointing for long runs. Make sure your IO system can support the checkpointing because it can create a lot of load. > part of the reason I got a kick out of this simple backtrace.so > is indeed that it's quite possible to conceive of a checkpoint.so > which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent job > of checkpointing at least serial codes non-intrusively. > BTW, I like your code. I had a script written for me in the past (by Greg Lindahl in a galaxy far-far away). The one modification I would make is to print out the MPI ID evnironment variable (MPI flavors vary how it is set). Then when it crashes, you know which process actually died. Craig From lindahl at pbm.com Mon Jun 11 21:20:11 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: <466E18F4.90701@hypermall.net> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> <466E18F4.90701@hypermall.net> Message-ID: <20070612042011.GA759@bx9.net> On Mon, Jun 11, 2007 at 08:54:28PM -0700, Craig Tierney wrote: > I don't think that it is foreign to people writing new codes. > It is foreign to scientists. Most serious supercomputing scientists -- those who have finite cpu allotments in particular -- put in checkpointing when they realize it saves them valuable resources. Until they lose work or money, it's not a priority. > BTW, I like your code. I had a script written for me in the past > (by Greg Lindahl in a galaxy far-far away). Hey, and here I was avoiding saying "You guys don't remember me talking about easy backtrace in conferences in 2000 and 2001? I was pretty insufferably on the topic..." That implementation used gdb and had zero overhead other than the memory gdb took. But fewer processes is always better, and OpenMPI and Intel and PathScale MPI & compilers all use a library implementation somewhat like Mark's. -- greg From gerry.creager at tamu.edu Mon Jun 11 21:55:02 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: <466E18F4.90701@hypermall.net> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> <466E18F4.90701@hypermall.net> Message-ID: <466E2726.50407@tamu.edu> I've tried to stay out of this. Really, I have. Craig Tierney wrote: > Mark Hahn wrote: >>> Sorry to start a flame war.... >> >> what part do you think was inflamed? > > It was when I was trying to say "Real codes have user-level > checkpointing implemented and no code should ever run for 7 > days." A number of my climate simulations will run for 7-10 days to get century-long simulations to complete. I've run geodesy simulations that ran for up to 17 days in the past. I like to think that my codes are real enough! Real codes do have user-level checkpointing, though. And even better codes can be restarted without a lot of user intervention by invoking a run-time flag and going off for coffee. >>> Make sure that your code generates the exact same answer with >>> debug/backtrace enabled and disabled, >> >> part of the point of my very simple backtrace.so is that it has zero >> runtime overhead and doesn't require any special compilation. >> > > Does the Intel version have overhead? I never measured it before, > but I never thought it was much. Can't speak to the Intel compiler, as with their terms of use I've abandoned it and never tried its traceback or checkpointing capabilities. PGI, which I do use, and old IBM Fort-G and Fort-H did have overhead issues. The PGI compiler is what I tend to use almost all the time for my model compiling so I'm not able to speak to must of this new-fangled language stuff you're talking about :-) >>> then you add user-level checkpointing so that you can >> >> I'm most curious to hear people's experience with checkpointing. >> all our more serious, established codes do checkpointing, but it's >> extremely foreign to people writing newish codes. >> and, of course, it's a lot of extra work. I'm not arguing against >> checkpointing, just acknowledging that although we _require_ it, >> we don't actually demand "proof-of-checkpointability". >> > > I included checkpointing in an ocean-model once. It was very easy, > but that was most likely because of how it was organized (Fortran 77, > most data structures were shared). > > I don't think that it is foreign to people writing new codes. > It is foreign to scientists. Software developers (who could be > scientists) would think of this from the beginning (I hope). Let's see. WRF and MM5 on the atmospheric front, support user-level checkpointing and restart capabilities. So does ADCIRC and Wave Watch-III. And ROMS. So, the oceans side is covered. The older *nix version of PAGES (geodesy) didn't but it was easily added. Most folks didn't use PAGES like I did, and thus checkpointing was pretty useless. I'm not dabbling in genomics or protein folding but most of the folks I know who are, are computer scientists who "followed the money" and are collaborating on projects with discipline scientists, implementing code to support the "real" work. So, I strongly suspect they're implementing checkpointing, too. >>> restart where you want. Then you >>> run up until the problem and restart with the last checkpoint. >> >> restarting from checkpoint is fine (the code in question could >> actually do it), but still means you have hours of running, >> presumably under a debugger. >> >>> Run for a week without checkpointing? Just begging for trouble. >> >> suppose you have 2k users, with ~300 active at any instant, >> and probably 200 unrelated codes running. while we do require >> checkpointing (I usually say "every 6-8 cpu hours"), I suspect that >> many users never do. how do you check/validate/encourage/support >> checkpointing? >> > > Set your queue maximums to 6-8 hours. Prevents system hogging, > encourages checkpointing for long runs. Make sure your IO system > can support the checkpointing because it can create a lot of load. And how do you support my operational requirements with this policy during hurricane season? Let's see... "Stop that ensemble run now so the Monte Carlo chemists can play for awhile, then we'll let you back on. Don't worry about the timeliness of your simulations. No one needs a 35-member ensemble for statistical forecasting, anyway." Did I miss something? Yeah, we really do that. With boundary-condition munging we can run a statistical set of simulations and see what the probabilities are and where, for instance, maximum storm surge is likely to go. If we don't get sufficient membership in the ensemble, the statistical strength of the forecasting procedure decreases. Gerry >> part of the reason I got a kick out of this simple backtrace.so >> is indeed that it's quite possible to conceive of a checkpoint.so >> which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent >> job of checkpointing at least serial codes non-intrusively. >> > > BTW, I like your code. I had a script written for me in the past > (by Greg Lindahl in a galaxy far-far away). The one modification > I would make is to print out the MPI ID evnironment variable (MPI > flavors vary how it is set). Then when it crashes, you know which > process actually died. > > Craig > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From lindahl at pbm.com Mon Jun 11 22:49:34 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: <466E2726.50407@tamu.edu> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> <466E18F4.90701@hypermall.net> <466E2726.50407@tamu.edu> Message-ID: <20070612054934.GA8063@bx9.net> On Mon, Jun 11, 2007 at 11:55:02PM -0500, Gerry Creager wrote: > And how do you support my operational requirements with this policy > during hurricane season? By not over-generalizing from a general policy to a place where it doesn't apply? Craig has worked in weather forecasting, you know. You don't run your ensemble elements as separate jobs? Isn't that asking for disaster if something goes wrong? -- greg From Hakon.Bugge at scali.com Tue Jun 12 01:00:23 2007 From: Hakon.Bugge at scali.com (=?iso-8859-1?Q?H=E5kon?= Bugge) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] Re: Beowulf Digest, Vol 40, Issue 9 In-Reply-To: <200706081900.l58J06gq014487@bluewest.scyld.com> References: <200706081900.l58J06gq014487@bluewest.scyld.com> Message-ID: <20070612080145.A55E235AA28@mail.scali.no> At 21:00 08.06.2007, Mark Hahn wrote: >Message: 1 >Date: Fri, 8 Jun 2007 12:11:10 -0400 (EDT) >From: Mark Hahn >Subject: [Beowulf] backtraces >To: Beowulf Mailing List > >I had a user grumble about how it was not trivial to get >a basic backtrace on our clusters. his jobs tend to be 32-128p, >and run for a week, so it's not ideal to run them under the debugger. Using Scali MPI Connect, you can easily install signal handlers. When the signal(s) is caught, the application continues to run, the offending process writes out its registers and you can conveniently attach it with your favorite debugger. Regards, H?kon From gerry.creager at tamu.edu Tue Jun 12 05:11:57 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: <20070612054934.GA8063@bx9.net> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> <466E18F4.90701@hypermall.net> <466E2726.50407@tamu.edu> <20070612054934.GA8063@bx9.net> Message-ID: <466E8D8D.80808@tamu.edu> Greg Lindahl wrote: > On Mon, Jun 11, 2007 at 11:55:02PM -0500, Gerry Creager wrote: > >> And how do you support my operational requirements with this policy >> during hurricane season? > > By not over-generalizing from a general policy to a place where it > doesn't apply? Craig has worked in weather forecasting, you know. Actually, the tone sounded like it was already over-generalized. I merely followed the trend. > You don't run your ensemble elements as separate jobs? Isn't that > asking for disaster if something goes wrong? Actually, it depends on what you call a "job". Apparently IBM's LoadLeveler (hardly a Beowulf implementation, but what I'm working with right now) thinks that the job-file defines the job. I can check-point, sleep or do quite a bit more within the normal job script but IBM wants to treat that as a "job". Most of my runs on that machine complete in a couple of clock hours for a single ensemble member, or less. The job, however, can take 8-12 hours with WRF, Holland winds, ADCIRC, WaveWatch, SWAN and ELCIRC in ensemble mode. Some of my WRF climate runs can go for days, however. Those are cycle hogs. gerry -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From rgb at phy.duke.edu Tue Jun 12 06:08:32 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> Message-ID: On Mon, 11 Jun 2007, Mark Hahn wrote: > part of the reason I got a kick out of this simple backtrace.so > is indeed that it's quite possible to conceive of a checkpoint.so > which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent job of > checkpointing at least serial codes non-intrusively. IIRC, condor has just such a library that it uses both for serial job migration and checkpointing. rgb > > regards, mark hahn. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From dnlombar at ichips.intel.com Tue Jun 12 07:02:45 2007 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> Message-ID: <20070612140245.GA14845@nlxdcldnl2.cl.intel.com> On Mon, Jun 11, 2007 at 10:00:02PM -0400, Mark Hahn wrote: > > part of the reason I got a kick out of this simple backtrace.so > is indeed that it's quite possible to conceive of a checkpoint.so > which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly > decent job of checkpointing at least serial codes non-intrusively. > Have you looked at Berkely Lab Checkpoint/Restart (BLCR) at It does far beyond serial codes; with proper support, it does MPI too... -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From apittman at concurrent-thinking.com Mon Jun 11 08:36:45 2007 From: apittman at concurrent-thinking.com (Ashley Pittman) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: <466CDE73.7020901@fft.be> References: <466CDE73.7020901@fft.be> Message-ID: <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> On Mon, 2007-06-11 at 07:32 +0200, Toon Knapen wrote: > Interesting indeed. On which platform is this backtrace.so available > (obtaining backtraces is higly platform dependent AFAIK) ? It's highly dependant to implement but I should imagine most people who need backtraces use a debugger, the libc backtrace() function or libbacktrace which can be use from either inside or outside the target process, these tend to be platform independent. > Mark Hahn wrote: > > I had a user grumble about how it was not trivial to get a basic > > backtrace on our clusters. his jobs tend to be 32-128p, > > and run for a week, so it's not ideal to run them under the debugger. It really shouldn't be that difficult, on a Quadrics cluster at least you can use the command "padb -x -r " from anywhere in the cluster to see a backtrace from any given rank. Ashley, From arnoldg at ncsa.uiuc.edu Mon Jun 11 12:31:56 2007 From: arnoldg at ncsa.uiuc.edu (Galen Arnold) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: <466D9EAE.8010105@fft.be> References: <466CDE73.7020901@fft.be> <466D69F2.60005@hypermall.net> <466D9EAE.8010105@fft.be> Message-ID: >> >> The Intel Compiler provides backtraces. I think (from memory) that >> you compile with -g --traceback. ...only for fortran source code [it's in icc in case you're linking with fortran]. -Galen From tmalas at ee.bilkent.edu.tr Tue Jun 12 00:25:37 2007 From: tmalas at ee.bilkent.edu.tr (Tahir Malas) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] Two problems related to slowness and TASK_UNINTERRUPTABLE process Message-ID: <01ae01c7acc2$dfa8e810$d80cb38b@bs> Hi all, We have an 8 dual quad-core node HP cluster connected via Infiniband. We use Voltaire DDR cards and 24-port switch. We also use OFED 1.1 and MVAPICH 0.9.7. We have two interesting problems that we could not overcome yet: 1. In our test program which mimics the communications in our code, the nodes are paired as follows: (0 and 1), (2 and 3), (4 and 5), (6 and 7). We perform one to one communications between these pairs of nodes simultaneously. We use blocking MPI send and receive commands to communicate an integer array of various sizes. In addition, we consider different numbers of processes: (a) 1 process per node, 8 processes overall: One link is established between the pairs of nodes. (b) 2 process per node, 16 processes overall: Two links are established between the pairs of nodes. (c) 4 process per node, 32 processes overall: Four links are established between the pairs of nodes. (d) 8 process per node, 64 processes overall: Eight links are established between the pairs of nodes. We obtain logical timings, except for the following interesting comparison: For 32 processes (4 process per node), the arrays with 512-Byte size are communicated slower than the 4096-Byte size arrays. For both of them, we send/receive 1,000,000 arrays and take the average to find the time per package. Only package size changes. We have made many trials and confirmed this abnormal case is persistent. More specifically, communication of 4k-Byte packages are 2 times faster than the communication of 512-Byte packages. The OSU bandwidth and latency test around these points shows: Byte MB/s 256 417.53 512 592.34 1024 691.02 2048 857.35 4096 906.04 8192 1022.52 Time (usec) 256 4.79 512 5.48 1024 6.60 2048 8.30 4096 11.02 So this behavior does not seem reasonable to us. 2. SOMETIMES, after the test with overall 32 processes, one of the four processes at node3 hangs in TASK_UNINTERRUPTABLE "D" state. Hence, the test program shows a "done." and waits for sometime. We can neither kill the process nor soft reboot the node. We have to wait for that process to terminate, which can last long. Does anybody have some comments in these issues? Thanks in advance, Tahir Malas Bilkent University Electrical and Electronics Engineering Department From hahn at mcmaster.ca Tue Jun 12 08:14:55 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] Two problems related to slowness and TASK_UNINTERRUPTABLE process In-Reply-To: <01ae01c7acc2$dfa8e810$d80cb38b@bs> References: <01ae01c7acc2$dfa8e810$d80cb38b@bs> Message-ID: > For 32 processes (4 process per node), the arrays with 512-Byte size are > communicated slower than the 4096-Byte size arrays. For both of them, we do you mean that this is not the case in other configurations? an interconnect _should_ have some steep rise in effective bandwidth as packet size is increased. it's a useful metric to know the packet size at which half-peak bandwidth is achieved, since this offers some "sense of scale" to programmers judging whether their own packet sizes are appropriate. > this abnormal case is persistent. More specifically, communication of > 4k-Byte packages are 2 times faster than the communication of 512-Byte > packages. perhaps I'm dense this morning, but what's unexpected about that? > The OSU bandwidth and latency test around these points shows: > Byte MB/s > 256 417.53 > 512 592.34 > 1024 691.02 > 2048 857.35 > 4096 906.04 > 8192 1022.52 the osu_bw test is a streaming, fire-and-forget one which strongly rewards message aggregation. (this is not necessarily deceptive - it's measuring a real communication pattern, though it's not the only way to quantify bandwidth.) you can see that it's aggregating because the reported bandwidth for small packets is much higher than you'd expect if each packet took the latency reported below. (unless my math is wrong, 256/(2*4.79e-6) = 26.7 MB/s) > Time (usec) > 256 4.79 > 512 5.48 > 1024 6.60 > 2048 8.30 > 4096 11.02 > So this behavior does not seem reasonable to us. > > 2. SOMETIMES, after the test with overall 32 processes, one of the four > processes at node3 hangs in TASK_UNINTERRUPTABLE "D" state. Hence, the test > program shows a "done." and waits for sometime. We can neither kill the > process nor soft reboot the node. We have to wait for that process to > terminate, which can last long. does /proc/$pid/wchan (on the 'D' state process) tell you anything? do all the ranks return from MPI_Finalize? regards, mark hahn. From ctierney at hypermall.net Tue Jun 12 08:34:22 2007 From: ctierney at hypermall.net (Craig Tierney) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: <466E2726.50407@tamu.edu> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> <466E18F4.90701@hypermall.net> <466E2726.50407@tamu.edu> Message-ID: <466EBCFE.4020606@hypermall.net> Gerry Creager wrote: > I've tried to stay out of this. Really, I have. > > Craig Tierney wrote: >> Mark Hahn wrote: >>>> Sorry to start a flame war.... >>> >>> what part do you think was inflamed? >> >> It was when I was trying to say "Real codes have user-level >> checkpointing implemented and no code should ever run for 7 >> days." > > A number of my climate simulations will run for 7-10 days to get > century-long simulations to complete. I've run geodesy simulations that > ran for up to 17 days in the past. I like to think that my codes are > real enough! > NCAR and GFDL run climate simulations for weeks as well. How longest period of time any one job can run? It is 8-12 hours. I can verify these numbers if needed, but I can guarantee you that no one is allowed to put their job in for 17 days. With explicit permission they may get 24 hours, but that would be for unique situations. > Real codes do have user-level checkpointing, though. And even better > codes can be restarted without a lot of user intervention by invoking a > run-time flag and going off for coffee. > You mean there are people that bother to implement checkpointing and then don't make it code like: if (checkpoint files exist in my directory) then load checkpoint files else start from scratch end ???? >> Set your queue maximums to 6-8 hours. Prevents system hogging, >> encourages checkpointing for long runs. Make sure your IO system >> can support the checkpointing because it can create a lot of load. > > And how do you support my operational requirements with this policy > during hurricane season? Let's see... "Stop that ensemble run now so > the Monte Carlo chemists can play for awhile, then we'll let you back > on. Don't worry about the timeliness of your simulations. No one needs > a 35-member ensemble for statistical forecasting, anyway." Did I miss > something? > You kick-off the users that are not running operational codes because their work is (probably) not as time constrained. Also, if you take so long to get your answer in an operational mode that the answer doesn't matter anymore, you need a faster computer. I would think that if you cannot spit out a 12-hour hurricane forecast in a couple of hours I would be concerned how valuable the answer would be. Craig > Yeah, we really do that. With boundary-condition munging we can run a > statistical set of simulations and see what the probabilities are and > where, for instance, maximum storm surge is likely to go. If we don't > get sufficient membership in the ensemble, the statistical strength of > the forecasting procedure decreases. > > Gerry > >>> part of the reason I got a kick out of this simple backtrace.so >>> is indeed that it's quite possible to conceive of a checkpoint.so >>> which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent >>> job of checkpointing at least serial codes non-intrusively. >>> >> >> BTW, I like your code. I had a script written for me in the past >> (by Greg Lindahl in a galaxy far-far away). The one modification >> I would make is to print out the MPI ID evnironment variable (MPI >> flavors vary how it is set). Then when it crashes, you know which >> process actually died. >> >> Craig >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > From gerry.creager at tamu.edu Tue Jun 12 09:19:53 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: <466EBCFE.4020606@hypermall.net> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> <466E18F4.90701@hypermall.net> <466E2726.50407@tamu.edu> <466EBCFE.4020606@hypermall.net> Message-ID: <466EC7A9.4010609@tamu.edu> Craig Tierney wrote: > Gerry Creager wrote: >> I've tried to stay out of this. Really, I have. >> >> Craig Tierney wrote: >>> Mark Hahn wrote: >>>>> Sorry to start a flame war.... >>>> >>>> what part do you think was inflamed? >>> >>> It was when I was trying to say "Real codes have user-level >>> checkpointing implemented and no code should ever run for 7 >>> days." >> >> A number of my climate simulations will run for 7-10 days to get >> century-long simulations to complete. I've run geodesy simulations >> that ran for up to 17 days in the past. I like to think that my codes >> are real enough! >> > > NCAR and GFDL run climate simulations for weeks as well. How longest > period of time any one job can run? It is 8-12 hours. I can verify > these numbers if needed, but I can guarantee you that no one is allowed > to put their job in for 17 days. With explicit permission they may get > 24 hours, but that would be for unique situations. On the p575, we have similar constraints and I do work within those. In my lab, I can control access a bit more and have considerably fewer (and truly grateful) users, so if we need to run "forever" we can implement that. >> Real codes do have user-level checkpointing, though. And even better >> codes can be restarted without a lot of user intervention by invoking >> a run-time flag and going off for coffee. >> > > You mean there are people that bother to implement checkpointing and > then don't make it code like: > > if (checkpoint files exist in my directory) then > load checkpoint files > else > start from scratch > end > > ???? Yes, there are. No, I'm not one of them. My stuff does do a restart if it stops and finds evidence of a need to continue. However, I've seen this failure time and time again over the years. >>> Set your queue maximums to 6-8 hours. Prevents system hogging, >>> encourages checkpointing for long runs. Make sure your IO system >>> can support the checkpointing because it can create a lot of load. >> >> And how do you support my operational requirements with this policy >> during hurricane season? Let's see... "Stop that ensemble run now so >> the Monte Carlo chemists can play for awhile, then we'll let you back >> on. Don't worry about the timeliness of your simulations. No one >> needs a 35-member ensemble for statistical forecasting, anyway." Did >> I miss something? >> > > You kick-off the users that are not running operational codes because > their work is (probably) not as time constrained. Also, if you take > so long to get your answer in an operational mode that the answer > doesn't matter anymore, you need a faster computer. I would think that > if you cannot spit out a 12-hour hurricane forecast in a couple of > hours I would be concerned how valuable the answer would be. Several points in here. 1. Preemption is one approach I finally got the admin to buy into for forecasting codes. 2. MY operational codes for an individual simulation don't take long to run, save the fact that we don't do a 12 hr hurricane sim, but an 84 hour sim for the weather side (WRF). Saving grace here is that the nested grids are not too large so they can run to completion in a couple of wall-clock hours. 3. When one starts trying to twiddle initial conditions statistically to create an ensemble, one then has to run all the ensemble members. One usually starts with central cases first, especially if one "knows" which are central and which are peripheral. If one run takes 30 min on 128 processors, and one thinks one needs 57 members run, one exceeds a wall-clock day. And needs a bigger, faster computer, or at least a bigger queue reservation. If one does this without preemption, one gets all results back at the end of the hurricane season and declares success after 3 years of analysis instead of providing data in near real time. Part of this involves the social engineering required on my campus to get HPC efforts to work at all... Alas, nothing has to do with backtraces. gerry >> Yeah, we really do that. With boundary-condition munging we can run a >> statistical set of simulations and see what the probabilities are and >> where, for instance, maximum storm surge is likely to go. If we don't >> get sufficient membership in the ensemble, the statistical strength of >> the forecasting procedure decreases. >> >> Gerry >> >>>> part of the reason I got a kick out of this simple backtrace.so >>>> is indeed that it's quite possible to conceive of a checkpoint.so >>>> which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly decent >>>> job of checkpointing at least serial codes non-intrusively. >>>> >>> >>> BTW, I like your code. I had a script written for me in the past >>> (by Greg Lindahl in a galaxy far-far away). The one modification >>> I would make is to print out the MPI ID evnironment variable (MPI >>> flavors vary how it is set). Then when it crashes, you know which >>> process actually died. >>> >>> Craig >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf@beowulf.org >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >> > > -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From surs at cse.ohio-state.edu Tue Jun 12 08:09:01 2007 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] Re: [mvapich-discuss] Two problems related to slowness and TASK_UNINTERRUPTABLE process In-Reply-To: <01ae01c7acc2$dfa8e810$d80cb38b@bs> References: <01ae01c7acc2$dfa8e810$d80cb38b@bs> Message-ID: <466EB70D.2000306@cse.ohio-state.edu> Hi Tahir, Thanks for sharing this data and your observations. It is interesting. We have a more recent release, MVAPICH-0.9.9 which is available from our website (mvapich.cse.ohio-state.edu) as well as with OFED-1.2 distribution. Could you please try out our newer release and see if the results change/remain the same? Thanks, Sayantan. Tahir Malas wrote: > Hi all, > We have an 8 dual quad-core node HP cluster connected via Infiniband. We use > Voltaire DDR cards and 24-port switch. We also use OFED 1.1 and MVAPICH > 0.9.7. We have two interesting problems that we could not overcome yet: > > 1. In our test program which mimics the communications in our code, the > nodes are paired as follows: (0 and 1), (2 and 3), (4 and 5), (6 and 7). We > perform one to one communications between these pairs of nodes > simultaneously. We use blocking MPI send and receive commands to communicate > an integer array of various sizes. In addition, we consider different > numbers of processes: > (a) 1 process per node, 8 processes overall: One link is established between > the pairs of nodes. > (b) 2 process per node, 16 processes overall: Two links are established > between the pairs of nodes. > (c) 4 process per node, 32 processes overall: Four links are established > between the pairs of nodes. > (d) 8 process per node, 64 processes overall: Eight links are established > between the pairs of nodes. > > We obtain logical timings, except for the following interesting comparison: > > For 32 processes (4 process per node), the arrays with 512-Byte size are > communicated slower than the 4096-Byte size arrays. For both of them, we > send/receive 1,000,000 arrays and take the average to find the time per > package. Only package size changes. We have made many trials and confirmed > this abnormal case is persistent. More specifically, communication of > 4k-Byte packages are 2 times faster than the communication of 512-Byte > packages. > > The OSU bandwidth and latency test around these points shows: > Byte MB/s > 256 417.53 > 512 592.34 > 1024 691.02 > 2048 857.35 > 4096 906.04 > 8192 1022.52 > Time (usec) > 256 4.79 > 512 5.48 > 1024 6.60 > 2048 8.30 > 4096 11.02 > So this behavior does not seem reasonable to us. > > 2. SOMETIMES, after the test with overall 32 processes, one of the four > processes at node3 hangs in TASK_UNINTERRUPTABLE "D" state. Hence, the test > program shows a "done." and waits for sometime. We can neither kill the > process nor soft reboot the node. We have to wait for that process to > terminate, which can last long. > > Does anybody have some comments in these issues? > Thanks in advance, > Tahir Malas > Bilkent University > Electrical and Electronics Engineering Department > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > -- http://www.cse.ohio-state.edu/~surs From ctierney at hypermall.net Tue Jun 12 14:48:33 2007 From: ctierney at hypermall.net (Craig Tierney) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] backtraces In-Reply-To: <466EC7A9.4010609@tamu.edu> References: <466CDE73.7020901@fft.be> <1181576205.10115.34.camel@bruce.priv.wark.uk.streamline-computing.com> <466D9F59.7070901@fft.be> <466DB793.1040903@hypermall.net> <466E18F4.90701@hypermall.net> <466E2726.50407@tamu.edu> <466EBCFE.4020606@hypermall.net> <466EC7A9.4010609@tamu.edu> Message-ID: <466F14B1.8070508@hypermall.net> > Several points in here. > 1. Preemption is one approach I finally got the admin to buy into for > forecasting codes. > 2. MY operational codes for an individual simulation don't take long to > run, save the fact that we don't do a 12 hr hurricane sim, but an 84 > hour sim for the weather side (WRF). Saving grace here is that the > nested grids are not too large so they can run to completion in a couple > of wall-clock hours. > 3. When one starts trying to twiddle initial conditions statistically > to create an ensemble, one then has to run all the ensemble members. One > usually starts with central cases first, especially if one "knows" which > are central and which are peripheral. If one run takes 30 min on 128 > processors, and one thinks one needs 57 members run, one exceeds a > wall-clock day. And needs a bigger, faster computer, or at least a > bigger queue reservation. If one does this without preemption, one gets > all results back at the end of the hurricane season and declares success > after 3 years of analysis instead of providing data in near real time. > So there are 57 jobs of 30 minutes each. Get your user to rewrite their scripts so it isn't one job. That shouldn't be too hard. > Part of this involves the social engineering required on my campus to > get HPC efforts to work at all... Alas, nothing has to do with backtraces. Very true (on both parts). Craig > > gerry > >>> Yeah, we really do that. With boundary-condition munging we can run >>> a statistical set of simulations and see what the probabilities are >>> and where, for instance, maximum storm surge is likely to go. If we >>> don't get sufficient membership in the ensemble, the statistical >>> strength of the forecasting procedure decreases. >>> >>> Gerry >>> >>>>> part of the reason I got a kick out of this simple backtrace.so >>>>> is indeed that it's quite possible to conceive of a checkpoint.so >>>>> which uses /proc/$pid/fd and /proc/$pid/maps to do a possibly >>>>> decent job of checkpointing at least serial codes non-intrusively. >>>>> >>>> >>>> BTW, I like your code. I had a script written for me in the past >>>> (by Greg Lindahl in a galaxy far-far away). The one modification >>>> I would make is to print out the MPI ID evnironment variable (MPI >>>> flavors vary how it is set). Then when it crashes, you know which >>>> process actually died. >>>> >>>> Craig >>>> >>>> _______________________________________________ >>>> Beowulf mailing list, Beowulf@beowulf.org >>>> To change your subscription (digest mode or unsubscribe) visit >>>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >> >> > From wrankin at ee.duke.edu Wed Jun 13 09:40:17 2007 From: wrankin at ee.duke.edu (Bill Rankin) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] MPI performance gain with jumbo frames In-Reply-To: <466D70D5.5050701@charter.net> References: <1269.155.210.32.73.1181561593.squirrel@webmail.cauterized.net> <56470.192.168.1.1.1181576582.squirrel@mail.eadline.org> <466D70D5.5050701@charter.net> Message-ID: <8B8F145A-38A7-4DD0-9BA9-D3A9EB0D758E@ee.duke.edu> Doug and Jeff have good points (and some good links). On thing to also pay attention to is the CPU utilization during the bandwidth and application testing. We found that on our cluster (various Dells with built in GigE NICs) while we did not see huge differences in effective bandwidth, the CPU overhead was notably less when using Jumbo Frames. Again, YMMV. Good luck, -bill On Jun 11, 2007, at 11:57 AM, Jeffrey B. Layton wrote: > Doug brings up some good points. If you want to try Jumbo > Frames to improve MPI performance you might have to > tweak the TCP buffers as well. There are some links around > the web on this. Sometimes it helps performance, sometimes > it doesn't. Your mileage may vary. > > Jeff From deadline at eadline.org Wed Jun 13 16:02:10 2007 From: deadline at eadline.org (Douglas Eadline) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] MPI performance gain with jumbo frames In-Reply-To: <8B8F145A-38A7-4DD0-9BA9-D3A9EB0D758E@ee.duke.edu> References: <1269.155.210.32.73.1181561593.squirrel@webmail.cauterized.net> <56470.192.168.1.1.1181576582.squirrel@mail.eadline.org> <466D70D5.5050701@charter.net> <8B8F145A-38A7-4DD0-9BA9-D3A9EB0D758E@ee.duke.edu> Message-ID: <36018.192.168.1.1.1181775730.squirrel@mail.eadline.org> So this begs the question, if we are "core rich and packet small" do we care about packet size and overhead? In other words if we have plenty of cores when do we not care about communication overhead. Most GigE drivers have various interrupt coalescence strategies and of course Jumbo Frames to lessen the processor load, but if we have multi-core do we need to care about this as much ... any thoughts? -- Doug > Doug and Jeff have good points (and some good links). On thing to > also pay attention to is the CPU utilization during the bandwidth and > application testing. We found that on our cluster (various Dells > with built in GigE NICs) while we did not see huge differences in > effective bandwidth, the CPU overhead was notably less when using > Jumbo Frames. > > Again, YMMV. > > Good luck, > > -bill > > On Jun 11, 2007, at 11:57 AM, Jeffrey B. Layton wrote: > >> Doug brings up some good points. If you want to try Jumbo >> Frames to improve MPI performance you might have to >> tweak the TCP buffers as well. There are some links around >> the web on this. Sometimes it helps performance, sometimes >> it doesn't. Your mileage may vary. >> >> Jeff > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > !DSPAM:467021b9234289691080364! > -- Doug From laytonjb at charter.net Wed Jun 13 16:30:16 2007 From: laytonjb at charter.net (laytonjb@charter.net) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] MPI performance gain with jumbo frames Message-ID: <1022887901.1181777416954.JavaMail.root@fepweb12> More questions: One of the purposes of interrupt coalescence is to reduce the load on the CPU by ganging interrupt requests together (sorry for all of the technical jargon there). In a multi-core situation, do the interrupts affect all of the cores or just one core? If the interrupts affect all of the cores, then interrupt coalescence might be a good thing (even if the latency is much higher). I think Doug has some benchmarks that show some strange things when running NPB on multi-core nodes. This might show us something about what's going on. > > So this begs the question, if we are "core rich and packet small" > do we care about packet size and overhead? In other words if we have > plenty of cores when do we not care about communication > overhead. Most GigE drivers have various interrupt coalescence > strategies and of course Jumbo Frames to lessen the processor > load, but if we have multi-core do we need to care about this > as much ... any thoughts? > > -- > Doug > > > > Doug and Jeff have good points (and some good links). On thing to > > also pay attention to is the CPU utilization during the bandwidth and > > application testing. We found that on our cluster (various Dells > > with built in GigE NICs) while we did not see huge differences in > > effective bandwidth, the CPU overhead was notably less when using > > Jumbo Frames. > > > > Again, YMMV. > > > > Good luck, > > > > -bill > > > > On Jun 11, 2007, at 11:57 AM, Jeffrey B. Layton wrote: > > > >> Doug brings up some good points. If you want to try Jumbo > >> Frames to improve MPI performance you might have to > >> tweak the TCP buffers as well. There are some links around > >> the web on this. Sometimes it helps performance, sometimes > >> it doesn't. Your mileage may vary. > >> > >> Jeff > > > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > > > !DSPAM:467021b9234289691080364! > > > > > -- > Doug > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pbm.com Wed Jun 13 16:37:05 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] MPI performance gain with jumbo frames In-Reply-To: <36018.192.168.1.1.1181775730.squirrel@mail.eadline.org> References: <1269.155.210.32.73.1181561593.squirrel@webmail.cauterized.net> <56470.192.168.1.1.1181576582.squirrel@mail.eadline.org> <466D70D5.5050701@charter.net> <8B8F145A-38A7-4DD0-9BA9-D3A9EB0D758E@ee.duke.edu> <36018.192.168.1.1.1181775730.squirrel@mail.eadline.org> Message-ID: <20070613233705.GA14997@bx9.net> On Wed, Jun 13, 2007 at 07:02:10PM -0400, Douglas Eadline wrote: > So this begs the question, if we are "core rich and packet small" > do we care about packet size and overhead? That's not quite the question. In many programs, there is no possible overlap between communication and computation, so they don't care how high the overhead is, although for smaller messages lower overhead can mean higher bandwidth (that "message rate" thing, again.) If you can overlap, then you do care about overhead, especially for Ethernet, where the cpu overhead is often unequally distributed over your cores. -- greg From lindahl at pbm.com Wed Jun 13 16:50:17 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] MPI performance gain with jumbo frames In-Reply-To: <1022887901.1181777416954.JavaMail.root@fepweb12> References: <1022887901.1181777416954.JavaMail.root@fepweb12> Message-ID: <20070613235017.GA16124@bx9.net> On Wed, Jun 13, 2007 at 04:30:16PM -0700, laytonjb@charter.net wrote: > In a multi-core situation, > do the interrupts affect all of the cores or just one core? One core gets each interrupt. cat /proc/interrupts to see how this works in your system. > I personally like the concept that Level 5 Networks used in conjunction > with their GigE cards - user space drivers. This is how everyone does their EtherNot devices: InfiniPath, Myrinet, Quadrics, yadda yadda. Then the next question is, why are you bothering with TCP? With EtherNot, you can avoid all of the interrupts. A typical InfiniPath system only has a couple of interrupts after running for weeks; bringing the link up causes a couple. MPI doesn't cause any. -- greg From jmack at wm7d.net Wed Jun 13 07:29:29 2007 From: jmack at wm7d.net (Joseph Mack NA3T) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] programming multicore clusters Message-ID: I've googled the internet and searched the Beowulf archives for "hybrid" || "multicore" and the only definitive statement I've found is by Greg Lindahl, 17 Dec 2004 "Most of the folks interested in hybrid models a few years ago have now given it up". I assume this was from the era of 2-way SMP nodes. Multicore CPUs are being projected for 15yrs into the future (statement by Pat Gelsinger, Intel's CTO, quoted in http://cook.rfe.org/grid.pdf) I expect the programming model will be a little different for single image machines like the Altix, than for beowulfs where each node has its own kernel (and which I assume will be running dual quadcore mobos). Still if a flat, one network model is used, all processes communicate through the off-board networking. Someone with a quadcore machine, running MPI on a flat network, told me that their application scales poorly to 4 processors. Instead if processes on cores within a package were working on adjacent parts of the compute volume and communicated through the on-board networking, then for a quadcore machine, the off-board networking bandwidth requirement would drop by a factor of 4 and scaling would improve. In a quadcore machine, if 4 OMP/threads processes are started on each quadcore package, could they be rescheduled at the end of their timeslice, on different cores arriving at a cold cache? On a large single image machine, could a thread be scheduled on another node and have to communicate over the off-board network? In a single image machine (with a single address space) how does the OS know to malloc memory from the on-board memory, rather than some arbitary location (on another board)? I expect everyone here knows all this. How is everyone going to program the quadcore machines? Thanks Joe -- Joseph Mack NA3T EME(B,D), FM05lw North Carolina jmack (at) wm7d (dot) net - azimuthal equidistant map generator at http://www.wm7d.net/azproj.shtml Homepage http://www.austintek.com/ It's GNU/Linux! From lfarkas at bppiac.hu Wed Jun 13 09:11:05 2007 From: lfarkas at bppiac.hu (Farkas Levente) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] network raid filesystem Message-ID: <46701719.1030605@bppiac.hu> hi, we've a few 10-20 server in a lan each has 4 hdd. we'd like to create one big filesystem on these server hard disks. we'd like to create it in a redundant way ie: - if one (or more) of the hdd or server fails the whole filesystem still usable and consistent. - any server in this farm can see the same storage. it's someting a big network raid5-6... storage where we have about 40-80 partition added to the same filesystem. and there is an fs over it. which hide all internal network raid functionality. is there any such solution? i can't find any easy way to do this on our linux servers. thank you for your help in advance. -- Levente "Si vis pacem para bellum!" From pal at di.fct.unl.pt Wed Jun 13 16:00:33 2007 From: pal at di.fct.unl.pt (Paulo Afonso Lopes) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] MPI performance gain with jumbo frames In-Reply-To: <8B8F145A-38A7-4DD0-9BA9-D3A9EB0D758E@ee.duke.edu> References: <1269.155.210.32.73.1181561593.squirrel@webmail.cauterized.net> <56470.192.168.1.1.1181576582.squirrel@mail.eadline.org> <466D70D5.5050701@charter.net> <8B8F145A-38A7-4DD0-9BA9-D3A9EB0D758E@ee.duke.edu> Message-ID: <30836.89.26.129.109.1181775633.squirrel@www.di.fct.unl.pt> I can report a decrease of circa 10% CPU use per GbE link in an IBM x335 (dual Xeon 2.6GHz) with on-board Broadcom NICs and SMC switch, when going from standard 1500 to 9K frames on the netperf benchmark, at full bandwidth (circa 80MB/s). Best Regards, paulo > Doug and Jeff have good points (and some good links). On thing to > also pay attention to is the CPU utilization during the bandwidth and > application testing. We found that on our cluster (various Dells > with built in GigE NICs) while we did not see huge differences in > effective bandwidth, the CPU overhead was notably less when using > Jumbo Frames. > > Again, YMMV. > > Good luck, > > -bill > -- Paulo Afonso Lopes | Tel: +351- 21 294 8536 Departamento de Inform?tica | 294 8300 ext.10763 Faculdade de Ci?ncias e Tecnologia | Fax: +351- 21 294 8541 Universidade Nova de Lisboa | e-mail: pal@di.fct.unl.pt 2829-516 Caparica, PORTUGAL From tmalas at ee.bilkent.edu.tr Wed Jun 13 05:37:08 2007 From: tmalas at ee.bilkent.edu.tr (Tahir Malas) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] Two problems related to slowness and TASK_UNINTERRUPTABLE process In-Reply-To: References: <01ae01c7acc2$dfa8e810$d80cb38b@bs> Message-ID: <000f01c7adb7$8ea0c5f0$d80cb38b@bs> > -----Original Message----- > From: Mark Hahn [mailto:hahn@mcmaster.ca] > Sent: Tuesday, June 12, 2007 6:15 PM > To: Tahir Malas > Cc: mvapich-discuss@cse.ohio-state.edu; beowulf@beowulf.org; > teoman.terzi@gmail.com; 'Ozgur Ergul' > Subject: Re: [Beowulf] Two problems related to slowness and > TASK_UNINTERRUPTABLE process > > > For 32 processes (4 process per node), the arrays with 512-Byte size are > > communicated slower than the 4096-Byte size arrays. For both of them, we > > do you mean that this is not the case in other configurations? > an interconnect _should_ have some steep rise in effective bandwidth > as packet size is increased. it's a useful metric to know the packet > size at which half-peak bandwidth is achieved, since this offers some > "sense of scale" to programmers judging whether their own packet sizes > are appropriate. > > > this abnormal case is persistent. More specifically, communication of > > 4k-Byte packages are 2 times faster than the communication of 512-Byte > > packages. > > perhaps I'm dense this morning, but what's unexpected about that? Considering the latency and bw measures, my expectation for the communication times in us: 512: 5.48 + 512/592.34 = 6.34 4096: 11.02 + 4096/906.04 = 15.54 Our test: 512: 29.434 4096: 16.209 So, somehow, isn't communication time for 512 bytes is unexpectedly slow? > > > > 2. SOMETIMES, after the test with overall 32 processes, one of the four > > processes at node3 hangs in TASK_UNINTERRUPTABLE "D" state. Hence, the > test > > program shows a "done." and waits for sometime. We can neither kill the > > process nor soft reboot the node. We have to wait for that process to > > terminate, which can last long. > > does /proc/$pid/wchan (on the 'D' state process) tell you anything? > do all the ranks return from MPI_Finalize? > The file tells "__lock_buffer". Yes, all ranks return; but I think, this problematic process (i.e. one of the processes on node3) returns always the latest. Thanks, and regards, Tahir. From lindahl at pbm.com Wed Jun 13 22:55:39 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] programming multicore clusters In-Reply-To: References: Message-ID: <20070614055539.GA26746@bx9.net> On Wed, Jun 13, 2007 at 07:29:29AM -0700, Joseph Mack NA3T wrote: > "Most of the folks interested in hybrid models a few years > ago have now given it up". > > I assume this was from the era of 2-way SMP nodes. No, the main place you saw that style was on IBM SPs with 8+ cores/node. > I expect the programming model will be a little different > for single image machines like the Altix, than for beowulfs > where each node has its own kernel (and which I assume will > be running dual quadcore mobos). Most Altixes spend most of their time running MPI programs. Or at least that was certainly the case with Origin. > Still if a flat, one network model is used, all processes > communicate through the off-board networking. No, the typical MPI implementation does not use off-board networking for messages to local ranks. You use the same MPI calls, but the underlying implementation uses shared memory when possible. > Someone with a > quadcore machine, running MPI on a flat network, told me > that their application scales poorly to 4 processors. Which could be because he's out of memory bandwith, or network bandwidth, or message rate. There are a lot of postential reasons. > In a quadcore machine, if 4 OMP/threads processes are > started on each quadcore package, could they be rescheduled > at the end of their timeslice, on different cores arriving > at a cold cache? Most MPI and OpenMP implementations lock processes to cores for this very reason. > In a single image machine (with > a single address space) how does the OS know to malloc > memory from the on-board memory, rather than some arbitary > location (on another board)? Generally the default is to always malloc memory local to the process. Linux grew this feature when it started being used on NUMA machines like the Altix and the Opteron. > I expect everyone here knows all this. How is everyone going > to program the quadcore machines? Using MPI? You can go read up on new approaches like UPC, Co-Array Fortran, Global Arrays, Titanium, Chapel/X-10/Fortress, etc, but MPI is going to be the market leader for a long time. -- greg From jerker at Update.UU.SE Thu Jun 14 02:10:16 2007 From: jerker at Update.UU.SE (Jerker Nyberg) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] network raid filesystem In-Reply-To: <46701719.1030605@bppiac.hu> References: <46701719.1030605@bppiac.hu> Message-ID: Hi, Here are some pointers to some free software distributed parallel file system projects. They have different goals and are in different stages of development, but most of them aiming for fault tolerance (mirroring). I recommend that you take a look at GlusterFS, although I havn't tried it myself yet. Ceph Early development. Gfarm file system For grid computing. GlusterFS Computing clusters. Hadoop file system Build your own search engine. Lustre Mirroring in roadmap for Q3 2008, but may use shared disks for fault tolerance now. MogileFS Intented for websites serving images, media etc. PVFS2 May use shared disks for fault tolerance. Regards, Jerker Nyberg. Uppsala Sweden. From jmack at wm7d.net Thu Jun 14 05:53:58 2007 From: jmack at wm7d.net (Joseph Mack NA3T) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] programming multicore clusters In-Reply-To: <20070614055539.GA26746@bx9.net> References: <20070614055539.GA26746@bx9.net> Message-ID: On Wed, 13 Jun 2007, Greg Lindahl wrote: >> Still if a flat, one network model is used, all processes >> communicate through the off-board networking. > > No, the typical MPI implementation does not use off-board networking > for messages to local ranks. You use the same MPI calls, but the > underlying implementation uses shared memory when possible. My apparently erroneous assumption was that in a beowulf of quadcore processors, each processor would be assigned a random rank, in which case adjacent processors in the quadcore package would not be working on adjacent parts of the compute space. What's the mechanism for assigning a processor a particular rank? (a url, or pointer to the MPI docs is fine). How does MPI know that one process is running on the same mobo and to use shared memory and that another process is running off-board? I take it there's a map somewhere other than the machines.LINUX file? >> Someone with a quadcore machine, running MPI on a flat >> network, told me that their application scales poorly to >> 4 processors. > > Which could be because he's out of memory bandwith, or > network bandwidth, or message rate. There are a lot of > postential reasons. OK >> In a quadcore machine, if 4 OMP/threads processes are >> started on each quadcore package, could they be >> rescheduled at the end of their timeslice, on different >> cores arriving at a cold cache? > > Most MPI and OpenMP implementations lock processes to > cores for this very reason. am off googling >> In a single image machine (with a single address space) >> how does the OS know to malloc memory from the on-board >> memory, rather than some arbitary location (on another >> board)? > > Generally the default is to always malloc memory local to > the process. Linux grew this feature when it started being > used on NUMA machines like the Altix and the Opteron. ditto >> I expect everyone here knows all this. How is everyone >> going to program the quadcore machines? > > Using MPI? I see. It's a bit clearer now. > You can go read up on new approaches like UPC, Co-Array > Fortran, Global Arrays, Titanium, Chapel/X-10/Fortress, > etc, but MPI is going to be the market leader for a long > time. Thanks Joe -- Joseph Mack NA3T EME(B,D), FM05lw North Carolina jmack (at) wm7d (dot) net - azimuthal equidistant map generator at http://www.wm7d.net/azproj.shtml Homepage http://www.austintek.com/ It's GNU/Linux! From rosing at peakfive.com Thu Jun 14 09:30:28 2007 From: rosing at peakfive.com (Matt) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] RE: programming multicore clusters In-Reply-To: <200706141419.l5EEITie028318@bluewest.scyld.com> References: <200706141419.l5EEITie028318@bluewest.scyld.com> Message-ID: <18033.27940.74696.340347@lala.site> Joseph Mack writes: > I expect everyone here knows all this. How is everyone going > to program the quadcore machines? We used OpenMP on the node and MPI between the nodes. It's ugly and horrendous to look at or comprehend. The only saving grace is that our source code is serial plus custom directives and we have tools to generate OpenMP or calls to a MPI based library or both. So we put all the difficult stuff in the directives. We don't have any SMP nodes anymore so it will take some time to resurrect that ability. Using straight MPI is the lowest common denominator and simplest, but doesn't use the machine very efficiently. I think it'll only get worse with more cores. I'd be interested in your experience and what you find out. Matt From jhh3851 at yahoo.com Thu Jun 14 14:04:59 2007 From: jhh3851 at yahoo.com (Joseph Han) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] RE: programming multicore clusters In-Reply-To: <200706141900.l5EJ08Uv014613@bluewest.scyld.com> Message-ID: <288079.89270.qm@web55015.mail.re4.yahoo.com> > Joseph Mack writes: > > > I expect everyone here knows all this. How is everyone going > > to program the quadcore machines? > > We used OpenMP on the node and MPI between the nodes. It's ugly and > horrendous to look at or comprehend. The only saving grace is that our > source code is serial plus custom directives and we have tools to > generate OpenMP or calls to a MPI based library or both. So we put all > the difficult stuff in the directives. We don't have any SMP nodes > anymore so it will take some time to resurrect that ability. > > Using straight MPI is the lowest common denominator and simplest, but > doesn't use the machine very efficiently. I think it'll only get worse > with more cores. > > I'd be interested in your experience and what you find out. > > Matt > > I don't know the answer to this, but what about MPI implementations which enable local host optimization automatically? For example, MPICH, Intel MPI, and HP-MPI among others all do so if asked. Is running a program using OpenMP on a SMP/multi-core box more efficient that an MPI code with an implementation using localhost optimization? Joseph From lindahl at pbm.com Thu Jun 14 20:33:15 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] RE: programming multicore clusters In-Reply-To: <288079.89270.qm@web55015.mail.re4.yahoo.com> References: <200706141900.l5EJ08Uv014613@bluewest.scyld.com> <288079.89270.qm@web55015.mail.re4.yahoo.com> Message-ID: <20070615033315.GA21732@bx9.net> On Thu, Jun 14, 2007 at 02:04:59PM -0700, Joseph Han wrote: > Is running a program using OpenMP on a SMP/multi-core box more efficient that > an MPI code with an implementation using localhost optimization? One good example comes from codes which have both pure MPI and hybrid MPI/OpenMPI implementations. There's published data from John Michalakes MM5 is faster in pure MPI mode. In fact I've never seen a bid involving pure MPI and hybrid codes where hybrid was faster. There are some unsual cases where hybrid can be a win: * Codes with extreme load imbalance, like NASA's CFD code Overflow. But it's hare to tell what a good MPI implementation of Overflow would perform like; if it turns out that it's simply unusually OpenMP friendly, that's not really a useful datapoint. * Codes where a pure MPI code runs out of decomposition, but the hyrbrid code doesn't. * Codes where there's a big read-only database that can be shared within a node. But you can share between MPI processes using Sys5 shared memory segments, or you can mmap the database as a file, which shares it. Hybrid can be a lose when MPI interconnect hardware benefits from being driving from multiple cores. All in all, hybrid programming has been an incredible waste of time, ranking up with HDF in the all-time failures in HPC. -- greg From toon.knapen at fft.be Fri Jun 15 04:49:49 2007 From: toon.knapen at fft.be (Toon Knapen) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] programming multicore clusters In-Reply-To: <20070614055539.GA26746@bx9.net> References: <20070614055539.GA26746@bx9.net> Message-ID: <46727CDD.6040808@fft.be> Greg Lindahl wrote: > Most MPI and OpenMP implementations lock processes to cores for this > very reason. AFAICT this is not always the case. E.g. on systems with glibc, this functionality (set_process_affinity and such) is only available starting from libc-2.3.4. In another mail in the same thread: > One good example comes from codes which have both pure MPI and hybrid > MPI/OpenMPI implementations. There's published data from John > Michalakes MM5 is faster in pure MPI mode. > > In fact I've never seen a bid involving pure MPI and hybrid codes > where hybrid was faster. > Mixing OpenMP and MPI in one and the same algorithm does indeed not generally provide a big advantage. However MPI and OpenMP can be used on different scales. E.g. you can obtain a big boost when running an MPI-code where each process performs local dgemm's for instance by using an OpenMP'd dgemm implementation. This is an example where running mixed-mode makes a lot of sense. toon From hahn at mcmaster.ca Fri Jun 15 05:46:49 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] RE: programming multicore clusters In-Reply-To: <288079.89270.qm@web55015.mail.re4.yahoo.com> References: <288079.89270.qm@web55015.mail.re4.yahoo.com> Message-ID: > Is running a program using OpenMP on a SMP/multi-core box more efficient that > an MPI code with an implementation using localhost optimization? beyond 2-4p, all machines are message passing. take a look at Intel's recent products: they have products with one or two dual-core chips in a package, but if you want a dual sockets, you get two FSB's - partly for fanout/loading reasons, and partly because truely symmetric, flat SMP machines just don't scale. OK, so once you accept that even shared-memory machines are actually passing messages, the question becomes: what kind of protocol and message size do you want? on a typical message-massing SMP machine (multi-socket x86_64, even SGI Altix), the message size is a cache line (64 or 128B afaik). that's a pretty OK number, but to make effective use of it, you have to write your code so you make sure to pack as much relevant data into these appropriately aligned and sized chunks of memory, knowing that they'll implicitly become packets. you have to marshal your packets, if you will. gosh! same term is used in explicit msg-passing... in other words, you have to adopt a message-passing methodology regardless of whether your packets are fixed-sized implicit things, or variable-sized, explicit ones. the main difference is in how your messages are addressed - by a simple flat memory address, or by something typically like . in some cases, implicit, memory-based addressing is a real win - mainly if many of your remote one-sided references are to a space that can remain unsynchronized for an extended time (say per timestep). I don't think I've ever seen a paper that tried to quantify this directly, though it would be most interesting... ccNUMA - provides automatic synchony by tracking the state of each cache line. but limited by cache size, and perhaps this tracking is irrelevant given your access patterns. the level of consistency may also hurt you, since a naive programmer will waste major cpu time on false sharing or hot cache lines. RDMA - similar to ccNUMA except with no 'O' or 'E' states, or tracking of states at all. no hardware-supported consistency guarantees, but also significantly higher latency. explicit msg-passing - different addressing, explicit list of data, not purely what's in a cacheline, but also explicit synchronization, which may seem too rigid. latency not that much higher than RDMA. for the classic example of one worker wanting to collect state from its grid neighbors, direct memory access seems the most natural. but MPI codes can handle this pretty successfully by either using a nonblocking irecv or by having a data-serving thread. either one is, admittedly, extra overhead. unless most of your IPC is this kind of async, unsync, passive data reference, I wouldn't think twice: go MPI. the current media frenzy about multicore systems (nothing new!) doesn't change the picture much. regards, mark hahn. From gerry.creager at tamu.edu Fri Jun 15 06:00:24 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] programming multicore clusters In-Reply-To: References: Message-ID: <46728D68.9000503@tamu.edu> For the forseeable future, I'm not developing much but will use the hybrid SMP/DM capabilities in WRF. Takes advantage of SMP availability, and supports message passing between SMP nodes. I've not used this capability for benchmarking but it appears to offer significant gains. As we get more hybrid HPC capabilities planning for this will be more important. A lot of system administrators (based on a statistical sample of 4 local) have decreed that this is inefficient and one should either do isolated shared memory or distributed memory so that we don't make our Gaussian users feel unloved. I'm skeptical. gerry Joseph Mack NA3T wrote: > I've googled the internet and searched the Beowulf archives > for "hybrid" || "multicore" and the only definitive statement I've found > is by Greg Lindahl, 17 Dec 2004 > > "Most of the folks interested in hybrid models a few years ago have now > given it up". > > I assume this was from the era of 2-way SMP nodes. > > Multicore CPUs are being projected for 15yrs into the future (statement > by Pat Gelsinger, Intel's CTO, quoted in > http://cook.rfe.org/grid.pdf) > > I expect the programming model will be a little different > for single image machines like the Altix, than for beowulfs > where each node has its own kernel (and which I assume will > be running dual quadcore mobos). > > Still if a flat, one network model is used, all processes communicate > through the off-board networking. Someone with a quadcore machine, > running MPI on a flat network, told me that their application scales > poorly to 4 processors. Instead if processes on cores within a package > were working on adjacent parts of the compute volume and communicated > through the on-board networking, then for a quadcore machine, the > off-board networking bandwidth requirement would drop by a factor of 4 > and scaling would improve. > > In a quadcore machine, if 4 OMP/threads processes are started on each > quadcore package, could they be rescheduled at the end of their > timeslice, on different cores arriving at a cold cache? On a large > single image machine, could a thread be scheduled on another node and > have to communicate over the off-board network? In a single image > machine (with a single address space) how does the OS know to malloc > memory from the on-board memory, rather than some arbitary location (on > another board)? > > I expect everyone here knows all this. How is everyone going to program > the quadcore machines? > > Thanks Joe -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From hahn at mcmaster.ca Fri Jun 15 06:02:13 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] programming multicore clusters In-Reply-To: <46727CDD.6040808@fft.be> References: <20070614055539.GA26746@bx9.net> <46727CDD.6040808@fft.be> Message-ID: >> Most MPI and OpenMP implementations lock processes to cores for this >> very reason. > > AFAICT this is not always the case. E.g. on systems with glibc, this > functionality (set_process_affinity and such) is only available starting from > libc-2.3.4. jan 2005 ;) > Mixing OpenMP and MPI in one and the same algorithm does indeed not generally > provide a big advantage. I'm curious why this would be. do you have examples or analysis? > scales. E.g. you can obtain a big boost when running an MPI-code where each > process performs local dgemm's for instance by using an OpenMP'd dgemm > implementation. This is an example where running mixed-mode makes a lot of > sense. if you take this approach, you'd do blocking to divide the node's work among threads, no? or would performance require that a thread's block fit in its private cache? if threads indeed do blocking, then the difference between hybrid and straight MPI approaches would mainly be down to time spent rearranging the matrices to set up for dgemm. or would the threaded part of the hybrid approach not do blocking? From toon.knapen at fft.be Fri Jun 15 06:17:16 2007 From: toon.knapen at fft.be (Toon Knapen) Date: Wed Nov 25 01:06:06 2009 Subject: [Beowulf] RE: programming multicore clusters In-Reply-To: References: <288079.89270.qm@web55015.mail.re4.yahoo.com> Message-ID: <4672915C.4040405@fft.be> Mark Hahn wrote: > unless most of your IPC is this kind of async, unsync, passive data > reference, I wouldn't think twice: go MPI. the current media frenzy > about multicore systems (nothing new!) doesn't change the picture much. Because of everybody going multi-core, everybody is pushing to go multi-threading to exploit these architectures (e.g. the gaming-world and many more). IIUC you're saying that MPI might better exploit these architectures? Interesting POV! t From eugen at leitl.org Fri Jun 15 06:24:05 2007 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] RE: programming multicore clusters In-Reply-To: <4672915C.4040405@fft.be> References: <288079.89270.qm@web55015.mail.re4.yahoo.com> <4672915C.4040405@fft.be> Message-ID: <20070615132405.GG17691@leitl.org> On Fri, Jun 15, 2007 at 03:17:16PM +0200, Toon Knapen wrote: > Because of everybody going multi-core, everybody is pushing to go > multi-threading to exploit these architectures (e.g. the gaming-world > and many more). IIUC you're saying that MPI might better exploit these > architectures? Interesting POV! Why "interesting"? Do you disagree that message-passing is the way to go, since shared memory doesn't scale? -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From toon.knapen at fft.be Fri Jun 15 06:46:19 2007 From: toon.knapen at fft.be (Toon Knapen) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] programming multicore clusters In-Reply-To: References: <20070614055539.GA26746@bx9.net> <46727CDD.6040808@fft.be> Message-ID: <4672982B.8040800@fft.be> Mark Hahn wrote: >>> Most MPI and OpenMP implementations lock processes to cores for this >>> very reason. >> >> AFAICT this is not always the case. E.g. on systems with glibc, this >> functionality (set_process_affinity and such) is only available >> starting from libc-2.3.4. > > jan 2005 ;) But I'm sure most MPI-implementations are still available on linux-distributions that do not have libc-2.3.4 (or higher). > >> Mixing OpenMP and MPI in one and the same algorithm does indeed not >> generally provide a big advantage. > > I'm curious why this would be. do you have examples or analysis? Maybe my statement was not carefull enough in wording. Basically I've never seen an implementation an algorithm containing a mix of OpenMP and MPI and benefit from this mix. > >> scales. E.g. you can obtain a big boost when running an MPI-code where >> each process performs local dgemm's for instance by using an OpenMP'd >> dgemm implementation. This is an example where running mixed-mode >> makes a lot of sense. > > if you take this approach, you'd do blocking to divide the node's work > among threads, no? or would performance require that a thread's block > fit in its private cache? if threads indeed do blocking, then the > difference between hybrid and straight MPI approaches would mainly be > down to time spent rearranging the matrices to set up for dgemm. > or would the threaded part of the hybrid approach not do blocking? Indeed, every thread will work on its block so OpenMPI and MPI approaches are alike. It is therefore interesting to compare e.g. the scalability of GotoBLAS (using OpenMP) to that of BLACS (using MPI). I have papers somewhere which show great scalability of GotoBLAS up to 8 threads. toon From toon.knapen at fft.be Fri Jun 15 06:53:23 2007 From: toon.knapen at fft.be (Toon Knapen) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] RE: programming multicore clusters In-Reply-To: <20070615132405.GG17691@leitl.org> References: <288079.89270.qm@web55015.mail.re4.yahoo.com> <4672915C.4040405@fft.be> <20070615132405.GG17691@leitl.org> Message-ID: <467299D3.3010306@fft.be> "Interesting" because I found it a very enlightened argument/POV in this whole multi-core frenzy. I _certainly_ do not disagree, I do not know yet if I totally agree (see my mail on BLACS and GotoBLAS in this same thread). My mail was actually not really intended for the whole beowulf-ml. I just found it shocking revealing and I think many multi-thread advocates would have a hard-time responding to such a clear statement. t Eugen Leitl wrote: > On Fri, Jun 15, 2007 at 03:17:16PM +0200, Toon Knapen wrote: > >> Because of everybody going multi-core, everybody is pushing to go >> multi-threading to exploit these architectures (e.g. the gaming-world >> and many more). IIUC you're saying that MPI might better exploit these >> architectures? Interesting POV! > > Why "interesting"? Do you disagree that message-passing is the way to > go, since shared memory doesn't scale? > From landman at scalableinformatics.com Fri Jun 15 06:57:08 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] RE: programming multicore clusters In-Reply-To: <4672915C.4040405@fft.be> References: <288079.89270.qm@web55015.mail.re4.yahoo.com> <4672915C.4040405@fft.be> Message-ID: <46729AB4.6000109@scalableinformatics.com> Toon Knapen wrote: > Mark Hahn wrote: > >> unless most of your IPC is this kind of async, unsync, passive data >> reference, I wouldn't think twice: go MPI. the current media frenzy >> about multicore systems (nothing new!) doesn't change the picture much. > > Because of everybody going multi-core, everybody is pushing to go > multi-threading to exploit these architectures (e.g. the gaming-world > and many more). IIUC you're saying that MPI might better exploit these > architectures? Interesting POV! Multicore has some interesting up sides. The down sides, oversubscription of memory bandwidth for the memory pipes out of the sockets, remind me of the days of larger SMP boxes with big busses in the early/mid 90s. First, shared memory is nice and simple as a programming model. Multicore suggests that shared memory should be very easy to exploit. You have to worry about contention, affinity, and everything else we used to have to worry about a decade ago with the big machines. Your precious resources that you need to optimize utilization of are no longer CPU cycles, but bandwidth. Second, MPI is a more complex model. It forces you to reconsider how the algorithm is mapped to the hardware. And it makes no assumptions about the hardware, at least in the API. In the implementation, it might be taught about multi-core, and optimizing communication within boxes via shm sockets, and between boxes by other methods. I think a few of the MPI toolkits do this today (Scali, Intel, OpenMPI, ...). Neither one of these modalities take into account the fact that memory bandwidth is finite out of a socket. Technically this is an implementation issue, but as we hit larger and larger core sizes, some codes, well, larger fractions of the parallel code base, are likely to run into this resource contention issue. We were seeing contention for fabric interconnects (e.g. bus contention) with LAMMPS runs for a customer last year simply between single and dual core. It was significant enough that the customer opted for single core. This contention is not going to get better as you increase the number of cores. Since MPI does, in part, depend upon resources being contended for (interconnect), it is not at all clear to me that MPI will be the *best* choice for programming all the cores, though it certainly would be a simple choice. Greg is right when he notes that the hybrid model is a challenge. Unfortunately we appear to be facing a regime with multiple layers of hierarchies. So this will need resolution. You can create a globally "optimal" code via MPI, that may not be as efficient locally as you like, and will likely grow less so with more cores, or a locally optimal never-get-out-of-the-box code via shared memory. Shared memory scales nicely on NUMA machines, assuming 1-2 cores per memory controller. It won't/doesn't scale with 8 cores and one memory bus. How well does stream run on clovertown? NAS parallel? The issue is, at the end of the day, the contended for resources. Joe > > t > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From lindahl at pbm.com Fri Jun 15 11:46:43 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] programming multicore clusters In-Reply-To: <46727CDD.6040808@fft.be> References: <20070614055539.GA26746@bx9.net> <46727CDD.6040808@fft.be> Message-ID: <20070615184642.GA25305@bx9.net> On Fri, Jun 15, 2007 at 01:49:49PM +0200, Toon Knapen wrote: > AFAICT this is not always the case. E.g. on systems with glibc, this > functionality (set_process_affinity and such) is only available starting > from libc-2.3.4. Nearly every statement about Linux is untrue at some point in the past. > E.g. you can obtain a big boost when running an > MPI-code where each process performs local dgemm's for instance by using > an OpenMP'd dgemm implementation. This is an example where running > mixed-mode makes a lot of sense. First off, I see people using *threaded* DGEMM, not OpenMP. Second, I've never seen anyone show an actual benefit -- can you name an example? i.e. "for N=foo, I get a 13% speedup on..." -- greg From lindahl at pbm.com Fri Jun 15 11:49:36 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] programming multicore clusters In-Reply-To: <46728D68.9000503@tamu.edu> References: <46728D68.9000503@tamu.edu> Message-ID: <20070615184936.GB25305@bx9.net> On Fri, Jun 15, 2007 at 08:00:24AM -0500, Gerry Creager wrote: > For the forseeable future, I'm not developing much but will use the > hybrid SMP/DM capabilities in WRF. Takes advantage of SMP availability, > and supports message passing between SMP nodes. I've not used this > capability for benchmarking but it appears to offer significant gains. Gerry, why do you think WRF is going to behave any different from MM5 in this aspect? MM5 slows down in hybrid mode. MPI "takes advantage" of SMP. In fact, on big SMPs like Origin and Altix, MPI programs are often faster than shared memory or OpenMP programs, unless you've done a lot of work on the OpenMP program to improve locality. Why? Because you're forced to worry about locality in an MPI program all the time. > As we get more hybrid HPC capabilities planning for this will be more > important. I predict a lot of people will waste a lot of time without seeing a benefit. -- greg From gerry.creager at tamu.edu Fri Jun 15 12:10:11 2007 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] programming multicore clusters In-Reply-To: <20070615184936.GB25305@bx9.net> References: <46728D68.9000503@tamu.edu> <20070615184936.GB25305@bx9.net> Message-ID: <4672E413.2090204@tamu.edu> Greg Lindahl wrote: > On Fri, Jun 15, 2007 at 08:00:24AM -0500, Gerry Creager wrote: > >> For the forseeable future, I'm not developing much but will use the >> hybrid SMP/DM capabilities in WRF. Takes advantage of SMP availability, >> and supports message passing between SMP nodes. I've not used this >> capability for benchmarking but it appears to offer significant gains. > > Gerry, why do you think WRF is going to behave any different from MM5 > in this aspect? MM5 slows down in hybrid mode. MPI "takes advantage" > of SMP. In fact, on big SMPs like Origin and Altix, MPI programs are > often faster than shared memory or OpenMP programs, unless you've done > a lot of work on the OpenMP program to improve locality. Why? Because > you're forced to worry about locality in an MPI program all the time. Potentially false advertising. NCAR/MMM are of the opinion that it runs faster. I've not tested it yet... >> As we get more hybrid HPC capabilities planning for this will be more >> important. > > I predict a lot of people will waste a lot of time without seeing a > benefit. Likely correct. gerry -- Gerry Creager -- gerry.creager@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From lindahl at pbm.com Fri Jun 15 12:16:38 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] RE: programming multicore clusters In-Reply-To: <46729AB4.6000109@scalableinformatics.com> References: <288079.89270.qm@web55015.mail.re4.yahoo.com> <4672915C.4040405@fft.be> <46729AB4.6000109@scalableinformatics.com> Message-ID: <20070615191638.GC25305@bx9.net> On Fri, Jun 15, 2007 at 09:57:08AM -0400, Joe Landman wrote: > First, shared memory is nice and simple as a programming model. Uhuh. You know, there are some studies going where students learning parallel programming do the same algorithm with MPI and with shared memory. Would you like to make a bet as to whether they found shared memory much easier? > In the implementation, it > might be taught about multi-core, and optimizing communication within > boxes via shm sockets, and between boxes by other methods. I think a > few of the MPI toolkits do this today (Scali, Intel, OpenMPI, ...). "a few" should be "almost all". -- greg From landman at scalableinformatics.com Fri Jun 15 12:44:17 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] RE: programming multicore clusters In-Reply-To: <20070615191638.GC25305@bx9.net> References: <288079.89270.qm@web55015.mail.re4.yahoo.com> <4672915C.4040405@fft.be> <46729AB4.6000109@scalableinformatics.com> <20070615191638.GC25305@bx9.net> Message-ID: <4672EC11.7080804@scalableinformatics.com> Greg Lindahl wrote: > On Fri, Jun 15, 2007 at 09:57:08AM -0400, Joe Landman wrote: > >> First, shared memory is nice and simple as a programming model. > > Uhuh. You know, there are some studies going where students learning > parallel programming do the same algorithm with MPI and with shared > memory. Would you like to make a bet as to whether they found shared > memory much easier? I don't know which "studies" you are referring to. Having taught multiple graduate level courses on MPI/OpenMP programming, I can tell you what I observed from my students. They largely just "get" OpenMP. It won't get them great overall performance, as there aren't many large multiprocessor SMPs around for them to work on. Be that as it may, they had little problem developing good code. Compare this to MPI, and these same exact students had a difficult time of it. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From James.P.Lux at jpl.nasa.gov Fri Jun 15 15:51:32 2007 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] RE: programming multicore clusters In-Reply-To: <20070615191638.GC25305@bx9.net> References: <288079.89270.qm@web55015.mail.re4.yahoo.com> <4672915C.4040405@fft.be> <46729AB4.6000109@scalableinformatics.com> <20070615191638.GC25305@bx9.net> Message-ID: <6.2.3.4.2.20070615154746.03189f50@mail.jpl.nasa.gov> At 12:16 PM 6/15/2007, Greg Lindahl wrote: >On Fri, Jun 15, 2007 at 09:57:08AM -0400, Joe Landman wrote: > > > First, shared memory is nice and simple as a programming model. > >Uhuh. You know, there are some studies going where students learning >parallel programming do the same algorithm with MPI and with shared >memory. Would you like to make a bet as to whether they found shared >memory much easier? only if they have a test and set/semaphore mechanism provided. One thing I find that's nice about message passing, as a conceptual model, is that the concept of "simultaneity" cannot exist. There's always some finite time between the data existing in place A and the same data existing in place B. So, if you program in a message passing model, you have to explicitly think about such things. Right from the start you have to deal with issues like "ships passing in the night", and that's a big hurdle from the "one process on one giant block of memory" that most things start out with. Shared Memory is sort of a parallel hardware implementation of the multiple threads in a classic single CPU multithreaded kernel. Instead of context switching all the time, each processor keeps its own context. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From toon.knapen at fft.be Sat Jun 16 05:36:17 2007 From: toon.knapen at fft.be (Toon Knapen) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] programming multicore clusters In-Reply-To: <20070615184642.GA25305@bx9.net> References: <20070614055539.GA26746@bx9.net> <46727CDD.6040808@fft.be> <20070615184642.GA25305@bx9.net> Message-ID: <4673D941.8060208@fft.be> Greg Lindahl wrote: > On Fri, Jun 15, 2007 at 01:49:49PM +0200, Toon Knapen wrote: > >> AFAICT this is not always the case. E.g. on systems with glibc, this >> functionality (set_process_affinity and such) is only available starting >> from libc-2.3.4. > > Nearly every statement about Linux is untrue at some point in the > past. Indeed, this is true for every system that is still in development. But as I responded to Mark Hahn, there are still many linux distributions deployed that have libc-2.3.3 or older. I guess your products (I had a quick look but could not find the info directly) are also still supporting linux distributions with libc-2.3.3 or older. > >> E.g. you can obtain a big boost when running an >> MPI-code where each process performs local dgemm's for instance by using >> an OpenMP'd dgemm implementation. This is an example where running >> mixed-mode makes a lot of sense. > > First off, I see people using *threaded* DGEMM, not OpenMP. I did not differentiate between these two in my previous mail because to me it's an implementation issue. Both come down to using multiple threads. > Second, > I've never seen anyone show an actual benefit -- can you name an > example? i.e. "for N=foo, I get a 13% speedup on..." We have benchmarked our code with using multiple BLAS implementations and so far GotoBLAS came out as a clear winner. Next we tested GotoBLAS using 1,2 and 4 threads and depending on the linear solver (of which one is http://graal.ens-lyon.fr/MUMPS/) we had a speedup of between 30% and 70% when using 2 or 4 threads. The scalability of GotoBLAS in respect to the number of threads is actually much better. But of course when integrated in a solver, the speedup is strongly dependent on the size of the matrices being passed to BLAS: the larger the better of course. toon From richard.walsh at comcast.net Sat Jun 16 06:44:10 2007 From: richard.walsh at comcast.net (richard.walsh@comcast.net) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] RE: programming multicore clusters Message-ID: <061620071344.12885.4673E92A00060B8F000032552200750438089C040E99D20B9D0E080C079D@comcast.net> From: Joe Landman > > > Greg Lindahl wrote: > > On Fri, Jun 15, 2007 at 09:57:08AM -0400, Joe Landman wrote: > > > >> First, shared memory is nice and simple as a programming model. > > > > Uhuh. You know, there are some studies going where students learning > > parallel programming do the same algorithm with MPI and with shared > > memory. Would you like to make a bet as to whether they found shared > > memory much easier? > > I don't know which "studies" you are referring to. Having taught > multiple graduate level courses on MPI/OpenMP programming, I can tell > you what I observed from my students. They largely just "get" OpenMP. > It won't get them great overall performance, as there aren't many large > multiprocessor SMPs around for them to work on. Be that as it may, they > had little problem developing good code. Compare this to MPI, and these > same exact students had a difficult time of it. We did a study at the AHPCRC attempting to measure the "ease of programming" of MPI versus UPC/CAF. Having observed how it was done, the mix of experience in the group looked at, and noting the complexity of measuring "ease of programming" I would say the conclusions drawn were of nearly no value. Explicitness (MPI) tends to force one to think for carefully of the potential pits falls and complexities of the coding problem (in some cases delivering better code), while slowing you down in the short-run. Implicitness (UPC,CAF, OpenMP) tends to speed the initial development of the code, while allowing more novice programmers to make both parallel programming and performance errors. This tendency is reflected in the design ideas behind UPC (more implicit shared memory references) and CAF (more explicit shared memory references). While both are small foot print, I tend to like the CAF model better which reminds the programmer of every remote reference with a square bracket at the end of its co-array expressions (Raffeinert es herr CAF, aber boshaft es herr nicht ... ;-) ...) I might add a point beyond ease-of-use related to granularity ... coding some algorithms that have a natural fine-grained-ness can be prevented entirely by the cumbersomeness explicit message passing models. The algorithmic flexibiltiy provided by small foot-print shared memory and PGAS models can be a liberating experience for the programmer, just like a very good symbolism can be in mathematics. Of course, across the spectrum of commodity resources OpenMP does not scale, and UPC and CAF do not yet equal the performance of well- written MPI code. Although it would seem that much MPI code is not that "well-written". As to how parallel programming will evolve in this context I think that my signature quote below is relevant. Regards, rbw -- "Making predictions is hard, especially about the future." Niels Bohr -- Richard Walsh Thrashing River Consulting-- 5605 Alameda St. Shoreview, MN 55126 From xclski at yahoo.com Fri Jun 15 11:12:02 2007 From: xclski at yahoo.com (Ellis Wilson) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Diskless booting - NIC BIOS Message-ID: <374241.82317.qm@web37906.mail.mud.yahoo.com> Hi all. For about a year now I've been interested in cluster computing following my reading of rgb's lengthier online text (much thanks). After short email to him some time ago, he directed me to this list which I have been enjoying for about two months now. My question arises at a level which may (or may not) frustrate some of the professionals on this list, but as it is, I am in college and have very small means. My apologies in advance :). I have accumulated some computers from friends and the like and am interested in mounting them in a rack I've designed, however, some completely lack onboard ethernet ports (showing their age). This fact completely voids the possibility of them having a network boot capable bios, and forces me to research NICs which enable this function. Shopping around at my favorite online store I am having a bit of difficulty pinpointing NICs which would do this. Is there a key term I am overlooking or a simpler solution to this issue (no, I really can't just go and buy 8 new motherboards, though I'd love to)? Regards, Ellis Wilson --------------------------------- Be a better Heartthrob. Get better relationship answers from someone who knows. Yahoo! Answers - Check it out. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070615/ef9fabf1/attachment.html From consultrmann at yahoo.com Sat Jun 16 06:00:31 2007 From: consultrmann at yahoo.com (Richard Mann) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Cluster Howto Message-ID: <824524.67092.qm@web55104.mail.re4.yahoo.com> Hello, I just read your comment on "[Beowulf] FreeBSD 6.1 and single system image -- http://www.beowulf.org/archive/2006-July/015983.html". I've been looking all over a 'working' example of FreeBSD w/cluster configs. However, I'm more interested in a "massive" storage config. Is this possible? If so, do you have any suggestions on what software/modules I should use to accomplish this? Thanks in advance! --Rich --------------------------------- Get the free Yahoo! toolbar and rest assured with the added security of spyware protection. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070616/7e77d485/attachment.html From galtons at aecl.ca Sat Jun 16 22:20:55 2007 From: galtons at aecl.ca (Galton, Simon) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Concurrently open sockets limit on Linux system Message-ID: Folks, I ask here in anticipation that somebody out there in cluster-land has run across this limitation and can advise me on moving past it... We are looking at an application which uses a proprietary license manager. The client connects to the license manager at start time and (according lsof on the system running the license manager) seems to hold a socket open during the duration of the job. We want to run a couple of hundred of these jobs on our cluster, but after job 126 the client can no longer connect to the license manager. The server hosting the license manager is otherwise fine, and you can continue to perform network-based operations on and against it... The vendor feels that they have not coded a specific limit; I'm wondering if it's file descriptors or somesuch. I raised the limit of FDs on the system to 65000+ and verified that the change took effect; no change to the applications behaviour. It's a c-based app, compiled with gcc on a Fedora Core 6 system, I believe. Any thoughts? Simon CONFIDENTIAL AND PRIVILEGED INFORMATION NOTICE This e-mail, and any attachments, may contain information that is confidential, subject to copyright, or exempt from disclosure. Any unauthorized review, disclosure, retransmission, dissemination or other use of or reliance on this information may be unlawful and is strictly prohibited. AVIS D'INFORMATION CONFIDENTIELLE ET PRIVIL?GI?E Le pr?sent courriel, et toute pi?ce jointe, peut contenir de l'information qui est confidentielle, r?gie par les droits d'auteur, ou interdite de divulgation. Tout examen, divulgation, retransmission, diffusion ou autres utilisations non autoris?es de l'information ou d?pendance non autoris?e envers celle-ci peut ?tre ill?gale et est strictement interdite. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070617/021f033a/attachment.html From landman at scalableinformatics.com Sun Jun 17 11:52:03 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Concurrently open sockets limit on Linux system In-Reply-To: References: Message-ID: <467582D3.8020407@scalableinformatics.com> Hi Simon Galton, Simon wrote: > (according lsof on the system running the license manager) seems to hold > a socket open during the duration of the job. > > We want to run a couple of hundred of these jobs on our cluster, but > after job 126 the client can no longer connect to the license manager. This seems to suggest somewhere that there is a 128 socket limit on the service. How is the license manager run? What does limit or ulimit -a tell you for the user that the license manager runs as? Any notes in system logs? > The server hosting the license manager is otherwise fine, and you can > continue to perform network-based operations on and against it... > > The vendor feels that they have not coded a specific limit; I'm > wondering if it's file descriptors or somesuch. I raised the limit of > FDs on the system to 65000+ and verified that the change took effect; no > change to the applications behaviour. It's a c-based app, compiled with > gcc on a Fedora Core 6 system, I believe. Doesn't sound like an FD limit, but as a socket limit. Older inetd's had something like 20-40 connections per process. If you are serving this through xinetd or similar, you might be able to raise these limits. Joe > > Any thoughts? > > Simon > > > > > CONFIDENTIAL AND PRIVILEGED INFORMATION NOTICE > > This e-mail, and any attachments, may contain information that > is confidential, subject to copyright, or exempt from disclosure. > Any unauthorized review, disclosure, retransmission, > dissemination or other use of or reliance on this information > may be unlawful and is strictly prohibited. > > AVIS D'INFORMATION CONFIDENTIELLE ET PRIVIL?GI?E > > Le pr?sent courriel, et toute pi?ce jointe, peut contenir de > l'information qui est confidentielle, r?gie par les droits > d'auteur, ou interdite de divulgation. Tout examen, > divulgation, retransmission, diffusion ou autres utilisations > non autoris?es de l'information ou d?pendance non autoris?e > envers celle-ci peut ?tre ill?gale et est strictement interdite. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From hahn at mcmaster.ca Sun Jun 17 12:19:55 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Concurrently open sockets limit on Linux system In-Reply-To: References: Message-ID: > The vendor feels that they have not coded a specific limit; I'm wondering if in cases like this, I tend to do things like replace the license manager by a script that first does "ulimit -a" before execing the actual program. if the license manager is being execed through su to run unprivileged, for instance, it's not always obvious whether some ulimit is in effect. similarly, I often cut to the chase and run such a daemon under strace, to see what it's doing that fails. 128-clients is such a low number that it doesn't sound like something more exotic like an ephemeral-port-range limit. 128 is remarkably low, though - you'd expect a multi-connection daemon might burn one fd per connection, but even a very desktop-y setting of NOFILE to 1024 would imply that the daemon is keeping ~8 fd's open per connection. > it's file descriptors or somesuch. I raised the limit of FDs on the system > to 65000+ and verified that the change took effect; no change to the the system-wide (/proc/sys) setting is not likely to be the issue. > CONFIDENTIAL AND PRIVILEGED INFORMATION NOTICE > > This e-mail, and any attachments, may contain information that are you aware that this nonsense has _no_ legal standing? regards, mark hahn. From brian.dobbins at yale.edu Sun Jun 17 13:30:57 2007 From: brian.dobbins at yale.edu (Brian Dobbins) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Diskless booting - NIC BIOS In-Reply-To: <374241.82317.qm@web37906.mail.mud.yahoo.com> References: <374241.82317.qm@web37906.mail.mud.yahoo.com> Message-ID: <46759A01.5080203@yale.edu> Hi Ellis, I wasn't sure from your post whether you meant the nodes had /no/ network whatsoever, or simply no capabilities for network booting from the NICs in the system. If it's the latter, and assuming these systems have a floppy drive, I'd suggest looking into using the Etherboot software to handle network booting. No need to spend extra money. :) The webpages will explain more, but essentially (from memory - it's been a while!), if you set up a DHCP / TFTP server for the images somewhere on the network, just create a boot floppy with the correct network drivers for the node, stick it in, power on, and provided the DHCP/TFTP servers are correctly configured, the node should boot up, initialize the network, send out a request to the DHCP server, and then (from the information handed back), request a boot image from the TFTP server. To create this boot floppy, you can probably just visit the Rom-O-Matic page ( http://rom-o-matic.net/ ) and select the type of card you have, but definitely read over the Etherboot documentation, too ( http://www.etherboot.org/ ). If you're not certain what type of card is in the nodes, I'd suggest putting a Knoppix CD in, booting up, starting the network, and then listing the modules that are loaded - the network drivers should be in that list. If you get stuck, drop me a note and I'll be glad to try to walk you through it - I'm a pack rat, and probably still have all the old configuration files from when I last did this, too. Finally, in terms of the DHCP/TFTP management, are you handling that by yourself, or using some already-written package? The initial cluster that I used Etherboot on used the Warewulf package - I'd recommend you take a look at it, too. The webpage is ( http://www.warewulf-cluster.org/ ). The guy developing it, Greg Kurtzer, is really helpful, too, so if you get stuck in that stage of things, you won't pull out all your hair in frustration. Good luck! - Brian (Naturally, RGB also helped me out in the past -- anyone know if there is some analogue in the Beowulf realm to the Erdos number for RGB? I can't imagine there's anyone he /hasn't/ helped!) Brian Dobbins Yale Engineering HPC From jeff.johnson at wsm.com Sun Jun 17 12:26:33 2007 From: jeff.johnson at wsm.com (Jeff Johnson) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Diskless booting - NIC BIOS Message-ID: <46758AE9.90707@wsm.com> Ellis, The term you want to look for is PXE (preboot execution environment). Most Intel and Broadcom PCI nics will support this. There are other ways, using open-source bios images for some cards that use firmware images produced by the etherboot project (www.etherboot.org). They may have some bootable firmware images for older cards you already own that are not PXE capable. Simply put, setup your machines with PXE capable cards or nics using etherboot, have a dhcp server that will handout address leases and a tftp server that will offer network boot kernels and ramdisk images and you should be up and running pretty quick. There are some cluster environments that provide this in a nicely bundled image: Warewulf - www.perceus.org/portal/project/warewulf OSCAR - oscar.openclustergroup.org Rocks - www.rocksclusters.org --Jeff -- Best Regards, Jeff Johnson Vice President Engineering/Technology Western Scientific, Inc jeff.johnson@wsm.com http://www.wsm.com 5444 Napa Street - San Diego, CA 92110 Tel 800.443.6699 +001.619.220.6580 Fax +001.619.220.6590 "Braccae tuae aperiuntur" From galtons at aecl.ca Sun Jun 17 14:30:25 2007 From: galtons at aecl.ca (Galton, Simon) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Concurrently open sockets limit on Linux system Message-ID: Joe - good ideas! As a debugging step we even ran the LM daemon as root, ulimit -a reports a file limit of 65535... It's a permanently running daemon, not going through xinetd and it, annoyingly, produces no logs or sends anything to syslog. Thanks for your thoughts... Simon -----Original Message----- From: Joe Landman [mailto:landman@scalableinformatics.com] Sent: 2007 Jun 17 2:52 PM To: Galton, Simon Cc: 'beowulf@beowulf.org' Subject: Re: [Beowulf] Concurrently open sockets limit on Linux system Hi Simon Galton, Simon wrote: > (according lsof on the system running the license manager) seems to hold > a socket open during the duration of the job. > > We want to run a couple of hundred of these jobs on our cluster, but > after job 126 the client can no longer connect to the license manager. This seems to suggest somewhere that there is a 128 socket limit on the service. How is the license manager run? What does limit or ulimit -a tell you for the user that the license manager runs as? Any notes in system logs? > The server hosting the license manager is otherwise fine, and you can > continue to perform network-based operations on and against it... > > The vendor feels that they have not coded a specific limit; I'm > wondering if it's file descriptors or somesuch. I raised the limit of > FDs on the system to 65000+ and verified that the change took effect; no > change to the applications behaviour. It's a c-based app, compiled with > gcc on a Fedora Core 6 system, I believe. Doesn't sound like an FD limit, but as a socket limit. Older inetd's had something like 20-40 connections per process. If you are serving this through xinetd or similar, you might be able to raise these limits. Joe > > Any thoughts? > > Simon > > > > > CONFIDENTIAL AND PRIVILEGED INFORMATION NOTICE > > This e-mail, and any attachments, may contain information that > is confidential, subject to copyright, or exempt from disclosure. > Any unauthorized review, disclosure, retransmission, > dissemination or other use of or reliance on this information > may be unlawful and is strictly prohibited. > > AVIS D'INFORMATION CONFIDENTIELLE ET PRIVIL?GI?E > > Le pr?sent courriel, et toute pi?ce jointe, peut contenir de > l'information qui est confidentielle, r?gie par les droits > d'auteur, ou interdite de divulgation. Tout examen, > divulgation, retransmission, diffusion ou autres utilisations > non autoris?es de l'information ou d?pendance non autoris?e > envers celle-ci peut ?tre ill?gale et est strictement interdite. > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 CONFIDENTIAL AND PRIVILEGED INFORMATION NOTICE This e-mail, and any attachments, may contain information that is confidential, subject to copyright, or exempt from disclosure. Any unauthorized review, disclosure, retransmission, dissemination or other use of or reliance on this information may be unlawful and is strictly prohibited. AVIS D'INFORMATION CONFIDENTIELLE ET PRIVIL?GI?E Le pr?sent courriel, et toute pi?ce jointe, peut contenir de l'information qui est confidentielle, r?gie par les droits d'auteur, ou interdite de divulgation. Tout examen, divulgation, retransmission, diffusion ou autres utilisations non autoris?es de l'information ou d?pendance non autoris?e envers celle-ci peut ?tre ill?gale et est strictement interdite. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070617/ebb10e02/attachment.html From galtons at aecl.ca Sun Jun 17 14:37:46 2007 From: galtons at aecl.ca (Galton, Simon) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Concurrently open sockets limit on Linux system Message-ID: Mark -- more good thoughts! Ulimit doesn't seem to be an issue here, and the darn thing is definitely only holding open one connection per client (according to lsof, that is). Interestingly, /proc/sys/fs/file-nr reports something odd: cat /proc/sys/fs/file-nr 1605 0 65535 This suggests that there are 0 free allocated file descriptors. I'm not clear on the implications here. I can certainly continue to open files on this box, and make new remote connections via ssh... Simon -----Original Message----- From: Mark Hahn [mailto:hahn@mcmaster.ca] Sent: 2007 Jun 17 3:20 PM To: Galton, Simon Cc: 'beowulf@beowulf.org' Subject: Re: [Beowulf] Concurrently open sockets limit on Linux system > The vendor feels that they have not coded a specific limit; I'm wondering if in cases like this, I tend to do things like replace the license manager by a script that first does "ulimit -a" before execing the actual program. if the license manager is being execed through su to run unprivileged, for instance, it's not always obvious whether some ulimit is in effect. similarly, I often cut to the chase and run such a daemon under strace, to see what it's doing that fails. 128-clients is such a low number that it doesn't sound like something more exotic like an ephemeral-port-range limit. 128 is remarkably low, though - you'd expect a multi-connection daemon might burn one fd per connection, but even a very desktop-y setting of NOFILE to 1024 would imply that the daemon is keeping ~8 fd's open per connection. > it's file descriptors or somesuch. I raised the limit of FDs on the system > to 65000+ and verified that the change took effect; no change to the the system-wide (/proc/sys) setting is not likely to be the issue. > CONFIDENTIAL AND PRIVILEGED INFORMATION NOTICE > > This e-mail, and any attachments, may contain information that are you aware that this nonsense has _no_ legal standing? regards, mark hahn. CONFIDENTIAL AND PRIVILEGED INFORMATION NOTICE This e-mail, and any attachments, may contain information that is confidential, subject to copyright, or exempt from disclosure. Any unauthorized review, disclosure, retransmission, dissemination or other use of or reliance on this information may be unlawful and is strictly prohibited. AVIS D'INFORMATION CONFIDENTIELLE ET PRIVIL?GI?E Le pr?sent courriel, et toute pi?ce jointe, peut contenir de l'information qui est confidentielle, r?gie par les droits d'auteur, ou interdite de divulgation. Tout examen, divulgation, retransmission, diffusion ou autres utilisations non autoris?es de l'information ou d?pendance non autoris?e envers celle-ci peut ?tre ill?gale et est strictement interdite. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070617/705d2037/attachment.html From hahn at mcmaster.ca Sun Jun 17 22:48:36 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Concurrently open sockets limit on Linux system In-Reply-To: References: Message-ID: > Interestingly, /proc/sys/fs/file-nr reports something odd: > cat /proc/sys/fs/file-nr > 1605 0 65535 > > This suggests that there are 0 free allocated file descriptors. I'm not according to Documentation/filesystems/proc.txt on a recentish kernel, 2.6 always reports a zero there. the field is a left-over from previous implementations which kept around a fh-specific pool > clear on the implications here. I can certainly continue to open files on > this box, and make new remote connections via ssh... I would strace the license daemon... From rgb at phy.duke.edu Mon Jun 18 03:24:57 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Diskless booting - NIC BIOS In-Reply-To: <374241.82317.qm@web37906.mail.mud.yahoo.com> References: <374241.82317.qm@web37906.mail.mud.yahoo.com> Message-ID: On Fri, 15 Jun 2007, Ellis Wilson wrote: Hey, Ellis. IIRC, these days it is fairly difficult to get a NIC that does NOT support PXE booting (very different from the case four or five years ago where many still did not and you had to pay a premium price for it). But if you look up any specific card on its manufacturer's website, it should tell you under the product's technical specifications. Name brand cards -- e.g. 3com or intel -- are nearly a sure thing. Or if it is really cheap ($10) no-name you can just buy one and try it out. If it PXE boots, you will usually be given a network boot option that "appears" at boot time. A REALLY old motherboard might have difficulty with this, but most should not. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From tmalas at ee.bilkent.edu.tr Mon Jun 18 09:23:04 2007 From: tmalas at ee.bilkent.edu.tr (Tahir Malas) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] RE: [mvapich-discuss] Two problems related to slowness and TASK_UNINTERRUPTABLE process In-Reply-To: <466EB70D.2000306@cse.ohio-state.edu> References: <01ae01c7acc2$dfa8e810$d80cb38b@bs> <466EB70D.2000306@cse.ohio-state.edu> Message-ID: <00da01c7b1c4$f2e2ba80$d80cb38b@bs> Hi Sayantan, We have installed OFED 1.2, and our two problems have gone! Now there is neither suspending processes and nor inconsistent communication times: PACKAGE SIZE 512 BYTES 1.76 PACKAGE SIZE 4096 BYTES 13.83 These were Our test: 512: 29.434 4096: 16.209 with OFED 1.1. Thanks and regards, Tahir Malas Bilkent University Electrical and Electronics Engineering Department Phone: +90 312 290 1385 > -----Original Message----- > From: Sayantan Sur [mailto:surs@cse.ohio-state.edu] > Sent: Tuesday, June 12, 2007 6:09 PM > To: Tahir Malas > Cc: mvapich-discuss@cse.ohio-state.edu; beowulf@beowulf.org; > teoman.terzi@gmail.com; 'Ozgur Ergul' > Subject: Re: [mvapich-discuss] Two problems related to slowness and > TASK_UNINTERRUPTABLE process > > Hi Tahir, > > Thanks for sharing this data and your observations. It is interesting. > We have a more recent release, MVAPICH-0.9.9 which is available from our > website (mvapich.cse.ohio-state.edu) as well as with OFED-1.2 > distribution. Could you please try out our newer release and see if the > results change/remain the same? > > Thanks, > Sayantan. > > Tahir Malas wrote: > > Hi all, > > We have an 8 dual quad-core node HP cluster connected via Infiniband. We > use > > Voltaire DDR cards and 24-port switch. We also use OFED 1.1 and MVAPICH > > 0.9.7. We have two interesting problems that we could not overcome yet: > > > > 1. In our test program which mimics the communications in our code, the > > nodes are paired as follows: (0 and 1), (2 and 3), (4 and 5), (6 and 7). > We > > perform one to one communications between these pairs of nodes > > simultaneously. We use blocking MPI send and receive commands to > communicate > > an integer array of various sizes. In addition, we consider different > > numbers of processes: > > (a) 1 process per node, 8 processes overall: One link is established > between > > the pairs of nodes. > > (b) 2 process per node, 16 processes overall: Two links are established > > between the pairs of nodes. > > (c) 4 process per node, 32 processes overall: Four links are established > > between the pairs of nodes. > > (d) 8 process per node, 64 processes overall: Eight links are > established > > between the pairs of nodes. > > > > We obtain logical timings, except for the following interesting > comparison: > > > > For 32 processes (4 process per node), the arrays with 512-Byte size are > > communicated slower than the 4096-Byte size arrays. For both of them, we > > send/receive 1,000,000 arrays and take the average to find the time per > > package. Only package size changes. We have made many trials and > confirmed > > this abnormal case is persistent. More specifically, communication of > > 4k-Byte packages are 2 times faster than the communication of 512-Byte > > packages. > > > > The OSU bandwidth and latency test around these points shows: > > Byte MB/s > > 256 417.53 > > 512 592.34 > > 1024 691.02 > > 2048 857.35 > > 4096 906.04 > > 8192 1022.52 > > Time (usec) > > 256 4.79 > > 512 5.48 > > 1024 6.60 > > 2048 8.30 > > 4096 11.02 > > So this behavior does not seem reasonable to us. > > > > 2. SOMETIMES, after the test with overall 32 processes, one of the four > > processes at node3 hangs in TASK_UNINTERRUPTABLE "D" state. Hence, the > test > > program shows a "done." and waits for sometime. We can neither kill the > > process nor soft reboot the node. We have to wait for that process to > > terminate, which can last long. > > > > Does anybody have some comments in these issues? > > Thanks in advance, > > Tahir Malas > > Bilkent University > > Electrical and Electronics Engineering Department > > > > > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > -- > http://www.cse.ohio-state.edu/~surs > From lindahl at pbm.com Mon Jun 18 17:00:50 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] programming multicore clusters In-Reply-To: <4673D941.8060208@fft.be> References: <20070614055539.GA26746@bx9.net> <46727CDD.6040808@fft.be> <20070615184642.GA25305@bx9.net> <4673D941.8060208@fft.be> Message-ID: <20070619000050.GA16368@bx9.net> > Indeed, this is true for every system that is still in development. > But as I responded to Mark Hahn, there are still many linux > distributions deployed that have libc-2.3.3 or older. I guess your > products (I had a quick look but could not find the info directly) are > also still supporting linux distributions with libc-2.3.3 or older. My memory is that older versions of x86_64 libc have a different set of affinity functions (different # of args). PathScale supported both. > >First off, I see people using *threaded* DGEMM, not OpenMP. > > I did not differentiate between these two in my previous mail because to > me it's an implementation issue. Both come down to using multiple threads. It's extremely inconvenient to express an efficient DGEMM in OpenMP, just like it's pretty inconvent to express an efficient serial DGEMM. So you won't find anyone using an OpenMP DGEMM. You can call everything in the universe an implementation issue if you like. > We have benchmarked our code with using multiple BLAS implementations > and so far GotoBLAS came out as a clear winner. Next we tested GotoBLAS > using 1,2 and 4 threads and depending on the linear solver (of which one > is http://graal.ens-lyon.fr/MUMPS/) we had a speedup of between 30% and > 70% when using 2 or 4 threads. Sorry, did you compare against a pure MPI implementation? For example the HPL code can run either way, so it's easy to compare. But if you're comparing a serial code to a threaded code, it's no surprise that the threaded code can be faster, especially solving a problem which is not memory intensive. In fact I'd expect an even bigger win than 1.7X, perhaps you aren't using Opterons ;-) -- greg From xclski at yahoo.com Mon Jun 18 17:26:43 2007 From: xclski at yahoo.com (Ellis Wilson) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Diskless booting - NIC BIOS In-Reply-To: <46759A01.5080203@yale.edu> Message-ID: <649380.30815.qm@web37914.mail.mud.yahoo.com> Thanks Brian, Matt, and rgb, The floppy idea is great (I think I remember now reading about it in rgb's book, but had forgotten), and I certainly will look into that. The motherboards are in some cases years and years old; one computer I'm deciding whether I'll use or not does have a 400mhz processor in it, so their age is sufficient to make me worry. In response to how I'm handling "DHCP/TFTP management", I am ridiculously interested in knowing how everything works, so I install each computer using a method that follows the guidelines similar to Linux From Scratch. I'll likely be using dhcpcd for DHCP and I'm currently looking into TFTP options. If there is a suggested program of choice among you all, please feel free to let me know. I opt for the Linux from Scratch route, one, because I hate going to class and thereby have plenty of time on my hands (hah), and two because the majority of computers I deal with are old and need all the free resources they can get. Again, much thanks to those who helped. Ellis Brian Dobbins wrote: Hi Ellis, I wasn't sure from your post whether you meant the nodes had /no/ network whatsoever, or simply no capabilities for network booting from the NICs in the system. If it's the latter, and assuming these systems have a floppy drive, I'd suggest looking into using the Etherboot software to handle network booting. No need to spend extra money. :) The webpages will explain more, but essentially (from memory - it's been a while!), if you set up a DHCP / TFTP server for the images somewhere on the network, just create a boot floppy with the correct network drivers for the node, stick it in, power on, and provided the DHCP/TFTP servers are correctly configured, the node should boot up, initialize the network, send out a request to the DHCP server, and then (from the information handed back), request a boot image from the TFTP server. To create this boot floppy, you can probably just visit the Rom-O-Matic page ( http://rom-o-matic.net/ ) and select the type of card you have, but definitely read over the Etherboot documentation, too ( http://www.etherboot.org/ ). If you're not certain what type of card is in the nodes, I'd suggest putting a Knoppix CD in, booting up, starting the network, and then listing the modules that are loaded - the network drivers should be in that list. If you get stuck, drop me a note and I'll be glad to try to walk you through it - I'm a pack rat, and probably still have all the old configuration files from when I last did this, too. Finally, in terms of the DHCP/TFTP management, are you handling that by yourself, or using some already-written package? The initial cluster that I used Etherboot on used the Warewulf package - I'd recommend you take a look at it, too. The webpage is ( http://www.warewulf-cluster.org/ ). The guy developing it, Greg Kurtzer, is really helpful, too, so if you get stuck in that stage of things, you won't pull out all your hair in frustration. Good luck! - Brian (Naturally, RGB also helped me out in the past -- anyone know if there is some analogue in the Beowulf realm to the Erdos number for RGB? I can't imagine there's anyone he /hasn't/ helped!) Brian Dobbins Yale Engineering HPC --------------------------------- Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user panel and lay it on us. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070618/68a6d978/attachment.html From becker at scyld.com Tue Jun 19 02:25:17 2007 From: becker at scyld.com (Donald Becker) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Reminder -- No BayBUG meeting this month Message-ID: Just a reminder that there is no west coast BayBUG meeting this month. Stay tuned (or volunteer!) for the fall speaker schedule... -- Donald Becker becker@scyld.com Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA From zahirt at cs.rmit.edu.au Tue Jun 19 00:34:30 2007 From: zahirt at cs.rmit.edu.au (zahir @RMIT CS) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] extension of OTM 07 Deadlines... Message-ID: Dear colleagues, Our apologies if you have received a similar message more than once. We just wanted to let you know that the deadlines of the OTM 07 Federated Conferences (involving DOA, CoopIS, ODBASE, ODBASE, IS and GADA international conferences) have been extended. The new deadlines are Abstract submission: 21st June, 2007 Paper submission: 27th June, 2007 For any information, please consult the following URL http://www.cs.rmit.edu.au/fedconf Regards, Zahir & Robert OTM 2007 General Co-Chairs ====================================================================== OTM 2007 Federated Conferences - Call For Papers November 25 - November 30, 2007 Vilamoura, Algarve, Portugal BRIEF OVERVIEW "OnTheMove (OTM) to Meaningful Internet Systems and Ubiquitous Computing" co-locates five successful related and complementary conferences: - International Symposium on Distributed Objects and Applications (DOA'07) - International Conference on Ontologies, Databases and Applications of Semantics (ODBASE'07) - International Conference on Cooperative Information Systems (CoopIS'07) - International Symposium on Grid computing, high-performAnce and Distributed Applications (GADA'07) - International Symposium on Information Security (IS'07) Each conference covers multiple research vectors, viz. theory (e.g. underlying formalisms), conceptual (e.g. technical designs and conceptual solutions) and applications (e.g. case studies and industrial best practices). All five conferences share the scientific study of the distributed, conceptual and ubiquitous aspects of modern computing systems, and share the resulting application-pull created by the WWW. IMPORTANT DATES: - Abstract Submission Deadline June 21, 2007 - Paper Submission Deadline June 27, 2007 - Acceptance Notification August 22, 2007 - Camera Ready Due September 10, 2007 - Registration Due September 10, 2007 - OTM Conferences November 25 - 30, 2007 PROGRAM COMMITTEE CHAIRS CoopIS PC Co-Chairs (coopis2007@cs.rmit.edu.au) - Francisco Curbera, IBM, USA - Frank Leymann, Francisco Curbera, Germany - Mathias Weske, University of Potsdam, Germany DOA PC Co-Chairs (doa2007@cs.rmit.edu.au) - Pascal Felber, Universit? de Neuch?tel, Switzerland - Aad van Moorsel, Newcastle University, UK - Calton Pu, Georgia Tech, USA ODBASE PC Co-Chairs (odbase2007@cs.rmit.edu.au) - Tharam Dillion, University of Technology Sydney, Australia - Michele Missikoff, CNR, Italy - Steffen Staab, University of Koblenz-Landau, Germany GADA PC Co-Chairs (gada2007@cs.rmit.edu.au) - Pilar Herrero, Universidad Polit?cnica de Madrid, Spain - Daniel Katz, Louisiana State University and Jet Propulsion Laboratory, USA - Mar?a S. P?rez, Universidad Polit?cnica de Madrid, Spain - Domenico Talia, Universit? della Callabria, Italy IS PC Co-Chairs (is2007@cs.rmit.edu.au) - M?rio M. Freire, University of Beira Interior, Portugal - Sim?o Melo de Sousa, University of Beira Interior, Portugal - Vitor Santos, Microsoft, Portugal - Jong Hyuk Park, Hanwha S&C Co. Ltd., Korea From rgb at phy.duke.edu Tue Jun 19 02:58:31 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Diskless booting - NIC BIOS In-Reply-To: <649380.30815.qm@web37914.mail.mud.yahoo.com> References: <649380.30815.qm@web37914.mail.mud.yahoo.com> Message-ID: On Mon, 18 Jun 2007, Ellis Wilson wrote: > Thanks Brian, Matt, and rgb, > > The floppy idea is great (I think I remember now reading about it in > rgb's book, but had forgotten), and I certainly will look into that. > The motherboards are in some cases years and years old; one computer I'm > deciding whether I'll use or not does have a 400mhz processor in it, so > their age is sufficient to make me worry. Two points. One is that these days if your system has a BIOS that can manage booting from CD, I'd advise booting from CD instead of floppy. There are a variety of reasons for this -- CD's are cheap, you can put a large kernel on it, you can actually put a whole linux image on it and avoid having to "boot diskless" over the network, although of course you can still do that as well. Floppies are pretty much obsolete at this point and it isn't easy to get a properly bootable image of a modern kernel to live on one -- I think you'll find building tight kernels that will fit moderately frustrating. Second, remember that one dual dual core 64-bit opteron processor system -- currently available for maybe $1600 if you shop hard -- is going to be faster than a 32 node 400 MHz P6 cluster, and the latter will cost around $3000/year to leave powered on 24x7 (estimate $1/watt/year, even if you're not paying for it...:-). So you're building your cluster to learn and have fun, not for speed or to save money. If you have real work to do and want to do it as cheaply as possible, it would be wiser to go with a very small cluster of dual core 64 bit modern CPUs. > In response to how I'm handling "DHCP/TFTP management", I am > ridiculously interested in knowing how everything works, so I install > each computer using a method that follows the guidelines similar to > Linux From Scratch. I'll likely be using dhcpcd for DHCP and I'm > currently looking into TFTP options. If there is a suggested program of > choice among you all, please feel free to let me know. I opt for the > Linux from Scratch route, one, because I hate going to class and thereby > have plenty of time on my hands (hah), and two because the majority of > computers I deal with are old and need all the free resources they can > get. I personally would strongly advise a newbie learning about clusters to go on one of three routes these days. Route one would be warewulf -- this project comes with diskless booting more or less directly supported (in a sense, the core of the project IS a diskless distribution system for node images) and is distribution neutral, within reason. There are Real Humans using it and mailing lists and so on to support it, both important. Routes two and three would be to go either Debian or Fedora Core -- both have advantages (and no, I'm not getting myself trapped in a religious debate over which one is is good and which one is evil -- I actually think both are pretty good) and disadvantages. Setting up for diskless booting is actually pretty easy with the stock tftpd and dhcpd from FC (the distro I generally use these days). There is plenty to learn using one of the high end distros without quite having to build everything yourself, which can be frustrating as much as illuminating, and there are some WONDERFUL tools in the newer bleeding edges of these distros. One other thing to play with that can suck you right in but that should prove to be very rewarding in the future is virtualization -- look over vmware-player and the library of VM appliances, including prebuilt ready-to-play cluster nodes. Xen promises to be similarly useful although so far it appears to me to be more cumbersome and less stable when supporting a workstation as opposed to a stripped down server image. Might be very good for stripped down node images, though. Virtualization lets you REALLY play with your system(s). You can actually boot (say) FC and run (say) Debian and even (yuk) Windows in two VMs and switch freely between all three images in multiple workspaces. Tres cool. rgb > Again, much thanks to those who helped. > > Ellis > > Brian Dobbins wrote: > Hi Ellis, > > I wasn't sure from your post whether you meant the nodes had /no/ > network whatsoever, or simply no capabilities for network booting from > the NICs in the system. If it's the latter, and assuming these systems > have a floppy drive, I'd suggest looking into using the Etherboot > software to handle network booting. No need to spend extra money. :) > > The webpages will explain more, but essentially (from memory - it's > been a while!), if you set up a DHCP / TFTP server for the images > somewhere on the network, just create a boot floppy with the correct > network drivers for the node, stick it in, power on, and provided the > DHCP/TFTP servers are correctly configured, the node should boot up, > initialize the network, send out a request to the DHCP server, and then > (from the information handed back), request a boot image from the TFTP > server. To create this boot floppy, you can probably just visit the > Rom-O-Matic page ( http://rom-o-matic.net/ ) and select the type of card > you have, but definitely read over the Etherboot documentation, too ( > http://www.etherboot.org/ ). If you're not certain what type of card is > in the nodes, I'd suggest putting a Knoppix CD in, booting up, starting > the network, and then listing the modules that are loaded - the network > drivers should be in that list. > > If you get stuck, drop me a note and I'll be glad to try to walk you > through it - I'm a pack rat, and probably still have all the old > configuration files from when I last did this, too. Finally, in terms > of the DHCP/TFTP management, are you handling that by yourself, or using > some already-written package? The initial cluster that I used Etherboot > on used the Warewulf package - I'd recommend you take a look at it, > too. The webpage is ( http://www.warewulf-cluster.org/ ). The guy > developing it, Greg Kurtzer, is really helpful, too, so if you get stuck > in that stage of things, you won't pull out all your hair in frustration. > > Good luck! > - Brian > > (Naturally, RGB also helped me out in the past -- anyone know if there > is some analogue in the Beowulf realm to the Erdos number for RGB? I > can't imagine there's anyone he /hasn't/ helped!) > > Brian Dobbins > Yale Engineering HPC > > > > --------------------------------- > Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user panel and lay it on us. -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From eugen at leitl.org Wed Jun 20 03:05:21 2007 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] HLRS Courses Fall 2007 - Overview Message-ID: <20070620100521.GJ17691@leitl.org> For german supercomputing folks mostly. ----- Forwarded message from Rolf Rabenseifner ----- From: Rolf Rabenseifner Date: Wed, 20 Jun 2007 11:14:38 +0200 (CEST) To: eugen@leitl.org Subject: HLRS Courses Fall 2007 - Overview Sehr geehrte Dame, sehr geehrter Herr, das Herbst-Kursprogramm der Parallel Programming Workshops 2007 liegt nun vor: http://www.hlrs.de/news-events/events/2007/parallel_prog_fall2007/ Koennten Sie bitte diese Ankuendigung an interessierte Kollegen weitergeben, da mit dieser Mailingliste die Interessenten fuer Kurse zur "Parallelen Programmierung" und "Sequentiellen Program- mierung" oft nicht direkt erreicht werden koennen. Anmeldungen sind schon moeglich. Ich wünsche Ihnen eine schöne Sommerzeit und mit freundlichen Grüßen Rolf Rabenseifner ===================================================================== Call for Participation ===================================================================== PARALLEL PROGRAMMING WORKSHOPS 2007 No. ______Date_____ ___Location____ ______________Content__________ CSCS Aug. 8-10,2007 CSCS, Manno(CH) Parallel Programming (MPI and OpenMP) [3-days, in English !!] -E- Sep.17-21,2007 LRZ, Garching _ Iterative Linear Solvers and Parallelization [5-days,German] F-a Oct. 8-9 ,2007 HLRS, Stuttgart Parallel programming with MPI [2 days, in English !!] F-b Oct.10th, 2007 HLRS, Stuttgart Shared memory parallel pro- gramming with OpenMP [1 day, in English !!] F-c Oct.11-12,2007 HLRS, Stuttgart Advanced topics in parallel programming [2 days, in English !!] -G- Oct.15-19,2007 HLRS, Stuttgart Introduction to Computational Fluid Dynamics [5-days, German] -H- Nov.26-28,2007 NIC, Juelich __ Parallel Programming (MPI and OpenMP) [3-days, in German] Registration and further information: http://www.hlrs.de/news-events/events/2007/parallel_prog_fall2007/ ===================================================================== Programming courses (sequential programming): No. ______Date_____ ___Location____ ______________Content__________ FTN Oct.22-26, 2007 HLRS, Stuttgart Fortran for Scientific Comp. [5-day course in German] Registration on sequential-programming courses: http://www.hlrs.de/news-events/events/2007/prog_lang_fall2007/ ===================================================================== Hands-on sessions will help participants to test and understand the lectures. ===================================================================== Please do no miss our "Open Day", July 21, 11:00-17:00 --------------------- ================================ The HLRS is awarded as a "Selected place 2007" in "Germany, a country of ideas": http://www.hlrs.de/news-events/2007/land-der-ideen/ --------------------------------------------------------------------- Please forward this announcement to any colleagues who may be interested. Our apologies if you receive multiple copies. --------------------------------------------------------------------- --------------------------------------------------------------------- Dr. Rolf Rabenseifner .. . . . . . . . . . email rabenseifner@hlrs.de High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 University of Stuttgart .. . . . . . . . . fax : ++49(0)711/685-65832 Head of Dpmt Parallel Computing .. .. www.hlrs.de/people/rabenseifner Nobelstr. 19, D-70550 Stuttgart, Germany . . (Office: Allmandring 30) --------------------------------------------------------------------- ----- End forwarded message ----- -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From alenzo at mail.rochester.edu Tue Jun 19 12:07:01 2007 From: alenzo at mail.rochester.edu (A Lenzo) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Resources for starting a Beowulf Cluster Message-ID: <002b01c7b2a5$07ad14c0$f6339780@libra.cc.rochester.edu> Hello all, I am new to Linux and need help with the setup of my Beowulf Cluster. Can anyone suggest a few good resources? I currently have 1 master node and 2 slave nodes, but now I am not sure how to proceed. For starters, I would like to be able to create a user account on the master node and have it appear on the slave nodes. I've figured out that the first step is to copy over several files as follows: /etc/group /etc/passwd /etc/shadow And this lets me now log into any node with a given password, but the home directory of that given user does not carry over. All comments welcome! Thanks! A Lenzo PS - I am using Fedora Core 6. From andrew.robbie at gmail.com Tue Jun 19 04:18:44 2007 From: andrew.robbie at gmail.com (Andrew Robbie) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Diskless booting - NIC BIOS Message-ID: On 19/06/2007, at 7:58 PM, Robert G. Brown wrote: > On Mon, 18 Jun 2007, Ellis Wilson wrote: > >> Thanks Brian, Matt, and rgb, >> >> The floppy idea is great (I think I remember now reading about it in >> rgb's book, but had forgotten), and I certainly will look into that. >> The motherboards are in some cases years and years old; one >> computer I'm >> deciding whether I'll use or not does have a 400mhz processor in >> it, so >> their age is sufficient to make me worry. > > Two points. One is that these days if your system has a BIOS that can > manage booting from CD, I'd advise booting from CD instead of floppy. > There are a variety of reasons for this -- CD's are cheap, you can > put a > large kernel on it, you can actually put a whole linux image on it and > avoid having to "boot diskless" over the network, although of > course you > can still do that as well. Floppies are pretty much obsolete at this > point and it isn't easy to get a properly bootable image of a modern > kernel to live on one -- I think you'll find building tight kernels > that > will fit moderately frustrating. I disagree that floppies are obsolete! But putting the kernel on the floppy is. On our old cluster we used floppies with a custom build of grub (easy to do). Just build grub with support for the network adapters in your cluster and throw in a config file. The config file can list any number of boot kernels (and optionally associated root paths) which is really handy for having eg a production kernel, a debug kernel, a testing kernel, memtest, etc. Though the grub config file is hardcoded into the binary which is not ideal. When I had to create new grub configs it only took a few minutes to dd the floppy image. Far, far quicker than burning 20 copies of a CD. The same technique can be extended to modern computers with PXE Grub, which is even better because the config file can be sucked off the TFTP server too. > Second, remember that one dual dual core 64-bit opteron processor > system > -- currently available for maybe $1600 if you shop hard -- is going to > be faster than a 32 node 400 MHz P6 cluster, and the latter will cost > around $3000/year to leave powered on 24x7 (estimate $1/watt/year, > even > if you're not paying for it...:-). So you're building your cluster to > learn and have fun, not for speed or to save money. If you have real > work to do and want to do it as cheaply as possible, it would be wiser > to go with a very small cluster of dual core 64 bit modern CPUs. Very true. But it is usually someone else paying for juice. Though logically I should be able to go to the building management people and say 'I can save you $1000 if you give me $2000' in practice I don't think it would work... > > One other thing to play with that can suck you right in but that > should > prove to be very rewarding in the future is virtualization -- look > over > vmware-player and the library of VM appliances, including prebuilt > ready-to-play cluster nodes. I can highly recommend this approach if you need to run stuff on windows. Far easier to netboot linux and start a vmware instance than to try to netboot windows (though emBoot makes things easier). Regards, Andrew From alenzo at mail.rochester.edu Wed Jun 20 06:32:51 2007 From: alenzo at mail.rochester.edu (A Lenzo) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Resources for starting a Beowulf Cluster (NFS Setup?) Message-ID: <000d01c7b33f$836d7a10$f6339780@libra.cc.rochester.edu> Hello all, I am new to Linux and need help with the setup of my Beowulf Cluster. Can anyone suggest a few good resources? Here is a description of my current hurdle: I have 1 master node and 2 slave nodes. For starters, I would like to be able to create a user account on the master node and have it appear on the slave nodes. I've figured out that the first step is to copy over several files as follows: /etc/group /etc/passwd /etc/shadow And this lets me now log into any node with a given password, but the home directory of that given user does not carry over. All comments welcome! Thanks! A Lenzo PS - I am using Fedora Core 6. From rodmur at maybe.org Wed Jun 20 08:03:58 2007 From: rodmur at maybe.org (Dale Harris) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] air filtration systems Message-ID: <20070620150358.GS12688@maybe.org> Is anyone using any standalone air filtration systems in their machine rooms? Any recommendations? -- Dale Harris rodmur@maybe.org rodmur@gmail.com /.-) From amjad11 at gmail.com Wed Jun 20 02:29:31 2007 From: amjad11 at gmail.com (amjad ali) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Barcelona for CFD, FEM, and Weather Forecasting codes Message-ID: <428810f20706200229s3cbf8fa8y5821f09f8682866c@mail.gmail.com> Hello all, AMD's Quad Core Barcelona is about to release. Would any of you please guide that whether a cluster with 2 Quad Core AMD CPUs, RAM 2 GB/core and Inifiniband is quite suitable for CFD, FEM and Weather Forecasting type parallel codes? OR same amount invested in a cluster with 2 Dual Core AMD Opterons, RAM 2 GB/core and Infiniband is more suitable for these codes? regards, Amjad Ali. From ctierney at hypermall.net Wed Jun 20 22:12:02 2007 From: ctierney at hypermall.net (Craig Tierney) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Resources for starting a Beowulf Cluster (NFS Setup?) In-Reply-To: <000d01c7b33f$836d7a10$f6339780@libra.cc.rochester.edu> References: <000d01c7b33f$836d7a10$f6339780@libra.cc.rochester.edu> Message-ID: <467A08A2.9060400@hypermall.net> A Lenzo wrote: > Hello all, > > I am new to Linux and need help with the setup of my Beowulf Cluster. Can > anyone suggest a few good resources? > > Here is a description of my current hurdle: I have 1 master node and 2 slave > nodes. For starters, I would like to be able to create a user account on > the master node and have it appear on the slave nodes. I've figured out > that the first step is to copy over several files as follows: > > /etc/group > /etc/passwd > /etc/shadow > You could also use network based authentication to remove the need to copy. LDAP and NIS work. However, you will hear many opinions on the subject. > And this lets me now log into any node with a given password, but the home > directory of that given user does not carry over. > For the home filesystem, you need to use a networkable filesystem so that the same image is consistent across all of the nodes. For /home and in small clusters, NFS is traditionally used. Configure your master to export /home to your clients. Have your clients automatically mount the filesystem as /home. Craig From eugen at leitl.org Thu Jun 21 01:44:19 2007 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] [tt] Nvidia unveils Tesla, moves into supercomputing Message-ID: <20070621084419.GS17691@leitl.org> ----- Forwarded message from Brian Atkins ----- From: Brian Atkins Date: Wed, 20 Jun 2007 16:23:29 -0500 To: transhumantech Subject: [tt] Nvidia unveils Tesla, moves into supercomputing User-Agent: Thunderbird 2.0.0.4 (Windows/20070604) http://www.tgdaily.com/content/view/32557/135/ Santa Clara (CA) ? Nvidia today announced Tesla, a third product line next to the GeForce and Quadro graphics products. The company aims to use Tesla cards and the massive floating point horsepower of its graphics processors to take over a portion of the lucrative supercomputing market. The core of each Tesla device is a GeForce 8-series GPU as well as the general component layout of the high-end Quadro FX 5600 workstation graphics card with 1.5 GB of memory. The only noteworthy difference between the FX 5600 and a Tesla card is the fact that the supercomputing-targeted devices lack the graphics outputs on the backpanel, which we were told, allows Nvidia to increase the clock speed on Tesla. While the actual clock speed of the Tesla GeForce GPU is kept under wraps, Nvidia said that one processor (used in the C870 add-in card) is good for a performance of 518 GFlops, two processors (used in the deskside supercomputer D870, which integrates two C870 cards) will bring 1 TFlops; the Tesla GPU server with four processors will hit 2 TFlops. In terms of pure number crunching horsepower, Nvidia told us that one GeForce GPU can match the combined performance of 40 x86 processors. In addition to the raw performance, Tesla also makes a case for power efficiency: The C870 is rated at a maximum power consumption of 170 watts and the GPU server at 800 watts, which may sound a lot at first look. However, 40 low-power x86 processors would run at a typical 1600 watts. With a common power budget of about 25 kilowatts per rackserver, a Tesla GPU server rack has a theoretical maximum performance of more than 60 TFlops ? which would put the floating point rating of such a device among the 15 fastest supercomputers currently ranked on the Top 500 Supercomputer list. Similarities to ATI?s stream processor card, implications for developers Readers, who have been following recent general purpose GPU announcements, will remember that ATI has product in its portfolio that is very similar to the Tesla C870 ? the stream processor card (which is based on a R580 GPU and 1 GB of memory). Both products follow the same concept to make the massively processing capability provided by shader processors available to run arbitrary code instead of graphics code. Developers such as John Stone and James Philips, senior research programmers at the Beckman Institute of Advanced Science and Technology at the University of Illinois, have been looking at accelerators such as GPUs for some, but have been limited mainly by bugs in shader drivers. Stone told us that much of his work with GPUs in the past was focused ?on finding driver bugs? and ?writing his applications around them? in order to make the technology usable for scientific simulations. ?There can be a lot of rounding errors and because of this very fact, I wasn?t very excited about working with GPUs,? he said. However, both AMD and Nvidia came up with a programming model to solve this problem. On AMD?s side, it is called CTM (?close to metal?) and on Nvidia?s side it is CUDA (?Compute Unified Device Architecture?). At this time, it appears to come down to personal liking which model is preferred by a developer, as, for example, there are some universities that are working with CTM (such as Stanford?s Folding@Home project) and there are some that are working with CUDA. Stone and Philips are focusing on the Nvidia model as they claim its C++-based language model is easier to deal with than AMD?s CTM version, which uses a low-level assembly language. While CUDA works very much like a regular programming model and, according to Stone, can deliver results very quickly, the big challenge in exploiting these devices will be knowledge to write advanced parallelized code for these GPGPUs. Stone believes that especially coders who have written code for (massively parallel) supercomputers before will have an easy transition opportunity. Of course, knowledge of the hardware, graphics processing and a good look at the parallelizable parts of applications help to take advantage of the technology. Shane Ryoo, a graduate research assistant at the University of Illinois at Urbana-Champaign, said that CUDA will allow programmers with some experience in developing threaded applications to get ?really good results right off the bat.? However, it will be the fine-tuning process, which will increase the value of GPGPUs: Ryoo noted that expert knowledge that will allow developers to squeeze the best possible performance out of GPUs, sometimes can accelerate application code by a factor of 5x or greater. Nvidia is well aware of this challenge and has begun assisting universities in establishing classes and developing course material focusing on massively parallel programming and CUDA in particular. Eventually, the company hopes, that GPGPU programming will become a standard part in computer science course work and help to educate a whole new generation of programmers. So far, Nvidia has taught courses at the University of Illinois, The University of California, the University of North Carolina and Purdue University. Nvidia said that several universities are developing their own courses, including the University of Virginia, the University of Pennsylvania, Oregon State University, the University of Wisconsin. Caltech, MIT, Berkeley and Stanford have been offering ?legacy? GPGPU and GPU programming classes, according to Nvidia chief scientist David Kirk. The payoff: Accelerated applications If the capabilities of these GPGPUs are exploited, there can be a big payoff. Stone, who is working on Nanoscale Molecular Dynamics (NAMD) as well as Visual Molecular Dynamics (VMD), said that a virus simulation that took 110 CPU hours on a SGI Altix Itanium 2 supercomputer at NCSA required only 27 GPU minutes on a GeForce 8 graphics processor ? which translates into a 240x speedup. In an example that showcases an impact that can touch many lifes, Ryoo and his team are working on an interactive, medical MRI application that substantially increases the resolution of MRI scans thanks to the added processing power. As a result, they expect to be able to deliver much finer images, which allow physicians to detect tumors at an earlier state or differentiate between a blip or an actual tumor. In a demonstration showed during an Nvidia event, a representative from Headwave, a company that provides geophysical data analysis, highlighted a 4D application, which allows users to visualize gigabytes and apparently even terabytes of data in a three-dimensional scale and even apply a time filter to display changes to geological layers over time. The company claims that GPUs are accelerating their application by about 2000% and are delivering an output of about 2000 MB/s. In fairness, we should mention that Tesla (or stream processor cards for that matter) will not be able to replace supercomputers, which continue to provide a memory bandwidth a few Tesla cards cannot match. Scientists such as Stone believe that products such as Tesla will make their way into supercomputers to create an overall more balanced environment. ?Number crunching was the limiting factor up until now. Now Infiniband will be a problem,? he said. GPGPUs are likely to have a greater impact on deskside supercomputers in the short term. While scientists today have to apply for expensive supercomputer time and in most cases have to wait several days until their application can be processed - if those requests are not turned down anyway ? there is now an opportunity to run many of those tests on a desk right in the lab. Conceivably, GPGPUs will allow more scientists to run more and higher quality simulations in less time. Cost and impact on the consumer Nvidia?s Tesla products will start at $1300 for the single GPU add-in card; the 2-GPU deskside unit will run for $7500 and the 4 GPU server, which soon will also be offered in an 8 GPU version, will sell for $12,000. Leaving out of consideration that, at least to our knowledge, Tesla is not yet available, these apparently lofty price tags turn out to be bargains at a closer look. The C870 not only undercuts the ATI stream processor card, which currently sells for about $2000, but also Nvidia?s own workstation products. The C870, at $1300, compares to a Quadro FX 5600 graphics card, which requires and investment in the neighborhood of $3000 and up. Clearspeed?s CSX600 accelerator card, which provides a performance of about 100 GFlops, is selling in volume for about $7500. A representative of Evolved Machines told us that the company plans to be offering a 12 TFlops Tesla server, which will cost somewhere between $60,000 and $70,000, but will be fast enough to match the floating point performance of the 19th fastest supercomputer on the Top-500 list. Stone told us that even if the GPUs per se may appear to be expensive for a consumer point of view, they ?are available for far less money than the next best thing that is available today.? So, what does that mean for the consumer? Clearly, there is only an indirect benefit for most consumers that we may see in improved research results down the road. However, as all technologies, these GPUs will get cheaper over time and even today, a $1300 card would be in reach for enthusiasts, who often spend substantially more than $5000 on their rig. The fact is that there is no magic necessary to make these cards work on a PC - and CUDA even works with GeForce 8 graphics cards, which can be had for less than $250 in the case of 8600-series models. The real question is: When will there be applications that take advantage of this technology and will they provide enough incentive for consumers to purchase a GeForce 8 card? Industry experts believe that it will be up do developers to come up with new applications that will take advantage of the capability of GPGPUs on the desktop. Nvidia CEO Jen-Hsun Huang told TG Daily that Tesla will be strictly focused for the enterprise market and will not be making its way to the consumer market. In the end, it will be up to the GeForce product groups to leverage CUDA on desktop computers, but at least for now, Nvidia has little motivation to push this technology for the average consumer: ?Perhaps in the future,? said Huang, ?[this technology] could do physics on the PC, but this would need a Windows API.? -- Brian Atkins Singularity Institute for Artificial Intelligence http://www.singinst.org/ _______________________________________________ tt mailing list tt@postbiota.org http://postbiota.org/mailman/listinfo/tt ----- End forwarded message ----- -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From rgb at phy.duke.edu Thu Jun 21 03:49:41 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] Resources for starting a Beowulf Cluster (NFS Setup?) In-Reply-To: <000d01c7b33f$836d7a10$f6339780@libra.cc.rochester.edu> References: <000d01c7b33f$836d7a10$f6339780@libra.cc.rochester.edu> Message-ID: On Wed, 20 Jun 2007, A Lenzo wrote: > Hello all, > > I am new to Linux and need help with the setup of my Beowulf Cluster. Can > anyone suggest a few good resources? > > Here is a description of my current hurdle: I have 1 master node and 2 slave > nodes. For starters, I would like to be able to create a user account on > the master node and have it appear on the slave nodes. I've figured out > that the first step is to copy over several files as follows: > > /etc/group > /etc/passwd > /etc/shadow > > And this lets me now log into any node with a given password, but the home > directory of that given user does not carry over. I'd suggest getting a good book on Unix/Linux systems administration at your local friendly bookstore. Most of this is standard stuff for managing any LAN, and the one by Nemeth, Snyder and Hein (Linux Administration Handbook) is likely as good as any. You want to: a) NFS export your home directory from the master. Basically this involves making an entry in /etc/exports (with PRECISELY the right format, sorry, RTMP) and doing chkconfig nfs on, /etc/init.d/nfs start. God willing and the crick don't rise, and after you turn off selinux completely and drive a stake through its heart and use system-config-security to enable at least NFS in addition to ssh, then with luck you'll be able to go to a node/client and do: mount -t nfs master:/home /home (and add a suitable line to /etc/fstab to make this automagical on boot) and have it "just work". b) There are two ways to handle the user account, password, /etc/hosts, and other system db synchronization. For a tiny cluster with one or two users they are pretty much break even. One is to do what you've done -- create e.g. /etc/[passwd,group,shadow,hosts] on the master and then rsync them to the nodes as root, taking care not to break them or you'll be booting them single user to clean them up or reinstalling them altogether! When a new account is added, rerun the rsyncs. You can even write a tiny script that will rsync exactly what is needed. Or, you can learn to use NIS, which scales to a much larger (department/organization sized) enterprise and cluster with dozens or hundreds of user accounts. For that you'll NEED the systems administration book or one like it -- NIS is not for the faint of heart. I've done NIS management before, and know how to use it, but elect to go the other way for my home LAN/cluster because even 8-10 systems and 4-5 users are about break even compared to a judicious and infrequent set of rsyncs, and a cluster is even simpler in this regard. FWIW, local (non-NIS) dbs are somewhat faster for certain classes of parallel operation although this is not generally a major issue for most code. Hope this helps, rgb > > All comments welcome! > > Thanks! > A Lenzo > > PS - I am using Fedora Core 6. > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From hahn at mcmaster.ca Thu Jun 21 07:57:45 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] any gp-gpu clusters? Message-ID: Hi all, is anyone messing with GPU-oriented clusters yet? I'm working on a pilot which I hope will be something like 8x workstations, each with 2x recent-gen gpu cards. the goal would be to host cuda/rapidmind/ctm-type gp-gpu development. part of the motive here is just to create a gpu-friendly infrastructure into which commodity cards can be added and refreshed every 8-12 months. as opposed to "investing" in quadro-level cards which are too expensive enough to toss when obsoleted. nvidia's 1U tesla (with two g80 chips) looks potentially attractive, though I'm guessing it'll be premium/quadro-priced - not really in keeping with the hyper-moore's-law mantra... if anyone has experience with clustered gp-gpu stuff, I'm interested in comments on particular tools, experiences, configuration of the host machines and networks, etc. for instance, is it naive to think that gp-gpu is most suited to flops-heavy-IO-light apps, and therefore doesn't necessarily need a hefty (IB, 10Geth) network? thanks, mark hahn. From diep at xs4all.nl Thu Jun 21 15:06:33 2007 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] [tt] Nvidia unveils Tesla, moves into supercomputing References: <20070621084419.GS17691@leitl.org> Message-ID: <004001c7b450$72bb0130$0900a8c0@objection> Instead of the bla bla that nvidia and ati produce, please let them create a few clear pdf's that describe things like for each specific graphics card EXACTLY how BIG the caches are on each card. How on planet earth can you program for a card without knowing how the caches work let alone know its size? intel and amd definitely don't make a major secret out of the size of their caches. What we see however in reviews of new graphics cards is that a few hardware sites simply must GUESS how big it is. Nearly all descriptions are out of graphics programmers viewpoints instead out of CPU programmers viewpoints. That makes the step real big to make a cpu intensive program work on a GPU. Is it so hard for ATI/NVIDIA to write about their latest flagship a clear document like that and put it online for free download? Additionally i miss 1 major important instruction on those GPU's, which the CPU's already had from 386 and on. If your gpu just can do 32 bits integer data types, then make a parallel multiplication that takes 2x32 bits input and 2x32 bits output. It is a fairy tale that FFT is faster in floating point; it just happens to be the case that in most SIMD there is no integer equivalent so far. Thanks, Vincent ----- Original Message ----- From: "Eugen Leitl" To: Sent: Thursday, June 21, 2007 10:44 AM Subject: [Beowulf] [tt] Nvidia unveils Tesla, moves into supercomputing > ----- Forwarded message from Brian Atkins ----- > > From: Brian Atkins > Date: Wed, 20 Jun 2007 16:23:29 -0500 > To: transhumantech > Subject: [tt] Nvidia unveils Tesla, moves into supercomputing > User-Agent: Thunderbird 2.0.0.4 (Windows/20070604) > > http://www.tgdaily.com/content/view/32557/135/ > > Santa Clara (CA) ? Nvidia today announced Tesla, a third product line next > to > the GeForce and Quadro graphics products. The company aims to use Tesla > cards > and the massive floating point horsepower of its graphics processors to > take > over a portion of the lucrative supercomputing market. > > The core of each Tesla device is a GeForce 8-series GPU as well as the > general > component layout of the high-end Quadro FX 5600 workstation graphics card > with > 1.5 GB of memory. The only noteworthy difference between the FX 5600 and a > Tesla > card is the fact that the supercomputing-targeted devices lack the > graphics > outputs on the backpanel, which we were told, allows Nvidia to increase > the > clock speed on Tesla. > > While the actual clock speed of the Tesla GeForce GPU is kept under wraps, > Nvidia said that one processor (used in the C870 add-in card) is good for > a > performance of 518 GFlops, two processors (used in the deskside > supercomputer > D870, which integrates two C870 cards) will bring 1 TFlops; the Tesla GPU > server > with four processors will hit 2 TFlops. > > In terms of pure number crunching horsepower, Nvidia told us that one > GeForce > GPU can match the combined performance of 40 x86 processors. In addition > to the > raw performance, Tesla also makes a case for power efficiency: The C870 is > rated > at a maximum power consumption of 170 watts and the GPU server at 800 > watts, > which may sound a lot at first look. However, 40 low-power x86 processors > would > run at a typical 1600 watts. With a common power budget of about 25 > kilowatts > per rackserver, a Tesla GPU server rack has a theoretical maximum > performance of > more than 60 TFlops ? which would put the floating point rating of such a > device > among the 15 fastest supercomputers currently ranked on the Top 500 > Supercomputer list. > > > Similarities to ATI?s stream processor card, implications for developers > > Readers, who have been following recent general purpose GPU announcements, > will > remember that ATI has product in its portfolio that is very similar to the > Tesla > C870 ? the stream processor card (which is based on a R580 GPU and 1 GB of > memory). Both products follow the same concept to make the massively > processing > capability provided by shader processors available to run arbitrary code > instead > of graphics code. > > Developers such as John Stone and James Philips, senior research > programmers at > the Beckman Institute of Advanced Science and Technology at the University > of > Illinois, have been looking at accelerators such as GPUs for some, but > have been > limited mainly by bugs in shader drivers. Stone told us that much of his > work > with GPUs in the past was focused ?on finding driver bugs? and ?writing > his > applications around them? in order to make the technology usable for > scientific > simulations. ?There can be a lot of rounding errors and because of this > very > fact, I wasn?t very excited about working with GPUs,? he said. > > However, both AMD and Nvidia came up with a programming model to solve > this > problem. On AMD?s side, it is called CTM (?close to metal?) and on Nvidia?s > side > it is CUDA (?Compute Unified Device Architecture?). At this time, it > appears to > come down to personal liking which model is preferred by a developer, as, > for > example, there are some universities that are working with CTM (such as > Stanford?s Folding@Home project) and there are some that are working with > CUDA. > Stone and Philips are focusing on the Nvidia model as they claim its > C++-based > language model is easier to deal with than AMD?s CTM version, which uses a > low-level assembly language. > > While CUDA works very much like a regular programming model and, according > to > Stone, can deliver results very quickly, the big challenge in exploiting > these > devices will be knowledge to write advanced parallelized code for these > GPGPUs. > Stone believes that especially coders who have written code for (massively > parallel) supercomputers before will have an easy transition opportunity. > Of > course, knowledge of the hardware, graphics processing and a good look at > the > parallelizable parts of applications help to take advantage of the > technology. > > Shane Ryoo, a graduate research assistant at the University of Illinois at > Urbana-Champaign, said that CUDA will allow programmers with some > experience in > developing threaded applications to get ?really good results right off the > bat.? > However, it will be the fine-tuning process, which will increase the value > of > GPGPUs: Ryoo noted that expert knowledge that will allow developers to > squeeze > the best possible performance out of GPUs, sometimes can accelerate > application > code by a factor of 5x or greater. > > Nvidia is well aware of this challenge and has begun assisting > universities in > establishing classes and developing course material focusing on massively > parallel programming and CUDA in particular. Eventually, the company > hopes, that > GPGPU programming will become a standard part in computer science course > work > and help to educate a whole new generation of programmers. So far, Nvidia > has > taught courses at the University of Illinois, The University of > California, the > University of North Carolina and Purdue University. Nvidia said that > several > universities are developing their own courses, including the University of > Virginia, the University of Pennsylvania, Oregon State University, the > University of Wisconsin. Caltech, MIT, Berkeley and Stanford have been > offering > ?legacy? GPGPU and GPU programming classes, according to Nvidia chief > scientist > David Kirk. > > The payoff: Accelerated applications > > > If the capabilities of these GPGPUs are exploited, there can be a big > payoff. > Stone, who is working on Nanoscale Molecular Dynamics (NAMD) as well as > Visual > Molecular Dynamics (VMD), said that a virus simulation that took 110 CPU > hours > on a SGI Altix Itanium 2 supercomputer at NCSA required only 27 GPU > minutes on a > GeForce 8 graphics processor ? which translates into a 240x speedup. > > In an example that showcases an impact that can touch many lifes, Ryoo and > his > team are working on an interactive, medical MRI application that > substantially > increases the resolution of MRI scans thanks to the added processing > power. As a > result, they expect to be able to deliver much finer images, which allow > physicians to detect tumors at an earlier state or differentiate between a > blip > or an actual tumor. > > In a demonstration showed during an Nvidia event, a representative from > Headwave, a company that provides geophysical data analysis, highlighted a > 4D > application, which allows users to visualize gigabytes and apparently even > terabytes of data in a three-dimensional scale and even apply a time > filter to > display changes to geological layers over time. The company claims that > GPUs are > accelerating their application by about 2000% and are delivering an output > of > about 2000 MB/s. > > In fairness, we should mention that Tesla (or stream processor cards for > that > matter) will not be able to replace supercomputers, which continue to > provide a > memory bandwidth a few Tesla cards cannot match. Scientists such as Stone > believe that products such as Tesla will make their way into > supercomputers to > create an overall more balanced environment. ?Number crunching was the > limiting > factor up until now. Now Infiniband will be a problem,? he said. > > GPGPUs are likely to have a greater impact on deskside supercomputers in > the > short term. While scientists today have to apply for expensive > supercomputer > time and in most cases have to wait several days until their application > can be > processed - if those requests are not turned down anyway ? there is now an > opportunity to run many of those tests on a desk right in the lab. > Conceivably, > GPGPUs will allow more scientists to run more and higher quality > simulations in > less time. > > > Cost and impact on the consumer > > Nvidia?s Tesla products will start at $1300 for the single GPU add-in > card; the > 2-GPU deskside unit will run for $7500 and the 4 GPU server, which soon > will > also be offered in an 8 GPU version, will sell for $12,000. Leaving out of > consideration that, at least to our knowledge, Tesla is not yet available, > these > apparently lofty price tags turn out to be bargains at a closer look. > > The C870 not only undercuts the ATI stream processor card, which currently > sells > for about $2000, but also Nvidia?s own workstation products. The C870, at > $1300, > compares to a Quadro FX 5600 graphics card, which requires and investment > in the > neighborhood of $3000 and up. Clearspeed?s CSX600 accelerator card, which > provides a performance of about 100 GFlops, is selling in volume for about > $7500. > > A representative of Evolved Machines told us that the company plans to be > offering a 12 TFlops Tesla server, which will cost somewhere between > $60,000 and > $70,000, but will be fast enough to match the floating point performance > of the > 19th fastest supercomputer on the Top-500 list. > > Stone told us that even if the GPUs per se may appear to be expensive for > a > consumer point of view, they ?are available for far less money than the > next > best thing that is available today.? > > So, what does that mean for the consumer? Clearly, there is only an > indirect > benefit for most consumers that we may see in improved research results > down the > road. However, as all technologies, these GPUs will get cheaper over time > and > even today, a $1300 card would be in reach for enthusiasts, who often > spend > substantially more than $5000 on their rig. The fact is that there is no > magic > necessary to make these cards work on a PC - and CUDA even works with > GeForce 8 > graphics cards, which can be had for less than $250 in the case of > 8600-series > models. The real question is: When will there be applications that take > advantage of this technology and will they provide enough incentive for > consumers to purchase a GeForce 8 card? Industry experts believe that it > will be > up do developers to come up with new applications that will take advantage > of > the capability of GPGPUs on the desktop. > > Nvidia CEO Jen-Hsun Huang told TG Daily that Tesla will be strictly > focused for > the enterprise market and will not be making its way to the consumer > market. In > the end, it will be up to the GeForce product groups to leverage CUDA on > desktop > computers, but at least for now, Nvidia has little motivation to push this > technology for the average consumer: ?Perhaps in the future,? said Huang, > ?[this > technology] could do physics on the PC, but this would need a Windows > API.? > > -- > Brian Atkins > Singularity Institute for Artificial Intelligence > http://www.singinst.org/ > _______________________________________________ > tt mailing list > tt@postbiota.org > http://postbiota.org/mailman/listinfo/tt > > ----- End forwarded message ----- > -- > Eugen* Leitl leitl http://leitl.org > ______________________________________________________________ > ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org > 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From landman at scalableinformatics.com Thu Jun 21 15:20:48 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] question for licensed software users Message-ID: <467AF9C0.8010707@scalableinformatics.com> Hi folks Is there any one out there, using flexlm, who is *not* having problems with flexlm? Just curious. It is being particularly vicious to several of our customers. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From jmdavis1 at vcu.edu Thu Jun 21 22:07:55 2007 From: jmdavis1 at vcu.edu (Mike Davis) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] question for licensed software users In-Reply-To: <467AF9C0.8010707@scalableinformatics.com> References: <467AF9C0.8010707@scalableinformatics.com> Message-ID: <467B592B.2090704@vcu.edu> Joe, No problems here. But then again, I've been using Flexlm on various platforms for more than 10 years. In general, I run it on the head node if I need access from outside the cluster. Otherwise, it can run on any node in the cluster. What are the problems? Mike Davis Joe Landman wrote: > Hi folks > > Is there any one out there, using flexlm, who is *not* having > problems with flexlm? > > Just curious. It is being particularly vicious to several of our > customers. > > Joe > From toon.knapen at fft.be Fri Jun 22 02:03:42 2007 From: toon.knapen at fft.be (Toon Knapen) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] question for licensed software users In-Reply-To: <467AF9C0.8010707@scalableinformatics.com> References: <467AF9C0.8010707@scalableinformatics.com> Message-ID: <467B906E.2090801@fft.be> What kind of problems: developing using FlexLm or deploying an app which needs to talk to a flexlm-server ? A few years ago we had an interesting problem with flexlm on windows 32bit: The flexlm library was loading plenty of other system libraries at a base-address near 1.4Gb. This implied that, even if there was 2Gb of RAM in the machine, the biggest allocation possible was maximally 1.2Gb. toon Joe Landman wrote: > Hi folks > > Is there any one out there, using flexlm, who is *not* having problems > with flexlm? > > Just curious. It is being particularly vicious to several of our > customers. > > Joe > From landman at scalableinformatics.com Fri Jun 22 04:57:39 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] question for licensed software users In-Reply-To: <467B906E.2090801@fft.be> References: <467AF9C0.8010707@scalableinformatics.com> <467B906E.2090801@fft.be> Message-ID: <467BB933.3070802@scalableinformatics.com> Toon Knapen wrote: > What kind of problems: developing using FlexLm or deploying an app which > needs to talk to a flexlm-server ? flexlm server suddenly and inexplicably switching its lmhostid. After a reboot. Renders all licenses unusable. This happened recently at two customers sites. One was a windows license server, one was a linux license server. Restarting the boxes didn't help. Restarting the daemon didn't help. Looking at the license files didn't indicate an expiration. The logs didn't indicate this either. It just appeared (in the logs) that the thing stopped believing it was server 1 (the lmhostid is a mac address BTW), and suddenly started believing it was a server 2 (well, it took the other mac address as its lmhostid). Very annoying. Have seen this on multiple boxes over the past few years, ones we haven't touched as well as ones we try to "fix" (e.g. alter the setup so that the flexlm daemon will in fact run correctly). This impacts multiple vendor products BTW. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From fly at anydata.co.uk Fri Jun 22 06:31:03 2007 From: fly at anydata.co.uk (Fred Youhanaie) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] question for licensed software users In-Reply-To: <467BB933.3070802@scalableinformatics.com> References: <467AF9C0.8010707@scalableinformatics.com> <467B906E.2090801@fft.be> <467BB933.3070802@scalableinformatics.com> Message-ID: <467BCF17.3010405@anydata.co.uk> Joe Landman wrote: > > flexlm server suddenly and inexplicably switching its lmhostid. After a > reboot. Renders all licenses unusable. I remember seeing something like this a few years ago, I think that relates to having multiple license servers, and they decide among themselves who does the serving, which, IIRC, is based on the servers' IP addresses, the lowest numeric address wins. HTH Cheers f. From eric-shook at uiowa.edu Fri Jun 22 08:16:02 2007 From: eric-shook at uiowa.edu (Eric Shook) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] any gp-gpu clusters? In-Reply-To: References: Message-ID: <467BE7B2.9020803@uiowa.edu> Hi Mark, I am a part of new research group that is considering adding gp-gpu technologies to our cluster, unfortunately we have the same questions which you raised. Which platform (ctm or cuda), development tools, configuration, etc. If we decided to add gpu technologies it would most likely only be added to 1-2 hosts so we can test its viability. So we are not developing a gpu-oriented cluster like you asked, but if the viability testing is successful we may look at it in the future. Do you have experience developing for GPUs? If so what was your experiences and/or results? Most particularly how high is the learning curve? thanks, Eric Mark Hahn wrote: > Hi all, > is anyone messing with GPU-oriented clusters yet? > > I'm working on a pilot which I hope will be something like 8x > workstations, each with 2x recent-gen gpu cards. > the goal would be to host cuda/rapidmind/ctm-type gp-gpu development. > > part of the motive here is just to create a gpu-friendly infrastructure > into which commodity cards can be added and refreshed every 8-12 > months. as opposed to "investing" in quadro-level cards which are too > expensive enough to toss when obsoleted. > > nvidia's 1U tesla (with two g80 chips) looks potentially attractive, > though I'm guessing it'll be premium/quadro-priced - not really in > keeping with the hyper-moore's-law mantra... > > if anyone has experience with clustered gp-gpu stuff, I'm interested in > comments on particular tools, experiences, configuration of the host > machines and networks, etc. for instance, is it naive to think that > gp-gpu is most suited to flops-heavy-IO-light apps, and therefore doesn't > necessarily need a hefty (IB, 10Geth) network? > > thanks, mark hahn. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Eric Shook From laytonjb at charter.net Fri Jun 22 08:34:46 2007 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] any gp-gpu clusters? In-Reply-To: References: Message-ID: <467BEC16.3050606@charter.net> I have no idea if this will help anyone, but here is an article that might help get started or at least provide some links: http://www.linux-mag.com/launchpad/business-class-hpc/main/3533 WARNING: You have to register to read the article (sorry about that). From what I understand, CTM is really just the low-level definition of the interface to AMD Stream processors. On the other hand CUDA is a real compiler with added features to make coding for GPUs easier. It also has a BLAS and FFT library. I think NVIDIA is ahead in the tools department, but I don't expect AMD to stay behind. Jeff > Hi all, > is anyone messing with GPU-oriented clusters yet? > > I'm working on a pilot which I hope will be something like 8x > workstations, each with 2x recent-gen gpu cards. > the goal would be to host cuda/rapidmind/ctm-type gp-gpu development. > > part of the motive here is just to create a gpu-friendly > infrastructure into which commodity cards can be added and refreshed > every 8-12 months. as opposed to "investing" in quadro-level cards > which are too expensive enough to toss when obsoleted. > > nvidia's 1U tesla (with two g80 chips) looks potentially attractive, > though I'm guessing it'll be premium/quadro-priced - not really in > keeping with the hyper-moore's-law mantra... > > if anyone has experience with clustered gp-gpu stuff, I'm interested > in comments on particular tools, experiences, configuration of the host > machines and networks, etc. for instance, is it naive to think that > gp-gpu is most suited to flops-heavy-IO-light apps, and therefore doesn't > necessarily need a hefty (IB, 10Geth) network? > > thanks, mark hahn. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From eric-shook at uiowa.edu Fri Jun 22 09:15:29 2007 From: eric-shook at uiowa.edu (Eric Shook) Date: Wed Nov 25 01:06:07 2009 Subject: [Beowulf] any gp-gpu clusters? In-Reply-To: <467BEC16.3050606@charter.net> References: <467BEC16.3050606@charter.net> Message-ID: <467BF5A1.4090304@uiowa.edu> Hi Jeff, I registered to see article and I must admit it was an excellent read. It provided a nice high-level overview of GPU programming and introduced libraries / languages that I was previously not aware of. I will be looking into these options as well. Thank you for posting the link. Eric Jeffrey B. Layton wrote: > I have no idea if this will help anyone, but here is an article > that might help get started or at least provide some links: > > http://www.linux-mag.com/launchpad/business-class-hpc/main/3533 > > WARNING: You have to register to read the article (sorry > about that). > > From what I understand, CTM is really just the low-level definition > of the interface to AMD Stream processors. On the other hand > CUDA is a real compiler with added features to make coding > for GPUs easier. It also has a BLAS and FFT library. > > I think NVIDIA is ahead in the tools department, but I don't > expect AMD to stay behind. > > Jeff > > >> Hi all, >> is anyone messing with GPU-oriented clusters yet? >> >> I'm working on a pilot which I hope will be something like 8x >> workstations, each with 2x recent-gen gpu cards. >> the goal would be to host cuda/rapidmind/ctm-type gp-gpu development. >> >> part of the motive here is just to create a gpu-friendly >> infrastructure into which commodity cards can be added and refreshed >> every 8-12 months. as opposed to "investing" in quadro-level cards >> which are too expensive enough to toss when obsoleted. >> >> nvidia's 1U tesla (with two g80 chips) looks potentially attractive, >> though I'm guessing it'll be premium/quadro-priced - not really in >> keeping with the hyper-moore's-law mantra... >> >> if anyone has experience with clustered gp-gpu stuff, I'm interested >> in comments on particular tools, experiences, configuration of the host >> machines and networks, etc. for instance, is it naive to think that >> gp-gpu is most suited to flops-heavy-IO-light apps, and therefore doesn't >> necessarily need a hefty (IB, 10Geth) network? >> >> thanks, mark hahn. >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Eric Shook (319) 335-6714 Technical Lead, Systems and Operations - GROW http://grow.uiowa.edu From lindahl at pbm.com Fri Jun 22 10:15:41 2007 From: lindahl at pbm.com (Greg Lindahl) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] any gp-gpu clusters? In-Reply-To: <467BEC16.3050606@charter.net> References: <467BEC16.3050606@charter.net> Message-ID: <20070622171541.GA31515@bx9.net> On Fri, Jun 22, 2007 at 11:34:46AM -0400, Jeffrey B. Layton wrote: > On the other hand CUDA is a real compiler with added features to > make coding for GPUs easier. It's based on the GPLed bits from PathScale's compilers. It's really interesting to see what you can do with this compiler, it's now been used to create high-performance compilers for RISC, CISC, VLIW, and stream processors. -- greg From alenzo at mail.rochester.edu Fri Jun 22 12:43:34 2007 From: alenzo at mail.rochester.edu (A Lenzo) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Setting up NFS directory mount on client In-Reply-To: <200706221517.l5MFHqLH030863@bluewest.scyld.com> References: <200706221517.l5MFHqLH030863@bluewest.scyld.com> Message-ID: <005801c7b505$a18a0730$f6339780@libra.cc.rochester.edu> Hello all, First, thank you to everyone who provided me help on my last problem. I set up a small test cluster in order to learn NFS. I have a server and two nodes on my test network. Running Fedora Core 6. I made an account on the server called barney. What I want to be able to do is log into any machine with this account and access files without having to set up the account on every machine. So I knew NFS would let me share the home directory of barney across the network. It is almost working now. Right now, using the instructions I found here: Linux Home Server HOWTO - Network File System I can log into any node and the directory I want appears here: /media/mytestsrv/home/barney So that's great! It is sharing. But I need it to be here: /home/barney I can't figure out how to do that. I edited the /etc/fstab file with the following line: mytestsrv:/ /home nfs4 auto,rw,nodev,sync,_netdev,proto=tcp,retry=10,rsiz e=32768,wsize=32768,hard,intr 0 0 but then the files I want appear in: /home/home/barney and I log into this message: Could not chdir to home directory /home/barney: No such file or directory What can I do to get my home directory into the right place? It is definitely sharing, but unless the home directory is acually in /home, I'm not quite there. Thanks again, Linux and NFS gurus! From onetoleo at yahoo.es Thu Jun 21 09:09:34 2007 From: onetoleo at yahoo.es (Leonardo Ismael) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] hi! Message-ID: <120580.34827.qm@web27408.mail.ukl.yahoo.com> Hi! I'm a student from Buenos Aires, Argentina. I want to built a Beowulf system as a Project inside my university. I'll make some questions via e-mail soon... --------------------------------- LLama Gratis a cualquier PC del Mundo. Llamadas a fijos y m?viles desde 1 c?ntimo por minuto. http://es.voice.yahoo.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070621/51631d81/attachment.html From geoff at galitz.org Thu Jun 21 15:28:54 2007 From: geoff at galitz.org (Geoff Galitz) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] question for licensed software users In-Reply-To: <467AF9C0.8010707@scalableinformatics.com> Message-ID: Flex has generally worked out for us. We have run into problems when some vendors wrap up the flex startup/shutdown sequence to tightly with their application. We would have situations where flex was starting up with the wrong PATH environment. Flex would start but not function properly in this case. If I had to rate on a scale of one to ten, I'd have to give it 7.5 or 8 as far as how well it worked out. -geoff On 6/21/07 3:20 PM, "Joe Landman" wrote: > Hi folks > > Is there any one out there, using flexlm, who is *not* having > problems with flexlm? > > Just curious. It is being particularly vicious to several of our > customers. > > Joe -- Geoff Galitz, geoff@galitz.org Oakland, California Lommersdorf, Deutschland From rb at hcl.in Thu Jun 21 21:35:12 2007 From: rb at hcl.in (Balamurugan.R) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] preemptive kernel and preemptive schedulers Message-ID: <467B5180.8070904@hcl.in> An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070622/a2014bc4/attachment.html From TPierce at rohmhaas.com Fri Jun 22 06:44:11 2007 From: TPierce at rohmhaas.com (Thomas H Dr Pierce) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] question for licensed software users In-Reply-To: <467BB933.3070802@scalableinformatics.com> Message-ID: Dear Joe Landman, The only issues that I have seen with Flexlm involve using a server that has with multiple ethernet cards. The lmhostid is the "default" mac address, but making changes to the server (new OS, changing network cards ) has changed the "default" eth0 definition at times. I have had eth0 and eth1 "change" identities as I patch the OS or add ethernet cards. It has never happened without cause, eventho unexpected at the software change. ------ Sincerely, Tom Pierce Joe Landman Sent by: beowulf-bounces@beowulf.org 06/22/2007 07:57 AM Please respond to landman@scalableinformatics.com To Toon Knapen cc beowulf@beowulf.org Subject Re: [Beowulf] question for licensed software users Toon Knapen wrote: > What kind of problems: developing using FlexLm or deploying an app which > needs to talk to a flexlm-server ? flexlm server suddenly and inexplicably switching its lmhostid. After a reboot. Renders all licenses unusable. This happened recently at two customers sites. One was a windows license server, one was a linux license server. Restarting the boxes didn't help. Restarting the daemon didn't help. Looking at the license files didn't indicate an expiration. The logs didn't indicate this either. It just appeared (in the logs) that the thing stopped believing it was server 1 (the lmhostid is a mac address BTW), and suddenly started believing it was a server 2 (well, it took the other mac address as its lmhostid). Very annoying. Have seen this on multiple boxes over the past few years, ones we haven't touched as well as ones we try to "fix" (e.g. alter the setup so that the flexlm daemon will in fact run correctly). This impacts multiple vendor products BTW. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070622/43e91dd4/attachment.html From hahn at mcmaster.ca Fri Jun 22 17:07:34 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Setting up NFS directory mount on client In-Reply-To: <005801c7b505$a18a0730$f6339780@libra.cc.rochester.edu> References: <200706221517.l5MFHqLH030863@bluewest.scyld.com> <005801c7b505$a18a0730$f6339780@libra.cc.rochester.edu> Message-ID: > I can't figure out how to do that. I edited the /etc/fstab file with the > following line: > > mytestsrv:/ /home nfs4 auto,rw,nodev,sync,_netdev,proto=tcp,retry=10,rsiz > e=32768,wsize=32768,hard,intr 0 0 mytestsrv appears to be exporting its whole filesystem, which is a bit unusual. normally, a home server would just export /home - that is, an /etc/exports line like: /home 10.0.0.0/255.0.0.0(rw,async) > but then the files I want appear in: > /home/home/barney note that you _could_ also make the fstab entry (on the client side) mytestsrv:/home /home ... (without changing the exports entry). out of curiosity, how are you finding nfs4? I'd be most interested if you've performed any performance comparisons... regards, mark hahn. From rgb at phy.duke.edu Fri Jun 22 17:26:43 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Setting up NFS directory mount on client In-Reply-To: <005801c7b505$a18a0730$f6339780@libra.cc.rochester.edu> References: <200706221517.l5MFHqLH030863@bluewest.scyld.com> <005801c7b505$a18a0730$f6339780@libra.cc.rochester.edu> Message-ID: On Fri, 22 Jun 2007, A Lenzo wrote: > I can't figure out how to do that. I edited the /etc/fstab file with the > following line: > > mytestsrv:/ /home nfs4 auto,rw,nodev,sync,_netdev,proto=tcp,retry=10,rsiz > e=32768,wsize=32768,hard,intr 0 0 > > > but then the files I want appear in: > /home/home/barney As I said, getting a good (not terribly expensive!) book on this would be worth its weight in CDs. Here are a few lines from the /etc/fstabs in my house. uriel (suitably defined in /etc/hosts) is my primary household server -- a small md RAID 5 array. Note that the home directory is on its own PARTITION on uriel -- it is a Very Good Idea to put a shared home directory (or an unshared home directory) onto its own partition, as that makes doing a full reinstall of the operation system possible/easy. SO, here's the relevant portion of uriel's fstab: rgb@lilith|B:1018>uriel cat /etc/fstab /dev/md0 / ext3 defaults 1 1 /dev/md1 /var/www/html ext3 defaults 1 1 /dev/md2 /home ext3 defaults 1 1 Here are its /etc/exports: rgb@lilith|B:1019>uriel cat /etc/exports # /etc/exports # A list of filesystems or directories to export matched with hosts # with permission to mount /home *.rgb.private.net(rw,no_root_squash,sync) /var/www/html *.rgb.private.net(rw,no_root_squash,sync) Here is the NFS part of /etc/fstab on serpent, one of uriel's clients: rgb@lilith|B:1020>serpent cat /etc/fstab LABEL=/1 / ext3 defaults 1 1 devpts /dev/pts devpts gid=5,mode=620 0 0 tmpfs /dev/shm tmpfs defaults 0 0 proc /proc proc defaults 0 0 sysfs /sys sysfs defaults 0 0 LABEL=SWAP-hda5 swap swap defaults 0 0 uriel:/home /home nfs defaults 0 0 uriel:/var/www/html /var/www/html nfs defaults 0 0 (Sorry, the tabs are all collapsed so it isn't as pretty as it is in the real thing, where the fields line up). The home directory looks like: rgb@lilith|B:1021>serpent dir /home patrick sam william rgb sfi wrankin jsquyres lost+found Music (where I put our household music collection on this as well so I can listen from any machine in the house). All of these are user accounts. Any of these people (sons, wives, a couple of colleagues) can log in to some or all of my home systems and have "their" home directories. HTH, rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From hahn at MCMASTER.CA Fri Jun 22 17:23:44 2007 From: hahn at MCMASTER.CA (Mark Hahn) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] preemptive kernel and preemptive schedulers In-Reply-To: <467B5180.8070904@hcl.in> References: <467B5180.8070904@hcl.in> Message-ID: > RT Linux (I am a newbie to RT=Linux) is based on the concept of Preemptive kernels where the jobs are > temporally preempted say a FIFO scheduler implemented at the kernel level. when people say "RT Linux", they normally mean various flavors of hard-RT hacks (not meant in an entirely disparaging way;) on such systems, it much more than simply suspending user-space - back in the day, the idea of preempting the whole kernel was extremely controversial. my impression is that since then, the (real, kernel.org kernel) has become drastically more predictable in latency, and in many ways more preemptable. even ~6 years ago, it was quite feasible to run realtime applications on the kernel.org kernel - I did realtime video generation and response timing for psychophysics labs. it wasn't hard-RT in the sense of "provably always meets all deadlines", but for a lab, it just worked... > I know that the same is achieved by submitting the job through a job scheduler(ca. PBS Pro) installed > on top of the OS. a job scheduler is entirely different from hard-RT stuff. a job scheduler is primarily concerned with managing cluster resources, which may indeed include suspending jobs. at least the conventional ones like PBS/LSF/etc are most definitely _not_ hard-RT - if nothing else, they tend to respond on the order of several seconds, not microseconds... > I wanted to know whether schedulers can be alternatives for preemptive kernels. sure. in the most abstract sense, they are the same. the original RTLinux stuff was based on an executive/hypervisor which preempted the kernel much as a normal job scheduler might preempt a job. these days, the boundaries are further blurred by virtualization... regards, mark hahn. From buccaneer at rocketmail.com Sat Jun 23 05:17:33 2007 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Setting up NFS directory mount on client In-Reply-To: Message-ID: <200750.54071.qm@web30609.mail.mud.yahoo.com> Great advice coming to you. You now know how to handle the exports. Here are my $0.02: (1) If you are firewalled off from the world, you can implement NIS wich is easy-tho' not secure. (2) I would not export /home, but to place home dirs where the file system is more robust (I like raid and am also known for having trust issues.) (3) learn how to set up autofs. ____________________________________________________________________________________ Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games. http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow From diep at xs4all.nl Sat Jun 23 08:57:16 2007 From: diep at xs4all.nl (Vincent Diepeveen) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] any gp-gpu clusters? References: Message-ID: <009c01c7b5af$307e8250$0900a8c0@objection> Hello Mark, Well i've been past few weeks investigating cards and what it seems is that so far the marketing department is far ahead of actual performance. At this 8800 card the fastest FFT that i could find is claiming 100 gflop out of a very expensive 8800 card that on paper should deliver nearly half a teraflop. That is quite dissappointing. Then we didn't even investigate that FFT yet, as it seems to do something that most of us don't need at all, what we all need is far more complicated to get really well to work on those cards. We also didn't discuss even how to do big matrix calculations knowing the complexity of implementing this into the architecture. You mention a thought that many have had already, namely if you build a cluster, that within a year or 2 you can quite easily upgrade the cards in each node. Though this sounds interesting, right now a single card isn't delivering more than what a quadcore can deliver you, whereas this quadcore can do much more and can use more RAM. When the power6 system got presented in Amsterdam a week ago (40 Tflop in 2008, right now it's power5 and 14 Tflop), i still can remember how one scientist was very happy with the 64GB of ram that each node has, as RAM speeds his calculations up more than additionally processing power. So he for sure won't line up for calculating within videocards with limited RAM. If you plan to put a card or 4 into a single node, please realize that a single quadcore node eats about 172 watt (when not using videocard nor i/o) or 180 watt when using a videocard, this with all 4 cores at full usage. This where a single videocard is having a TDP of far over 200 watt, so at full usage. If you plan to put in a videocard or 4 @ 225 watt each, you have some monster of an energy bill in return. The easiest programming language (CUDA) also delivers the smallest amount of performance it seems, versus ATI's 2900 card. The advantages of using a bunch of videocards in a single node is basically next: a) the speculation that the next generation videocards from ATI and NVIDIA will deliver great performance for those who can use the card b) the theoretic possibility to save upon network costs, as the network is basically a pci-e 16x slot at the mainboard. So where one card is perhaps nearly equal to a quadcore, just on paper, for something that needs very little RAM; it is obvious that if you put in 4 of those cards that you still just need 1 network card in the node to connect the network. c) on paper it would be possible that nodes equipped with 2 videocards, 1 simple card to adress the system and 1 card to do calculations upon, can be used by 2 users at the same time. One person could use on paper the videocard and the other one the rest of the node. This is however wishful thinking as of now. Which university is going to put in a monster that eats 200 watt or so at full performance and that just 1 or 2 users can use? There is however a few weaknesses that remains: a) you need n+1 cards in a system to use n cards for calculations b) The measured latency, so not theoretic but practical latency measured here, between RAM and cards RAM are far worse than that network cards deliver; 50 us roughly for the 8800 versus 1.5 us roughly for network cards one-way ping pong latency. The bandwidth is not better either and with several cards a node that'll deteriorate probably. c) the limited amount of RAM on-card and the huge price for cards that do have more than half a gigabyte DDR3, nvidia's high clocked cards really are quite expensive. d) the huge mass production that ATI and NVIDIA must achieve in order to sell those cards to keep price a bit affordable instead of thousands a card is counter productive in our direction. For just graphics all they need is single precision floating point, whereas the few guys (that's people in this beowulf list) who want a card that is programmable like a cpu and use it for DSP type workloads is quite limited. They need to produce and sell tens of millions of those cards so selling a couple of thousands to calculation type workloads is not real interesting to ati/nvidia and it is rather wishful thinking that cards will get really optimized for what we really need. e) it is very hard to get information about the cards, like how caches work, yes it's not even clear how BIG caches are on a card and what bottlenecks are on the cards. So programming for those cards in a manner that HPC needs, namely getting the utmost performance out of it, is total impossible to do with some generic programming language. It requires complete fulltime dedication to do so, have friends at nvidia or ati to get more info and so on. It is very specialistic work in short. This is currently by far the biggest obstacle to start programming for those cards. f) the few attempts that have been tried so far had very dissappointing results for whatever reason, the lack of information basically means that the huge marketing balloons of ATI and NVIDIA promising nearly half a teraflop now a card are just not even close to reality. Every project on it so far has failed to deliver more performance than existing generic code already delivers at c2q. That said, on paper there is a theoretic possibility that such cards in future (perhaps end 2007) get huge Teraflop capabilities single precision, which cpu's won't have any soon, so keeping an eye on them is very interesting. As of now the graphics cards are simply our only hope to get great gflop capabilities for a small price. Giving up that dream not many of us will want to do. Yet so far it is a mystery how to beat a 3Ghz core2 @ 16 cores dual Xeon node with a big L2/L3 with such a graphics card that has such tiny caches and is lobotomized everywhere so that the total number of instructions it can process on paper simply can never be true? To keep objective, ATI's latest 2900 card has 64 streaming processors which ATI markets as 320 by the way, lying directly factor 5, and is just 742Mhz clocked. So you start at a disadvantage against core2 of a factor: 2.4Ghz / 0.742 = 3.2 So you must somewhere win a factor 3.2 to just *keep the same speed* for your code. This where at 22 july the 2.4ghz quadcore drops to 266 dollar whereas the ati2900 is currently priced nearly 400 EURO here. It is very hard to compete when you already must make up for a factor 3+ to start with. That 4.7Ghz power6 is far more interesting in that sense, yet i know in advance i won't get any system time at it, whereas i CAN buy a videocard for a couple of hundreds of euro's. The future will provide answers therefore whether future graphics chips can kick butt for a small price, i sure hope so. Thanks, Vincent ----- Original Message ----- From: "Mark Hahn" To: "Beowulf Mailing List" Sent: Thursday, June 21, 2007 4:57 PM Subject: [Beowulf] any gp-gpu clusters? > Hi all, > is anyone messing with GPU-oriented clusters yet? > > I'm working on a pilot which I hope will be something like 8x > workstations, each with 2x recent-gen gpu cards. > the goal would be to host cuda/rapidmind/ctm-type gp-gpu development. > > part of the motive here is just to create a gpu-friendly infrastructure > into which commodity cards can be added and refreshed every 8-12 months. > as opposed to "investing" in quadro-level cards which are too expensive > enough to toss when obsoleted. > > nvidia's 1U tesla (with two g80 chips) looks potentially attractive, > though I'm guessing it'll be premium/quadro-priced - not really in keeping > with the hyper-moore's-law mantra... > > if anyone has experience with clustered gp-gpu stuff, I'm interested in > comments on particular tools, experiences, configuration of the host > machines and networks, etc. for instance, is it naive to think that > gp-gpu is most suited to flops-heavy-IO-light apps, and therefore doesn't > necessarily need a hefty (IB, 10Geth) network? > > thanks, mark hahn. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From toon.knapen at fft.be Mon Jun 25 00:48:44 2007 From: toon.knapen at fft.be (Toon Knapen) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] preemptive kernel and preemptive schedulers In-Reply-To: References: <467B5180.8070904@hcl.in> Message-ID: <467F735C.5050103@fft.be> To me the difference is that a (or all) scheduler will not be able to 'continue' (and thus migrate) the job it preempted on any of the cpu's available to the scheduler. You might find more info on this in this thread: http://www.beowulf.org/archive/2007-April/018245.html toon Mark Hahn wrote: >> RT Linux (I am a newbie to RT=Linux) is based on the concept of >> Preemptive kernels where the jobs are >> temporally preempted say a FIFO scheduler implemented at the kernel >> level. > > when people say "RT Linux", they normally mean various flavors of > hard-RT hacks (not meant in an entirely disparaging way;) > > on such systems, it much more than simply suspending user-space - back > in the > day, the idea of preempting the whole kernel was extremely controversial. > my impression is that since then, the (real, kernel.org kernel) has > become drastically more predictable in latency, and in many ways more > preemptable. > > even ~6 years ago, it was quite feasible to run realtime applications on > the kernel.org kernel - I did realtime video generation and response > timing for psychophysics labs. it wasn't hard-RT in the sense of > "provably always meets all deadlines", but for a lab, it just worked... > >> I know that the same is achieved by submitting the job through a job >> scheduler(ca. PBS Pro) installed >> on top of the OS. > > a job scheduler is entirely different from hard-RT stuff. a job > scheduler is > primarily concerned with managing cluster resources, which may indeed > include suspending jobs. at least the conventional ones like > PBS/LSF/etc are most > definitely _not_ hard-RT - if nothing else, they tend to respond on the > order > of several seconds, not microseconds... > >> I wanted to know whether schedulers can be alternatives for preemptive >> kernels. > > sure. in the most abstract sense, they are the same. the original RTLinux > stuff was based on an executive/hypervisor which preempted the kernel > much as a normal job scheduler might preempt a job. these days, the > boundaries are further blurred by virtualization... > > regards, mark hahn. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From dnlombar at ichips.intel.com Mon Jun 25 08:06:04 2007 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] preemptive kernel and preemptive schedulers In-Reply-To: <467B5180.8070904@hcl.in> References: <467B5180.8070904@hcl.in> Message-ID: <20070625150604.GA3951@nlxdcldnl2.cl.intel.com> On Fri, Jun 22, 2007 at 10:05:12AM +0530, Balamurugan.R wrote: [something in HTML, that I didn't want to muck about with in Mutt] As Mark Hahn mentioned, the two schedulers you're looking at are very different. The kernel's scheduler determines which *processes* are to be run at each moment on the system. You can see the processes being scheduled and run via top, or better yet, atop; you can see how the processes all relate via pstree. As an example of these processes, I started the mutt program to read this email by typing the command mutt into a bash shell prompt; that process, the bash shell, read the command and created a new process from the mutt program, which then allowed me to read the email. To answer the email, mutt started the vim editor, allowing me to type this response you are now reading. Those are example of the various processes. The job scheduler determines which *jobs* are to run; jobs are a sequence of commands that you want to run, likely composed of many processes as described above. Once the job scheduler launches the job, the running of the individual processes is controlled by the kernel's scheduler. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From mathog at caltech.edu Mon Jun 25 09:52:36 2007 From: mathog at caltech.edu (David Mathog) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Re: question for licensed software users Message-ID: > I have had eth0 and eth1 "change" identities as I patch the OS or add > ethernet cards. Recent versions of Linux, such as Mandriva 2007.1, have an /etc/iftab and/or /etc/udev/rules.d/61-net_config.rules files. Both of these associate one specific MAC with eth0, eth1, etc.. The original intent was noble - they were trying to provide a way to allow eth0 to always be the wired and eth1 the wireless network connection, for instance. However if these files get the least bit out of sync with the actual hardware all hell can break loose. For instance, if one clones a single NIC machine that uses these mechanisms the MAC won't match, eth0 won't be used and a new eth1 will be magically created. Unfortunately the firewall doesn't know about eth1 and everything network related then breaks. Result, most likely the machine will hang during boot. Others have reported machines which create a new eth# device at each boot, abandoning all the previous ones. The general fix for these sorts of bugs is to delete both of these files, and at the next boot the udev file will be recreated and will match the hardware. I have not seen a need for /etc/iftab and just leave it deleted. Now, back to Joe's problem, for the linux machines that are having flexlm problems, if the nature of the problem is that eth0 and eth1 are swapping around at random, and those distros have these mechanisms, be sure these two files exist and are configured properly so that eth0 and eth1 are rigidly mapped to fixed MAC addresses. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From dnlombar at ichips.intel.com Mon Jun 25 10:56:07 2007 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Setting up NFS directory mount on client In-Reply-To: References: <200706221517.l5MFHqLH030863@bluewest.scyld.com> <005801c7b505$a18a0730$f6339780@libra.cc.rochester.edu> Message-ID: <20070625175607.GA6308@nlxdcldnl2.cl.intel.com> On Fri, Jun 22, 2007 at 08:26:43PM -0400, Robert G. Brown wrote: >... > Any of these people (sons, wives, a couple of colleagues) can log in to Hmmm. Trying to tell us something, here? ;) -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From Daniel.Pfenniger at obs.unige.ch Mon Jun 25 09:57:38 2007 From: Daniel.Pfenniger at obs.unige.ch (Daniel Pfenniger) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Re: question for licensed software users In-Reply-To: References: Message-ID: <467FF402.6040800@obs.unige.ch> Hi, I encountered also a NIC problem with Maple flexlm. Flexlm checks the existence of the original eth0 NIC present at Maple install time. This interface was later bad, so a second one was added and used instead of eth0. But then Maple was then prevented to start by flexlm. After some search it was found that after each reboot one has to initialize eth0 once (ifconfig eth0 ... up), even if disabled later in order to satisfy flexlm. No need to say that the lost time finding the cause of flexlm disfunction was yet another argument to hate licensed software. Dan David Mathog wrote: >> I have had eth0 and eth1 "change" identities as I patch the OS or add >> ethernet cards. > > Recent versions of Linux, such as Mandriva 2007.1, have an /etc/iftab > and/or /etc/udev/rules.d/61-net_config.rules files. Both of these > associate one specific MAC with eth0, eth1, etc.. > The original intent was noble - they were trying to provide a > way to allow eth0 to always be the wired and eth1 the wireless > network connection, for instance. However if these files > get the least bit out of sync with the actual hardware > all hell can break loose. For instance, if one clones a single NIC > machine that uses these mechanisms the MAC won't match, eth0 won't be > used and a new eth1 will be magically created. Unfortunately > the firewall doesn't know about eth1 and everything network > related then breaks. Result, most likely the machine will hang > during boot. Others have reported machines which create a new > eth# device at each boot, abandoning all the previous ones. The general > fix for these sorts of bugs is to delete both of these files, and > at the next boot the udev file will be recreated and will match the > hardware. I have not seen a need for /etc/iftab and just leave it deleted. > > Now, back to Joe's problem, for the linux machines that are having > flexlm problems, if the nature of the problem is that eth0 and eth1 > are swapping around at random, and those distros have these mechanisms, > be sure these two files exist and are configured properly so that > eth0 and eth1 are rigidly mapped to fixed MAC addresses. > > Regards, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From rgb at phy.duke.edu Mon Jun 25 12:08:09 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Setting up NFS directory mount on client In-Reply-To: <20070625175607.GA6308@nlxdcldnl2.cl.intel.com> References: <200706221517.l5MFHqLH030863@bluewest.scyld.com> <005801c7b505$a18a0730$f6339780@libra.cc.rochester.edu> <20070625175607.GA6308@nlxdcldnl2.cl.intel.com> Message-ID: On Mon, 25 Jun 2007, Lombard, David N wrote: > On Fri, Jun 22, 2007 at 08:26:43PM -0400, Robert G. Brown wrote: >> ... >> Any of these people (sons, wives, a couple of colleagues) can log in to I didn't say they were all mine, did I? In fact, I didn't say ANY of them were mine...;-) Heck, according to my ONE wife of 28 years as of last Saturday (on a bad day:-) even my sons aren't, actually, my sons... Perhaps I should rephrase it to "these people (who may or may not be related to me but for whom I act as a household systems administrator for non-Windows related matters, for Windows they are On Their Own) can log to..." rgb > > Hmmm. Trying to tell us something, here? ;) > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From mfatica at gmail.com Sat Jun 23 10:26:40 2007 From: mfatica at gmail.com (Massimiliano Fatica) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] any gp-gpu clusters? In-Reply-To: References: Message-ID: <8e6393ac0706231026w2ab6f8d1n3e1330eb082fdd83@mail.gmail.com> Mark,I am messing up with a GPU oriented cluster. I am now on travel to ISC, where I will show a sustained Teraflop with a workstation with 4 Tesla cards using VMD to do ion placement (for the list member going to Dresden stop by to the Nvidia booth to see the demo in action). This was a computation that used to take 100 CPU hours on an Altix and it is now done in the matter of minutes. Yes, the whole system probably consumes 900W ( the tdp of a tesla is 170W not 220W), but I can assure you that is nothing compared to a big Altix machine and you can put under your desk and do some real science. Several groups are building gpu-oriented cluster. Once mine is completed ( 8 compute nodes, each one with 2 Tesla boards) , it should be accessible for testing to academic and research group. People interested in testing their CUDA codes on cluster could drop me an email. On a side note, it is interesting to see all the speculations from people that have never used CUDA (and most of the time don't have a clue...) and at the same time to see quality software (mostly open source like VMD, NAMD, SOFA ) achieving pretty impressive results and enabling new science. Massimiliano PS: Usual disclaimer, I work in the GPU Computing group at NVIDIA. On 6/21/07, Mark Hahn wrote: > > Hi all, > is anyone messing with GPU-oriented clusters yet? > > I'm working on a pilot which I hope will be something > like 8x workstations, each with 2x recent-gen gpu cards. > the goal would be to host cuda/rapidmind/ctm-type gp-gpu development. > > part of the motive here is just to create a gpu-friendly > infrastructure into which commodity cards can be added and > refreshed every 8-12 months. as opposed to "investing" in > quadro-level cards which are too expensive enough to toss when obsoleted. > > nvidia's 1U tesla (with two g80 chips) looks potentially attractive, > though I'm guessing it'll be premium/quadro-priced - not really in > keeping with the hyper-moore's-law mantra... > > if anyone has experience with clustered gp-gpu stuff, I'm interested > in comments on particular tools, experiences, configuration of the host > machines and networks, etc. for instance, is it naive to think that > gp-gpu is most suited to flops-heavy-IO-light apps, and therefore doesn't > necessarily need a hefty (IB, 10Geth) network? > > thanks, mark hahn. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070623/d3bbb462/attachment.html From geoff at galitz.org Mon Jun 25 19:01:17 2007 From: geoff at galitz.org (Geoff Galitz) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] IMPI network monitoring Message-ID: Hi folks, I have some older Dell PowerEdge 850's. These guys have IPMI 1.1 capabilities. I cannot seem to find a way to interrogate these systems and get the status of the network interfaces. In particular I want to see if there are dropped packets, bad frames, collisions... That kind of thing. Going through the OS is not an option as they are running as an embedded platform. Is this possible to get this data via IPMI? -geoff -- Geoff Galitz, geoff@galitz.org Oakland, California Lommersdorf, Deutschland From andrew.robbie at gmail.com Tue Jun 26 07:50:09 2007 From: andrew.robbie at gmail.com (Andrew Robbie) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Re: question for licensed software users In-Reply-To: <467FF402.6040800@obs.unige.ch> References: <467FF402.6040800@obs.unige.ch> Message-ID: <89A71750-2A2A-4210-9F68-5EC22F1DDCB0@gmail.com> On 26/06/2007, at 2:57 AM, Daniel Pfenniger wrote: > Hi, > > I encountered also a NIC problem with Maple flexlm. Flexlm checks > the existence of the original eth0 NIC present at Maple install time. > This interface was later bad, so a second one was added and > used instead of eth0. But then Maple was then prevented to start by > flexlm. After some search it was found that after each reboot one has > to initialize eth0 once (ifconfig eth0 ... up), even if disabled later > in order to satisfy flexlm. It is possible under linux (and sometimes windows depending on the driver) to tell a card to use a different MAC address. If you are throwing out a bad NIC (ie two nodes with the same MAC will never appear on the network) this is a possible solution. It has to be done at every reboot, but that is easily accomplished by creating a startup script (or using rc.local). man ifconfig. > No need to say that the lost time finding the cause of flexlm > disfunction > was yet another argument to hate licensed software. Talk to your vendor. The more people who complain the better. Andrew > > Dan > > > David Mathog wrote: >>> I have had eth0 and eth1 "change" identities as I patch the OS or >>> add >>> ethernet cards. >> >> Recent versions of Linux, such as Mandriva 2007.1, have an /etc/iftab >> and/or /etc/udev/rules.d/61-net_config.rules files. Both of these >> associate one specific MAC with eth0, eth1, etc.. >> The original intent was noble - they were trying to provide a >> way to allow eth0 to always be the wired and eth1 the wireless >> network connection, for instance. However if these files >> get the least bit out of sync with the actual hardware >> all hell can break loose. For instance, if one clones a single NIC >> machine that uses these mechanisms the MAC won't match, eth0 won't be >> used and a new eth1 will be magically created. Unfortunately >> the firewall doesn't know about eth1 and everything network >> related then breaks. Result, most likely the machine will hang >> during boot. Others have reported machines which create a new >> eth# device at each boot, abandoning all the previous ones. The >> general >> fix for these sorts of bugs is to delete both of these files, and >> at the next boot the udev file will be recreated and will match the >> hardware. I have not seen a need for /etc/iftab and just leave it >> deleted. >> >> Now, back to Joe's problem, for the linux machines that are having >> flexlm problems, if the nature of the problem is that eth0 and eth1 >> are swapping around at random, and those distros have these >> mechanisms, >> be sure these two files exist and are configured properly so that >> eth0 and eth1 are rigidly mapped to fixed MAC addresses. >> >> Regards, >> >> David Mathog >> mathog@caltech.edu >> Manager, Sequence Analysis Facility, Biology Division, Caltech >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From ganesh at alvasystems.net Tue Jun 26 09:23:37 2007 From: ganesh at alvasystems.net (Ganesh Shetty) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] any gp-gpu clusters? Message-ID: I did ( or at least attempted to ) to someting similar with the ATI stream processor card(s) using the Peakstream VM. We evaluated Cuda - do not want to get into that here. But now that GOOG has acquired Peakstream, we might have to take a second look at CUDA. -Ganesh > > Mark,I am messing up with a GPU oriented cluster. > > I am now on travel to ISC, where I will show a sustained Teraflop with a > workstation with 4 Tesla cards using VMD to do ion placement (for the list > member going to Dresden stop by to the Nvidia booth to see the demo in > action). This was a computation that used to take 100 CPU hours on an Altix > and it is now done in the matter of minutes. Yes, the whole system probably > consumes 900W ( the tdp of a tesla is 170W not 220W), but I can assure you > that is nothing compared to a big Altix machine and you can put under your > desk and do some real science. > > Several groups are building gpu-oriented cluster. Once mine is completed ( 8 > compute nodes, each one with 2 Tesla boards) , it should be accessible for > testing to academic and research group. People interested in testing their > CUDA codes on cluster could drop me an email. > > On a side note, it is interesting to see all the speculations from people > that have never used CUDA (and most of the time don't have a clue...) and at > the same time to see quality software (mostly open source like VMD, NAMD, > SOFA ) achieving pretty impressive results and enabling new science. > > > Massimiliano > PS: Usual disclaimer, I work in the GPU Computing group at NVIDIA. > > > > On 6/21/07, Mark Hahn wrote: > > > > Hi all, > > is anyone messing with GPU-oriented clusters yet? > > > > I'm working on a pilot which I hope will be something > > like 8x workstations, each with 2x recent-gen gpu cards. > > the goal would be to host cuda/rapidmind/ctm-type gp-gpu development. > > > > part of the motive here is just to create a gpu-friendly > > infrastructure into which commodity cards can be added and > > refreshed every 8-12 months. as opposed to "investing" in > > quadro-level cards which are too expensive enough to toss when obsoleted. > > > > nvidia's 1U tesla (with two g80 chips) looks potentially attractive, > > though I'm guessing it'll be premium/quadro-priced - not really in > > keeping with the hyper-moore's-law mantra... > > > > if anyone has experience with clustered gp-gpu stuff, I'm interested > > in comments on particular tools, experiences, configuration of the host > > machines and networks, etc. for instance, is it naive to think that > > gp-gpu is most suited to flops-heavy-IO-light apps, and therefore doesn't > > necessarily need a hefty (IB, 10Geth) network? > > > > thanks, mark hahn. > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > > > -- Ganesh P Shetty From landman at scalableinformatics.com Tue Jun 26 20:19:45 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] IMPI network monitoring In-Reply-To: References: Message-ID: <4681D751.8050103@scalableinformatics.com> Hi Geoff Geoff Galitz wrote: > > Hi folks, > > I have some older Dell PowerEdge 850's. These guys have IPMI 1.1 > capabilities. I cannot seem to find a way to interrogate these systems and > get the status of the network interfaces. In particular I want to see if > there are dropped packets, bad frames, collisions... That kind of thing. These items probably aren't available. I don't see such things in the IPMI 2.0 implementations. Not sure it is the 2.0 spec either. > > Going through the OS is not an option as they are running as an embedded > platform. Hmmm > > Is this possible to get this data via IPMI? If you have the capability to develop a custom i2c or similar interface to the network hardware, you should be able to monitor it that way. Then you would need to also include a method for ipmi to return custom measurements from i2c. This would require hardware and software changes. Joe > > -geoff > > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From laytonjb at charter.net Wed Jun 27 04:50:20 2007 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] any gp-gpu clusters? In-Reply-To: References: Message-ID: <46824EFC.9070309@charter.net> Ganesh Shetty wrote: > I did ( or at least attempted to ) to someting similar with the ATI stream processor card(s) > using the Peakstream VM. We evaluated Cuda - do not want to get into that here. > > But now that GOOG has acquired Peakstream, we might have to take a second look at CUDA. > > -Ganesh > There's also RapidMind. It's somewhat similar to Peakstream. Jeff From deadline at clustermonkey.net Wed Jun 27 06:30:22 2007 From: deadline at clustermonkey.net (Douglas Eadline) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Are You Ready for "Intel Cluster Ready" Message-ID: <45680.192.168.1.1.1182951022.squirrel@mail.eadline.org> Intel has announced their new "Cluster Ready" program. I have a short write-up with links on Cluster Monkey. http://www.clustermonkey.net//content/view/204/1/ It is an Intel centric spec for clusters. A "good thing" in general I think, though I have concerns. (read the post) Opinions ? (yes a dangerous but worthy question on this list!) -- Doug From peter.st.john at gmail.com Wed Jun 27 07:53:41 2007 From: peter.st.john at gmail.com (Peter St. John) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Re: question for licensed software users In-Reply-To: <89A71750-2A2A-4210-9F68-5EC22F1DDCB0@gmail.com> References: <467FF402.6040800@obs.unige.ch> <89A71750-2A2A-4210-9F68-5EC22F1DDCB0@gmail.com> Message-ID: Can you use a competititor of Flexlm, such as IBM's LUM, or is Flexlm required by Maple? If the later, I'd complain to Maple. It's sad to me that there are folks who need proprietary UIs to do science, while there are businesses paying C programmers to bucket-brigade cruft that should be handled by pretty, and expensive, (N+1)GL packages with ribbons tied around them. C'est la vie. Peter On 6/26/07, Andrew Robbie wrote: > > > On 26/06/2007, at 2:57 AM, Daniel Pfenniger wrote: > > > Hi, > > > > I encountered also a NIC problem with Maple flexlm. Flexlm checks > > the existence of the original eth0 NIC present at Maple install time. > > This interface was later bad, so a second one was added and > > used instead of eth0. But then Maple was then prevented to start by > > flexlm. After some search it was found that after each reboot one has > > to initialize eth0 once (ifconfig eth0 ... up), even if disabled later > > in order to satisfy flexlm. > > It is possible under linux (and sometimes windows depending on the > driver) to tell a card to use a different MAC address. If you are > throwing out a bad NIC (ie two nodes with the same MAC will never > appear on the network) this is a possible solution. It has to be done > at every reboot, but that is easily accomplished by creating a > startup script (or using rc.local). man ifconfig. > > > No need to say that the lost time finding the cause of flexlm > > disfunction > > was yet another argument to hate licensed software. > > Talk to your vendor. The more people who complain the better. > > Andrew > > > > > > Dan > > > > > > David Mathog wrote: > >>> I have had eth0 and eth1 "change" identities as I patch the OS or > >>> add > >>> ethernet cards. > >> > >> Recent versions of Linux, such as Mandriva 2007.1, have an /etc/iftab > >> and/or /etc/udev/rules.d/61-net_config.rules files. Both of these > >> associate one specific MAC with eth0, eth1, etc.. > >> The original intent was noble - they were trying to provide a > >> way to allow eth0 to always be the wired and eth1 the wireless > >> network connection, for instance. However if these files > >> get the least bit out of sync with the actual hardware > >> all hell can break loose. For instance, if one clones a single NIC > >> machine that uses these mechanisms the MAC won't match, eth0 won't be > >> used and a new eth1 will be magically created. Unfortunately > >> the firewall doesn't know about eth1 and everything network > >> related then breaks. Result, most likely the machine will hang > >> during boot. Others have reported machines which create a new > >> eth# device at each boot, abandoning all the previous ones. The > >> general > >> fix for these sorts of bugs is to delete both of these files, and > >> at the next boot the udev file will be recreated and will match the > >> hardware. I have not seen a need for /etc/iftab and just leave it > >> deleted. > >> > >> Now, back to Joe's problem, for the linux machines that are having > >> flexlm problems, if the nature of the problem is that eth0 and eth1 > >> are swapping around at random, and those distros have these > >> mechanisms, > >> be sure these two files exist and are configured properly so that > >> eth0 and eth1 are rigidly mapped to fixed MAC addresses. > >> > >> Regards, > >> > >> David Mathog > >> mathog@caltech.edu > >> Manager, Sequence Analysis Facility, Biology Division, Caltech > >> _______________________________________________ > >> Beowulf mailing list, Beowulf@beowulf.org > >> To change your subscription (digest mode or unsubscribe) visit > >> http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070627/af9ab73c/attachment.html From peter.st.john at gmail.com Wed Jun 27 08:17:06 2007 From: peter.st.john at gmail.com (Peter St. John) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Are You Ready for "Intel Cluster Ready" In-Reply-To: <45680.192.168.1.1.1182951022.squirrel@mail.eadline.org> References: <45680.192.168.1.1.1182951022.squirrel@mail.eadline.org> Message-ID: Doug, I just want to note: "...[Intel's standard] adds a requirement on the message layer implementation that differences in the device-level API be hidden from the application code. *An example* [emphasis mine] of an implementation of a message layer that meets this requirement is the Intel (tm) MPI Library..." This makes me a bit optimistic that other parts than Intel's (hopefully all parts, e.g. AMD microprocessors) can conform to the standard. Requiring MPI (as opposed to say virtual machines) is probably a necessary limitation to the standard's scope. Plainly they don't require any particular unix. And probably they will want to permit MS compilers to conform, don't you think? Peter On 6/27/07, Douglas Eadline wrote: > > Intel has announced their new "Cluster Ready" > program. I have a short write-up with links on > Cluster Monkey. > > http://www.clustermonkey.net//content/view/204/1/ > > It is an Intel centric spec for clusters. A "good thing" > in general I think, though I have concerns. (read the post) > > Opinions ? (yes a dangerous but worthy question on this list!) > > -- > Doug > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070627/1489a81b/attachment.html From buccaneer at rocketmail.com Wed Jun 27 08:35:18 2007 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Are You Ready for "Intel Cluster Ready" In-Reply-To: <45680.192.168.1.1.1182951022.squirrel@mail.eadline.org> Message-ID: <854102.82264.qm@web30608.mail.mud.yahoo.com> --- Douglas Eadline wrote: > > Intel has announced their new "Cluster Ready" > program. I have a short write-up with links on > Cluster Monkey. > > http://www.clustermonkey.net//content/view/204/1/ > > It is an Intel centric spec for clusters. A "good > thing" > in general I think, though I have concerns. (read > the post) > > Opinions ? (yes a dangerous but worthy question on > this list!) (1) It is a commercial concern and that what one expects for the most part. (2) If you can profit from it. We are currently testing different MPIs for performance reasons. ___________________________________________________________________________________ You snooze, you lose. Get messages ASAP with AutoCheck in the all-new Yahoo! Mail Beta. http://advision.webevents.yahoo.com/mailbeta/newmail_html.html From kball at pathscale.com Wed Jun 27 09:30:29 2007 From: kball at pathscale.com (Kevin Ball) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Are You Ready for "Intel Cluster Ready" In-Reply-To: <45680.192.168.1.1.1182951022.squirrel@mail.eadline.org> References: <45680.192.168.1.1.1182951022.squirrel@mail.eadline.org> Message-ID: <1182961829.14030.38.camel@ammonite> Wow... they require, on every node: Java Runtime Environment Perl Python Tcl Kitchen Sink* *(Okay, only figuratively) But I guess we already knew 'lean and mean' is not something Intel thinks about very often. -Kevin On Wed, 2007-06-27 at 06:30, Douglas Eadline wrote: > Intel has announced their new "Cluster Ready" > program. I have a short write-up with links on > Cluster Monkey. > > http://www.clustermonkey.net//content/view/204/1/ > > It is an Intel centric spec for clusters. A "good thing" > in general I think, though I have concerns. (read the post) > > Opinions ? (yes a dangerous but worthy question on this list!) > > -- > Doug > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From i.kozin at dl.ac.uk Wed Jun 27 10:11:27 2007 From: i.kozin at dl.ac.uk (Kozin, I (Igor)) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Are You Ready for "Intel Cluster Ready" References: <45680.192.168.1.1.1182951022.squirrel@mail.eadline.org> <1182961829.14030.38.camel@ammonite> Message-ID: I've heard about it but was expecting it to be like a hardware spec (a la PC spec). Rather surprising to see so many software requirements. It's sort of understandable why Perl or Python (MPICH2/IntelMPI) are required. But Java, Tcl and even Python should be optional. The funny thing is we have a cluster which should comply (loads of Intel software) and yet every time I update Intel compilers the installers complain the libs are not supported (because it's Suse 10.1). Everything works fine. So what? Igor -----Original Message----- From: beowulf-bounces@beowulf.org on behalf of Kevin Ball Sent: Wed 27/06/2007 17:30 To: Douglas Eadline Cc: Beowulf@beowulf.org Subject: Re: [Beowulf] Are You Ready for "Intel Cluster Ready" Wow... they require, on every node: Java Runtime Environment Perl Python Tcl Kitchen Sink* *(Okay, only figuratively) But I guess we already knew 'lean and mean' is not something Intel thinks about very often. -Kevin On Wed, 2007-06-27 at 06:30, Douglas Eadline wrote: > Intel has announced their new "Cluster Ready" > program. I have a short write-up with links on > Cluster Monkey. > > http://www.clustermonkey.net//content/view/204/1/ > > It is an Intel centric spec for clusters. A "good thing" > in general I think, though I have concerns. (read the post) > > Opinions ? (yes a dangerous but worthy question on this list!) > > -- > Doug > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From brian.dobbins at yale.edu Wed Jun 27 11:50:43 2007 From: brian.dobbins at yale.edu (Brian Dobbins) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Are You Ready for "Intel Cluster Ready" In-Reply-To: References: <45680.192.168.1.1.1182951022.squirrel@mail.eadline.org> <1182961829.14030.38.camel@ammonite> Message-ID: <4682B183.1070307@yale.edu> Hi Doug and everyone else, I remember some of our initial clusters running with really tiny ramdisks, and the idea of putting anything non-essential on the nodes seemed like blasphemy, but as a quick counterpoint, I just installed some nodes and included Tcl as required by the (environment) Modules package. I think this in turn is used by OSCAR's "switcher", and NPACI Rocks - in fact, in section 8.9.6 of the PDF, they do list OSCAR and Rocks as 'satisfying the requirements.' Having something like Modules or SoftEnv makes life much easier and is well worth the small amount of space on a system, in my opinion. (Though, to be fair, I think SoftEnv doesn't need Tcl at all.) The point is, memory and disk are cheap, and even with a ramdisk, an extra 1MB (for Tcl) is hardly anything to sweat in most cases. Python? Besides MPICH and its derivatives, some OS utilities (yum, for example) use it. Also, I didn't see it in there, but does the spec say they have to be locally installed, or can these packages be mounted via NFS? I've only glanced very quickly at the document itself (anyone else see the "Intel Confidential" markings on every page?), but it might just be that the Java, X11, etc. packages that they look for are required to run the full suite of Intel cluster tools. Chances are, this also mimics the setup they have at their labs. So, basically, this seems to me to be geared towards ensuring customers that they work with have an environment that is up to par with their own platforms... nothing more, nothing less. It's like the situation with the Intel compilers that Igor mentions - you can install them on SuSE and they work fine, but it isn't technically 'supported'. (Heh, naturally enough, the spec also seems to call for at least one 'Intel 64' processor per node. No 32-bit and, understandably enough, no AMD.) In short, mostly everything IS optional in a cluster, but not if you want support. Seems pretty much the case with any system from any vendor, no? Cheers, - Brian Brian Dobbins Yale Engineering HPC Kozin, I (Igor) wrote: > I've heard about it but was expecting it to be like a hardware spec (a la PC spec). Rather surprising to see so many software requirements. It's sort of understandable why Perl or Python (MPICH2/IntelMPI) are required. But Java, Tcl and even Python should be optional. The funny thing is we have a cluster which should comply (loads of Intel software) and yet every time I update Intel compilers the installers complain the libs are not supported (because it's Suse 10.1). Everything works fine. So what? > > Igor > From dnlombar at ichips.intel.com Wed Jun 27 14:19:37 2007 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Are You Ready for "Intel Cluster Ready" In-Reply-To: <45680.192.168.1.1.1182951022.squirrel@mail.eadline.org> References: <45680.192.168.1.1.1182951022.squirrel@mail.eadline.org> Message-ID: <20070627211937.GA30768@nlxdcldnl2.cl.intel.com> On Wed, Jun 27, 2007 at 09:30:22AM -0400, Douglas Eadline wrote: > > Intel has announced their new "Cluster Ready" > program. I have a short write-up with links on > Cluster Monkey. > > http://www.clustermonkey.net//content/view/204/1/ > > It is an Intel centric spec for clusters. A "good thing" > in general I think, though I have concerns. (read the post) > The "Intel Wall" that you mention in your article strictly refers to the *runtime* components, which are freely downloadable directly from intel.com. Our motivation is to ensure that customers and ISVs can rely on a specific set of capabilities, and not have to wonder if they're present. Also, note that while those specific runtimes must be present, there's no requirement that applications must use them; there's also no intention of excluding any other compilers, MPI implementations, kernel libraries, &etc. Perhaps more to the point, the spec requires that applications which do use packages outside the spec make sure they install those "extra" packages too, in a way that doesn't interfere with other applications on the cluster. The problems being addressed here are applications interfering with each other, e.g, by installing conflicting utilities or libraries without clear separation. Finally, you will also note the spec doesn't discuss *how* the cluster is built or what management tools are present. We've tried very hard to include: - open source and proprietary stacks - diskful and diskless nodes - fully distributed and SSI kernels - "Enterprise" or community distros In the best of all possible worlds, we'll see certified clusters built in all those combinations, and more. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From dnlombar at ichips.intel.com Wed Jun 27 14:32:41 2007 From: dnlombar at ichips.intel.com (Lombard, David N) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Are You Ready for "Intel Cluster Ready" In-Reply-To: <4682B183.1070307@yale.edu> References: <45680.192.168.1.1.1182951022.squirrel@mail.eadline.org> <1182961829.14030.38.camel@ammonite> <4682B183.1070307@yale.edu> Message-ID: <20070627213241.GA1065@nlxdcldnl2.cl.intel.com> On Wed, Jun 27, 2007 at 02:50:43PM -0400, Brian Dobbins wrote: > Hi Doug and everyone else, > >... > > The point is, memory and disk are cheap, and even with a ramdisk, > an extra 1MB (for Tcl) is hardly anything to sweat in most cases. > Python? Besides MPICH and its derivatives, some OS utilities (yum, > for example) use it. Also, I didn't see it in there, but does the > spec say they have to be locally installed, or can these packages > be mounted via NFS? The spec doesn't care. They need to be "accessible" if I correctly recall. > I've only glanced very quickly at the document itself (anyone else > see the "Intel Confidential" markings on every page?), but it might Sigh, proofreading is a thankless and error-prone process. > just be that the Java, X11, etc. packages that they look for are > required to run the full suite of Intel cluster tools. Not all are required for Intel tools, but other already registered applications have required these in toto. Disk and memory should not be the reason that applications fail in suprising ways. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From hahn at mcmaster.ca Wed Jun 27 21:28:35 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Re: question for licensed software users In-Reply-To: <89A71750-2A2A-4210-9F68-5EC22F1DDCB0@gmail.com> References: <467FF402.6040800@obs.unige.ch> <89A71750-2A2A-4210-9F68-5EC22F1DDCB0@gmail.com> Message-ID: > possible solution. It has to be done at every reboot, but that is easily > accomplished by creating a startup script (or using rc.local). man ifconfig. everything he said was right, though I'd recommend using the /sbin/ip tool instead of ifconfig. both will work, but 'ip' is a very nice, modernized tool for network configuration. regards, mark hahn. From hahn at mcmaster.ca Wed Jun 27 22:02:15 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] IMPI network monitoring In-Reply-To: References: Message-ID: > capabilities. I cannot seem to find a way to interrogate these systems and > get the status of the network interfaces. In particular I want to see if > there are dropped packets, bad frames, collisions... That kind of thing. I wouldn't expect to be able to do that. things like dropped packets are inherently OS-dependent, and not simply some counter on the NIC. further, the IPMI "coprocessor" (BMC) is, afaikt, normally limited to interacting only with the I2C bus on the system, which would connect to things like the fan controller. but I'd expect a nic to not have an I2C connection, and only talk over PCI. > Going through the OS is not an option as they are running as an embedded > platform. I think you mean "going through user-space like ssh host ifconfig". but does your environment actually preclude querying the OS directly, such as adding a kernel hook which responds to specific packets with data pulled from the (OS-level) counters for dropped packets? for instance, I believe ICMP packets are handled entirely within the kernel... full SNMP in the kernel is probably a terrible idea, but something more basic might be eminently doable. regards, mark hahn. From eugen at leitl.org Thu Jun 28 10:01:00 2007 From: eugen at leitl.org (Eugen Leitl) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] [info] top500 for june 2007 Message-ID: <20070628170100.GA7079@leitl.org> ----- Forwarded message from Alejandro Dubrovsky ----- From: Alejandro Dubrovsky Date: Fri, 29 Jun 2007 01:01:21 +1000 To: info@postbiota.org Cc: transhumantech Subject: [info] top500 for june 2007 X-Mailer: Evolution 2.8.3 ( exponential graph haters avert your eyes. all other check out this beauty http://www.top500.org/lists/2007/06/performance_development pasted below, general highlights http://www.top500.org/lists/2007/06/highlights/general ) General highlights from the Top 500 since the last edition All changes are from November 2006 to June 2007: * The entry level to the list moved up to the 4.005 TFlop/s mark on the Linpack benchmark, compared to 2.737 TFlop/s six months ago. * The last system on the list would have been listed at position 216 in the last TOP500 just six months ago. This is the largest turnover rate ever seen in the 15 years of the TOP500 project. * Total accumulated performance has grown to 4.92 PFlop/s, compared to 3.54 PFlop/s six months ago and 2.79 PFlop/s one year ago. * The entry point for the top 100 increased in six months from 6.65 TFlop/s to 9.29 TFlop/s. * A total of 289 systems (57.8 percent) are now using Intel processors. This is slightly up from six month ago (261 systems, 52.5 percent) and a represents a typical fraction recently seen for Intel chips in the TOP500. * The AMD Opteron family, which passed the IBM Power processors six month ago, remained the second most common processor family with 105 systems (21 percent) down from 113 systems (22.6 percent) six month ago. 85 systems (17 percent) use IBM Power processors down from 93 systems (18.6 percent) six month ago. * Dual core processors are the dominant chip architecture. The most impressive growth showed the number of systems using the Intel Woodcrest dual core chips which grew in six month from 31 to 205. * Another 90 systems use Opteron dual core processors up from 75 six month ago. * 373 systems are labeled as clusters, making this the most common architecture in the TOP500 with a stable share of 74.6 percent. * InfiniBand technology is strongly increasing its share to 127 systems up from 78 six months ago. But Gigabit Ethernet is still the most used internal system interconnect technology (207 systems, down from 211 six month ago). * For quit some time, IBM and Hewlett-Packard sell the bulk of systems at all performance levels of the TOP500. * IBM was ahead of HP since June 2004 but has lost the lead in the number of system this time with 38.4 percent (down from 47.2) compared to HP with 40.6 percent (up from 31.6). * IBM remains the clear leader in the TOP500 list in performance with 41.9 percent of installed performance (down from 49.5) compared to HP with 24.5 percent (up from 16.5). * In the system category again no other manufacturer could break the 5 percent barrier, but Dell got very close with 4.8 percent. * In the performance category the manufacturers with more than 5 percent are: Dell (9 percent of performance), Cray (7.3 percent of performance), and SGI (5.7 percent), each of which benefit from large systems in the TOP10. * IBM (82) and HP (181) sold together 263 out of 269 systems at commercial and industrial customers and have this important market segment clearly cornered. * The U.S. is clearly the leading consumer of HPC systems with 281 of the 500 systems. The European share (127 systems up from 95) recovered and is again larger then the Asian share (72 down from 79 systems). * Dominant countries in Asia are Japan with 23 systems (down from 30) and China with 13 systems (down from 18). * In Europe, UK has established itself as the No. 1 with 43 systems (32 six months ago). Germany has to live with the No. 2 spot with 24 systems (19 six month ago). _______________________________________________ info mailing list info@postbiota.org http://postbiota.org/mailman/listinfo/info ----- End forwarded message ----- -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From geoff at galitz.org Wed Jun 27 10:30:23 2007 From: geoff at galitz.org (Geoff Galitz) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Are You Ready for "Intel Cluster Ready" In-Reply-To: References: <45680.192.168.1.1.1182951022.squirrel@mail.eadline.org> <1182961829.14030.38.camel@ammonite> Message-ID: <24688.209.204.185.82.1182965423.squirrel@webmail.sonic.net> Those software requirements are indicative of web servers farms rather than scientific HPC. -geoff > I've heard about it but was expecting it to be like a hardware spec (a la > PC spec). Rather surprising to see so many software requirements. It's > sort of understandable why Perl or Python (MPICH2/IntelMPI) are required. > But Java, Tcl and even Python should be optional. The funny thing is we > have a cluster which should comply (loads of Intel software) and yet every > time I update Intel compilers the installers complain the libs are not > supported (because it's Suse 10.1). Everything works fine. So what? > > Igor > > > -----Original Message----- > From: beowulf-bounces@beowulf.org on behalf of Kevin Ball > Sent: Wed 27/06/2007 17:30 > To: Douglas Eadline > Cc: Beowulf@beowulf.org > Subject: Re: [Beowulf] Are You Ready for "Intel Cluster Ready" > > Wow... they require, on every node: > > Java Runtime Environment > Perl > Python > Tcl > Kitchen Sink* > > *(Okay, only figuratively) > > But I guess we already knew 'lean and mean' is not something Intel > thinks about very often. > > -Kevin > > > On Wed, 2007-06-27 at 06:30, Douglas Eadline wrote: >> Intel has announced their new "Cluster Ready" >> program. I have a short write-up with links on >> Cluster Monkey. >> >> http://www.clustermonkey.net//content/view/204/1/ >> >> It is an Intel centric spec for clusters. A "good thing" >> in general I think, though I have concerns. (read the post) >> >> Opinions ? (yes a dangerous but worthy question on this list!) >> >> -- From duane at duaneberry.net Thu Jun 28 03:29:58 2007 From: duane at duaneberry.net (duane@duaneberry.net) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] cold cathode fluorescent backlighting In-Reply-To: <6394934c0705251214r4ec157f6o882f391ec8c0e7d9@mail.gmail.com> References: <6394934c0705251214r4ec157f6o882f391ec8c0e7d9@mail.gmail.com> Message-ID: <50811.70.22.66.55.1183026598.squirrel@email.powweb.com> > Not having an Electronics background my questions may seem naive. However > as the following issues give me concern I should very much appreciate it > if they could be sorted out with some reliable knowledge. Even naive questions are a quest for knowledge and therefore honorable. [Even if they are a bit off-topic ;) ] Executive summary: You have nothing to worry about. For a truly informed opinion look for the list of approving authorities that must be clearly visible on each device. In the USA that would be the FCC and the UL. Other countries have their own equivalents. The standards they use for testing should be publicly available and include radiation metrics if applicable. Primer on Radiation Alpha and Beta "radiation" are actually particles like neutrons. These are the most dangerous forms of radiation BUT they are also the easiest to stop. If I remember arightly Alpha can be stopped by a sheet of paper and Beta is blocked by normal clothing. Gamma radiation (aka X-rays) is true radiation and requires some thing like lead shielding to stop. This is why an X-ray technician steps behind the lead shield or leaves the room after they get the film and the emitter positioned around your body. *** NOTHING *** in the world of consumer electronics, including computers and peripherals, emits Gamma radiation. > Firstly, do Liquid Crystal Display TV or computer monitors emit any > ionizing radiation? For CRT's there has been long debate, for LCD's no. > If the LCD screenbecomes damaged through the inadvertent use of the wrong > typed of cleaner or by using any abrasive cloth could it expose one to > increased ionizing radiation? No. > Regarding the cold cathode fluorescent backlights of monitors I read in > the Wikipedia encyclopedia under Cold Cathode that some ccfls use a source > of beta radiation to start the ionization process. If this is the case > then could LCD televisions expose us to beta or gamma radiation. I should > like to replace my CRT TV with a LCD TV, but the thought of a radioactive > material being present causes me much anxiety. Even if a beta source inside a monitor/TV was continuously emitting the physical construction would provide more than enough shielding. For power and regulatory considerations I doubt the source is on continuously. IMHO moving from CRT to LCD is a good move in general. Wikipedia is a good source of keywords for use in further research. I would not consider Wikipedia an authoritative source of information on it's own. > Looing forward to your informed response, > > Julia Howard > email: juliarachel_howard@yahoo.co.uk > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From tripatthi at yahoo.com Thu Jun 28 06:03:11 2007 From: tripatthi at yahoo.com (Saaantosh Tripatthi) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] A newbie question regarding compilation of cpilog in Windows installation of mpich2 Message-ID: <352600.9712.qm@web31807.mail.mud.yahoo.com> Hi, I am new to parallel programming. I installed the Windows binary for mpich2 from ANL website. Now when I try to compile one of the example files it gives error. A typical line is clog_inttypes.h:15: error: parse error before "CLOG_int8_t". I would be greateful for any input. regards, Saaantosh --------------------------------- Ready for the edge of your seat? Check out tonight's top picks on Yahoo! TV. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070628/35cb31c5/attachment.html From rgb at phy.duke.edu Thu Jun 28 13:56:16 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] cold cathode fluorescent backlighting In-Reply-To: <50811.70.22.66.55.1183026598.squirrel@email.powweb.com> References: <6394934c0705251214r4ec157f6o882f391ec8c0e7d9@mail.gmail.com> <50811.70.22.66.55.1183026598.squirrel@email.powweb.com> Message-ID: On Thu, 28 Jun 2007, duane@duaneberry.net wrote: > Even if a beta source inside a monitor/TV was continuously emitting the > physical construction would provide more than enough shielding. For > power and regulatory considerations I doubt the source is on > continuously. > > IMHO moving from CRT to LCD is a good move in general. Actually, TVs and monitors at one point in time were notorious sources of soft x-rays from where the electrons hit the glass. Those of us who are old enough remember that we we were told not to sit too close because it could damage our eyes. Those of us who sat too close anyway have a higher risk of cataracts and skin cancers. However, one reason that sets are so damn heavy now, and why they discourage them from being put into landfills, is that long ago they mandated lead in the glass in sufficient quantity to block the x-rays. CRT glass is a whopping 70% lead by mass (less by volume -- it is much more dense than glass). It is a landfill hazard as studies have shown that lead can leach from the glass. Also, the beta source (generally a hot wire) IS kept hot at all times the CRT is "on", which can be a lot of the time, especially if you use a #%Q!* screensaver instead of blank screen for idle mode on a workstation. I have measured (with a kill-a-watt) color CRTs drawing roughly 100W or even a bit over, compared to LCDs drawing around 30W or even a bit less. On the biohazard side, the tubes inside an LCD contain mercury vapor, just like all of those compact fluorescent bulbs. Sooner or nearly all that mercury will ALSO make it into the environment. There is less mercury per display (in terms of mass) than lead; mercury is more toxic than lead as heavy metals go. So it is literally a matter of choosing your poison (and don't forget the arsenic in semiconductors while you're at it). I don't know about the relative toxicity of solid state e.g. LED designs. I'm guessing that it would be the least, and would probably consume the least energy as well. So yes, I think that LCDs are, on average, far better for the planet and your pocketbook than CRTs (remember, an 80W power differential can add up to $100's in power savings over the lifetime of a monitor), but not perfect. LEDs, if/when they ever appear (Cree, are you listening?) would almost certainly be better than either in all ways. rgb > > Wikipedia is a good source of keywords for use in further research. I > would not consider Wikipedia an authoritative source of information on > it's own. -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From joelja at bogus.com Thu Jun 28 17:14:39 2007 From: joelja at bogus.com (Joel Jaeggli) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] cold cathode fluorescent backlighting In-Reply-To: References: <6394934c0705251214r4ec157f6o882f391ec8c0e7d9@mail.gmail.com> <50811.70.22.66.55.1183026598.squirrel@email.powweb.com> Message-ID: <46844EEF.6040308@bogus.com> Robert G. Brown wrote: > So yes, I think that LCDs are, on average, far better for the planet and > your pocketbook than CRTs (remember, an 80W power differential can add > up to $100's in power savings over the lifetime of a monitor), but not > perfect. LEDs, if/when they ever appear (Cree, are you listening?) > would almost certainly be better than either in all ways. led backlit displays are already commercially available (for a year or more in some case like cellphone and high-end lcd tv), with the new mac be a notable but not first example in a laptop. As lumens/watt continues to increase their advantages over ccfl's will continue to grow... At the same time direct emissive displays (oled) will eventually challenge lcd in most areas where lcd currently challenges other technology. > rgb > >> >> Wikipedia is a good source of keywords for use in further research. I >> would not consider Wikipedia an authoritative source of information on >> it's own. > From James.P.Lux at jpl.nasa.gov Fri Jun 29 11:27:03 2007 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] cold cathode fluorescent backlighting In-Reply-To: References: <6394934c0705251214r4ec157f6o882f391ec8c0e7d9@mail.gmail.com> <50811.70.22.66.55.1183026598.squirrel@email.powweb.com> Message-ID: <6.2.3.4.2.20070629111948.032a8ef0@mail.jpl.nasa.gov> At 01:56 PM 6/28/2007, you wrote: >On Thu, 28 Jun 2007, duane@duaneberry.net wrote: > >> Even if a beta source inside a monitor/TV was continuously emitting the >>physical construction would provide more than enough shielding. For >>power and regulatory considerations I doubt the source is on >>continuously. >> >> IMHO moving from CRT to LCD is a good move in general. > >Actually, TVs and monitors at one point in time were notorious sources >of soft x-rays from where the electrons hit the glass. Actually, another big source of Xrays was the HV rectifier tube. Older tubes actually used to dimly fluoresce, but along there somewhere, they started using lead glass in the envelope. > Those of us who >are old enough remember that we we were told not to sit too close >because it could damage our eyes. Those of us who sat too close anyway >have a higher risk of cataracts and skin cancers. are you sure that's not from playing out in the sun all summer in the days before SPF50 and solumbra fabric. >Also, the beta source (generally a hot wire) IS kept hot at all times >the CRT is "on", The "instant on" feature actually doesn't keep the cathode at full temperature... typically it keeps it about half way there, so that when you turn it on, it only takes a second or two to come up to full temp/emission. Keeping it at operating temp all the time would lead to reduced cathode/filament life from evaporation. Since the emission/evaporation rate is a very nonlinear function of temperature, running it just a bit cooler makes it last a lot longer (e.g. incandescent light bulb life goes as the twelfth power of applied voltage) >On the biohazard side, the tubes inside an LCD contain mercury vapor, >just like all of those compact fluorescent bulbs. White LEDs? Jim From deadline at eadline.org Fri Jun 29 13:47:55 2007 From: deadline at eadline.org (Douglas Eadline) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] cold cathode fluorescent backlighting In-Reply-To: <6.2.3.4.2.20070629111948.032a8ef0@mail.jpl.nasa.gov> References: <6394934c0705251214r4ec157f6o882f391ec8c0e7d9@mail.gmail.com> <50811.70.22.66.55.1183026598.squirrel@email.powweb.com> <6.2.3.4.2.20070629111948.032a8ef0@mail.jpl.nasa.gov> Message-ID: <36741.68.44.87.235.1183150075.squirrel@mail.eadline.org> --snip-- > >> Those of us who >>are old enough remember that we we were told not to sit too close >>because it could damage our eyes. Those of us who sat too close anyway >>have a higher risk of cataracts and skin cancers. > > are you sure that's not from playing out in the sun all summer in the > days before SPF50 and solumbra fabric. You mean like playing out in the sun all day and then coming in at night, watching "Lost in Space" (up close), with a glass of Tang. Sun, sugar, and X-rays, those were the days. Notice I did not mention the X-ray machine that would determine your shoe size -- when buying your Keds or PF Flyers. I really could "jump higher and run faster." * -- Doug * for those who are not chonologicly gifted, the previous statements will not make much sense From laytonjb at charter.net Fri Jun 29 14:07:01 2007 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Participate in a survey about cluster management? Message-ID: <46857475.5040204@charter.net> Good evening, I've been kicking around an idea to do something like a survey about cluster management tools. I'm working on a fairly extensive questionnaire about aspect of cluster management tools and I've already targeted some people I would like to send it to (and some of you probably know who you are :) But I wanted to extend the invitation to participate in the survey to anyone who is a cluster admin, has been a cluster admin, worked with cluster management tools, etc. Unfortunately I can't give away an iPod like Don can, but I can promise you that I will read your answers and the article will be public (it will be put on ClusterMonkey). If you are interested, please send me an email and I will add you to the list. If you decide you don't want to fill out the survey after seeing it - there's no problem. I will just have rgb keep sending you very long detailed emails until you scream uncle. Actually I won't do that, but you don't have to submit anything if you don't want to. Thanks! Jeff From James.P.Lux at jpl.nasa.gov Fri Jun 29 13:59:59 2007 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] cold cathode fluorescent backlighting In-Reply-To: <36741.68.44.87.235.1183150075.squirrel@mail.eadline.org> References: <6394934c0705251214r4ec157f6o882f391ec8c0e7d9@mail.gmail.com> <50811.70.22.66.55.1183026598.squirrel@email.powweb.com> <6.2.3.4.2.20070629111948.032a8ef0@mail.jpl.nasa.gov> <36741.68.44.87.235.1183150075.squirrel@mail.eadline.org> Message-ID: <6.2.3.4.2.20070629135935.032bd478@mail.jpl.nasa.gov> At 01:47 PM 6/29/2007, Douglas Eadline wrote: >--snip-- > > > > >> Those of us who > >>are old enough remember that we we were told not to sit too close > >>because it could damage our eyes. Those of us who sat too close anyway > >>have a higher risk of cataracts and skin cancers. > > > > are you sure that's not from playing out in the sun all summer in the > > days before SPF50 and solumbra fabric. > >You mean like playing out in the sun all day and then coming in >at night, watching "Lost in Space" (up close), with a glass of >Tang. Sun, sugar, and X-rays, those were the days. Notice I did >not mention the X-ray machine that would determine your shoe >size -- when buying your Keds or PF Flyers. I really could "jump >higher and run faster." * I'm sure your body was built stronger 12 ways too. > -- > Doug > >* for those who are not chonologicly gifted, the previous >statements will not make much sense James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From gdjacobs at gmail.com Sat Jun 30 19:30:22 2007 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Wed Nov 25 01:06:08 2009 Subject: [Beowulf] Resources for starting a Beowulf Cluster (NFS Setup?) In-Reply-To: References: <000d01c7b33f$836d7a10$f6339780@libra.cc.rochester.edu> Message-ID: <468711BE.2090406@gmail.com> Robert G. Brown wrote: > On Wed, 20 Jun 2007, A Lenzo wrote: > >> Hello all, >> >> I am new to Linux and need help with the setup of my Beowulf Cluster. >> Can >> anyone suggest a few good resources? >> >> Here is a description of my current hurdle: I have 1 master node and 2 >> slave >> nodes. For starters, I would like to be able to create a user account on >> the master node and have it appear on the slave nodes. I've figured out >> that the first step is to copy over several files as follows: >> >> /etc/group >> /etc/passwd >> /etc/shadow >> >> And this lets me now log into any node with a given password, but the >> home >> directory of that given user does not carry over. > > I'd suggest getting a good book on Unix/Linux systems administration at > your local friendly bookstore. Most of this is standard stuff for > managing any LAN, and the one by Nemeth, Snyder and Hein (Linux > Administration Handbook) is likely as good as any. > > You want to: > > a) NFS export your home directory from the master. Basically this > involves making an entry in /etc/exports (with PRECISELY the right > format, sorry, RTMP) and doing chkconfig nfs on, /etc/init.d/nfs start. > God willing and the crick don't rise, and after you turn off selinux > completely and drive a stake through its heart and use > system-config-security to enable at least NFS in addition to ssh, then > with luck you'll be able to go to a node/client and do: > > mount -t nfs master:/home /home > > (and add a suitable line to /etc/fstab to make this automagical on boot) > and have it "just work". > > b) There are two ways to handle the user account, password, > /etc/hosts, and other system db synchronization. For a tiny cluster > with one or two users they are pretty much break even. One is to do > what you've done -- create e.g. /etc/[passwd,group,shadow,hosts] on the > master and then rsync them to the nodes as root, taking care not to > break them or you'll be booting them single user to clean them up or > reinstalling them altogether! When a new account is added, rerun the > rsyncs. You can even write a tiny script that will rsync exactly what > is needed. Or, you can learn to use NIS, which scales to a much larger > (department/organization sized) enterprise and cluster with dozens or > hundreds of user accounts. > > For that you'll NEED the systems administration book or one like it -- > NIS is not for the faint of heart. I've done NIS management before, and > know how to use it, but elect to go the other way for my home > LAN/cluster because even 8-10 systems and 4-5 users are about break even > compared to a judicious and infrequent set of rsyncs, and a cluster is > even simpler in this regard. FWIW, local (non-NIS) dbs are somewhat > faster for certain classes of parallel operation although this is not > generally a major issue for most code. > > Hope this helps, > > rgb What about integrating rsync into the password scripts? Fundamentally, I don't trust NIS. -- Geoffrey D. Jacobs